2008. január 4., péntek
Building an Easy-to-Use Parser/Parsing Framework (Part I)
Problem/Question/Abstract:
How to create a simple parsing framework to parse any kind of data?
Answer:
Note:
The full sourcecodes for all components & examples descripted in the following article are available as an open-source project under SourceForge:
Parser Framework
A second article was released on 29.01.2002 with a more detailed example:
Building an Easy-to-Use Parser/Parsing Framework (Part II)
Today, we wonna speak about "how to create a simple parser framework" in Delphi. Our goal will be a class solutions which helps up to parse any kind of data and store all relevant informations in an easy-to- access object model.
At the end of this article, we've developed a small utility, which generates a simple object model of a .dtd file and output it's xml pendant from a given starting node. In other words, we're using the parsing framework to create a parser, which is able to parse a .dtd file, extract all neccessary tags, store them in the object model and generates the xml output. Note: This utility uses a simply dtd- parser model, which don't include all routines to parse all kinds of dtd datas - it's up to you to include those features.
Our claims to the framework and object model are:
easy to use.
save/loadable object trees.
integrated error reporting.
expandable.
Okay, now let's start to develope the main parsing engine. Delphi comes with a unit called CopyPrsr which includes the simple stream parser object TCopyParser. Try to take a look into that file to understand how it works - it's located under $(DELPHI)\Source\Internet\CopyPrsr.pas. Our framework parser is derived from that idea, but uses a simple string instead of the stream and includes some additional functions:
The boiler plate:
unit StringParser;
interface
uses
Classes;
const
{ Additional Parser special tokens }
toEOL = char(6);
toBOF = char(7);
type
{ TSysCharSet }
TSysCharSet = set of Char;
{ TStringParser }
TStringParser = class
private
{ Private declarations }
FParseString: string;
FLineTokens: Integer;
FSourceLine: Integer;
FSourcePos: Integer;
FTokenPos: Integer;
FToken: Char;
procedure SkipBlanks;
function GetParseString: string;
function GetSourcePos: Integer;
function GetTokenString: string;
protected
{ Protected declarations }
public
{ Public declarations }
constructor Create;
function LoadFromFile(const FileName: string): Boolean;
function LoadFromStream(const Stream: TStream): Boolean;
function SkipToEOF: string;
function SkipToEOL: string;
function SkipToken: Char;
function SkipTokenString: string;
function SkipToToken(const AToken: Char): string; overload;
function SkipToToken(const AToken: TSysCharSet): string; overload;
function SkipToTokenString(const ATokenString: string): string;
property ParseString: string read GetParseString;
property SourceLine: Integer read FSourceLine;
property SourcePos: Integer read GetSourcePos;
property Token: Char read FToken;
property TokenString: string read GetTokenString;
end;
As you can see, there are many public helper functions which you can use to parse the data. The main functions are LoadFromFile and LoadFromStream, which needs the name of the file to be parsed as the only parameter. Both functions loads the content of the file and store it to the string FParseString which can be accessed through the denominator property:
LoadFromFile/LoadFromStream:
function TStringParser.LoadFromFile(const FileName: string): Boolean;
var
Stream: TMemoryStream;
begin
Result := False;
if not FileExists(FileName) then
Exit;
Stream := TMemoryStream.Create;
try
Stream.LoadFromFile(FileName);
Result := LoadFromStream(Stream);
finally
Stream.Free;
end;
end;
function TStringParser.LoadFromStream(const Stream: TStream): Boolean;
var
MemStream: TMemoryStream;
begin
Result := False;
if not (assigned(Stream)) then
Exit;
MemStream := TMemoryStream.Create;
try
Stream.Seek(0, soFromBeginning);
MemStream.CopyFrom(Stream, Stream.Size);
FParseString := StrPas(MemStream.Memory);
SetLength(FParseString, MemStream.Size);
FParseString := Concat(FParseString, toEOF);
FToken := toBOF;
Result := True;
finally
MemStream.Free;
end;
end;
The main functionality of the parsing engine is the extraction of so- called tokens. A token can be a seperator (like CR, LF or EOF) or a symbol, which can be a keyword if you plan to parse a programing language. The following functions are used to skip blank characters (which are used to seperate symbols and aren't relevant) and to extract/skip the next token symbol:
Token related functions (pullout only):
procedure TStringParser.SkipBlanks;
begin
while True do
begin
FToken := FParseString[FTokenPos];
case FToken of
#10:
begin
Inc(FSourceLine);
FLineTokens := FTokenPos;
end;
toEOF, #33..#255:
Exit;
end;
Inc(FTokenPos);
end;
end;
function TStringParser.SkipToken: Char;
const
KeySet = ['A'..'Z', 'a'..'z', '0'..'9', '_'];
begin
SkipBlanks;
FSourcePos := FTokenPos;
if FParseString[FTokenPos] = toEOF then
FToken := toEOF
else if FParseString[FTokenPos] in KeySet then
begin
while FParseString[FTokenPos] in KeySet do
Inc(FTokenPos);
FToken := toSymbol;
end
else
begin
FToken := FParseString[FTokenPos];
Inc(FTokenPos);
end;
Result := FToken;
end;
function TStringParser.SkipToToken(const AToken: TSysCharSet): string;
begin
FSourcePos := FTokenPos;
while not ((FParseString[FTokenPos] = toEOF) or (FParseString[FTokenPos] in AToken))
do
begin
if FParseString[FTokenPos] = #10 then
begin
Inc(FSourceLine);
FLineTokens := FTokenPos;
end;
Inc(FTokenPos);
end;
if FParseString[FTokenPos] = toEOF then
FToken := toEOF
else
FToken := FParseString[FTokenPos];
Result := GetTokenString;
if FToken <> toEOF then
SkipToken;
end;
The absent functions includes alternativ possibilities to extract or skip the tokens, like SkipToTokenString or SkipToEof. Well, the next step is to create the object model, which holds all our parsed informations. As I mentioned earlier, the goal of this article it to create a simple dtd parser, so we'll create an object model to store dtd
informations.
A dtd file is used to descripe the syntax rules of a xml file, like:
DTD example:
key CDATA #REQUIRED
value CDATA #REQUIRED
>
This example demonstrated the simplest way of a dtd structure. It's not the purpose of this article to develope a highly flexible dtd parser which spots all dtd grammas, so we only include the weightly ones. Root of each object model is the document, which holds all other objects as collections:
The Root Document:
{ TDTDDocument }
TDTDDocument = class(TPersistent)
private
{ Private declarations }
FEntities: TDTDEntities;
FElements: TDTDElements;
procedure SetEntities(Value: TDTDEntities);
procedure SetElements(Value: TDTDElements);
public
{ Public declarations }
constructor Create;
destructor Destroy; override;
procedure Assign(Source: TPersistent); override;
published
{ Published declarations }
property Entities: TDTDEntities read FEntities write SetEntities;
property Elements: TDTDElements read FElements write SetElements;
end;
As you can see, our document gives us the access of some other types of data: Entities and Elements. Entities are very hard to parse, so it's a good lesson for you to include that feature. Parsing elements is quite easier, so this type of data is better to explain here. Look at the dtd example some rows above this. You can see, that a dtd element is descripted as followed:
Our object model needs some extra fields to store such element informations. If you are not familiar with dtd or xml, look at W3CSchools - it's a good starting point to learn more about that topic. So, take a look at the following object structure:
TDTDDocument
|
o--TDTDEntities
|
o--TDTElements
|
o--TDTDElementTyp
|
o--TDTDAttributes
|
o--TDTDAttributeTyp
o--TDTDAttributeStatus
o--Default: string
o--TDTDEnums
o--TDTDChild
|
o--TDTDTyp
o--TDTDChilds
I've tried to "pack" the dtd grammars into an easy object model as you can see above:
Each document contains a collection of elements. Each element is descripted by an elementtype and containes in turn a collection of attributes and childs. Each attribute again is descripted by an attributetype and contains a collection of enum(erations) and so forth. Followed a code cantle from the element implementation, it's not suggestive to show you the whole code here - it's quit long and a little bit more complex:
TDTDElement:
unit DTD_Document;
interface
uses
Classes;
type
...
{ TDTDElementTyp }
TDTDElementTyp =
(etAny, etEmpty, etData, etContainer);
{ TDTDElementStatus }
TDTDElementStatus =
(esRequired, esRequiredSeq, esOptional, esOptionalSeq);
{ TDTDItem }
TDTDItem = class(TCollectionItem)
private
{ Private declarations }
FName: string;
public
{ Public declarations }
procedure Assign(Source: TPersistent); override;
published
{ Published declarations }
property Name: string read FName write FName;
end;
{ TDTDItems }
TDTDItems = class(TCollection)
private
{ Private declarations }
function GetItem(Index: Integer): TDTDItem;
procedure SetItem(Index: Integer; Value: TDTDItem);
public
{ Public declarations }
function Add: TDTDItem;
function Find(const Name: string): TDTDItem;
property Items[Index: Integer]: TDTDItem read GetItem write SetItem; default;
end;
...
{ TDTDElement }
TDTDElement = class(TDTDProperty)
private
{ Private declarations }
FTyp: TDTDElementTyp;
FAttributes: TDTDAttributes;
FChilds: TDTDChilds;
procedure SetAttributes(Value: TDTDAttributes);
procedure SetChilds(Value: TDTDChilds);
public
{ Public declarations }
constructor Create(Collection: TCollection); override;
destructor Destroy; override;
procedure Assign(Source: TPersistent); override;
published
{ Published declarations }
property Typ: TDTDElementTyp read FTyp write FTyp;
property Attributes: TDTDAttributes read FAttributes write SetAttributes;
property Childs: TDTDChilds read FChilds write SetChilds;
end;
{ TDTDElements }
TDTDElements = class(TDTDProperties)
private
{ Private declarations }
function GetItem(Index: Integer): TDTDElement;
procedure SetItem(Index: Integer; Value: TDTDElement);
public
{ Public declarations }
function Add: TDTDElement;
function Find(const Name: string): TDTDElement;
property Items[Index: Integer]: TDTDElement read GetItem write SetItem; default;
end;
...
implementation
...
{ TDTDItem }
procedure TDTDItem.Assign(Source: TPersistent);
begin
if Source is TDTDItem then
begin
Name := TDTDItem(Source).Name;
Exit;
end;
inherited Assign(Source);
end;
{ TDTDItems }
function TDTDItems.Add: TDTDItem;
begin
Result := TDTDItem(inherited Add);
end;
function TDTDItems.Find(const Name: string): TDTDItem;
var
i: Integer;
begin
Result := nil;
for i := 0 to Count - 1 do
if CompareStr(Items[i].Name, Name) = 0 then
begin
Result := Items[i];
Break;
end;
end;
function TDTDItems.GetItem(Index: Integer): TDTDItem;
begin
Result := TDTDItem(inherited GetItem(Index));
end;
procedure TDTDItems.SetItem(Index: Integer; Value: TDTDItem);
begin
inherited SetItem(Index, Value);
end;
...
{ TDTDElement }
constructor TDTDElement.Create(Collection: TCollection);
begin
inherited Create(Collection);
FAttributes := TDTDAttributes.Create(TDTDAttribute);
FChilds := TDTDChilds.Create(TDTDChild);
end;
destructor TDTDElement.Destroy;
begin
FAttributes.Free;
FChilds.Free;
inherited Destroy;
end;
procedure TDTDElement.Assign(Source: TPersistent);
begin
if Source is TDTDElement then
begin
Typ := TDTDElement(Source).Typ;
Attributes.Assign(TDTDElement(Source).Attributes);
Childs.Assign(TDTDElement(Source).Childs);
end;
inherited Assign(Source);
end;
procedure TDTDElement.SetAttributes(Value: TDTDAttributes);
begin
FAttributes.Assign(Value);
end;
procedure TDTDElement.SetChilds(Value: TDTDChilds);
begin
FChilds.Assign(Value);
end;
{ TDTDElements }
function TDTDElements.Add: TDTDElement;
begin
Result := TDTDElement(inherited Add);
end;
function TDTDElements.Find(const Name: string): TDTDElement;
begin
Result := TDTDElement(inherited Find(Name));
end;
function TDTDElements.GetItem(Index: Integer): TDTDElement;
begin
Result := TDTDElement(inherited GetItem(Index));
end;
procedure TDTDElements.SetItem(Index: Integer; Value: TDTDElement);
begin
inherited SetItem(Index, Value);
end;
The advantage of this object model is, that you're able to easily add the document to a standard component and use Delphi's internal streaming technology to load and save the object contents of a parsed file.
The next step will be the developing of the real dtd parser. Do you remember the TStringParser descripted at the top of this article? We'll using this class to build up our parser. But, we don't want to create a parser from scratch each time we're about to parse a new kind of data - it's not mind of a framework. So first, we'll develope a small parser class from which we will inherit our dtd parser. This parent class should include the error reporting and accessable fields to some other informations:
The Private Parser class:
unit PrivateParser;
interface
uses
Classes, SysUtils, StringParser;
type
{ TParserError }
TParserError = class(TCollectionItem)
private
{ Private declarations }
FFileName: string;
FLine: Integer;
FMessage: string;
FPosition: Integer;
public
{ Public declarations }
procedure Assign(Source: TPersistent); override;
published
{ Published declarations }
property FileName: string read FFileName write FFileName;
property Line: Integer read FLine write FLine;
property Message: string read FMessage write FMessage;
property Position: Integer read FPosition write FPosition;
end;
{ TParserErrors }
TParserErrors = class(TCollection)
private
{ Private declarations }
function GetItem(Index: Integer): TParserError;
procedure SetItem(Index: Integer; Value: TParserError);
public
{ Public declarations }
function Add: TParserError;
property Items[Index: Integer]: TParserError read GetItem write SetItem; default;
end;
{ TValidationParser }
TValidationParser = class
private
{ Private declarations }
FErrors: TParserErrors;
procedure SetErrors(const Value: TParserErrors);
public
{ Public declarations }
constructor Create;
destructor Destroy; override;
procedure AddError(const AMessage: string; Parser: TStringParser; const AFileName:
string = '');
procedure AddErrorFmt(const AMessage: string; Params: array of const; Parser:
TStringParser; const AFileName: string = '');
property Errors: TParserErrors read FErrors write SetErrors;
end;
implementation
{ TParserError }
procedure TParserError.Assign(Source: TPersistent);
begin
if Source is TParserError then
begin
Line := TParserError(Source).Line;
Message := TParserError(Source).Message;
Position := TParserError(Source).Position;
Exit;
end;
inherited Assign(Source);
end;
{ TParserErrors }
function TParserErrors.Add: TParserError;
begin
Result := TParserError(inherited Add);
end;
function TParserErrors.GetItem(Index: Integer): TParserError;
begin
Result := TParserError(inherited GetItem(Index));
end;
procedure TParserErrors.SetItem(Index: Integer; Value: TParserError);
begin
inherited SetItem(Index, Value);
end;
{ TValidationParser }
constructor TValidationParser.Create;
begin
inherited Create;
FErrors := TParserErrors.Create(TParserError);
end;
destructor TValidationParser.Destroy;
begin
FErrors.Free;
inherited Destroy;
end;
procedure TValidationParser.SetErrors(const Value: TParserErrors);
begin
FErrors.Assign(Value);
end;
procedure TValidationParser.AddErrorFmt(const AMessage: string; Params: array of
const; Parser: TStringParser; const AFileName: string = '');
begin
with FErrors.Add do
begin
FileName := AFileName;
Line := Parser.SourceLine;
Message := Format(AMessage, Params);
Position := Parser.SourcePos;
end;
end;
procedure TValidationParser.AddError(const AMessage: string; Parser: TStringParser;
const AFileName: string = '');
begin
AddErrorFmt(AMessage, [], Parser, AFileName);
end;
end.
Now we can start developing the real parser by inheriting it from the TValidationParser. Again, I don't want to show you the whole sourcecode here, so I pick up only the sapid one. Our dtd parser is a so- called two-way parser, i.e. it uses the first pass to parse the elements and the second pass to parse the attributes. This is useful, because an attibute can refer to an element which is placed below it and otherwise we'll get some unneeded errors. The main method of our parse is Parse, which needs the name of the file to be parsed as the first parameter, and a pre-initialized TDTDDocument as the second parameter. A sample call should looks like:
Sample Call:
// Create DTDDocument.
DTDDocument := TDTDDocument.Create;
try
// Create DTDParser.
DTDParser := TDTDParser.Create;
try
// Parse File.
DTDParser.Parse(FileName, DTDDocument);
// Display possible Errors.
if DTDParser.Errors.Count > 0 then
begin
for i := 0 to DTDParser.Errors.Count - 1 do
with DTDParser.Errors[i] do
WriteLn(Format('Error in Line %d, Pos %d: %s...', [Line, Position,
Message]));
Exit;
end;
...
// Free DTDParser.
finally
DTDParser.Free;
end;
// Free DTDDocument.
finally
DTDDocument.Free;
end;
But now, let's take a look at some sourcecode lines of the parser implementation. The first think we had to do is to inherited our parser from the parent class:
Parser Implementation (Snippet):
type
{ EDTDParser }
EDTDParser = class(Exception);
{ TDTDParser }
TDTDParser = class(TValidationParser)
private
{ Private declarations }
procedure ParseElement(Parser: TStringParser; Document: TDTDDocument; const Pass:
Integer);
procedure ParseAttlist(Parser: TStringParser; Document: TDTDDocument);
procedure ParseFile(const FileName: string; Document: TDTDDocument; const Pass:
Integer = 0);
public
{ Public declarations }
procedure Parse(const FileName: string; var Document: TDTDDocument);
end;
Afterwards we implement the Parse method which calls the internal method ParseFile on her part:
Method "Parse":
procedure TDTDParser.Parse(const FileName: string; var Document: TDTDDocument);
var
TmpDocument: TDTDDocument;
begin
if not assigned(Document) then
raise EDTDParser.Create('Document not assigned!');
TmpDocument := TDTDDocument.Create;
try
ParseFile(FileName, TmpDocument);
if Errors.Count = 0 then
Document.Assign(TmpDocument);
finally
TmpDocument.Free;
end;
end;
As you can see, we create a special temporar document to store the parsed objects in. I've done this because I don't want to return the document if it is full of errors - I assign a exact copy of the objects only, if no errors occured. The method ParseFile implements the proper parsing calls to the StringParser and creates the real objects. Followed a code snippet of the method body:
Method "ParseFile":
procedure TDTDParser.ParseFile(const FileName: string;
Document: TDTDDocument; const Pass: Integer = 0);
var
Parser: TStringParser;
begin
Parser := TStringParser.Create;
try
if not Parser.LoadFromFile(FileName) then
begin
AddErrorFmt('File "%s" not found', [FileName], Parser);
Exit;
end;
while True do
begin
while not (Parser.Token in [toEOF, '<']) do
Parser.SkipToken;
if Parser.Token = toEOF then
Break;
Parser.SkipToken;
if Parser.Token <> '!' then
begin
if not (Parser.Token in ['?']) and (Pass = 1) then
AddError('InvalidToken', Parser);
Continue;
end;
if Parser.SkipToken <> toSymbol then
begin
if (Parser.Token <> '-') and (Pass = 1) then
AddError('InvalidToken', Parser);
Continue;
end;
if UpperCase(Parser.TokenString) = 'ENTITY' then
Continue;
if UpperCase(Parser.TokenString) = 'ELEMENT' then
ParseElement(Parser, Document, Pass)
else if UpperCase(Parser.TokenString) = 'ATTLIST' then
begin
if Pass = 1 then
ParseAttlist(Parser, Document);
end
else if Pass = 1 then
AddErrorFmt('Invalid Symbol "%s"', [Parser.TokenString], Parser);
end;
if Pass = 0 then
ParseFile(FileName, Document, 1);
finally
Parser.Free;
end;
end;
This method calls some other functions (ParseElement and ParseAttlist) which parses the internal structures of an element or an attribute. Look at the whole sourceode to understand.
What's next??
Well, this article has shown you how easy it is to write a customizeable parser which can parse any kind of data - it's up to you, how complex it should be. The main benefit in using this kind of parsing is, that you don't need to incorporate in complex systems like LexParser.
Continue reading my second article:
Building an Easy-to-Use Parser/Parsing Framework (Part II)
Feliratkozás:
Megjegyzések küldése (Atom)
Nincsenek megjegyzések:
Megjegyzés küldése