2008. január 4., péntek

Building an Easy-to-Use Parser/Parsing Framework (Part I)


Problem/Question/Abstract:

How to create a simple parsing framework to parse any kind of data?

Answer:

Note:

The full sourcecodes for all components & examples descripted in the following article are available as an open-source project under SourceForge:

  Parser Framework

A second article was released on 29.01.2002 with a more detailed example:

Building an Easy-to-Use Parser/Parsing Framework (Part II)

Today, we wonna speak about "how to create a simple parser framework" in Delphi. Our goal will be a class solutions which helps up to parse any kind of data and store all relevant informations in an easy-to- access object model.

At the end of this article, we've developed a small utility, which generates a simple object model of a .dtd file and output it's xml pendant from a given starting node. In other words, we're using the parsing framework to create a parser, which is able to parse a .dtd file, extract all neccessary tags, store them in the object model and generates the xml output. Note: This utility uses a simply dtd- parser model, which don't include all routines to parse all kinds of dtd datas - it's up to you to include those features.

Our claims to the framework and object model are:

easy to use.
save/loadable object trees.
integrated error reporting.
expandable.

Okay, now let's start to develope the main parsing engine. Delphi comes with a unit called CopyPrsr which includes  the simple stream parser object TCopyParser. Try to take a look into that file to understand how it works - it's located under $(DELPHI)\Source\Internet\CopyPrsr.pas. Our framework parser is derived from that idea, but uses a simple string instead of the stream and includes some additional functions:

The boiler plate:

unit StringParser;

interface

uses
  Classes;

const
  { Additional Parser special tokens }

  toEOL = char(6);
  toBOF = char(7);

type
  { TSysCharSet }

  TSysCharSet = set of Char;

  { TStringParser }

  TStringParser = class
  private
    { Private declarations }
    FParseString: string;
    FLineTokens: Integer;
    FSourceLine: Integer;
    FSourcePos: Integer;
    FTokenPos: Integer;
    FToken: Char;
    procedure SkipBlanks;
    function GetParseString: string;
    function GetSourcePos: Integer;
    function GetTokenString: string;
  protected
    { Protected declarations }
  public
    { Public declarations }
    constructor Create;
    function LoadFromFile(const FileName: string): Boolean;
    function LoadFromStream(const Stream: TStream): Boolean;
    function SkipToEOF: string;
    function SkipToEOL: string;
    function SkipToken: Char;
    function SkipTokenString: string;
    function SkipToToken(const AToken: Char): string; overload;
    function SkipToToken(const AToken: TSysCharSet): string; overload;
    function SkipToTokenString(const ATokenString: string): string;
    property ParseString: string read GetParseString;
    property SourceLine: Integer read FSourceLine;
    property SourcePos: Integer read GetSourcePos;
    property Token: Char read FToken;
    property TokenString: string read GetTokenString;
  end;

As you can see, there are many public helper functions which you can use to parse the data. The main functions are LoadFromFile and LoadFromStream, which needs the name of the file to be parsed as the only parameter. Both functions loads the content of the file and store it to the string FParseString which can be accessed through the denominator property:

LoadFromFile/LoadFromStream:

function TStringParser.LoadFromFile(const FileName: string): Boolean;
var
  Stream: TMemoryStream;
begin
  Result := False;
  if not FileExists(FileName) then
    Exit;
  Stream := TMemoryStream.Create;
  try
    Stream.LoadFromFile(FileName);
    Result := LoadFromStream(Stream);
  finally
    Stream.Free;
  end;
end;

function TStringParser.LoadFromStream(const Stream: TStream): Boolean;
var
  MemStream: TMemoryStream;
begin
  Result := False;
  if not (assigned(Stream)) then
    Exit;
  MemStream := TMemoryStream.Create;
  try
    Stream.Seek(0, soFromBeginning);
    MemStream.CopyFrom(Stream, Stream.Size);
    FParseString := StrPas(MemStream.Memory);
    SetLength(FParseString, MemStream.Size);
    FParseString := Concat(FParseString, toEOF);
    FToken := toBOF;
    Result := True;
  finally
    MemStream.Free;
  end;
end;

The main functionality of the parsing engine is the extraction of so- called tokens. A token can be a seperator (like CR, LF or EOF) or a symbol, which can be a keyword if you plan to parse a programing language. The following functions are used to skip blank characters (which are used to seperate symbols and aren't relevant) and to extract/skip the next token symbol:

Token related functions (pullout only):

procedure TStringParser.SkipBlanks;
begin
  while True do
  begin
    FToken := FParseString[FTokenPos];
    case FToken of
      #10:
        begin
          Inc(FSourceLine);
          FLineTokens := FTokenPos;
        end;
      toEOF, #33..#255:
        Exit;
    end;
    Inc(FTokenPos);
  end;
end;

function TStringParser.SkipToken: Char;
const
  KeySet = ['A'..'Z', 'a'..'z', '0'..'9', '_'];
begin
  SkipBlanks;
  FSourcePos := FTokenPos;
  if FParseString[FTokenPos] = toEOF then
    FToken := toEOF
  else if FParseString[FTokenPos] in KeySet then
  begin
    while FParseString[FTokenPos] in KeySet do
      Inc(FTokenPos);
    FToken := toSymbol;
  end
  else
  begin
    FToken := FParseString[FTokenPos];
    Inc(FTokenPos);
  end;
  Result := FToken;
end;

function TStringParser.SkipToToken(const AToken: TSysCharSet): string;
begin
  FSourcePos := FTokenPos;
  while not ((FParseString[FTokenPos] = toEOF) or (FParseString[FTokenPos] in AToken))
    do
  begin
    if FParseString[FTokenPos] = #10 then
    begin
      Inc(FSourceLine);
      FLineTokens := FTokenPos;
    end;
    Inc(FTokenPos);
  end;
  if FParseString[FTokenPos] = toEOF then
    FToken := toEOF
  else
    FToken := FParseString[FTokenPos];
  Result := GetTokenString;
  if FToken <> toEOF then
    SkipToken;
end;

The absent functions includes alternativ possibilities to extract or skip the tokens, like SkipToTokenString or SkipToEof. Well, the next step is to create the object model, which holds all our parsed informations. As I mentioned earlier, the goal of this article it to create a simple dtd parser, so we'll create an object model to store dtd
informations.

A dtd file is used to descripe the syntax rules of a xml file, like:

DTD example:

key CDATA #REQUIRED
value CDATA #REQUIRED
>

This example demonstrated the simplest way of a dtd structure. It's not the purpose of this article to develope a highly flexible dtd parser which spots all dtd grammas, so we only include the weightly ones. Root of each object model is the document, which holds all other objects as collections:

The Root Document:

{ TDTDDocument }

TDTDDocument = class(TPersistent)
private
  { Private declarations }
  FEntities: TDTDEntities;
  FElements: TDTDElements;
  procedure SetEntities(Value: TDTDEntities);
  procedure SetElements(Value: TDTDElements);
public
  { Public declarations }
  constructor Create;
  destructor Destroy; override;
  procedure Assign(Source: TPersistent); override;
published
  { Published declarations }
  property Entities: TDTDEntities read FEntities write SetEntities;
  property Elements: TDTDElements read FElements write SetElements;
end;

As you can see, our document gives us the access of some other types of data: Entities and Elements. Entities are very hard to parse, so it's a good lesson for you to include that feature. Parsing elements is quite easier, so this type of data is better to explain here. Look at the dtd example some rows above this. You can see, that a dtd element is descripted as followed:

Our object model needs some extra fields to store such element informations. If you are not familiar with dtd or xml, look at W3CSchools - it's a good starting point to learn more about that topic. So, take a look at the following  object structure:

TDTDDocument
  |
  o--TDTDEntities
  |
  o--TDTElements
    |
    o--TDTDElementTyp
    |
    o--TDTDAttributes
      |
      o--TDTDAttributeTyp
      o--TDTDAttributeStatus
      o--Default: string
      o--TDTDEnums
    o--TDTDChild
      |
      o--TDTDTyp
      o--TDTDChilds

I've tried to "pack" the dtd grammars into an easy object model as you can see above:

Each document contains a collection of elements. Each element is descripted by an elementtype and containes in turn a collection of attributes and childs. Each attribute again is descripted by an attributetype and contains a collection of enum(erations) and so forth. Followed a code cantle from the element implementation, it's not suggestive to show you the whole code here - it's quit long and a little bit more complex:

TDTDElement:

unit DTD_Document;

interface

uses
  Classes;

type

  ...

  { TDTDElementTyp }

  TDTDElementTyp =
    (etAny, etEmpty, etData, etContainer);

  { TDTDElementStatus }

  TDTDElementStatus =
    (esRequired, esRequiredSeq, esOptional, esOptionalSeq);

  { TDTDItem }

  TDTDItem = class(TCollectionItem)
  private
    { Private declarations }
    FName: string;
  public
    { Public declarations }
    procedure Assign(Source: TPersistent); override;
  published
    { Published declarations }
    property Name: string read FName write FName;
  end;

  { TDTDItems }

  TDTDItems = class(TCollection)
  private
    { Private declarations }
    function GetItem(Index: Integer): TDTDItem;
    procedure SetItem(Index: Integer; Value: TDTDItem);
  public
    { Public declarations }
    function Add: TDTDItem;
    function Find(const Name: string): TDTDItem;
    property Items[Index: Integer]: TDTDItem read GetItem write SetItem; default;
  end;

  ...

  { TDTDElement }

  TDTDElement = class(TDTDProperty)
  private
    { Private declarations }
    FTyp: TDTDElementTyp;
    FAttributes: TDTDAttributes;
    FChilds: TDTDChilds;
    procedure SetAttributes(Value: TDTDAttributes);
    procedure SetChilds(Value: TDTDChilds);
  public
    { Public declarations }
    constructor Create(Collection: TCollection); override;
    destructor Destroy; override;
    procedure Assign(Source: TPersistent); override;
  published
    { Published declarations }
    property Typ: TDTDElementTyp read FTyp write FTyp;
    property Attributes: TDTDAttributes read FAttributes write SetAttributes;
    property Childs: TDTDChilds read FChilds write SetChilds;
  end;

  { TDTDElements }

  TDTDElements = class(TDTDProperties)
  private
    { Private declarations }
    function GetItem(Index: Integer): TDTDElement;
    procedure SetItem(Index: Integer; Value: TDTDElement);
  public
    { Public declarations }
    function Add: TDTDElement;
    function Find(const Name: string): TDTDElement;
    property Items[Index: Integer]: TDTDElement read GetItem write SetItem; default;
  end;

  ...

implementation

...

{ TDTDItem }

procedure TDTDItem.Assign(Source: TPersistent);
begin
  if Source is TDTDItem then
  begin
    Name := TDTDItem(Source).Name;
    Exit;
  end;
  inherited Assign(Source);
end;

{ TDTDItems }

function TDTDItems.Add: TDTDItem;
begin
  Result := TDTDItem(inherited Add);
end;

function TDTDItems.Find(const Name: string): TDTDItem;
var
  i: Integer;
begin
  Result := nil;
  for i := 0 to Count - 1 do
    if CompareStr(Items[i].Name, Name) = 0 then
    begin
      Result := Items[i];
      Break;
    end;
end;

function TDTDItems.GetItem(Index: Integer): TDTDItem;
begin
  Result := TDTDItem(inherited GetItem(Index));
end;

procedure TDTDItems.SetItem(Index: Integer; Value: TDTDItem);
begin
  inherited SetItem(Index, Value);
end;

...

{ TDTDElement }

constructor TDTDElement.Create(Collection: TCollection);
begin
  inherited Create(Collection);
  FAttributes := TDTDAttributes.Create(TDTDAttribute);
  FChilds := TDTDChilds.Create(TDTDChild);
end;

destructor TDTDElement.Destroy;
begin
  FAttributes.Free;
  FChilds.Free;
  inherited Destroy;
end;

procedure TDTDElement.Assign(Source: TPersistent);
begin
  if Source is TDTDElement then
  begin
    Typ := TDTDElement(Source).Typ;
    Attributes.Assign(TDTDElement(Source).Attributes);
    Childs.Assign(TDTDElement(Source).Childs);
  end;
  inherited Assign(Source);
end;

procedure TDTDElement.SetAttributes(Value: TDTDAttributes);
begin
  FAttributes.Assign(Value);
end;

procedure TDTDElement.SetChilds(Value: TDTDChilds);
begin
  FChilds.Assign(Value);
end;

{ TDTDElements }

function TDTDElements.Add: TDTDElement;
begin
  Result := TDTDElement(inherited Add);
end;

function TDTDElements.Find(const Name: string): TDTDElement;
begin
  Result := TDTDElement(inherited Find(Name));
end;

function TDTDElements.GetItem(Index: Integer): TDTDElement;
begin
  Result := TDTDElement(inherited GetItem(Index));
end;

procedure TDTDElements.SetItem(Index: Integer; Value: TDTDElement);
begin
  inherited SetItem(Index, Value);
end;

The advantage of this object model is, that you're able to easily add the document to a standard component and use Delphi's internal streaming technology to load and save the object contents of a parsed file.

The next step will be the developing of the real dtd parser. Do you remember the TStringParser descripted at the top of this article? We'll using this class to build up our parser. But, we don't want to create a parser from scratch each time we're about to parse a new kind of data - it's not mind of a framework. So first, we'll develope a small parser class from which we will inherit our dtd parser. This parent class should include the error reporting and accessable fields to some other informations:

The Private Parser class:

unit PrivateParser;

interface

uses
  Classes, SysUtils, StringParser;

type
  { TParserError }

  TParserError = class(TCollectionItem)
  private
    { Private declarations }
    FFileName: string;
    FLine: Integer;
    FMessage: string;
    FPosition: Integer;
  public
    { Public declarations }
    procedure Assign(Source: TPersistent); override;
  published
    { Published declarations }
    property FileName: string read FFileName write FFileName;
    property Line: Integer read FLine write FLine;
    property Message: string read FMessage write FMessage;
    property Position: Integer read FPosition write FPosition;
  end;

  { TParserErrors }

  TParserErrors = class(TCollection)
  private
    { Private declarations }
    function GetItem(Index: Integer): TParserError;
    procedure SetItem(Index: Integer; Value: TParserError);
  public
    { Public declarations }
    function Add: TParserError;
    property Items[Index: Integer]: TParserError read GetItem write SetItem; default;
  end;

  { TValidationParser }

  TValidationParser = class
  private
    { Private declarations }
    FErrors: TParserErrors;
    procedure SetErrors(const Value: TParserErrors);
  public
    { Public declarations }
    constructor Create;
    destructor Destroy; override;
    procedure AddError(const AMessage: string; Parser: TStringParser; const AFileName:
      string = '');
    procedure AddErrorFmt(const AMessage: string; Params: array of const; Parser:
      TStringParser; const AFileName: string = '');
    property Errors: TParserErrors read FErrors write SetErrors;
  end;

implementation

{ TParserError }

procedure TParserError.Assign(Source: TPersistent);
begin
  if Source is TParserError then
  begin
    Line := TParserError(Source).Line;
    Message := TParserError(Source).Message;
    Position := TParserError(Source).Position;

    Exit;
  end;

  inherited Assign(Source);
end;

{ TParserErrors }

function TParserErrors.Add: TParserError;
begin
  Result := TParserError(inherited Add);
end;

function TParserErrors.GetItem(Index: Integer): TParserError;
begin
  Result := TParserError(inherited GetItem(Index));
end;

procedure TParserErrors.SetItem(Index: Integer; Value: TParserError);
begin
  inherited SetItem(Index, Value);
end;

{ TValidationParser }

constructor TValidationParser.Create;
begin
  inherited Create;
  FErrors := TParserErrors.Create(TParserError);
end;

destructor TValidationParser.Destroy;
begin
  FErrors.Free;
  inherited Destroy;
end;

procedure TValidationParser.SetErrors(const Value: TParserErrors);
begin
  FErrors.Assign(Value);
end;

procedure TValidationParser.AddErrorFmt(const AMessage: string; Params: array of
  const; Parser: TStringParser; const AFileName: string = '');
begin
  with FErrors.Add do
  begin
    FileName := AFileName;
    Line := Parser.SourceLine;
    Message := Format(AMessage, Params);
    Position := Parser.SourcePos;
  end;
end;

procedure TValidationParser.AddError(const AMessage: string; Parser: TStringParser;
  const AFileName: string = '');
begin
  AddErrorFmt(AMessage, [], Parser, AFileName);
end;

end.

Now we can start developing the real parser by inheriting it from the TValidationParser. Again, I don't want to show you the whole sourcecode here, so I pick up only the sapid one. Our dtd parser is a so- called two-way parser, i.e. it uses the first pass to parse the elements and the second pass to parse the attributes. This is useful, because an attibute can refer to an element which is placed below it and otherwise we'll get some unneeded errors. The main method of our parse is Parse, which  needs the name of the file to be parsed as the first parameter, and a pre-initialized TDTDDocument as the second parameter. A sample call should looks like:

Sample Call:

// Create DTDDocument.
DTDDocument := TDTDDocument.Create;
try
  // Create DTDParser.
  DTDParser := TDTDParser.Create;
  try
    // Parse File.
    DTDParser.Parse(FileName, DTDDocument);

    // Display possible Errors.
    if DTDParser.Errors.Count > 0 then
    begin
      for i := 0 to DTDParser.Errors.Count - 1 do
        with DTDParser.Errors[i] do
          WriteLn(Format('Error in Line %d, Pos %d: %s...', [Line, Position,
            Message]));
      Exit;
    end;

    ...

    // Free DTDParser.
  finally
    DTDParser.Free;
  end;

  // Free DTDDocument.
finally
  DTDDocument.Free;
end;

But now, let's take a look at some sourcecode lines of the parser implementation. The first think we had to do is to inherited our parser from the parent class:

Parser Implementation (Snippet):

type
  { EDTDParser }

  EDTDParser = class(Exception);

  { TDTDParser }

  TDTDParser = class(TValidationParser)
  private
    { Private declarations }
    procedure ParseElement(Parser: TStringParser; Document: TDTDDocument; const Pass:
      Integer);
    procedure ParseAttlist(Parser: TStringParser; Document: TDTDDocument);
    procedure ParseFile(const FileName: string; Document: TDTDDocument; const Pass:
      Integer = 0);
  public
    { Public declarations }
    procedure Parse(const FileName: string; var Document: TDTDDocument);
  end;

Afterwards we implement the Parse method which calls the internal method ParseFile on her part:

Method "Parse":

procedure TDTDParser.Parse(const FileName: string; var Document: TDTDDocument);
var
  TmpDocument: TDTDDocument;
begin
  if not assigned(Document) then
    raise EDTDParser.Create('Document not assigned!');
  TmpDocument := TDTDDocument.Create;
  try
    ParseFile(FileName, TmpDocument);
    if Errors.Count = 0 then
      Document.Assign(TmpDocument);
  finally
    TmpDocument.Free;
  end;
end;

As you can see, we create a special temporar document to store the parsed objects in. I've done this because I don't want to return the document if it is full of errors - I assign a exact copy of the objects only, if no errors occured. The method ParseFile implements the proper parsing calls to the StringParser and creates the real objects. Followed a code snippet of the method body:

Method "ParseFile":

procedure TDTDParser.ParseFile(const FileName: string;
  Document: TDTDDocument; const Pass: Integer = 0);
var
  Parser: TStringParser;
begin
  Parser := TStringParser.Create;
  try
    if not Parser.LoadFromFile(FileName) then
    begin
      AddErrorFmt('File "%s" not found', [FileName], Parser);
      Exit;
    end;
    while True do
    begin
      while not (Parser.Token in [toEOF, '<']) do
        Parser.SkipToken;
      if Parser.Token = toEOF then
        Break;
      Parser.SkipToken;
      if Parser.Token <> '!' then
      begin
        if not (Parser.Token in ['?']) and (Pass = 1) then
          AddError('InvalidToken', Parser);
        Continue;
      end;
      if Parser.SkipToken <> toSymbol then
      begin
        if (Parser.Token <> '-') and (Pass = 1) then
          AddError('InvalidToken', Parser);
        Continue;
      end;
      if UpperCase(Parser.TokenString) = 'ENTITY' then
        Continue;
      if UpperCase(Parser.TokenString) = 'ELEMENT' then
        ParseElement(Parser, Document, Pass)
      else if UpperCase(Parser.TokenString) = 'ATTLIST' then
      begin
        if Pass = 1 then
          ParseAttlist(Parser, Document);
      end
      else if Pass = 1 then
        AddErrorFmt('Invalid Symbol "%s"', [Parser.TokenString], Parser);
    end;
    if Pass = 0 then
      ParseFile(FileName, Document, 1);
  finally
    Parser.Free;
  end;
end;

This method calls some other functions (ParseElement and ParseAttlist) which parses the internal structures of an element or an attribute. Look at the whole sourceode to understand.

What's next??

Well, this article has shown you how easy it is to write a customizeable parser which can parse any kind of data - it's up to you, how complex it should be. The main benefit in using this kind of parsing is, that you don't need to incorporate in complex systems like LexParser.

Continue reading my second article:

Building an Easy-to-Use Parser/Parsing Framework (Part II)

Nincsenek megjegyzések:

Megjegyzés küldése