2010. február 24., szerda

How to isolate text between two HTML tags


Problem/Question/Abstract:

I have a TRichEdit.Lines (TStrings) where I want to extract a string and copy it to another string. I use ScanF to find begining of string which is '<a href' and almost end of string which is '</ a>'. Then I need to find either next '<' or end of Line. Once I do all this, how do I extract this string and copy it to another string?

Answer:

See the Copy function. Perhaps the following routine can be of use for you, it uses the diverse PChar-based string functions instead of the standard String Pos and Copy, basically because it is a bit easier in this case to work with pointers.

procedure IsolateTextBetweentags(const S: string; Tag1, Tag2: string; list: TStrings);
var
  pScan, pEnd, pTag1, pTag2: PChar;
  foundText: string;
  searchtext: string;
begin
  {Set up pointers we need for the search. HTML is not case sensitive, so
  we need to perform the search on a uppercased copy of S}
  searchtext := Uppercase(S);
  Tag1 := Uppercase(Tag1);
  Tag2 := Uppercase(Tag2);
  pTag1 := PChar(Tag1);
  pTag2 := PChar(Tag2);
  pScan := PChar(searchtext);
  repeat
    {Search for next occurence of Tag1}
    pScan := StrPos(pScan, pTag1);
    if pScan <> nil then
    begin
      {Found one, hop over it, then search from that position forward for the
                        next occurence of Tag2}
      Inc(pScan, Length(Tag1));
      pEnd := StrPos(pScan, pTag2);
      if pEnd <> nil then
      begin
        {Found start and end tag, isolate text between, add it to the list. We need to
        get the text from the original S, however, since we
                                want the un-uppercased version!}
        SetString(foundText, Pchar(S) + (pScan - PChar(searchtext)), pEnd - pScan);
        list.Add(foundText);
        {Continue next search after the found end tag}
        pScan := pEnd + Length(tag2);
      end
      else
        {Error, no end tag found for start tag, abort}
        pScan := nil;
    end;
  until
    pScan = nil;
end;

procedure TForm1.Button1Click(Sender: TObject);
begin
  with opendialog1 do
  begin
    filter := 'HTML files|*.HTM; *.HTML';
    if execute then
    begin
      richedit1.PlainText := true;
      richedit1.lines.loadfromfile(filename);
      memo2.clear;
      IsolateTextBetweenTags(richedit1.text, '<H1>', '</H1>', memo2.lines);
    end;
  end;
end;

Nincsenek megjegyzések:

Megjegyzés küldése