2004. február 25., szerda

Parsing the Web


Problem/Question/Abstract:

Three Classes for Grabbing HTML/XML Information

Answer:

I recently bought one of the new digital satellite dishes and ran across an interesting challenge - figuring out just what was on, and when. DirectTV provided a Web site with a basic search engine, but the Web-based search was slow and very narrowly focused. As a typical programmer, I knew I could provide a better, more powerful UI, if I could just figure out how their Web search engine worked.

A quick scan of the HTML source pointed out a relatively simple search form that I could easily duplicate, but the HTML results came back in a mildly complicated HTML table. Brute force code would have been simple enough to construct to parse through the table, but I'd been looking for a reason to build a more general parser, so off I went. If I'd known just how lax the HTML rules are, and just how many hacks there are, I'd have just stuck with the brute force method and saved myself a lot of agony, but since I'm here now ...

The Basics

To put together a parser for HTML, an understanding of the rules is required. HTML originated as an SGML-like syntax, and over time has grown to fit more closely within the confines of SGML. These days the syntax is described within an SGML Data Type Definition (DTD) bringing it into a reasonably well-understood and managed domain. Given that SGML now establishes the underpinnings of HTML, the parser should apply the SGML syntax rules as a starting point. This also allows for the consideration of some simple extensions that allow parsing of XML.

Therefore, the parser is built to work on SGML in general, with specific handlers for exceptions and extensions that occur in HTML and XML. The rules for SGML are straightforward, and provide five basic constructs that we care about:

elements,
attributes,
comments,
SGML directives, and
"everything else"

Elements are the facet of the SGML content with which we are most concerned, and around which the parser is established. Elements have a start tag, content, and an end tag, for example:

<TITLE>HTML Parsing</TITLE>

where TITLE is considered to be the "name" of the tag. Element names are case-insensitive. So we start with the following parsing rules:

Element start and end tags are surrounded by < and > characters.
Element end tags are denoted by the / character immediately following a <.
Element content is surrounded by start and end tags.

HTML Extensions

In HTML, we immediately note that there are exceptions to these rules. For some elements - most notably <P> - the end tags may be omitted, even though the element may have contents. This offers perhaps the most annoying challenge of the HTML parsing rule set, because there are several methods by which the element may be terminated.

To start with, we note another syntax rule from SGML: elements may not span. That is, if an element's start tag is contained within another element's start and end tags, its end tag must also appear there. Put simply, if we encounter an end tag, all omitted end tags are considered "closed" back up to the matching start tag. Also, by observation (I couldn't find a formal rule for this in the HTML specification), virtually all elements with optional end tags close when they encounter another of themselves, <LI> and <OPTION> being fine examples. Further, the HTML reference material does indicate that <P> elements that omit their end tags are terminated by "block elements." The HTML DTD must be consulted to determine which elements are considered block elements. Unfortunately, all of this prevents us from using a general rule, and requires that we become concerned with the HTML DTD.

A quick consideration of the DTD is therefore in order. The DTD calls out which elements must have end tags, and which may omit (or are forbidden to have) end tags. For example, the DTD fragment for <P> is:

<!ELEMENT P - O (%inline;)* >

The important thing to note here is - O. The - indicates the start tag is required, and O means the end tag may be omitted. Compare this to the fragment for <BR>:

<!ELEMENT BR - O EMPTY >

where - O EMPTY indicates that the start tag is required. Since the element is EMPTY, however, the optional end tag is now expressly forbidden. In the fragment for <P>, the (%inline;)* shows the legal contents of the P element. In this case, %inline; refers to a list of elements defined earlier in the DTD. Notably missing from the %inline; list is <P> itself. A perusal of the other ambiguous elements reinforces the observation that in general, an element may not immediately contain itself (although this ambitiously general rule is certainly not guaranteed to remain valid for future releases of the HTML DTD). In the same way that %inline; is defined, so there exists a list name %block; which contains the list of block elements.

This leads to another set of parsing rules:

<P> with an omitted end tag is terminated when a block element is encountered.
Elements are terminated when another element of the same name is encountered.
Elements are terminated if a parent's end tag is encountered; no spans are allowed.

Attributes

Attributes represent the various properties of elements. By definition, attributes appear in name/value pairs within the start tag of the element. For example, in:

<BODY bgcolor="#FFFFFF">

the BODY element has an attribute name bgcolor, and the attribute has a value of #FFFFFF. Double quotation marks or single quotation marks are required to delimit the value, unless the value contains only letters, digits, hyphens, and periods. If the value contains single quotation marks, it should be delimited with double quotation marks, and vice versa. Attribute names are case-insensitive. Also worth noting is that not all attributes have a value. For instance, the NOWRAP attribute of the <TD> element.

Attributes appear within an element's start tag
Attributes are delimited by a space character (ASCII 32)
Attribute values are delimited by " or '

Comments

SGML also provides that its content may include comments. Comments are of the form:

<!-- This is a comment -->

The <! is a markup declaration open delimiter, and indicates that an SGML directive is to follow. Comments are specifically denoted by -- following the open delimiter (white space is not allowed between the <! and the --, but is allowed between the closing -- and >). Further, a comment may contain < and > characters. Comments may not include other comments.

<!-- indicates a comment; --> terminates a comment.
< and > are ignored while parsing a comment.

The remaining SGML directives are denoted by the beginning markup declaration open delimiter <!. To further complicate things, comments may exist within the directives delimited by --.


<! denotes an SGML directive (if it's not a comment)
Comments within the directives are delimited by --

Lastly, we consider what remains. Content of elements not contained within a start tag, end tag, or comment is considered by the parser to be PCData (parsed character data).

Store text not included in element start/end tags or comments as PCData.

XML Extensions

As mentioned before, XML is also derived from SGML. While HTML is basically a DTD described within and using SGML, XML is a subset of SGML capable both of representing data and containing other DTDs of its own. XML also demonstrates that those working on the standards in the programming community actually do learn from the mistakes of those that went before them. For instance, the rules of containment are much more formal in XML than they are in HTML, making parsing a great deal simpler. This means that while a DTD may be included in an XML document for syntax checking purposes, it isn't necessarily required for the actual parsing of the XML content, as it is for HTML.

Knowing this, we can add two more rules and provide DTD-less XML parsing as well. For one, empty elements in HTML are simply called out as such in the DTD with their end tags forbidden (<BR> for example). If an element in XML is to be empty (that is, it will have no content) its start tag may have a / just before the closing > indicating that no content and no end tag will follow. Additionally, XML directives may appear with the ? character rather than !.

Empty elements in XML may be terminated by a / just before the > in the start tag, e.g. <partno/>.
Additional directives appear using ?, instead of the ! character.

Everything Else

Any items encountered in the content that are not contained in element start or end tags, comments, or DTD items are considered by the parser to be PCData. The content of elements fits this bill, as do carriage returns and line feeds encountered by the parser. This leads to the final parsing rule:

Any content located outside of start/end tags, comments, or DTD items is PCData.

Additional Considerations

It is also worth noting that syntax errors and occurrences of "browser-tolerated" HTML inconsistencies are frequently encountered, and as such, should not raise exceptions except in extreme cases. Instead, a warning should be noted and parsing should continue if possible.

The Parser

The goal of parsing the SGML content is to place the data it represents into a form more readily accessible to other components. Since SGML calls out a hierarchical structure a hierarchy is probably the most accurate way to store the parsed content. With that in mind, the parser is built from two primary classes, and a third supporting class.

First and foremost is the THTMLParser class. Its Parse method accepts the content to be parsed, and places the processed results in the Tree property.
Next is the TTagNode class in which the parsed results are contained. This class is a hierarchical storage container with Parent pointing to the TTagNode that contains the current node, and Children containing a list of children immediately owned by the current node.
TTagNodeList is provided as a list container for a collection of TTagNode objects, typically produced by a call to the GetTags method of the TTagNode class.

A Simple Example

Consider the sample HTML shown in Figure 1. The parser would produce from the HTML a hierarchy that can be visualized as in Figure 2. Each of the boxes in the tree represents a TTagNode instance.

<HTML>
<HEAD>
<TITLE>Example HTML</TITLE>
</HEAD>
<BODY>
<!-- Insert non-sensical comment here -->
<H1>Example HTML</H1>
Plain old text right out there in the middle of the document.
<P>Text contained within in paragraph</P>
<B>Unordered List</B>
<UL>
<LI>Item #1
<LI>Item #2
<LI>Item #3
</UL>
</BODY>
</HTML>
Figure 1: Sample HTML.


Figure 2: Hierarchical representation of parsed HTML.

Each node has a NodeType property that indicates what type of node it is. All node types except ntePCData also have text in the Caption property that provides more information about the node's contents. See the table in Figure 3 for details.

Node Contents
NodeType
Caption
HTML elements
nteElement
Element name
Text
ntePCData

Comments
nteComment
!
SGML/XML/DTD directives
nteDTDItem
! or ? and directive

Figure 3: TTagNode node types.

For example, the HTML node in Figure 2 has a NodeType of nteElement, while the comment tag is of type nteComment, and the PCData nodes are of type ntePCData.

The content or text for each HTML element is contained in a node in its Children list. For example, the TITLE node in Figure 2 has a PCData node whose Text property contains "Example HTML". The GetPCData method of a TTagNode returns the PCData text for all children of the node for which it's called. Note, this method is recursive and will return the PCData text for all nodes in the tree beneath the node upon which it's called.

Retrieving Elements

The GetTags method of a TTagNode will return a list of all children that match an element name. If '*' or '' is specified as the element name, then all children will be returned. Note that this method is recursive. The result list is a TTagNodeList.


The code in Figure 4 illustrates how the GetTags method is used to collect a list of all <LI> elements in the HTML from Figure 1 and insert their contents into a list box. The process is as follows:

Create a container for the results, i.e. the Elements list.
Call the GetTags method, passing the desired element name and the container.
Iterate through the container, placing the text for each element in a list box.
Destroy the container.


procedure Button1OnClick(Sender: TObject);
var
  Elements: TTagNodeList;
  Counter: Integer;
begin
  Elements := TTagNodeList.Create;
  HTMLParser1.Tree.GetTags('LI', Elements);
  for Counter := 0 to Elements.Count - 1 do
    ListBox1.Items.Add(Elements[Counter].GetPCData);
  Elements.Free;
end;
Figure 4: Using GetTags.

TTagNodeList has two methods that offer another approach (assuming that a result set has already been acquired): GetTagCount and FindTagByIndex. GetTagCount returns a count of the occurrences of an element name. FindTagByIndex returns the index within the list of the specified occurrence of an element name. For instance, the statement:

ShowMessage(Elements.FindTagByIndex('li', 1).GetPCData);

were it included in Figure 4, would display the text for the second occurrence of the <LI> element in the Elements container. This can prove exceptionally useful for locating a specific tag from target HTML content. For example, if the third <TABLE> in the HTML contained the desired data, the following code would make quick work of locating the root <TABLE> node:

HTMLParser1.GetTags('*', Elements);
Node := Elements.FindTagByIndex('table', 2);
if Assigned(Node) then
begin
  // Perform processing on <TABLE> node.
end;

Working with Results as a Hierarchy

The procedure in Figure 5 provides yet another method for accessing the contents of the Tree (albeit a more brute force approach). In this example, the HTML content is assumed to be fairly straightforward: <TABLE> elements contain <TR> elements, which contain <TH> and <TD> elements. Given this reasonably accurate assumption, the code will walk the children of a <TABLE> node, and for each <TR> node found will walk its children looking for occurrences of either <TH> or <TR>, and add their text (contained in PCData nodes) to a TStrings container, e.g. the Lines property of a TMemo control.

// Return TableNode's contents in a TStrings container.

procedure GetTable(TableNode: TTagNode; Lines: TStrings);
var
  RowCtr,
    DataCtr: Integer;
  Node,
    RowNode: TTagNode;
  TempStr: string;
begin
  Lines.Clear;
  if CompareText(TableNode.Caption, 'table') = 0 then
  begin
    for RowCtr := 0 to TableNode.ChildCount - 1 do
    begin
      RowNode := TableNode.Children[RowCtr];
      if CompareText(RowNode.Caption, 'tr') = 0 then
      begin
        TempStr := '';
        for DataCtr := 0 to RowNode.ChildCount - 1 do
        begin
          Node := RowNode.Children[DataCtr];
          if CompareText(Node.Caption, 'td') = 0 then
            TempStr := TempStr + Node.GetPCData + #9
          else if CompareText(Node.Caption, 'th') = 0 then
            TempStr := TempStr + Node.GetPCData + #9;
        end;
        TempStr := Trim(TempStr);
        if TempStr <> '' then
          Lines.Add(TempStr);
      end;
    end;
  end;
end;
Figure 5: Working with results as a hierarchy.

As an illustration of just how difficult HTML processing can be, the following caveats apply to the code provided in Figure 5 and would have to be handled to provide a robust solution:

Subtables are not handled. That is, <TABLE> elements encountered within a <TD> element are ignored.
Row and column spanning is not handled.
<TBODY>, <THEAD> and a host of other table elements are not considered (although admittedly they are rare).

Working with Results as a List

The procedure in Figure 6 demonstrates working with the parsed results as a list. The goal here is to retrieve a list of all comments, links, meta tags, and images from the document, with the <TITLE> thrown in for good measure. The code does this in several simple steps:

Parse the HTML.
Create a container for the list of matching nodes from the tree.
Call the GetTags method passing '*', and the container ('*' indicates that all items in the tree should be returned).
Iterate through the container collecting matches and place their contents in the StringList.
Destroy the container.


procedure TForm1.Parse(HTML: string; Lines: TStrings);
const
  cKnownTags = '|title|img  |a    |meta |!    ';
  cTITLE = 0;
  cIMG = 1;
  cA = 2;
  cMETA = 3;
  cComment = 4;
var
  Index,
    Counter: Integer;
  TempStr: string;
  Nodes: TTagNodeList;
begin
  HTMLParser1.Parse(HTML);
  Nodes := TTagNodeList.Create;
  // Retrieve all nodes.
  HTMLParser1.Tree.GetTags('*', Nodes);
  for Counter := 0 to Nodes.Count - 1 do
  begin
    TempStr := '|' + LowerCase(Nodes[Counter].Caption);
    // Index of element name.
    Index := Pos(TempStr, cKnownTags);
    if Index > 0 then
    begin
      Index := Index div 6;
      case Index of
        cTITLE:
          Lines.Add('Title=' +
            HTMLDecode(Nodes[Counter].GetPCData));
        cIMG:
          begin
            TempStr := Nodes[Counter].Params.Values['src'];
            if TempStr <> '' then
              Lines.Add(
                Nodes[Counter].Params.Values['src']);
          end;
        cA:
          begin
            TempStr :=
              Nodes[Counter].Params.Values['href'];
            if TempStr <> '' then
              Lines.Add(TempStr + '=' +
                HTMLDecode(Nodes[Counter].GetPCData));
          end;
        cMETA:
          with Nodes[Counter].Params do
            Lines.Add(Values['name'] + '=' +
              Values['content']);
        cComment:
          Lines.Add('[Comment] ' +
            HTMLDecode(Nodes[Counter].Text));
      end; {  case Index }
    end; {  if Index > 0 }
  end; {  for Counter := 0 to Nodes.Count - 1  }
  Nodes.Free;
end;
Figure 6: Working with results as a list.

The important thing to understand here is that the TTagNodeList class is just a list of pointers to nodes from the Tree. This is quite beneficial in that once a desired node is located in the list, it may be used as if it had been acquired by traversing the tree. For example, when the case statement in Figure 6 encounters a TITLE element, its contents are retrieved by making a call to the TITLE node's GetPCData method (which depends on the parsed tree structure behaving as it appears to in Figure 2). Note that the PCData often contains encoded items such as &gt; and &lt; (< and > respectively). HTMLDecode is provided for handling most simple cases, but doesn't handle all cases (notably non-US character encoding).

This example also demonstrates the use of the attributes from an element. When an <A> element is encountered, the HREF attribute is examined. If it exists, the <A> element is treated as a link to some other resource. If the HREF attribute were not specified, this might be an instance of <A> serving as an anchor instead of a link. For more details on how the Params property of the TTagNode behaves, see the Delphi help for the Names and Values properties of the TStrings class.

Searching the www.directv.com Program Guide

Applying the HTML parser to the original need turns out to be another simple (albeit involved) exercise. First, an understanding of the CGI scripts that allow searching of the program guide is required. As it turns out, the search script is rather crude and accepts only three parameters: timezone, category, and search text. timezone is simply a number representing Eastern, Central, Mountain, or Pacific. category allows the search to span all programs, or to be narrowed to certain types of programs such as movies or sports. The search text should be all, or a portion of, the desired program name. The search is specific to program names, and doesn't consider program descriptions.

The results of the search are returned as an HTML table including channel, date, time, duration, and program name. The program name is contained within a link to the program description, which will need to be retrieved as well. This is slightly complicated by the fact that the link provided is written using JavaScript which we cannot simply call. However, the URL produced by the JavaScript function is easy to replicate, as it contains a program ID number that can be passed to another CGI script that returns the desired description.

The next step is to parse the HTML and process the results into a more useful format. TStringTable is provided as a simple container for just this purpose. The TStringTable offers a non-visual equivalent to TStringGrid with a few additional methods to make manipulating the data a bit easier. Once the HTML table has been processed and placed in the string table, a bit of house cleaning is required. For one, there are rows at the end of the HTML table that need to be ignored, as they contain images, not program content. Also, the channel appears only in the first row of a set of programs that occur on that channel.

The contents can now be added to a ListView. Once that task is complete, the descriptions can be fetched using the program IDs to call the description CGI, and then added to the ListView.

Further Study

While these examples are not terribly glamorous, the parser can be applied to more meaningful problems. For instance, a friend of mine has put together an extremely handy application using the Pricewatch site (http://www.pricewatch.com/) to monitor prices on PC hardware. Pricewatch offers a current snapshot of pricing of a particular piece of hardware from various vendors, usually sorted from least expensive to most. However, it doesn't allow for viewing of several different pieces of hardware at once, and it doesn't track the history of the price changes for the hardware. So, the application provides a way to build a list of hardware to be tracked, and then offers a simple trend analysis by gathering and saving off the price information on a regular basis. This provides a useful picture for the consumer of just how quickly the prices are moving downward on a particular item. If the pricing is in a steep downward curve, waiting to purchase might be wise. If the curve is flat, the time to purchase might be at hand.

The parser is used not only to retrieve the pricing data, but also to help deal with one of the more significant issues facing those attempting to interface with sites they do not control: unexpected changes in the target site's contents. In this case, a review of the <form> elements from the main page is performed to ensure that the query mechanism remains intact. As a further safeguard, the search results page is also examined to verify a match against the expected HTML format. If unexpected items are found in either case, processing cannot continue, but at least the user can be warned of the situation.

In a more interesting demonstration of the parser's abilities, it has been combined with a database to create a poor man's OODB (object-oriented database). XML is used to wrap the data, and is then stored in text fields in the database. When needed, the XML is retrieved and the parser used to extract the data. Without going in to detail, this is useful because the data stored in the database can carry semantic information with it (the XML elements) that provides information about the data's structure. In systems where the data structure is dynamic, this provides a simple way to avoid excessive database maintenance and further provides a clean, easily understood mechanism for the exchange of data between various applications and platforms. In the case mentioned here, a legacy defect (bug) tracking system hosted on a Solaris platform was wrapped in a Delphi based UI.

Additional Demonstration Applications

To further demonstrate the power of the classes presented in this article, two additional applications accompany this article. An HTML Engine Demo application (see Figure 7) displays a great deal of information about any selected URL, including meta tags, links, and images.


Figure 7: The HTML Engine Demo application.

The Parser Test application parses any URL, or SGML/HTML/XML document, and displays the results in a TreeView (see Figure 8). It can also display links, selected tags, text, etc.


Figure 8: The Parser Test application.

Room for Improvement

This parser builds a reasonable basis for parsing of HTML and XML but, offers significant room for further development.

Incorporation of a DTD processor (DTDs can be parsed with the existing parser, but no handling of the parsed contents is provided). This would provide two main benefits: more thorough parsing of elements based a true understanding of their legal contents and no need for hard coding a representation of the HTML DTD within the parser. Further, DTD-based XML parsing would then be possible.

A DOM container model to complement the TTagNode model. DOM represents a fairly well understood and commonly encountered model for representing the parsed contents of XML. While it doesn't suit all needs, it does provide a useful, standard way to communicate about the parsed elements.

XQL or other suitable extended model query mechanism. The GetTags method is reasonably sufficient, but for more exhaustive queries against XML contents a more advanced mechanism is desired. For example, it would be extremely handy if GetTags could be passed 'order/partno' to indicate that we're searching for all PARTNO elements that are immediately below an ORDER element.

Further performance tuning. While some attention has been paid to this area, no extreme efforts to speed things up were applied. Most notably, Delphi's string routines are not considered to be as fast as those provided in some third-party string-handling collections (notably HyperString).

Resources and Alternatives

For more information and examples, please visit http://www.dallas.net/~richardp/delphi/components/delphicomps.htm.

HTML 4.0 Specification -http://www.w3.org/TR/REC-html40. This is the single most useful resource to those seeking HTML enlightenment. It is extremely detailed and well written.

HTML 4.0 Loose DTD -http://www.w3.org/TR/REC-html40/loose.dtd. A part of the HTML 4.0 specification, this offers the exact specification of just what the HTML rules are. It is upon this DTD (as opposed to the "strict" DTD) that the parser is designer to operate.

"XML: Creating Structures of Meaning," Visual Developer, Nov/Dec 1998, Vol. 9 No. 4. A quick survey of XML for the beginner. Syntax and use are explored here with an eye to bringing the novice on board.

"Using Internet Explorer's HTML Parser," Dr. Dobb's Journal, #302, August 1999 (http://www.ddj.com/articles/1999/9908/9908toc.htm). This article offers an examination of using the HTML parser that is available within Microsoft's Internet Explorer via COM interface. The source for the article is in C++, but it's not difficult to follow.

"XML from Delphi," Delphi Informant Magazine, July 1999, Vol. 5 No. 7. A beginner's explanation of XML, and an excellent example of using the XML parser included in Internet Explorer 4.0.

Begin Listing One - Searching www.directv.com
unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Classes, Controls, Forms,
  Dialogs, ComCtrls, StdCtrls, ExtCtrls, GetURL, HTMLMisc,
  HTMLParser, StringTable;

type
  TForm1 = class(TForm)
    WIGetURL1: TWIGetURL;
    StatusBar1: TStatusBar;
    Panel2: TPanel;
    pbSearch: TButton;
    ebSearchText: TEdit;
    Panel3: TPanel;
    ListView1: TListView;
    HTMLParser1: THTMLParser;
    procedure pbSearchClick(Sender: TObject);
    procedure WIGetURL1Status(Sender: TObject;
      Status: Integer; StatusInformation: Pointer;
      StatusInformationLength: Integer);
  private
    procedure GetTable(TableNode: TTagNode;
      Table: TStringTable);
    function GetDescription(Node: TTagNode): string;
  end;

var
  Form1: TForm1;

implementation

{$R *.DFM}

const
  cDirecTVSearchURL =
    'http:// 206.17.88.15/cgi-bin/pgm_search.cgi/';
  cDirecTVDescURL =
    'http:// 206.17.88.15/cgi-bin/pgm_desc.cgi/';

  // Return TableNode's contents in Table

procedure TForm1.GetTable(TableNode: TTagNode;
  Table: TStringTable);
var
  RowCtr,
    DataCtr: Integer;
  Node,
    RowNode: TTagNode;
begin
  Table.Clear;
  if LowerCase(TableNode.Caption) = 'table' then
  begin
    for RowCtr := 0 to TableNode.ChildCount - 1 do
    begin
      RowNode := TableNode.Children[RowCtr];
      if LowerCase(RowNode.Caption) = 'tr' then
      begin
        Table.NewRow;
        for DataCtr := 0 to RowNode.ChildCount - 1 do
        begin
          Node := RowNode.Children[DataCtr];
          if LowerCase(Node.Caption) = 'td' then
            Table.AddColumnObject(Node.GetPCData, Node)
          else if LowerCase(Node.Caption) = 'th' then
            Table.AddHeader(Node.GetPCData);
        end;
        if Table.Row[Table.RowCount - 1].Count <= 0 then
          Table.DeleteRow(Table.RowCount - 1);
      end;
    end;
  end;
end;

function TForm1.GetDescription(Node: TTagNode): string;
var
  TempStr: string;
begin
  Result := '';
  if Node.ChildCount > 0 then
    TempStr := Node.Children[0].Params.Values['href']
  else
    TempStr := '';

  if TempStr <> '' then
  begin
    // Parse out the description id
    Delete(TempStr, 1, Pos('(', TempStr));
    Delete(TempStr, Pos(')', TempStr), Length(TempStr));
    WIGetURL1.URL := cDirecTVDescURL + TempStr;
    Screen.Cursor := crHourglass;
    Application.ProcessMessages;
    WIGetURL1.GetURL;
    Screen.Cursor := crDefault;
    StatusBar1.Panels.Items[0].Text := '';
    if WIGetURL1.Status = wiSuccess then
    begin
      TempStr := WIGetURL1.Text;
      // Use brute force to scrape out program description.
      if Pos('<BLOCKQUOTE>', TempStr) > 0 then
      begin
        Delete(TempStr, 1, Pos('<BLOCKQUOTE>', TempStr) + 11);
        Delete(TempStr, Pos('</BLOCKQUOTE>', TempStr),
          Length(TempStr));
      end;
      Result := TempStr;
    end;
  end
end;

procedure TForm1.pbSearchClick(Sender: TObject);
const
  tzPacific = '0'; // Time zones.
  tzMountain = '1';
  tzCentral = '2';
  tzEastern = '3';
  cgMovies = '0'; // Categories.
  cgSports = '1';
  cgSpecials = '2';
  cgSeries = '3';
  cgNews = '4';
  cgShopping = '5';
  cgAllCategories = '-1';
var
  Cols,
    Rows: Integer;
  NewItem: TListItem;
  Node: TTagNode;
  Nodes: TTagNodeList;
  ResultTable: TStringTable;
  TempStr: string;
begin
  if ebSearchText.Text = '' then
    Exit;
  WIGetURL1.URL := cDirecTVSearchURL + tzCentral + '/' +
    cgAllCategories + '/' + urlEncode(ebSearchText.Text);
  Screen.Cursor := crHourglass;
  Application.ProcessMessages;
  WIGetURL1.GetURL;
  Screen.Cursor := crDefault;
  StatusBar1.Panels.Items[0].Text := '';
  if WIGetURL1.Status = wiSuccess then
  begin
    if Pos('No program titles that match',
      WIGetURL1.Text) > 0 then
      ShowMessage('No matches found')
    else
    begin
      // Attempt to parse HTML table we're looking for.
      HTMLParser1.Parse(WIGetURL1.Text);
      Nodes := TTagNodeList.Create;
      HTMLParser1.Tree.GetTags('table', Nodes);
      if Nodes.Count > 0 then
      begin
        ResultTable := TStringTable.Create;
        GetTable(Nodes[0], ResultTable);
        // Get rid of image tags at bottom of search
        // response (the 2nd column has no contents).
        with ResultTable do
          for Rows := RowCount - 1 downto 0 do
            if Cells[1, Rows] = '' then
              DeleteRow(Rows);
        // Ensure all cells are filled appropriately
        // (in the HTML table, a RowSpan attribute
        // allows the "Channel" to be displayed in
        // one cell for several programs).
        with ResultTable do
          for Rows := 0 to RowCount - 1 do
            for Cols := 0 to ColCount - 1 do
              if Cells[Cols, Rows] = '' then
                if Rows > 0 then
                  Cells[Cols, Rows] :=
                    Cells[Cols, Rows - 1];
        // Add items to ListView (Program, Channel,
        // Data, Time).
        ListView1.Items.Clear;
        for Rows := 0 to
          ResultTable.RowCount - 1 do
        begin
          NewItem := ListView1.Items.Add;
          NewItem.Caption :=
            ResultTable.Cells[4, Rows];
          NewItem.SubItems.Add(
            ResultTable.Cells[0, Rows]);
          NewItem.SubItems.Add(
            ResultTable.Cells[1, Rows]);
          NewItem.SubItems.Add(
            ResultTable.Cells[2, Rows]);
        end;
        // Retrieve program descriptions (program id
        // contained in 4th column's node).
        for Rows := 0 to
          ResultTable.RowCount - 1 do
        begin
          // It's rude to whack the server :-)
          Sleep(500);
          Node :=
            TTagNode(ResultTable.Objects[4, Rows]);
          TempStr := GetDescription(Node);
          ListView1.Items[Rows].
            SubItems.Add(TempStr);
        end;
        ResultTable.Free;
      end // if Nodes.Count > 0 ...
      else
        ShowMessage(
          'Error - Expected table...found none');
    end; // else of Pos('No program titles that...
  end // WIGetURL1.Status = wiSuccess...
  else
    ShowMessage('Unable to contact search server [' +
      WIGetURL1.ErrorMessage + ']');
end;

end.
End Listing One


Component Download: http://www.baltsoft.com/files/dkb/attachment/Parsing_the_Web.zip

Nincsenek megjegyzések:

Megjegyzés küldése