2007. május 1., kedd

Introductory Principles of Indexed Searching


Problem/Question/Abstract:

Describe the principles of an Indexed search engine (like google). How this works when a lot of text information is compressed into an index - so that a search can be done within very short time.

Answer:

Introduction

There are really two main ways to search a large collection of text documents.  The simplest method would be to load each document and scan through it for the search terms, this would be referred to as a full text scan.  The second, much faster method is to create an index and then search the index.  An index is a list of terms found in a document or set of documents.  Each word only appears once per document so it is much shorter then the original document.  

Creating an index

Finding the words
In order to create an index you must first parse the document.  Parsing, is the process of picking out the individual tokens (terms or words) in a piece of text.  A parser is a type of state machine.  There are many existing parsing routines available.  "The Tomes of Delphi Algorithms and Data Structures" by Julian Bucknall contains many very good parsers.  An example.

A simple parser would scan through a string of text, starting at the beginning, looking at each character.  If it is a letter or number then it is part of a word, if it is white space or punctuation then it is a separator.  Each word is added to a list (i.e. TStringList) in the order it is found in the document.  Typically each word is converted to the same case (upper or lower).  

It is really important to consider what you are indexing and how your index will be used when creating your index and parsing.  For example if you are parsing HTML then you want to exclude most tags (with the obvious exception of META tags, which are handled specially).  Other times you might only want to index summery information about each document.  

Indexing the Words
Now that we have a parsed token list we need to index it.  The simplest index is just a list of each word found in a document, and a reference to the document.  This reference may be a URL, a document name or any other unique identifier (a GUID or a foreign key to another table describing the document).  A more complex index may include the number of times the word is found in the document or a ranking for where it is in the document (in the title, keyword section, first paragraph, middle, last, etc.)  This additional information stored with each word is part of what differentiates one search engine's performance from another.

Many times certain words are left out.  These are called stop words.  Stop words are common words, words that will not be searched on, or words that will not enhance the meaning of a search.  An example of stop words includes "THE, A, AN, AND, IF, BUT", words with numbers in them, or anything else you want to filter out.  Selecting stop words is another point of separation of performance.  

Some web search engines used to leave out words like "HTML" or "WEB" because they were so common while other search engines would include every word.  Other search engines start with a dictionary list and only index words found in that dictionary list.  This leads to trouble when you are indexing names, technical terms or anything else not found in your original dictionary.  

One time I was creating a search engine for a collection newsgroup articles.  I discovered that there was UUEncoded (similar to MIME or Base64) binaries in the articles.  This resulted in my parser finding words that were hundreds of characters long and total gibberish.  I decided to omit any word longer then 50 characters or shorter then 4.  Making the choices about what to include and what to omit is an important decision, and will vary based on what content you are indexing.

So here is an example table structure for your index:

Table: WordList

Document: Number (foreign key to Documents table)
Word : String[20] (if our longest word is 20 characters)
Count : Number (how many times the word is found)

The primary key would be a compound of Document and Word since each word is listed once per document.

Table: Documents

Document : Number(primary key index)
Title : string (title of document)
Location : string (URL or full filename and path)

Optionally you could include the entire document as a blob in this table.  You could also have other tables that lists terms (from the meta section of the document) or include authors.  Again this design choice depends on the type of documents you are indexing and the purpose of your search engine.

Searching Your Index

Once all the indexes are stored in a database you need to be able to search the index for a document.  A simple SQL statement to search for a document that contains a single word could look like this:

SELECT *
FROM WordList
WHERE Word = :v_search_term
ORDER BY Count DESC

This returns all documents containing your single search term and they are ordered by the number of times the word is found.  If you want to use SQL then to search on multiple terms involves an join for each term.  Instead you could retrieve a list for each term and then merge them manually.  This is where you would support AND, OR or NOT key words.  

If you want to allow phrase searching then you could search for each word in the phrase and then search those documents for the phrase.  The same technique could be used for the NEAR key word.  There are other more advanced techniques to do this that are much quicker, but they are beyond the scope of this document.

Once the hits are found and ranked then display the title of each document, possibly a summary or the context of the hits, and provide a way for your user to reach the document.

Variations

One thing Google does a little differently is they look at how pages are linked.  This works really well with the hyper linked nature of the web.  For example if you search for Borland most pages that mention Borland link to www.borland.com.  This is assumed to indicate that www.borland.com is a very important site about Borland.  Google also limits the number of hits you get on each domain.

Many search engines also rank pages higher if the search term appears in the URL or title of the page.  They also look at the description and keywords meta tags for ranking.  Some search engines will actually ignore a word if it appears too often in a page.  This weeds out sites that try to artificially inflate their rankings.  

Phonetics or Soundex is another technique that can be used.  This could be done with an additional table similar to the word table, but instead store the soundex value for the words instead of the actual word.  

Conclusion

Searching a shorter and well organized index is much quicker then searching an entire document.  Creating and searching an index takes a lot more planning and effort up front, but quickly pays off if the text is searched very often.  Typically the larger and more complex the index, the more effective the search.  If your index gets too large or complex then the search speed will degrade.  

There are off the shelf searching tools available to end users and developers alike.  dtSearch ( http://www.dtsearch.com ) makes a wide range of searching tools for both end users and all types of developers.  Tamarack Associates' Rubicon ( http://www.tamaracka.com ) is a search component for Delphi that provides a lot of flexibility, especially in storage options.  Both are extremely fast and useful, but don't let that stop you from designing your own, especially if you have a specialized need.

See also:

http://www.dtsearch.com/dtsoftware.html#Art_of_The_Text_Query

Nincsenek megjegyzések:

Megjegyzés küldése