2004. január 19., hétfő

Searching Strings by the way they sound (2)


Problem/Question/Abstract:

How to match strings based on the way they sound & not on their spellings.

Answer:

This article is in continuation of my previous article "Searching Strings by the way they sound" and represents an attempt at making the SoundEx() more versatile so as to theoratically accomodate languages other than English - the only restriction being that the language should use the ASCII character set. Another advantage is that the function can be "tuned" to peculiarities of a language e.g. "Knife" is pronounced as "Nife" in English. There is theoratically no limit to this "tunability" - of course with corresponding decrease in performance. But you can get amazing results which are better than what SoundEx() gives.

I have chosen to post a new article rather than update the original one since the original function has been modified quite significantly (in concept) thus making it different from the industry standard SoundEx() function - which was implemented in the original article.

Since the function now supports language "tuning", it can give different results than the industry standard SoundEx(). I have thus renamed the function to "Sound()". This also gives me the freedom to implement it differently.

Sound() returns the same value (M240) for each of Micael/Maical/Michael/Maichael. Additionally, since it has been (partially) tuned for English, it will give the same result (F500) for "Phone"/"Fone".

I guess the "Ultimate" Sound Matching logic will be based on phonemes - of which I currently know very little. If you help me by providing me details of phonemes that you may have, then I will make yet another attempt at improving "Sound()" even further...

I thank Toninho Nunes and Joe Meyer for providing me ideas & inputs respectively.

Please save the code below in a file called "Sounds.pas". You will need to include the file in your source (Uses Sounds) and then you will have access to the Sound() function.

{********************************************************************}
{* Description: Modified Soundex function in which it is attempted to include *}
{* language pecularities which theoratically makes it adaptable to languages  *}
{* other than English - the only restriction being that the language in       *}
{* question should use ASCII character set                                    *}
{********************************************************************}
{* Date Created  : 15-Nov-2000                                                *}
{* Last Modified : 16-Nov-2000                                                *}
{* Version       : 0.10                                                       *}
{* Author        : Paramjeet Reen                                             *}
{* eMail         : Paramjeet.Reen@EudoraMail.com                              *}
{******************************************************************************}
{* This program is based on an algorithm that I had found in a magazine,      *}
{* merged with an algorithm of a program posted by Joe Meyer. I do not        *}
{* gurantee the fitness of this program in any way. Use it at your own risk.  *}
{********************************************************************}
{* Category: Freeware.                                                        *}
{********************************************************************}

unit Sounds;

interface

//Returns a code for InpStr depending upon how it sounds.
function Sound(const InpStr: ShortString): ShortString;

implementation

type
  TReplacePos = (pStart, pMid, pEnd);
  TReplacePosSet = set of TReplacePos;

const
  {********************************************************************}
  {* The following are selected letters of the alphabet which are divided     *}
  {* into their corresponding code (1-6). You might need to modify these for  *}
  {* different languages depending upon whether the language requires         *}
  {* alphabets other than the ones specified below                            *}
  {********************************************************************}
  Chars1 = ['B', 'P', 'F', 'V'];
  Chars2 = ['C', 'S', 'K', 'G', 'J', 'Q', 'X', 'Z'];
  Chars3 = ['D', 'T'];
  Chars4 = ['L'];
  Chars5 = ['M', 'N'];
  Chars6 = ['R'];

procedure ReplaceStr(var InpStr: ShortString; const SubStr, WithStr: ShortString;
  const ReplacePositions: TReplacePosSet);
var
  i: Integer;
begin
  if (pStart in ReplacePositions) then
  begin
    i := Pos(SubStr, InpStr);

    if (i = 1) then
    begin
      Delete(InpStr, i, Length(SubStr));
      Insert(WithStr, InpStr, i);
    end;
  end;

  if (pMid in ReplacePositions) then
  begin
    i := Pos(SubStr, InpStr);

    while (i > 1) and (i <= (Length(InpStr) - Length(SubStr))) do
    begin
      Delete(InpStr, i, Length(SubStr));
      Insert(WithStr, InpStr, i);
      i := Pos(SubStr, InpStr);
    end;
  end;

  if (pEnd in ReplacePositions) then
  begin
    i := Pos(SubStr, InpStr);

    if (i > 1) and (i > (Length(InpStr) - Length(SubStr))) then
    begin
      Delete(InpStr, i, Length(SubStr));
      Insert(WithStr, InpStr, i);
    end;
  end;
end;

function Sound(const InpStr: ShortString): ShortString;
var
  vStr: ShortString;
  PrevCh: Char;
  CurrCh: Char;
  i: Word;
begin
  {********************************************************************}
  {* Uppercase & remove invalid characters from given string                  *}
  {********************************************************************}
  {* Please have a long & hard look at this code if you have modified any of  *}
  {* the constants Chars1,Chars2 ... Chars6 by increasing the overall range   *}
  {* of alphabets                                                             *}
  {********************************************************************}
  vStr := '';
  for i := 1 to Length(InpStr) do
    case InpStr[i] of
      'a'..'z': vStr := vStr + UpCase(InpStr[i]);
      'A'..'Z': vStr := vStr + InpStr[i];
    end; {case}

  if (vStr <> '') then
  begin
    {**************************************************************************}
    {* Language Tweaking Section                                              *}
    {********************************************************************}
    {* Tweak for language peculiarities e.g. "CAt"="KAt", "KNIfe"="NIfe"      *}
    {* "PHone"="Fone", "PSYchology"="SIchology", "EXcel"="Xcel" etc...        *}
    {* You will need to modify these for different languages. Optionally, you *}
    {* may choose not to have this section at all, in which case, the output  *}
    {* of Sound() will correspond to that of SoundEx(). Please note however   *}
    {* the importance of what you replace & the order in which you replace.   *}
    {********************************************************************}
    {* Also, please note that the following replacements are targeted for the *}
    {* English language & that too is subject to improvements                 *}
    {********************************************************************}
    ReplaceStr(vStr, 'CA', 'KA', [pStart, pMid, pEnd]); //arCAde = arKAde
    ReplaceStr(vStr, 'CL', 'KL', [pStart, pMid, pEnd]); //CLass  = Klass
    ReplaceStr(vStr, 'CK', 'K', [pStart, pMid, pEnd]); //baCK   = baK
    ReplaceStr(vStr, 'EX', 'X', [pStart, pMid, pEnd]); //EXcel  = Xcel
    ReplaceStr(vStr, 'X', 'Z', [pStart]); //Xylene = Zylene
    ReplaceStr(vStr, 'PH', 'F', [pStart, pMid, pEnd]); //PHone  = Fone
    ReplaceStr(vStr, 'KN', 'N', [pStart]); //KNife  = Nife
    ReplaceStr(vStr, 'PSY', 'SI', [pStart]); //PSYche = SIche
    ReplaceStr(vStr, 'SCE', 'CE', [pStart, pMid, pEnd]); //SCEne  = CEne

    {********************************************************************}
    {* String Assembly Section                                                *}
    {********************************************************************}
    PrevCh := #0;
    Result := vStr[1];
    for i := 2 to Length(vStr) do
    begin
      if Length(Result) = 4 then
        break;

      CurrCh := vStr[i];
      if (CurrCh <> PrevCh) then
      begin
        if CurrCh in Chars1 then
          Result := Result + '1'
        else if CurrCh in Chars2 then
          Result := Result + '2'
        else if CurrCh in Chars3 then
          Result := Result + '3'
        else if CurrCh in Chars4 then
          Result := Result + '4'
        else if CurrCh in Chars5 then
          Result := Result + '5'
        else if CurrCh in Chars6 then
          Result := Result + '6';

        PrevCh := CurrCh;
      end;
    end;
  end
  else
    Result := '';

  while (Length(Result) < 4) do
    Result := Result + '0';
end;

end.

Nincsenek megjegyzések:

Megjegyzés küldése