2005. november 11., péntek
Extended E-mail Address Verification and Correction
Problem/Question/Abstract:
Have you ever needed to verify that an e-mail address is correct, or have you had to work with a list of e-mail addresses and realized that some had simple problems that you could easily correct by hand?
Answer:
Have you ever needed to verify that an e-mail address is correct, or have you had to work with a list of e-mail addresses and realized that some had simple problems that you could easily correct by hand? Well the functions I present here are designed to do just that. In this article I present two functions, one to check that an e-mail address is valid, and another to try to correct an incorrect e-mail address.
Just what is a correct e-mail address?
The majority of articles I’ve seen on e-mail address verification use an over-simplified approach. For example, the most common approach I’ve seen is to ensure that an ‘@’ symbol is present, or that it’s a minimum size (ex. 7 characters), or a combination of both. And a better, but less used method is to verify that only allowed characters (based on the SMTP standard) are in the address.
The problem with these approaches is that they only can tell you at the highest level that an address is POSSIBLY correct, for example:
The address: ------@--------
Can be considered a valid e-mail address, as it does contain an @, is at least 7 characters long and contains valid characters.
To ensure an address is truly correct, you must verify that all portions of the e-mail address are valid. The function I present performs the following checks:
a) Ensure an address is not blank
b) Ensure an @ is present
c) Ensure that only valid characters are used
Then splits the validation to the two individual sections: username (or mailbox) and domain
Validation for the username:
a) Ensure it is not blank
b) Ensure the username is not longer than the current standard (RFC 821)
c) Ensures that periods (.) are used properly, specifically there can not be sequential periods (ex. David..Lederman is not valid) nor can there be a period in the first or last character of an e-mail address
Validation for the domain name:
a) Ensure it is not blank
b) Ensure the domain name is not longer than the current standard
d) Ensure that periods (.) are used properly, specifically there can not be sequential periods (ex. World..net is not valid) nor can there a period in the first or last character of the domain segment
e) Domain segments need to be checked (ex. in someplace.somewhere.com, someplace, somewhere, and com are considered segments) to ensure that they do not start or end with a hyphen (-) (ex. somewhere.-someplace.com, is not valid)
f) Ensure that at least two domain segments exists (ex. someplace.com is valid, .com is not valid)
g) Ensure that there are no additional @ symbols in the domain portion
With the steps above most syntactically valid e-mail address that are not correct can be detected and invalidated.
The VerifyEmailAddress function:
This function takes 3 parameters:
Email – The e-mail address to check
FailCode – The error code reported by the function if it can’t validate an address
FailPosition – The position of the character (if available) where the validation failure occurred
The function returns a Boolean value that returns True if the address is valid, and False if it is invalid. If a failure does occur the FailCode can be used to determine the exact error that caused the problem:
flUnknown – An unknown error occurred, and was trapped by the exception handler.
flNoSeperator – No @ symbol was found.
flToSmall – The email address was blank.
flUserNameToLong – The user name was longer than the SMTP standard allows.
flDomainNameToLong – The domain name was longer than the SMTP standard allows.
flInvalidChar – An invalid character was found. (FailPosition returns the location of the character)
flMissingUser – The username section is not present.
flMissingDomain – The domain name section is not present
flMissingDomainSeperator – No domain segments where found
flMissingGeneralDomain – No top-level domain was found
flToManyAtSymbols – More than one @ symbol was found
For simple validation there is no use for FailCode and FailPosition, but can be used to display an error using the ValidationErrorString which takes the FailCode as a parameter and returns a text version of the error which can then be displayed.
E-mail Address Correction
Since the e-mail validation routine returns detailed error information an automated system to correct common e-mail address mistakes can be easily created. The following common mistakes can all be corrected automatically:
example2.aol.com – The most common error (at least in my experience) is when entering an e-mail address a user doesn’t hold shift properly and instead enters a 2.
example@.aol.com - This error is just an extra character entered by the user, of course example@aol.com was the intended e-mail address.
example8080 @ aol .com – In this case another common error, spaces.
A Cool Screen name@AOL.com – In this case the user entered what they thought was their e-mail address, except while AOL allows screen names to contain spaces, the Internet does not.
myaddress@ispcom - In this case the period was not entered between ISP and Com.
The CorrectEmailAddress function:
The function takes three parameters:
Email – The e-mail address to check and correct
Suggestion – This string passed by reference contains the functions result
MaxCorrections – The maximum amount of corrections to attempt before stopping (defaults to 5)
This function simply loops up to MaxCorrection times, validating the e-mail address then using the FailCode to decide what kind of correction to make, and repeating this until it find a match, determines the address can’t be fixed, or has looped more than MaxCorrection times.
The following corrections are performed, based on the FailCode (see description above):
flUnknown – Simply stops corrections, as there is no generic way to correct this problem.
flNoSeperator – When this error is encountered the system performs a simple but powerful function, it will navigate the e-mail address until it finds the last 2, and then convert it to an @ symbol. This will correct most genuine transposition errors. If it converts a 2 that was not really an @ chances are it has completely invalidated the e-mail address.
flToSmall - Simply stops corrections, as there is no generic way to correct this problem.
flUserNameToLong – Simply stops corrections, as there is no generic way to correct this problem.
flDomainNameToLong – Simply stops corrections, as there is no generic way to correct this problem.
flInvalidChar – In this case the offending character is simply deleted.
flMissingUser – Simply stops corrections, as there is no generic way to correct this problem.
flMissingDomain – Simply stops corrections, as there is no generic way to correct this problem.
flMissingDomainSeperator – Simply stops corrections, as there is no generic way to correct this problem.
flMissingGeneralDomain – Simply stops corrections, as there is no generic way to correct this problem.
flToManyAtSymbols – Simply stops corrections, as there is no generic way to correct this problem.
While only a small portion of errors can be corrected the function can correct the most common errors encountered when working with list of e-mail addresses, specifically when the data is entered by the actual e-mail address account holder.
The following is the source code for the functions described above, feel free to use the code in your own programs, but please leave my name and address intact!
// ---------------------------ooo------------------------------ \\
// ©2000 David Lederman
// dlederman@internettoolscorp.com
// ---------------------------ooo------------------------------ \\
unit abSMTPRoutines;
interface
uses
SysUtils, Classes;
// ---------------------------ooo------------------------------ \\
// These constants represent the various errors validation
// errors (known) that can occur.
// ---------------------------ooo------------------------------ \\
const
flUnknown = 0;
flNoSeperator = 1;
flToSmall = 2;
flUserNameToLong = 3;
flDomainNameToLong = 4;
flInvalidChar = 5;
flMissingUser = 6;
flMissingDomain = 7;
flMissingDomainSeperator = 8;
flMissingGeneralDomain = 9;
flToManyAtSymbols = 10;
function ValidateEmailAddress(Email: string; var FailCode, FailPosition: Integer):
Boolean;
function CorrectEmailAddress(Email: string; var Suggestion: string; MaxCorrections:
Integer = 5): Boolean;
function ValidationErrorString(Code: Integer): string;
implementation
// ---------------------------ooo------------------------------ \\
// This is a list of error descriptions, it's kept in the
// implementation section as it's not needed directlly
// from outside this unit, and can be accessed using the
// ValidationErrorString which does range checking.
// ---------------------------ooo------------------------------ \\
const
ErrorDescriptions: array[0..10] of string = ('Unknown error occured!',
'Missing @ symbol!', 'Data to small!', 'User name to long!',
'Domain name to long!', 'Invalid character!', 'Missing user name!',
'Missing domain name!',
'Missing domain portion (.com,.net,etc)', 'Invalid general domain!',
'To many @ symbols!');
AllowedEmailChars: set of Char = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',
'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',
'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3',
'4', '5', '6', '7',
'8', '9', '@', '-', '.', '_', '''', '+', '$', '/', '%'];
MaxUsernamePortion = 64; // Per RFC 821
MaxDomainPortion = 256; // Per RFC 821
function CorrectEmailAddress;
var
CurITT, RevITT, ITT, FailCode, FailPosition, LastAt: Integer;
begin
try
// Reset the suggestion
Suggestion := Email;
CurITT := 1;
// Now loop through to the max depth
for ITT := CurITT to MaxCorrections do // Iterate
begin
// Now try to validate the address
if ValidateEmailAddress(Suggestion, FailCode, FailPosition) then
begin
// The email worked so exit
result := True;
exit;
end;
// Otherwise, try to correct it
case FailCode of //
flUnknown:
begin
// This error can't be fixed
Result := False;
exit;
end;
flNoSeperator:
begin
// This error can possibly be fixed by finding
// the last 2 (which was most likely transposed for an @)
LastAt := 0;
for RevITT := 1 to Length(Suggestion) do // Iterate
begin
// Look for the 2
if Suggestion[RevITT] = '2' then
LastAt := RevITT;
end; // for
// Now see if we found an 2
if LastAt = 0 then
begin
// The situation can't get better so exit
Result := False;
exit;
end;
// Now convert the 2 to an @ and continue
Suggestion[LastAt] := '@';
end;
flToSmall:
begin
// The situation can't get better so exit
Result := False;
exit;
end;
flUserNameToLong:
begin
// The situation can't get better so exit
Result := False;
exit;
end;
flDomainNameToLong:
begin
// The situation can't get better so exit
Result := False;
exit;
end;
flInvalidChar:
begin
// Simply delete the offending char
Delete(Suggestion, FailPosition, 1);
end;
flMissingUser:
begin
// The situation can't get better so exit
Result := False;
exit;
end;
flMissingDomain:
begin
// The situation can't get better so exit
Result := False;
exit;
end;
flMissingDomainSeperator:
begin
// The best correction we can make here is to go back three spaces
// and insert a .
// Instead of checking the length of the string, we'll let an
// exception shoot since at this point we can't make things any better
// (suggestion wise)
Insert('.', Suggestion, Length(Suggestion) - 2);
end;
flMissingGeneralDomain:
begin
// The situation can't get better so exit
Result := False;
exit;
end;
flToManyAtSymbols:
begin
// The situation can't get better so exit
Result := False;
exit;
end;
end; // case
end; // for
// If we got here fail
Result := False;
except
// Just return false
Result := false;
end;
end;
// ---------------------------ooo------------------------------ \\
// This function will validate an address, much further than
// simply verifying the syntax as the RFC (821) requires
// ---------------------------ooo------------------------------ \\
function ValidateEmailAddress;
var
DataLen, SepPos, Itt, DomainStrLen, UserStrLen, LastSep, SepCount, PrevSep: Integer;
UserStr, DomainStr, SubDomain: string;
begin
try
// Get the data length
DataLen := Length(Email);
// Make sure that the string is not blank
if DataLen = 0 then
begin
// Set the result and exit
FailCode := flToSmall;
Result := False;
Exit;
end;
// First real validation, ensure the @ seperator
SepPos := Pos('@', Email);
if SepPos = 0 then
begin
// Set the result and exit
FailCode := flNoSeperator;
Result := False;
Exit;
end;
// Now verify that only the allowed characters are in the system
for Itt := 1 to DataLen do // Iterate
begin
// Make sure the character is allowed
if not (Email[Itt] in AllowedEmailChars) then
begin
// Report an invalid char error and the location
FailCode := flInvalidChar;
FailPosition := Itt;
result := False;
exit;
end;
end; // for
// Now split the string into the two elements: user and domain
UserStr := Copy(Email, 1, SepPos - 1);
DomainStr := Copy(Email, SepPos + 1, DataLen);
// If either the user or domain is missing then there's an error
if (UserStr = '') then
begin
// Report a missing section and exit
FailCode := flMissingUser;
Result := False;
exit;
end;
if (DomainStr = '') then
begin
// Report a missing section and exit
FailCode := flMissingDomain;
Result := False;
exit;
end;
// Now get the lengths of the two portions
DomainStrLen := Length(DomainStr);
UserStrLen := Length(UserStr);
// Ensure that either one of the sides is not to large (per the standard)
if DomainStrLen > MaxDomainPortion then
begin
FailCode := flDomainNameToLong;
Result := False;
exit;
end;
if UserStrLen > MaxUserNamePortion then
begin
FailCode := flUserNameToLong;
Result := False;
exit;
end;
// Now verify the user portion of the email address
// Ensure that the period is neither the first or last char (or the only char)
// Check first char
if (UserStr[1] = '.') then
begin
// Report a missing section and exit
FailCode := flInvalidChar;
Result := False;
FailPosition := 1;
exit;
end;
// Check end char
if (UserStr[UserStrLen] = '.') then
begin
// Report a missing section and exit
FailCode := flInvalidChar;
Result := False;
FailPosition := UserStrLen;
exit;
end;
// No direct checking for a single char is needed since the previous two
// checks would have detected it.
// Ensure no subsequent periods
for Itt := 1 to UserStrLen do // Iterate
begin
if UserStr[Itt] = '.' then
begin
// Check the next char, to make sure it's not a .
if UserStr[Itt + 1] = '.' then
begin
// Report the error
FailCode := flInvalidChar;
Result := False;
FailPosition := Itt;
exit;
end;
end;
end; // for
{ At this point, we've validated the user name, and will now move into the domain.}
// Ensure that the period is neither the first or last char (or the only char)
// Check first char
if (DomainStr[1] = '.') then
begin
// Report a missing section and exit
FailCode := flInvalidChar;
Result := False;
// The position here needs to have the user name portion added to it
// to get the right number, + 1 for the now missing @
FailPosition := UserStrLen + 2;
exit;
end;
// Check end char
if (DomainStr[DomainStrLen] = '.') then
begin
// Report a missing section and exit
FailCode := flInvalidChar;
Result := False;
// The position here needs to have the user name portion added to it
// to get the right number, + 1 for the now missing @
FailPosition := UserStrLen + 1 + DomainStrLen;
exit;
end;
// No direct checking for a single char is needed since the previous two
// checks would have detected it.
// Ensure no subsequent periods, and while in the loop count the periods, and
// record the last one, and while checking items, verify that the domain and
// subdomains to dont start or end with a -
SepCount := 0;
LastSep := 0;
PrevSep := 1; // Start of string
for Itt := 1 to DomainStrLen do // Iterate
begin
if DomainStr[Itt] = '.' then
begin
// Check the next char, to make sure it's not a .
if DomainStr[Itt + 1] = '.' then
begin
// Report the error
FailCode := flInvalidChar;
Result := False;
FailPosition := UserStrLen + 1 + Itt;
exit;
end;
// Up the count, record the last sep
Inc(SepCount);
LastSep := Itt;
// Now verify this domain
SubDomain := Copy(DomainStr, PrevSep, (LastSep) - PrevSep);
// Make sure it doens't start with a -
if SubDomain[1] = '-' then
begin
FailCode := flInvalidChar;
Result := False;
FailPosition := UserStrLen + 1 + (PrevSep);
exit;
end;
// Make sure it doens't end with a -
if SubDomain[Length(SubDomain)] = '-' then
begin
FailCode := flInvalidChar;
Result := False;
FailPosition := (UserStrLen + 1) + LastSep - 1;
exit;
end;
// Update the pointer
PrevSep := LastSep + 1;
end
else
begin
if DomainStr[Itt] = '@' then
begin
// Report an error
FailPosition := UserStrLen + 1 + Itt;
FailCode := flToManyAtSymbols;
result := False;
exit;
end;
end;
end; // for
// Verify that there is at least one .
if SepCount < 1 then
begin
FailCode := flMissingDomainSeperator;
Result := False;
exit;
end;
// Now do some extended work on the final domain the most general (.com)
// Verify that the lowest level is at least 2 chars
SubDomain := Copy(DomainStr, LastSep, DomainStrLen);
if Length(SubDomain) < 2 then
begin
FailCode := flMissingGeneralDomain;
Result := False;
exit;
end;
// Well after all that checking, we should now have a valid address
Result := True;
except
Result := False;
FailCode := -1;
end; // try/except
end;
// ---------------------------ooo------------------------------ \\
// This function returns the error string from the constant
// array, and makes sure that the error code is valid, if
// not it returns an invalid error code string.
// ---------------------------ooo------------------------------ \\
function ValidationErrorString(Code: Integer): string;
begin
// Make sure a valid error code is passed
if (Code < Low(ErrorDescriptions)) or (Code > High(ErrorDescriptions)) then
begin
Result := 'Invalid error code!';
exit;
end;
// Get the error description from the constant array
Result := ErrorDescriptions[Code];
end;
end.
Feliratkozás:
Megjegyzések küldése (Atom)
Nincsenek megjegyzések:
Megjegyzés küldése