2011. május 4., szerda

Determining the actual length of a DBCS string (multibyte-character ANSI string)


Problem/Question/Abstract:

How can I get the length in characters of a multibyte-character string? Function Length returns the length in bytes, but in Eastern languages some characters may take more than one byte...

Answer:

Solve 1:

Introduction

The Length function returns the length of a string, but it behaves differently according to the type of the string. For the old short strings (ShortString) and for long strings (AnsiString), Length returns the number of bytes they take, while for wide (Unicode) strings (WideString) it returns the number of wide characters (WideChar), that is, the number of bytes divided by two. In the case of short and long strings, in Western languages one character takes one byte, while for example in Asian languages some characters take one and others two bytes. For this reason, there are two versions of almost all string functions, one of great performance that only works with single-byte character strings (SBCS) and another -less performant- one that also works with strings where a character can take one or two bytes (DBCS) that are used in applications distributed internationally. This way we have functions like Pos, LowerCase and UpperCase on one side and AnsiPos, AnsiLowerCase and AnsiUpperCase on the other. Curiosly there is no AnsiLength function that returns the number of characters in a DBCS.

AnsiLength (Draft)

Then here it goes a function that returns the number of characters in a double-byte character string:

function AnsiLength(const s: string): integer;
var
  i, n: integer;
begin
  Result := 0;
  n := Length(s);
  i := 1;
  while i <= n do
  begin
    inc(Result);
    if s[i] in LeadBytes then
      inc(i);
    inc(i);
  end;
end;

AnsiLength (Final)

Naturally, this function is not optimized. We are not going to mess with assembler, but at least we can use pointers:

function AnsiLength(const s: string): integer;
var
  p, q: pchar;
begin
  Result := 0;
  p := PChar(s);
  q := p + Length(s);
  while p < q do
  begin
    inc(Result);
    if p^ in LeadBytes then
      inc(p, 2)
    else
      inc(p);
  end;
end;


Solve 2:

function AnsiLength(const s: string): integer;
var
  p: PAnsiChar;
begin
  Result := MultiByteToWideChar(CP_ACP, 0, PAnsiChar(s), -1, NULL, 0);
end;

The documentation on MultiByteToWideChar says:
"If the function succeeds, and cchWideChar is zero, the return value is the required size, in wide characters, for a buffer that can receive the translated string." Number of wide characters is, actually, the number of characters in MBCS.


Solve 3:

function AnsiLength(const s: string): integer;
begin
  Result := lstrlenA(PAnsiChar(s));
  Result := MultiByteToWideChar(CP_ACP, 0, PAnsiChar(s), -1, NULL, 0);
end;

The documentation on MultiByteToWideChar says:
"If the function succeeds, and cchWideChar is zero, the return value is the required size, in wide characters, for a buffer that can receive the translated string." The number of wide characters for the buffer is, actually, the number of characters in the string - it's length.

Copyright (c) 2001 Ernesto De Spirito
Visit: http://www.latiumsoftware.com/delphi-newsletter.php

Nincsenek megjegyzések:

Megjegyzés küldése