Be careful with Ord function in Unicode Delphi versions

17

Here is a simple test:

program OrdTest;

{$APPTYPE CONSOLE}

uses
  SysUtils;

begin
  try
    Writeln(Ord('Я'), '  ', Ord(Char('Я')));   // 223,  1071
    Assert(Ord('Я') = Ord(Char('Я')));         // Fails
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  Readln;
end.

While evaluating the Ord function with hardcoded character parameter the compiler treats the parameter as ANSI character. In the above example Ord(‘Я’) returns 223 (Cyrillic codepage 1251) instead of 1071 (UTF16) as one could expect. As a result the assertion fails (tested on Delphi XE):
assertion failed

After reading the comments I tried another test with both Cyrillic ‘Я’ (=223 on 1251 codepage) and German ‘ß’ (=223 on 1252 codepage):

program OrdTest2;

{$APPTYPE CONSOLE}

uses
  SysUtils;

begin
  try
    Writeln(Ord('Я'), '  ', Ord(Char('Я')));
    Writeln(Ord('ß'), '  ', Ord(Char('ß')));
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  Readln;
end.

if I set the compiler’s codepage to 1251 I get

if I set the compiler’s codepage to 1252 I get

because German ‘ß’ has the same code (223) both in ANSI 1252 codepage and UTF16 encoding.

Parsing UTF8 strings

0

Windows API functions like CharNextExA do not support UTF8 encoding; you can test CharNextExA with UTF8 codepage (65001) and UTF8 strings and see it does not work. If you need to calculate for example the number of unicode codepoints in UTF8 string you should parse UTF8 string manually. Fortunately the task is very simple and boils down to the following function:

function UTF8IsLeadChar(Ch: AnsiChar): Boolean;
begin
  Result:= Ord(Ch) and $C0 <> $80;
end;

the function that returns number of unicode codepoints in UTF8 string looks as follows:

function UTF8CharCount(const S: UTF8String): Integer;
var
  P: PAnsiChar;
  I: Integer;

begin
  I:= 0;
  P:= PAnsiChar(S);
  Result:= 0;
  while I < Length(S) do begin
    if UTF8IsLeadChar(P^) then Inc(Result);
    Inc(P);
    Inc(I);
  end;
end;

How to uppecase [lowercase] strings in Delphi 2009+ ?

0

The answer is not so obvious as it was before.

In the time immemorial strings were ASCII strings, and since then we have UpperCase function. The function is  just replaces the lowercase ‘a’..’z’ characters by their uppercase ‘A’..’Z’ versions. Though nowadays the UpperCase is a Unicode function it still ‘uppercases’ only ASCII characters. That is good but that is not all uppercase functionality we need.

The early Delphi versions introduced ANSI strings. The uppercase conversion on ANSI strings includes local characters besides ASCII ‘a’..’z’ and consequntly is locale-dependent. The only locale supported by pre-unicode Delphi versions is system locale. The ANSIUpperCase function was introduced to uppercase ANSI strings using system locale. That is not so good as it may be, since sometimes we need a locale different from system locale, but at least is clear and unambiguous.

Now in Delphi2009 we can explicitly define the ANSI locale in Ansistring type, and that is great. The ANSI uppercase conversion is locale-sensitive, and now we can perform ANSI string uppercase conversions on any locale by just calling AnsiUpperCase function, right? The answer is – NO.

First, in Delphi 2009 the AnsiUpperCase is unicode function. I understand the [compatibility] reasons why the unicode function has Ansi prefix, but I prefer it would never happen.

Second, the real ANSI version of AnsiUpperCase function defined in Ansistrings unit is a wrapper for WinAPI CharUpperBuffA function, and still uses the current system locale and ignores the locale associated with the string type.

As a result it is better to avoid using AnsiUpperCase at all. The AnsiUpperCase function is just a backward compatibility issue.

Nowadays the defaul string type is Unicode string. The unicode string can include characters of all different locales at once, so the unicode strings are again locale-indendent as ASCII strings were, right? Well, it is almost right. But to be absolutely correct, the answer is still NO.

The only problem I know that makes Unicode uppercase conversion locale-dependent is the case of dotless i (ı, $0131) and dotted I (İ, $0130). In most languages the upper of i ($69) is I ($49), but in turkish locale i ($69) maps to İ ($0130). Similarly in turkish the lower of I ($49) is ı ($0131).

Now let us go back to Delphi. The new string format containes no locale information for unicode strings. We have 2 functions to make the unicode uppercase conversion – the already mentioned SysUtils.AnsiUpperCase introduced for backward compatibility and new ToUpper function defined in Character unit. The SysUtils.AnsiUpperCase is a wrapper for WinAPI CharUpperBuffW function and ignores locale-specific issues. The CharUpperBuffW works much better than its ANSI analog CharUpperBuffA, but still can’t help with “turkish case”. The Character.ToUpper function is a wrapper for LCMapString function and is locale-dependent, but the locale parameter is set to system locale. So if you need “turkish uppercase” on the system with different locale (or want to make the result of uppercase independent from system locale) you must write your own LCMapString wrapper.

BTW: the early Delphi 2009 releases containes side-effect bug in ToUpper and ToLower implementations. The bug was fixed in Update 3 (build 12.0.3420.21218).