RawByteString type explained

With the introduction of Unicode support Delphi also introduced magic RawByteString type; the word ‘magic’ here means that you can’t implement a type with RawByteString functionality without hidden compiler support.

A common misunderstanding about the RawByteString type is that instances of RawByteString contain no encoding information and because of this the compiler can’t implement implicit string conversion. That is not true.

First of all the RawByteString is AnsiString (1-byte character size). If you typecast a Unicode string to RawByteString type or a RawByteString string to Unicode type the compiler will always implement string conversion; if the compiler has no information about the ANSI codepage of RawByteString it uses the system codepage for conversion.

So the RawByteString magic is for AnsiStrings only.

When you declare a variable of AnsiString type you also declare a codepage; for example

type
  CyrString = type AnsiString(1251);

var
  S: CyrString;

That means the compiler has static codepage information; if you do not declare codepage the compiler assumes the system codepage. The AnsiString instances also have runtime codepage information (codepage field in the string instance header), but usually the compiler never checks the runtime codepage information and uses static codepage information for string conversions.

The magic of RawByteString type is that the compiler has no static codepage information; it does not mean that a RawByteString instance has no runtime codepage information.

If you typecast an AnsiString instance to RawByteString type no string conversion happens.

The RawByteString type is for ANSI strings’ typecasting only, not for creating instances of the type. To understand how it works let us first consider “an educated abuse” of RawByteString:

type
  string1251 = type AnsiString(1251);
  string1252 = type AnsiString(1252);

var
  S1: string1251;
  S2: string1252;

begin
// just initialize it with some data
  S1:= 'АБВГДЕЙКА';

// no string conversion here;
// the S1 string instance is copied 'as is',
//   with codepage information
  S2:= RawByteString(S1);

Now we have an instance (S2) of string1252 type containing data in ANSI 1251 encoding and runtime codepage 1251. But since the compiler normally uses static codepage information the subsequent use of the instance may produce strange results.

Finally an example of correct RawByteString type usage. The following function counts the number of occurrences of ‘?’ character in an ANSI string:

function CountQuestions(const S: RawByteString): Integer;
const
  Mark = $3F;

var
  I: Integer;

begin
  Result:= 0;
  for I:= 1 to Length(S) do begin
    if Byte(S[I]) = Mark
      then Inc(Result);
  end;
end;

The purpose of using RawByteString type for the function’s argument is to avoid unnecessary string conversion.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s