1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
|
IMPLEMENTATION NOTES
====================
Multibyte Strings
-----------------
Gawk stores strings as (char *, size_t) pair, not as C strings, in order
to follow the GNU "no arbitrary limits" design principle. There are also
some flag bits associated with awk values, such as whether the type
is string or number, and if the numeric and/or string values are current.
Wide character support consists of an additional (wchar_t *, size_t)
pair and a flag bit indicating that the wide character string value
is current.
The primary representation is the char* string, which holds the
multibyte encoding of the string (UTF-8 or whatever is used in the
current locale). This is used for input and output.
The wide character representation is created and managed on an
as-needed basis. That is, when we need to know about characters
and not bytes: length(), substr(), index(), match() and printf("%c")
format. (Did I forget any?)
Fortunately, the GNU regex routines know how to match directly
on the multibyte representation, although I've often wished
for a version of those APIs that would take the wide character
strings directly.
Getting info out of match() is a bit of a pain. The regex routines
return match start and length in terms of byte offsets. Gawk builds
a secondary index that turns these offsets into offsets within
the multibyte string, so that proper values can be returned.
For printf %c, the wide character representation has to be
turned back into multibyte encoded characters and then printed.
Assignment of a string value clears the wide-string-current bit
and the memory for the wchar_t* string is released.
GOTCHAS
*******
There is a significant GOTCHA with GLIBC. The "constant" MB_CUR_MAX
indicates how many bytes are the maximum needed to multibyte-encode
a character in the current locale. And it's easy to use it in code.
The problem is that this constant is actually a macro defined to call
a function to return the correct value. This makes sense, since with
setlocale() you can change the current locale of the running program.
But, for programs like gawk that don't change their locale mid-run,
using MB_CUR_MAX inside a heavily-called loop is a disaster (e.g., the
loop that parses input data to find the end of the record).
Thus gawk has it's own `gawk_mb_cur_max' variable that it initializes upon
start-up, and doesn't touch afterwards. That variable is used everywhere.
|