diff options
author | Arnold D. Robbins <arnold@skeeve.com> | 2019-12-22 20:45:27 +0200 |
---|---|---|
committer | Arnold D. Robbins <arnold@skeeve.com> | 2019-12-22 20:45:27 +0200 |
commit | 2c94bddc97b3ec05ecaebb394bab57be707ca81f (patch) | |
tree | f09f1f283d5f279768ee88a9f8f66131ba5ead21 /doc/implementation-notes.txt | |
parent | 85d4a663bbe5ac966b15b4d671bf892638350f4d (diff) | |
download | egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.tar.gz egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.tar.bz2 egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.zip |
Add implementation-notes.txt to the repo.
Diffstat (limited to 'doc/implementation-notes.txt')
-rw-r--r-- | doc/implementation-notes.txt | 57 |
1 files changed, 57 insertions, 0 deletions
diff --git a/doc/implementation-notes.txt b/doc/implementation-notes.txt new file mode 100644 index 00000000..275c422b --- /dev/null +++ b/doc/implementation-notes.txt @@ -0,0 +1,57 @@ +IMPLEMENTATION NOTES +==================== + +Multibyte Strings +----------------- + +Gawk stores strings as (char *, size_t) pair, not as C strings, in order +to follow the GNU "no arbitrary limits" design principle. There are also +some flag bits associated with awk values, such as whether the type +is string or number, and if the numeric and/or string values are current. + +Wide character support consists of an additional (wchar_t *, size_t) +pair and a flag bit indicating that the wide character string value +is current. + +The primary representation is the char* string, which holds the +multibyte encoding of the string (UTF-8 or whatever is used in the +current locale). This is used for input and output. + +The wide character representation is created and managed on an +as-needed basis. That is, when we need to know about characters +and not bytes: length(), substr(), index(), match() and printf("%c") +format. (Did I forget any?) + +Fortunately, the GNU regex routines know how to match directly +on the multibyte representation, although I've often wished +for a version of those APIs that would take the wide character +strings directly. + +Getting info out of match() is a bit of a pain. The regex routines +return match start and length in terms of byte offsets. Gawk builds +a secondary index that turns these offsets into offsets within +the multibyte string, so that proper values can be returned. + +For printf %c, the wide character representation has to be +turned back into multibyte encoded characters and then printed. + +Assignment of a string value clears the wide-string-current bit +and the memory for the wchar_t* string is released. + +GOTCHAS +******* + +There is a significant GOTCHA with GLIBC. The "constant" MB_CUR_MAX +indicates how many bytes are the maximum needed to multibyte-encode +a character in the current locale. And it's easy to use it in code. + +The problem is that this constant is actually a macro defined to call +a function to return the correct value. This makes sense, since with +setlocale() you can change the current locale of the running program. + +But, for programs like gawk that don't change their locale mid-run, +using MB_CUR_MAX inside a heavily-called loop is a disaster (e.g., the +loop that parses input data to find the end of the record). + +Thus gawk has it's own `gawk_mb_cur_max' variable that it initializes upon +start-up, and doesn't touch afterwards. That variable is used everywhere. |