Add implementation-notes.txt to the repo.

author: Arnold D. Robbins <arnold@skeeve.com> 2019-12-22 20:45:27 +0200
committer: Arnold D. Robbins <arnold@skeeve.com> 2019-12-22 20:45:27 +0200
commit: 2c94bddc97b3ec05ecaebb394bab57be707ca81f (patch)
tree: f09f1f283d5f279768ee88a9f8f66131ba5ead21 /doc/implementation-notes.txt
parent: 85d4a663bbe5ac966b15b4d671bf892638350f4d (diff)
download: egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.tar.gz
egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.tar.bz2
egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.zip
1 files changed, 57 insertions, 0 deletions
diff --git a/doc/implementation-notes.txt b/doc/implementation-notes.txt
new file mode 100644
index 00000000..275c422b
--- /dev/null
+++ b/doc/implementation-notes.txt
@@ -0,0 +1,57 @@
+IMPLEMENTATION NOTES
+====================
+
+Multibyte Strings
+-----------------
+
+Gawk stores strings as (char *, size_t) pair, not as C strings, in order
+to follow the GNU "no arbitrary limits" design principle. There are also
+some flag bits associated with awk values, such as whether the type
+is string or number, and if the numeric and/or string values are current.
+
+Wide character support consists of an additional (wchar_t *, size_t)
+pair and a flag bit indicating that the wide character string value
+is current.
+
+The primary representation is the char* string, which holds the
+multibyte encoding of the string (UTF-8 or whatever is used in the
+current locale). This is used for input and output.
+
+The wide character representation is created and managed on an
+as-needed basis.  That is, when we need to know about characters
+and not bytes: length(), substr(), index(), match() and printf("%c")
+format.  (Did I forget any?)
+
+Fortunately, the GNU regex routines know how to match directly
+on the multibyte representation, although I've often wished
+for a version of those APIs that would take the wide character
+strings directly.
+
+Getting info out of match() is a bit of a pain. The regex routines
+return match start and length in terms of byte offsets.  Gawk builds
+a secondary index that turns these offsets into offsets within
+the multibyte string, so that proper values can be returned.
+
+For printf %c, the wide character representation has to be
+turned back into multibyte encoded characters and then printed.
+
+Assignment of a string value clears the wide-string-current bit
+and the memory for the wchar_t* string is released.
+
+GOTCHAS
+*******
+
+There is a significant GOTCHA with GLIBC.  The "constant" MB_CUR_MAX
+indicates how many bytes are the maximum needed to multibyte-encode
+a character in the current locale.  And it's easy to use it in code.
+
+The problem is that this constant is actually a macro defined to call
+a function to return the correct value.  This makes sense, since with
+setlocale() you can change the current locale of the running program.
+
+But, for programs like gawk that don't change their locale mid-run,
+using MB_CUR_MAX inside a heavily-called loop is a disaster (e.g., the
+loop that parses input data to find the end of the record).
+
+Thus gawk has it's own `gawk_mb_cur_max' variable that it initializes upon
+start-up, and doesn't touch afterwards. That variable is used everywhere.
author	Arnold D. Robbins <arnold@skeeve.com>	2019-12-22 20:45:27 +0200
committer	Arnold D. Robbins <arnold@skeeve.com>	2019-12-22 20:45:27 +0200
commit	2c94bddc97b3ec05ecaebb394bab57be707ca81f (patch)
tree	f09f1f283d5f279768ee88a9f8f66131ba5ead21 /doc/implementation-notes.txt
parent	85d4a663bbe5ac966b15b4d671bf892638350f4d (diff)
download	egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.tar.gz egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.tar.bz2 egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.zip