aboutsummaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorArnold D. Robbins <arnold@skeeve.com>2019-12-22 20:45:27 +0200
committerArnold D. Robbins <arnold@skeeve.com>2019-12-22 20:45:27 +0200
commit2c94bddc97b3ec05ecaebb394bab57be707ca81f (patch)
treef09f1f283d5f279768ee88a9f8f66131ba5ead21 /doc
parent85d4a663bbe5ac966b15b4d671bf892638350f4d (diff)
downloadegawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.tar.gz
egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.tar.bz2
egawk-2c94bddc97b3ec05ecaebb394bab57be707ca81f.zip
Add implementation-notes.txt to the repo.
Diffstat (limited to 'doc')
-rw-r--r--doc/ChangeLog4
-rw-r--r--doc/implementation-notes.txt57
2 files changed, 61 insertions, 0 deletions
diff --git a/doc/ChangeLog b/doc/ChangeLog
index 0e12cae7..3a8688b1 100644
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@@ -1,3 +1,7 @@
+2019-12-22 Arnold D. Robbins <arnold@skeeve.com>
+
+ * implementation-notes.txt: New file.
+
2019-12-15 Arnold D. Robbins <arnold@skeeve.com>
* gawktexi.in: Use `@codequoteundirected on' and
diff --git a/doc/implementation-notes.txt b/doc/implementation-notes.txt
new file mode 100644
index 00000000..275c422b
--- /dev/null
+++ b/doc/implementation-notes.txt
@@ -0,0 +1,57 @@
+IMPLEMENTATION NOTES
+====================
+
+Multibyte Strings
+-----------------
+
+Gawk stores strings as (char *, size_t) pair, not as C strings, in order
+to follow the GNU "no arbitrary limits" design principle. There are also
+some flag bits associated with awk values, such as whether the type
+is string or number, and if the numeric and/or string values are current.
+
+Wide character support consists of an additional (wchar_t *, size_t)
+pair and a flag bit indicating that the wide character string value
+is current.
+
+The primary representation is the char* string, which holds the
+multibyte encoding of the string (UTF-8 or whatever is used in the
+current locale). This is used for input and output.
+
+The wide character representation is created and managed on an
+as-needed basis. That is, when we need to know about characters
+and not bytes: length(), substr(), index(), match() and printf("%c")
+format. (Did I forget any?)
+
+Fortunately, the GNU regex routines know how to match directly
+on the multibyte representation, although I've often wished
+for a version of those APIs that would take the wide character
+strings directly.
+
+Getting info out of match() is a bit of a pain. The regex routines
+return match start and length in terms of byte offsets. Gawk builds
+a secondary index that turns these offsets into offsets within
+the multibyte string, so that proper values can be returned.
+
+For printf %c, the wide character representation has to be
+turned back into multibyte encoded characters and then printed.
+
+Assignment of a string value clears the wide-string-current bit
+and the memory for the wchar_t* string is released.
+
+GOTCHAS
+*******
+
+There is a significant GOTCHA with GLIBC. The "constant" MB_CUR_MAX
+indicates how many bytes are the maximum needed to multibyte-encode
+a character in the current locale. And it's easy to use it in code.
+
+The problem is that this constant is actually a macro defined to call
+a function to return the correct value. This makes sense, since with
+setlocale() you can change the current locale of the running program.
+
+But, for programs like gawk that don't change their locale mid-run,
+using MB_CUR_MAX inside a heavily-called loop is a disaster (e.g., the
+loop that parses input data to find the end of the record).
+
+Thus gawk has it's own `gawk_mb_cur_max' variable that it initializes upon
+start-up, and doesn't touch afterwards. That variable is used everywhere.