doc: redocument UTF-8 in source and literals.

* txr.1: Because invalid UTF-8 bytes are allowed in string literals, that documentation needs to be updated. I'm rewriting it substantially to clarify the difference between text streams and parsing source. In the discussion of escape sequences in string literals, the wording is improved. Because the source code is UTF-8, we could plausibly support escapes which specify bytes (that are then decoded), so that's not the correct rationale for not supporting it.
author: Kaz Kylheku <kaz@kylheku.com> 2021-04-10 09:52:15 -0700
committer: Kaz Kylheku <kaz@kylheku.com> 2021-04-10 09:52:15 -0700
commit: 92612fa6b59c74197394edce9b9379382671cbcf (patch)
tree: 735a5c3c6c013d773f129d782364c465c8d88760
parent: a95e662da8b6e84a2cd9619258204e46580317a2 (diff)
download: txr-92612fa6b59c74197394edce9b9379382671cbcf.tar.gz
txr-92612fa6b59c74197394edce9b9379382671cbcf.tar.bz2
txr-92612fa6b59c74197394edce9b9379382671cbcf.zip
1 files changed, 48 insertions, 32 deletions
diff --git a/txr.1 b/txr.1
index 7d38383e..f9d718fd 100644
--- a/txr.1
+++ b/txr.1
@@ -1742,25 +1742,21 @@ and
 .codn L_CTYPE .
 The program reads and writes only the UTF-8 encoding.
 
-If \*(TX encounters invalid bytes in the UTF-8 input, what happens depends on
-the context in which this occurs. In a query, comments are read without regard
-for encoding, so invalid encoding bytes in comments are not detected. A comment
-is simply a sequence of bytes terminated by a newline.  In lexical elements
-which represent text, such as string literals, invalid or unexpected encoding
-bytes are treated as syntax errors. The scanner issues an error message,
-then discards a byte and resumes scanning.  Certain sequences pass through the
-scanner without triggering an error, namely some overlong UTF-8 sequences.
-These are caught when when the lexeme is subject to UTF-8 decoding, and treated
-in the same manner as other UTF-8 data, described in the following paragraph.
-
-Invalid bytes in data are treated as follows. When an invalid byte is
-encountered in the middle of a multibyte character, or if the input
-ends in the middle of a multibyte character, or if a character is extracted
-which is encoded as an overlong form, the UTF-8 decoder returns to the starting
-byte of the ill-formed multibyte character, and extracts just that byte,
-mapping it to the Unicode character range U+DC00 through U+DCFF.  The decoding
-resumes afresh at the following byte, expecting that byte to be the start
-of a UTF-8 code.
+\*(TX deals with UTF-8 separately in its parser, and in its I/O streams
+implementation.
+
+\*(TX's text streams perform UTF-8 conversion internally,
+such that \*(TX application works with Unicode code points.
+
+In text streams, invalid UTF-8 bytes are treated as follows. When an invalid
+byte is encountered in the middle of a multibyte character, or if the input
+ends in the middle of a multibyte character, or if an invalid character is decoded,
+such as an overlong from, or code in the range U+DC00 through U+DCFF, the UTF-8
+decoder returns to the starting byte of the ill-formed multibyte character, and
+extracts just one byte, mapping that byte to the Unicode character range U+DC00
+through U+DCFF, producing that code point as the decoded result.  The decoder
+is then reset to its initial state and begins decoding at the following byte,
+where the same algorithm is repeated.
 
 Furthermore, because \*(TX internally uses a null-terminated character
 representation of strings which easily interoperates with C language
@@ -1769,6 +1765,23 @@ the code U+DC00. On output, this code converts back to a null byte,
 as explained in the previous paragraph. By means of this representational
 trick, \*(TX can handle textual data containing null bytes.
 
+In contrast to the above, the \*(TX parser scans raw UTF-8 bytes from a binary
+stream, rather than using a text stream. The parser performing its own
+recognition of UTF-8 sequences in certain language constructs, using a UTF-8
+decoder only when processing certain kinds of tokens.
+
+Comments are read without regard for encoding, so invalid encoding bytes in
+comments are not detected. A comment is simply a sequence of bytes terminated
+by a newline.
+
+Invalid UTF-8 encountered while scanning identifiers and character names in
+character literal (hash-backslash) syntax is diagnosed as a syntax error.
+
+UTF-8 in string literals is treated in the same way as UTF-8 in text streams.
+Invalid UTF-8 bytes are mapped into code points in the U+DC000 through U+DCFF
+range, and incorporated as such into the resulting string object which the
+literal denotes. The same remarks apply to regular expression literals.
+
 .SS* Regular Expression Directives
 
 In place of a piece of text (see section Text above), a regular expression
@@ -1905,7 +1918,8 @@ Moreover, most Unicode characters beyond U+007F may appear in a
 with certain exceptions. A character may not be used if it is any of the
 Unicode space characters, a member of the high or low surrogate region,
 a member of any Unicode private use area, or is one of the two characters
-U+FFFE or U+FFFF.
+U+FFFE or U+FFFF. These situations produce a syntax error. Invalid UTF-8
+in an identifier is also a syntax error.
 
 The rule still holds that a name cannot look like a number so
 .code +123
@@ -2943,7 +2957,7 @@ numbers and not symbols.
 
 Character literals are introduced by the
 .code #\e
-syntax, which is either
+(hash-backslash) syntax, which is either
 followed by a character name, the letter
 .code x
 followed by hex digits,
@@ -3011,19 +3025,21 @@ as a delimiter. Thus,
 represents
 .strn "!;" .
 
-Note: strings in \*(TX consist of Unicode code points, not UTF-8 bytes;
-therefore the elements of a string literal notation cannot specify individual
-bytes.  Each instance of hexadecimal or octal escape specifies a code point,
-even if its value lies in the 8 bit range.
-However, when a \*(TX string is encoded to UTF-8,
-every code point in the range U+DC00 through U+DCFF is converted to a
-a single byte, by taking the low-order eight bits of its value. By manipulating
-code points in this special range, \*(TX programs can output arbitrary binary
-data into text streams. Also note that the
+Note that the source code syntax of \*(TX string literals is specified
+in UTF-8, which is decoded into an internal string representation consisting
+of code points. The numeric escape sequences are an abstract syntax for
+specifying code points, not for specifying bytes to be inserted into the
+UTF-8 representation, even if they lie in the 8 bit range. Bytes cannot be
+directly specified, other than literally.  However, when a \*(TX string object
+is encoded to UTF-8, every code point lying in the range U+DC00 through U+DCFF
+is converted to a a single byte, by taking the low-order eight bits of its
+value.  By manipulating code points in this special range, \*(TX programs can
+reproduce arbitrary byte sequences in text streams. Also note that the
 .code \eu
 escape sequence for specifying code points found in some languages is
-unnecessary and absent.  More detailed information is given in the section
-Character Handling and International Characters.
+unnecessary and absent, since the existing hexadecimal and octal escapes
+satisfy this requirement.  More detailed information is given in the earlier
+section Character Handling and International Characters.
 
 If the line ends in the middle of a literal, it is an error, unless the
 last character is a backslash. This backslash is a special escape which does
author	Kaz Kylheku <kaz@kylheku.com>	2021-04-10 09:52:15 -0700
committer	Kaz Kylheku <kaz@kylheku.com>	2021-04-10 09:52:15 -0700
commit	92612fa6b59c74197394edce9b9379382671cbcf (patch)
tree	735a5c3c6c013d773f129d782364c465c8d88760
parent	a95e662da8b6e84a2cd9619258204e46580317a2 (diff)
download	txr-92612fa6b59c74197394edce9b9379382671cbcf.tar.gz txr-92612fa6b59c74197394edce9b9379382671cbcf.tar.bz2 txr-92612fa6b59c74197394edce9b9379382671cbcf.zip