diff options
Diffstat (limited to 'txr.1')
-rw-r--r-- | txr.1 | 92 |
1 files changed, 92 insertions, 0 deletions
@@ -85056,6 +85056,98 @@ If this variable is .codn nil , then JSON numbers are all converted to floating point. +.coNP Function @ get-csv +.synb +.mets (get-csv <> [ source ]) +.syne +.desc +The +.code get-csv +function reads a single record of CSV ("comma-separated values") +data from the input +.metn source , +returning a vector of strings. + +The +.meta source +must be a stream or a string. +If it is omitted, then +.code *stdin* +is used. + +The CSV scanning is implemented in a way which is nearly compatible with RFC +4180, with certain differences, as well as extensions of behavior. + +RFC 4180 specifies that the line separators in CSV are CR-LF pairs. +The specification makes it unclear whether, when these separators occur +in the data, they are retained in that two-character form or whether +they may be mapped to a native newline representation. + +In contrast, +.meta get-csv +function recognizes two equivalent line breaks: CR-LF and LF. +When a line break occurs in field data, it is represented as a single LF, +which is the newline character in \*(TL: the character +.code #\enewline +denoted in strings by the escape sequence +.codn \en . + +An isolated CR character in the CSV data (one not followed by +LF) is considered an ordinary character and becomes a constituent +character of a field; it is never treated as a line break. + +RFC 4180 specifies CSV as consisting of 7 bit characters only. The +.code get-csv +function extends the behavior by operating on Unicode characters, +which are decoded from UTF-8 by the underlying stream implementation. + +RFC 4180 excludes control characters other than those encoding +line breaks, and also excludes the character U+007F; +.code get-csv +treats control characters as literal field constituent characters. +A NUL character occurring in the UTF-8 data is mapped by the \*(TX +stream implementation to pseudo-null character, and +.code get-csv +then allows it as a field constituent. + +RFC 4180 neglects to specify behavior when the input deviates from +the specified syntax. The +.code get-csv +function implements the following extensions of behavior for +nonconforming input: + +When the closing quote of a double-quoted field is followed by trailing +characters, these are added to the field. In other words, when a doubly quoted +field is closed, then processing of additional characters continues int the +same manner as for an unquoted field, allowing additional characters to be +recognized and added to the field prior to the appearance of a comma or end of +record. + +The RFC states that fields containing double quotes should be +enclosed in double-quotes, with the constituent double-quotes being +escaped. The +.code get-csv +function allows an unquoted field to contain double quote characters, +which are treated as ordinary characters belonging to the field. In this +situation, a sequence of two double quotes specifies two double quotes. + +The RFC states that the last field of a record must not be followed +by a comma. Under the +.code get-csv +function, this situation is impossible. A trailing comma at the end +of a record specifies an empty last field, which is not itself +followed by a comma. + +The +.code get-csv +function does not recognize or diagnose any errors; it extracts the +maximal prefix of the input source which constitutes a valid CSV record. +Characters not belonging to the CSV record remain in the stream. +Multiple calls to +.code get-csv +for the same input stream given as +.meta source +extract consecutive CSV records. .SH* FOREIGN FUNCTION INTERFACE |