summaryrefslogtreecommitdiffstats
path: root/txr.1
diff options
context:
space:
mode:
Diffstat (limited to 'txr.1')
-rw-r--r--txr.192
1 files changed, 92 insertions, 0 deletions
diff --git a/txr.1 b/txr.1
index a4bf6237..2d3090dd 100644
--- a/txr.1
+++ b/txr.1
@@ -85056,6 +85056,98 @@ If this variable is
.codn nil ,
then JSON numbers are all converted to floating point.
+.coNP Function @ get-csv
+.synb
+.mets (get-csv <> [ source ])
+.syne
+.desc
+The
+.code get-csv
+function reads a single record of CSV ("comma-separated values")
+data from the input
+.metn source ,
+returning a vector of strings.
+
+The
+.meta source
+must be a stream or a string.
+If it is omitted, then
+.code *stdin*
+is used.
+
+The CSV scanning is implemented in a way which is nearly compatible with RFC
+4180, with certain differences, as well as extensions of behavior.
+
+RFC 4180 specifies that the line separators in CSV are CR-LF pairs.
+The specification makes it unclear whether, when these separators occur
+in the data, they are retained in that two-character form or whether
+they may be mapped to a native newline representation.
+
+In contrast,
+.meta get-csv
+function recognizes two equivalent line breaks: CR-LF and LF.
+When a line break occurs in field data, it is represented as a single LF,
+which is the newline character in \*(TL: the character
+.code #\enewline
+denoted in strings by the escape sequence
+.codn \en .
+
+An isolated CR character in the CSV data (one not followed by
+LF) is considered an ordinary character and becomes a constituent
+character of a field; it is never treated as a line break.
+
+RFC 4180 specifies CSV as consisting of 7 bit characters only. The
+.code get-csv
+function extends the behavior by operating on Unicode characters,
+which are decoded from UTF-8 by the underlying stream implementation.
+
+RFC 4180 excludes control characters other than those encoding
+line breaks, and also excludes the character U+007F;
+.code get-csv
+treats control characters as literal field constituent characters.
+A NUL character occurring in the UTF-8 data is mapped by the \*(TX
+stream implementation to pseudo-null character, and
+.code get-csv
+then allows it as a field constituent.
+
+RFC 4180 neglects to specify behavior when the input deviates from
+the specified syntax. The
+.code get-csv
+function implements the following extensions of behavior for
+nonconforming input:
+
+When the closing quote of a double-quoted field is followed by trailing
+characters, these are added to the field. In other words, when a doubly quoted
+field is closed, then processing of additional characters continues int the
+same manner as for an unquoted field, allowing additional characters to be
+recognized and added to the field prior to the appearance of a comma or end of
+record.
+
+The RFC states that fields containing double quotes should be
+enclosed in double-quotes, with the constituent double-quotes being
+escaped. The
+.code get-csv
+function allows an unquoted field to contain double quote characters,
+which are treated as ordinary characters belonging to the field. In this
+situation, a sequence of two double quotes specifies two double quotes.
+
+The RFC states that the last field of a record must not be followed
+by a comma. Under the
+.code get-csv
+function, this situation is impossible. A trailing comma at the end
+of a record specifies an empty last field, which is not itself
+followed by a comma.
+
+The
+.code get-csv
+function does not recognize or diagnose any errors; it extracts the
+maximal prefix of the input source which constitutes a valid CSV record.
+Characters not belonging to the CSV record remain in the stream.
+Multiple calls to
+.code get-csv
+for the same input stream given as
+.meta source
+extract consecutive CSV records.
.SH* FOREIGN FUNCTION INTERFACE