More stuff on CSV files.

author: Arnold D. Robbins <arnold@skeeve.com> 2020-04-12 20:33:06 +0300
committer: Arnold D. Robbins <arnold@skeeve.com> 2020-04-12 20:33:06 +0300
commit: 96656f01af15915c943865c8705bc7fc4a9ab436 (patch)
tree: 3a10d0864d7b92c145ddd863e0a23bc464e6f217 /doc/gawk.texi
parent: b60b2e33b6fa1727050c1e97662e1cf79ef1652b (diff)
download: egawk-96656f01af15915c943865c8705bc7fc4a9ab436.tar.gz
egawk-96656f01af15915c943865c8705bc7fc4a9ab436.tar.bz2
egawk-96656f01af15915c943865c8705bc7fc4a9ab436.zip
1 files changed, 77 insertions, 7 deletions
diff --git a/doc/gawk.texi b/doc/gawk.texi
index ca99f017..cbc71886 100644
--- a/doc/gawk.texi
+++ b/doc/gawk.texi
@@ -577,6 +577,7 @@ particular records in a file and perform operations upon them.
 * Allowing trailing data::              Capturing optional trailing data.
 * Fields with fixed data::              Field values with fixed-width data.
 * Splitting By Content::                Defining Fields By Content
+* More CSV::                            More on CSV files.
 * Testing field creation::              Checking how @command{gawk} is
                                         splitting records.
 * Multiple Line::                       Reading multiline records.
@@ -8188,6 +8189,10 @@ four, and @code{$4} has the value @code{"ddd"}.
 @node Splitting By Content
 @section Defining Fields by Content
 
+@menu
+* More CSV::                    More on CSV files.
+@end menu
+
 @c O'Reilly doesn't like it as a note the first thing in the section.
 This @value{SECTION} discusses an advanced
 feature of @command{gawk}.  If you are a novice @command{awk} user,
@@ -8227,7 +8232,9 @@ This regular expression describes the contents of each field.
 
 In the case of CSV data as presented here, each field is either ``anything that
 is not a comma,'' or ``a double quote, anything that is not a double quote, and a
-closing double quote.''  If written as a regular expression constant
+closing double quote.''  (There are more complicated definitions of CSV data,
+treated shortly.)
+If written as a regular expression constant
 (@pxref{Regexp}),
 we would have @code{/([^,]+)|("[^"]+")/}.
 Writing this as a string requires us to escape the double quotes, leading to:
@@ -8283,12 +8290,6 @@ if (substr($i, 1, 1) == "\"") @{
 @}
 @end example
 
-As with @code{FS}, the @code{IGNORECASE} variable (@pxref{User-modified})
-affects field splitting with @code{FPAT}.
-
-Assigning a value to @code{FPAT} overrides field splitting
-with @code{FS} and with @code{FIELDWIDTHS}.
-
 @quotation NOTE
 Some programs export CSV data that contains embedded newlines between
 the double quotes.  @command{gawk} provides no way to deal with this.
@@ -8311,9 +8312,78 @@ FPAT = "([^,]*)|(\"[^\"]+\")"
 @c (star in latter part of value) to allow quoted strings to be empty.
 @c Per email from Ed Morton <mortoneccc@comcast.net>
 
+As with @code{FS}, the @code{IGNORECASE} variable (@pxref{User-modified})
+affects field splitting with @code{FPAT}.
+
+Assigning a value to @code{FPAT} overrides field splitting
+with @code{FS} and with @code{FIELDWIDTHS}.
+
 Finally, the @code{patsplit()} function makes the same functionality
 available for splitting regular strings (@pxref{String Functions}).
 
+@node More CSV
+@subsection More on CSV Files
+
+Manuel Collado notes that in addition to commas, a CSV field can also
+contains quotes, that have to be escaped by doubling them. The previously
+described regexps fail to accept quoted fields with both commas and
+quotes inside. He suggests that the simplest @code{FPAT} expression that
+recognizes this kind of fields is @code{/([^,]*)|("([^"]|"")+")/}. He
+provides the following imput data test these variants:
+
+@example
+@c file eg/misc/sample.csv
+p,"q,r",s
+p,"q""r",s
+p,"q,""r",s
+p,"",s
+p,,s
+@c endfile
+@end example
+
+@noindent
+And here is his test program:
+
+@example
+@c file eg/misc/test-csv.awk
+@group
+BEGIN @{
+     fp[0] = "([^,]+)|(\"[^\"]+\")"
+     fp[1] = "([^,]*)|(\"[^\"]+\")"
+     fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
+     FPAT =  fp[fpat+0]
+@}
+@end group
+
+@group
+@{
+     print "<" $0 ">"
+     printf("NF = %s ", NF)
+     for (i = 1; i <= NF; i++) @{
+         printf("<%s>", $i)
+     @}
+     print ""
+@}
+@end group
+@c endfile
+@end example
+
+When run on the third variant, it produces:
+
+@example
+$ @kbd{gawk -v fpat=2 -f test-csv.awk sample.csv}
+@print{} <p,"q,r",s>
+@print{} NF = 3 <p><"q,r"><s>
+@print{} <p,"q""r",s>
+@print{} NF = 3 <p><"q""r"><s>
+@print{} <p,"q,""r",s>
+@print{} NF = 3 <p><"q,""r"><s>
+@print{} <p,"",s>
+@print{} NF = 3 <p><""><s>
+@print{} <p,,s>
+@print{} NF = 3 <p><><s>
+@end example
+
 @node Testing field creation
 @section Checking How @command{gawk} Is Splitting Records
author	Arnold D. Robbins <arnold@skeeve.com>	2020-04-12 20:33:06 +0300
committer	Arnold D. Robbins <arnold@skeeve.com>	2020-04-12 20:33:06 +0300
commit	96656f01af15915c943865c8705bc7fc4a9ab436 (patch)
tree	3a10d0864d7b92c145ddd863e0a23bc464e6f217 /doc/gawk.texi
parent	b60b2e33b6fa1727050c1e97662e1cf79ef1652b (diff)
download	egawk-96656f01af15915c943865c8705bc7fc4a9ab436.tar.gz egawk-96656f01af15915c943865c8705bc7fc4a9ab436.tar.bz2 egawk-96656f01af15915c943865c8705bc7fc4a9ab436.zip