aboutsummaryrefslogtreecommitdiffstats
path: root/doc/gawk.texi
diff options
context:
space:
mode:
authorArnold D. Robbins <arnold@skeeve.com>2020-04-12 20:33:06 +0300
committerArnold D. Robbins <arnold@skeeve.com>2020-04-12 20:33:06 +0300
commit96656f01af15915c943865c8705bc7fc4a9ab436 (patch)
tree3a10d0864d7b92c145ddd863e0a23bc464e6f217 /doc/gawk.texi
parentb60b2e33b6fa1727050c1e97662e1cf79ef1652b (diff)
downloadegawk-96656f01af15915c943865c8705bc7fc4a9ab436.tar.gz
egawk-96656f01af15915c943865c8705bc7fc4a9ab436.tar.bz2
egawk-96656f01af15915c943865c8705bc7fc4a9ab436.zip
More stuff on CSV files.
Diffstat (limited to 'doc/gawk.texi')
-rw-r--r--doc/gawk.texi84
1 files changed, 77 insertions, 7 deletions
diff --git a/doc/gawk.texi b/doc/gawk.texi
index ca99f017..cbc71886 100644
--- a/doc/gawk.texi
+++ b/doc/gawk.texi
@@ -577,6 +577,7 @@ particular records in a file and perform operations upon them.
* Allowing trailing data:: Capturing optional trailing data.
* Fields with fixed data:: Field values with fixed-width data.
* Splitting By Content:: Defining Fields By Content
+* More CSV:: More on CSV files.
* Testing field creation:: Checking how @command{gawk} is
splitting records.
* Multiple Line:: Reading multiline records.
@@ -8188,6 +8189,10 @@ four, and @code{$4} has the value @code{"ddd"}.
@node Splitting By Content
@section Defining Fields by Content
+@menu
+* More CSV:: More on CSV files.
+@end menu
+
@c O'Reilly doesn't like it as a note the first thing in the section.
This @value{SECTION} discusses an advanced
feature of @command{gawk}. If you are a novice @command{awk} user,
@@ -8227,7 +8232,9 @@ This regular expression describes the contents of each field.
In the case of CSV data as presented here, each field is either ``anything that
is not a comma,'' or ``a double quote, anything that is not a double quote, and a
-closing double quote.'' If written as a regular expression constant
+closing double quote.'' (There are more complicated definitions of CSV data,
+treated shortly.)
+If written as a regular expression constant
(@pxref{Regexp}),
we would have @code{/([^,]+)|("[^"]+")/}.
Writing this as a string requires us to escape the double quotes, leading to:
@@ -8283,12 +8290,6 @@ if (substr($i, 1, 1) == "\"") @{
@}
@end example
-As with @code{FS}, the @code{IGNORECASE} variable (@pxref{User-modified})
-affects field splitting with @code{FPAT}.
-
-Assigning a value to @code{FPAT} overrides field splitting
-with @code{FS} and with @code{FIELDWIDTHS}.
-
@quotation NOTE
Some programs export CSV data that contains embedded newlines between
the double quotes. @command{gawk} provides no way to deal with this.
@@ -8311,9 +8312,78 @@ FPAT = "([^,]*)|(\"[^\"]+\")"
@c (star in latter part of value) to allow quoted strings to be empty.
@c Per email from Ed Morton <mortoneccc@comcast.net>
+As with @code{FS}, the @code{IGNORECASE} variable (@pxref{User-modified})
+affects field splitting with @code{FPAT}.
+
+Assigning a value to @code{FPAT} overrides field splitting
+with @code{FS} and with @code{FIELDWIDTHS}.
+
Finally, the @code{patsplit()} function makes the same functionality
available for splitting regular strings (@pxref{String Functions}).
+@node More CSV
+@subsection More on CSV Files
+
+Manuel Collado notes that in addition to commas, a CSV field can also
+contains quotes, that have to be escaped by doubling them. The previously
+described regexps fail to accept quoted fields with both commas and
+quotes inside. He suggests that the simplest @code{FPAT} expression that
+recognizes this kind of fields is @code{/([^,]*)|("([^"]|"")+")/}. He
+provides the following imput data test these variants:
+
+@example
+@c file eg/misc/sample.csv
+p,"q,r",s
+p,"q""r",s
+p,"q,""r",s
+p,"",s
+p,,s
+@c endfile
+@end example
+
+@noindent
+And here is his test program:
+
+@example
+@c file eg/misc/test-csv.awk
+@group
+BEGIN @{
+ fp[0] = "([^,]+)|(\"[^\"]+\")"
+ fp[1] = "([^,]*)|(\"[^\"]+\")"
+ fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
+ FPAT = fp[fpat+0]
+@}
+@end group
+
+@group
+@{
+ print "<" $0 ">"
+ printf("NF = %s ", NF)
+ for (i = 1; i <= NF; i++) @{
+ printf("<%s>", $i)
+ @}
+ print ""
+@}
+@end group
+@c endfile
+@end example
+
+When run on the third variant, it produces:
+
+@example
+$ @kbd{gawk -v fpat=2 -f test-csv.awk sample.csv}
+@print{} <p,"q,r",s>
+@print{} NF = 3 <p><"q,r"><s>
+@print{} <p,"q""r",s>
+@print{} NF = 3 <p><"q""r"><s>
+@print{} <p,"q,""r",s>
+@print{} NF = 3 <p><"q,""r"><s>
+@print{} <p,"",s>
+@print{} NF = 3 <p><""><s>
+@print{} <p,,s>
+@print{} NF = 3 <p><><s>
+@end example
+
@node Testing field creation
@section Checking How @command{gawk} Is Splitting Records