aboutsummaryrefslogtreecommitdiffstats
path: root/doc/gawk.texi
diff options
context:
space:
mode:
Diffstat (limited to 'doc/gawk.texi')
-rw-r--r--doc/gawk.texi215
1 files changed, 151 insertions, 64 deletions
diff --git a/doc/gawk.texi b/doc/gawk.texi
index 353a0c9d..5b9eeed7 100644
--- a/doc/gawk.texi
+++ b/doc/gawk.texi
@@ -568,7 +568,13 @@ particular records in a file and perform operations upon them.
field.
* Field Splitting Summary:: Some final points and a summary table.
* Constant Size:: Reading constant width data.
+* Fixed width data:: Processing fixed-width data.
+* Skipping intervening:: Skipping intervening fields.
+* Allowing trailing data:: Capturing optional trailing data.
+* Fields with fixed data:: Field values with fixed-width data.
* Splitting By Content:: Defining Fields By Content
+* Testing field creation:: Checking how @command{gawk} is
+ splitting records.
* Multiple Line:: Reading multiline records.
* Getline:: Reading files under explicit program
control using the @code{getline}
@@ -6431,6 +6437,8 @@ used with it do not have to be named on the @command{awk} command line
* Field Separators:: The field separator and how to change it.
* Constant Size:: Reading constant width data.
* Splitting By Content:: Defining Fields By Content
+* Testing field creation:: Checking how @command{gawk} is splitting
+ records.
* Multiple Line:: Reading multiline records.
* Getline:: Reading files under explicit program control
using the @code{getline} function.
@@ -7756,18 +7764,30 @@ feature of @command{gawk}. If you are a novice @command{awk} user,
you might want to skip it on the first reading.
@command{gawk} provides a facility for dealing with fixed-width fields
-with no distinctive field separator. For example, data of this nature
-arises in the input for old Fortran programs where numbers are run
-together, or in the output of programs that did not anticipate the use
-of their output as input for other programs.
-
-An example of the latter is a table where all the columns are lined up by
-the use of a variable number of spaces and @emph{empty fields are just
-spaces}. Clearly, @command{awk}'s normal field splitting based on @code{FS}
-does not work well in this case. Although a portable @command{awk} program
-can use a series of @code{substr()} calls on @code{$0}
-(@pxref{String Functions}),
-this is awkward and inefficient for a large number of fields.
+with no distinctive field separator. We discuss this feature in
+the following @value{SUBSECTION}s.
+
+@menu
+* Fixed width data:: Processing fixed-width data.
+* Skipping intervening:: Skipping intervening fields.
+* Allowing trailing data:: Capturing optional trailing data.
+* Fields with fixed data:: Field values with fixed-width data.
+@end menu
+
+@node Fixed width data
+@subsection Processing Fixed-Width Data
+
+An example of fixed-width data would be the input for old Fortran programs
+where numbers are run together, or the output of programs that did not
+anticipate the use of their output as input for other programs.
+
+An example of the latter is a table where all the columns are lined up
+by the use of a variable number of spaces and @emph{empty fields are
+just spaces}. Clearly, @command{awk}'s normal field splitting based
+on @code{FS} does not work well in this case. Although a portable
+@command{awk} program can use a series of @code{substr()} calls on
+@code{$0} (@pxref{String Functions}), this is awkward and inefficient
+for a large number of fields.
@cindex troubleshooting, fatal errors, field widths@comma{} specifying
@cindex @command{w} utility
@@ -7775,14 +7795,12 @@ this is awkward and inefficient for a large number of fields.
@cindex @command{gawk}, @code{FIELDWIDTHS} variable in
The splitting of an input record into fixed-width fields is specified by
assigning a string containing space-separated numbers to the built-in
-variable @code{FIELDWIDTHS}. Each number specifies the width of the field,
-@emph{including} columns between fields. If you want to ignore the columns
-between fields, you can specify the width as a separate field that is
-subsequently ignored.
-Or, starting in @value{PVERSION} 4.2, each field width may optionally be
-preceded by a colon-separated value specifying the number of characters to skip
-before the field starts.
-It is a fatal error to supply a field width that has a negative value.
+variable @code{FIELDWIDTHS}. Each number specifies the width of the
+field, @emph{including} columns between fields. If you want to ignore
+the columns between fields, you can specify the width as a separate
+field that is subsequently ignored. It is a fatal error to supply a
+field width that has a negative value.
+
The following data is the output of the Unix @command{w} utility. It is useful
to illustrate the use of @code{FIELDWIDTHS}:
@@ -7812,7 +7830,7 @@ NR > 2 @{
sub(/^ +/, "", idle) # strip leading spaces
if (idle == "")
idle = 0
- if (idle ~ /:/) @{
+ if (idle ~ /:/) @{ # hh:mm
split(idle, t, ":")
idle = t[1] * 60 + t[2]
@}
@@ -7841,13 +7859,30 @@ brent ttyp0 286
dave ttyq4 1296000
@end example
-Starting in @value{PVERSION} 4.2, this program could be rewritten to
-specify @code{FIELDWIDTHS} like so:
+Another (possibly more practical) example of fixed-width input data
+is the input from a deck of balloting cards. In some parts of
+the United States, voters mark their choices by punching holes in computer
+cards. These cards are then processed to count the votes for any particular
+candidate or on any particular issue. Because a voter may choose not to
+vote on some issue, any column on the card may be empty. An @command{awk}
+program for processing such data could use the @code{FIELDWIDTHS} feature
+to simplify reading the data. (Of course, getting @command{gawk} to run on
+a system with card readers is another story!)
+
+@node Skipping intervening
+@subsection Skipping Intervening Fields
+
+Starting in @value{PVERSION} 4.2, each field width may optionally be
+preceded by a colon-separated value specifying the number of characters
+to skip before the field starts. Thus, the preceding program could be
+rewritten to specify @code{FIELDWIDTHS} like so:
+
@example
BEGIN @{ FIELDWIDTHS = "8 1:5 4:7 6 1:6 1:6 2:33" @}
@end example
+
This strips away some of the white space separating the fields. With such
-a change, the program would produce the following results:
+a change, the program produces the following results:
@example
hzang ttyV3 50
@@ -7859,42 +7894,65 @@ brent ttyp0 286
dave ttyq4 1296000
@end example
-Another (possibly more practical) example of fixed-width input data
-is the input from a deck of balloting cards. In some parts of
-the United States, voters mark their choices by punching holes in computer
-cards. These cards are then processed to count the votes for any particular
-candidate or on any particular issue. Because a voter may choose not to
-vote on some issue, any column on the card may be empty. An @command{awk}
-program for processing such data could use the @code{FIELDWIDTHS} feature
-to simplify reading the data. (Of course, getting @command{gawk} to run on
-a system with card readers is another story!)
+@node Allowing trailing data
+@subsection Capturing Optional Trailing Data
-@cindex @command{gawk}, splitting fields and
-Assigning a value to @code{FS} causes @command{gawk} to use
-@code{FS} for field splitting again. Use @samp{FS = FS} to make this happen,
-without having to know the current value of @code{FS}.
-In order to tell which kind of field splitting is in effect,
-use @code{PROCINFO["FS"]}
-(@pxref{Auto-set}).
-The value is @code{"FS"} if regular field splitting is being used,
-or @code{"FIELDWIDTHS"} if fixed-width field splitting is being used:
+There are times when fixed-width data may be followed by additional data
+that has no fixed length. Such data may or may not be present, but if
+it is, it should be possible to get at it from an @command{awk} program.
+
+Starting with version 4.2, in order to provide a way to say ``anything
+else in the record after the defined fields,'' @command{gawk}
+allows you to add a final @samp{*} character to the value of
+@code{FIELDWIDTHS}. There can only be one such character, and it must
+be the final non-whitespace character in @code{FIELDWIDTHS}.
+For example:
@example
-if (PROCINFO["FS"] == "FS")
- @var{regular field splitting} @dots{}
-else if (PROCINFO["FS"] == "FIELDWIDTHS")
- @var{fixed-width field splitting} @dots{}
-else if (PROCINFO["FS"] == "FPAT")
- @var{content-based field splitting} @dots{} @ii{(see next @value{SECTION})}
-else
- @var{API input parser field splitting} @dots{} @ii{(advanced feature)}
+$ @kbd{cat fw.awk} @ii{Show the program}
+@print{} BEGIN @{ FIELDWIDTHS = "2 2 *" @}
+@print{} @{ print NF, $1, $2, $3 @}
+$ @kbd{cat fw.in} @ii{Show sample input}
+@print{} 1234abcdefghi
+$ @kbd{gawk -f fw.awk fw.in} @ii{Run the program}
+@print{} 3 12 34 abcdefghi
@end example
-This information is useful when writing a function
-that needs to temporarily change @code{FS} or @code{FIELDWIDTHS},
-read some records, and then restore the original settings
-(@pxref{Passwd Functions}
-for an example of such a function).
+@node Fields with fixed data
+@subsection Field Values With Fixed-Width Data
+
+So far, so good. But what happens if there isn't as much data as there
+should be based on the contents of @code{FIELDWIDTHS}? Or, what happens
+if there is more data than expected?
+
+For many years, what happens in these cases was not well defined. Starting
+with version 4.2, the rules are as follows:
+
+@table @asis
+@item Enough data for some fields
+For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4"} and the
+input record is @samp{aabbb}. In this case, @code{NF} is set to two.
+
+@item Not enough data for a field
+For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4"} and the
+input record is @samp{aab}. In this case, @code{NF} is set to two and
+@code{$2} has the value @code{"b"}. The idea is that even though there
+aren't as many characters as were expected, there are some, so the data
+should be made available to the program.
+
+@item Too much data
+For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4"} and the
+input record is @samp{aabbbccccddd}. In this case, @code{NF} is set to
+three and the extra characters (@samp{ddd}) are ignored. If you want
+@command{gawk} to capture the extra characters, supply a final @samp{*}
+in the value of @code{FIELDWIDTHS}.
+
+@item Too much data, but with @samp{*} supplied
+For example, if @code{FIELDWIDTHS} is set to @code{"2 3 4 *"} and the
+input record is @samp{aabbbccccddd}. In this case, @code{NF} is set to
+four, and @code{$4} has the value @code{"ddd"}.
+
+@end table
@node Splitting By Content
@section Defining Fields by Content
@@ -7995,8 +8053,6 @@ affects field splitting with @code{FPAT}.
Assigning a value to @code{FPAT} overrides field splitting
with @code{FS} and with @code{FIELDWIDTHS}.
-Similar to @code{FIELDWIDTHS}, the value of @code{PROCINFO["FS"]}
-will be @code{"FPAT"} if content-based field splitting is being used.
@quotation NOTE
Some programs export CSV data that contains embedded newlines between
@@ -8023,13 +8079,44 @@ FPAT = "([^,]*)|(\"[^\"]+\")"
Finally, the @code{patsplit()} function makes the same functionality
available for splitting regular strings (@pxref{String Functions}).
-To recap, @command{gawk} provides three independent methods
-to split input records into fields.
-The mechanism used is based on which of the three
-variables---@code{FS}, @code{FIELDWIDTHS}, or @code{FPAT}---was
-last assigned to. In addition, an API input parser may choose to
-override the record parsing mechanism; please refer to @ref{Input Parsers}
-for further information about this feature.
+
+@node Testing field creation
+@section Checking How @command{gawk} Is Splitting Records
+
+@cindex @command{gawk}, splitting fields and
+As we've seen, @command{gawk} provides three independent methods to split
+input records into fields. The mechanism used is based on which of the
+three variables---@code{FS}, @code{FIELDWIDTHS}, or @code{FPAT}---was
+last assigned to. In addition, an API input parser may choose to override
+the record parsing mechanism; please refer to @ref{Input Parsers} for
+further information about this feature.
+
+To restore normal field splitting after using @code{FIELDWIDTHS}
+and/or @code{FPAT}, simply assign a value to @code{FS}.
+You can use @samp{FS = FS} to do this,
+without having to know the current value of @code{FS}.
+
+In order to tell which kind of field splitting is in effect,
+use @code{PROCINFO["FS"]} (@pxref{Auto-set}).
+The value is @code{"FS"} if regular field splitting is being used,
+@code{"FIELDWIDTHS"} if fixed-width field splitting is being used,
+or @code{"FPAT"} if content-based field splitting is being used:
+
+@example
+if (PROCINFO["FS"] == "FS")
+ @var{regular field splitting} @dots{}
+else if (PROCINFO["FS"] == "FIELDWIDTHS")
+ @var{fixed-width field splitting} @dots{}
+else if (PROCINFO["FS"] == "FPAT")
+ @var{content-based field splitting}
+else
+ @var{API input parser field splitting} @dots{} @ii{(advanced feature)}
+@end example
+
+This information is useful when writing a function that needs to
+temporarily change @code{FS} or @code{FIELDWIDTHS}, read some records,
+and then restore the original settings (@pxref{Passwd Functions} for an
+example of such a function).
@node Multiple Line
@section Multiple-Line Records