diff options
Diffstat (limited to 'gawk.info-3')
-rw-r--r-- | gawk.info-3 | 1288 |
1 files changed, 1288 insertions, 0 deletions
diff --git a/gawk.info-3 b/gawk.info-3 new file mode 100644 index 00000000..5c87ac3a --- /dev/null +++ b/gawk.info-3 @@ -0,0 +1,1288 @@ +This is Info file gawk.info, produced by Makeinfo-1.54 from the input +file gawk.texi. + + This file documents `awk', a program that you can use to select +particular records in a file and perform operations upon them. + + This is Edition 0.15 of `The GAWK Manual', +for the 2.15 version of the GNU implementation +of AWK. + + Copyright (C) 1989, 1991, 1992, 1993 Free Software Foundation, Inc. + + Permission is granted to make and distribute verbatim copies of this +manual provided the copyright notice and this permission notice are +preserved on all copies. + + Permission is granted to copy and distribute modified versions of +this manual under the conditions for verbatim copying, provided that +the entire resulting derived work is distributed under the terms of a +permission notice identical to this one. + + Permission is granted to copy and distribute translations of this +manual into another language, under the above conditions for modified +versions, except that this permission notice may be stated in a +translation approved by the Foundation. + + +File: gawk.info, Node: Output Separators, Next: OFMT, Prev: Print Examples, Up: Printing + +Output Separators +================= + + As mentioned previously, a `print' statement contains a list of +items, separated by commas. In the output, the items are normally +separated by single spaces. But they do not have to be spaces; a +single space is only the default. You can specify any string of +characters to use as the "output field separator" by setting the +built-in variable `OFS'. The initial value of this variable is the +string `" "', that is, just a single space. + + The output from an entire `print' statement is called an "output +record". Each `print' statement outputs one output record and then +outputs a string called the "output record separator". The built-in +variable `ORS' specifies this string. The initial value of the +variable is the string `"\n"' containing a newline character; thus, +normally each `print' statement makes a separate line. + + You can change how output fields and records are separated by +assigning new values to the variables `OFS' and/or `ORS'. The usual +place to do this is in the `BEGIN' rule (*note `BEGIN' and `END' +Special Patterns: BEGIN/END.), so that it happens before any input is +processed. You may also do this with assignments on the command line, +before the names of your input files. + + The following example prints the first and second fields of each +input record separated by a semicolon, with a blank line added after +each line: + + awk 'BEGIN { OFS = ";"; ORS = "\n\n" } + { print $1, $2 }' BBS-list + + If the value of `ORS' does not contain a newline, all your output +will be run together on a single line, unless you output newlines some +other way. + + +File: gawk.info, Node: OFMT, Next: Printf, Prev: Output Separators, Up: Printing + +Controlling Numeric Output with `print' +======================================= + + When you use the `print' statement to print numeric values, `awk' +internally converts the number to a string of characters, and prints +that string. `awk' uses the `sprintf' function to do this conversion. +For now, it suffices to say that the `sprintf' function accepts a +"format specification" that tells it how to format numbers (or +strings), and that there are a number of different ways that numbers +can be formatted. The different format specifications are discussed +more fully in *Note Using `printf' Statements for Fancier Printing: +Printf. + + The built-in variable `OFMT' contains the default format +specification that `print' uses with `sprintf' when it wants to convert +a number to a string for printing. By supplying different format +specifications as the value of `OFMT', you can change how `print' will +print your numbers. As a brief example: + + awk 'BEGIN { OFMT = "%d" # print numbers as integers + print 17.23 }' + +will print `17'. + + +File: gawk.info, Node: Printf, Next: Redirection, Prev: OFMT, Up: Printing + +Using `printf' Statements for Fancier Printing +============================================== + + If you want more precise control over the output format than `print' +gives you, use `printf'. With `printf' you can specify the width to +use for each item, and you can specify various stylistic choices for +numbers (such as what radix to use, whether to print an exponent, +whether to print a sign, and how many digits to print after the decimal +point). You do this by specifying a string, called the "format +string", which controls how and where to print the other arguments. + +* Menu: + +* Basic Printf:: Syntax of the `printf' statement. +* Control Letters:: Format-control letters. +* Format Modifiers:: Format-specification modifiers. +* Printf Examples:: Several examples. + + +File: gawk.info, Node: Basic Printf, Next: Control Letters, Prev: Printf, Up: Printf + +Introduction to the `printf' Statement +-------------------------------------- + + The `printf' statement looks like this: + + printf FORMAT, ITEM1, ITEM2, ... + +The entire list of arguments may optionally be enclosed in parentheses. +The parentheses are necessary if any of the item expressions uses a +relational operator; otherwise it could be confused with a redirection +(*note Redirecting Output of `print' and `printf': Redirection.). The +relational operators are `==', `!=', `<', `>', `>=', `<=', `~' and `!~' +(*note Comparison Expressions: Comparison Ops.). + + The difference between `printf' and `print' is the argument FORMAT. +This is an expression whose value is taken as a string; it specifies +how to output each of the other arguments. It is called the "format +string". + + The format string is the same as in the ANSI C library function +`printf'. Most of FORMAT is text to be output verbatim. Scattered +among this text are "format specifiers", one per item. Each format +specifier says to output the next item at that place in the format. + + The `printf' statement does not automatically append a newline to its +output. It outputs only what the format specifies. So if you want a +newline, you must include one in the format. The output separator +variables `OFS' and `ORS' have no effect on `printf' statements. + + +File: gawk.info, Node: Control Letters, Next: Format Modifiers, Prev: Basic Printf, Up: Printf + +Format-Control Letters +---------------------- + + A format specifier starts with the character `%' and ends with a +"format-control letter"; it tells the `printf' statement how to output +one item. (If you actually want to output a `%', write `%%'.) The +format-control letter specifies what kind of value to print. The rest +of the format specifier is made up of optional "modifiers" which are +parameters such as the field width to use. + + Here is a list of the format-control letters: + +`c' + This prints a number as an ASCII character. Thus, `printf "%c", + 65' outputs the letter `A'. The output for a string value is the + first character of the string. + +`d' + This prints a decimal integer. + +`i' + This also prints a decimal integer. + +`e' + This prints a number in scientific (exponential) notation. For + example, + + printf "%4.3e", 1950 + + prints `1.950e+03', with a total of four significant figures of + which three follow the decimal point. The `4.3' are "modifiers", + discussed below. + +`f' + This prints a number in floating point notation. + +`g' + This prints a number in either scientific notation or floating + point notation, whichever uses fewer characters. + +`o' + This prints an unsigned octal integer. + +`s' + This prints a string. + +`x' + This prints an unsigned hexadecimal integer. + +`X' + This prints an unsigned hexadecimal integer. However, for the + values 10 through 15, it uses the letters `A' through `F' instead + of `a' through `f'. + +`%' + This isn't really a format-control letter, but it does have a + meaning when used after a `%': the sequence `%%' outputs one `%'. + It does not consume an argument. + + +File: gawk.info, Node: Format Modifiers, Next: Printf Examples, Prev: Control Letters, Up: Printf + +Modifiers for `printf' Formats +------------------------------ + + A format specification can also include "modifiers" that can control +how much of the item's value is printed and how much space it gets. The +modifiers come between the `%' and the format-control letter. Here are +the possible modifiers, in the order in which they may appear: + +`-' + The minus sign, used before the width modifier, says to + left-justify the argument within its specified width. Normally + the argument is printed right-justified in the specified width. + Thus, + + printf "%-4s", "foo" + + prints `foo '. + +`WIDTH' + This is a number representing the desired width of a field. + Inserting any number between the `%' sign and the format control + character forces the field to be expanded to this width. The + default way to do this is to pad with spaces on the left. For + example, + + printf "%4s", "foo" + + prints ` foo'. + + The value of WIDTH is a minimum width, not a maximum. If the item + value requires more than WIDTH characters, it can be as wide as + necessary. Thus, + + printf "%4s", "foobar" + + prints `foobar'. + + Preceding the WIDTH with a minus sign causes the output to be + padded with spaces on the right, instead of on the left. + +`.PREC' + This is a number that specifies the precision to use when printing. + This specifies the number of digits you want printed to the right + of the decimal point. For a string, it specifies the maximum + number of characters from the string that should be printed. + + The C library `printf''s dynamic WIDTH and PREC capability (for +example, `"%*.*s"') is supported. Instead of supplying explicit WIDTH +and/or PREC values in the format string, you pass them in the argument +list. For example: + + w = 5 + p = 3 + s = "abcdefg" + printf "<%*.*s>\n", w, p, s + +is exactly equivalent to + + s = "abcdefg" + printf "<%5.3s>\n", s + +Both programs output `<**abc>'. (We have used the bullet symbol "*" to +represent a space, to clearly show you that there are two spaces in the +output.) + + Earlier versions of `awk' did not support this capability. You may +simulate it by using concatenation to build up the format string, like +so: + + w = 5 + p = 3 + s = "abcdefg" + printf "<%" w "." p "s>\n", s + +This is not particularly easy to read, however. + + +File: gawk.info, Node: Printf Examples, Prev: Format Modifiers, Up: Printf + +Examples of Using `printf' +-------------------------- + + Here is how to use `printf' to make an aligned table: + + awk '{ printf "%-10s %s\n", $1, $2 }' BBS-list + +prints the names of bulletin boards (`$1') of the file `BBS-list' as a +string of 10 characters, left justified. It also prints the phone +numbers (`$2') afterward on the line. This produces an aligned +two-column table of names and phone numbers: + + aardvark 555-5553 + alpo-net 555-3412 + barfly 555-7685 + bites 555-1675 + camelot 555-0542 + core 555-2912 + fooey 555-1234 + foot 555-6699 + macfoo 555-6480 + sdace 555-3430 + sabafoo 555-2127 + + Did you notice that we did not specify that the phone numbers be +printed as numbers? They had to be printed as strings because the +numbers are separated by a dash. This dash would be interpreted as a +minus sign if we had tried to print the phone numbers as numbers. This +would have led to some pretty confusing results. + + We did not specify a width for the phone numbers because they are the +last things on their lines. We don't need to put spaces after them. + + We could make our table look even nicer by adding headings to the +tops of the columns. To do this, use the `BEGIN' pattern (*note +`BEGIN' and `END' Special Patterns: BEGIN/END.) to force the header to +be printed only once, at the beginning of the `awk' program: + + awk 'BEGIN { print "Name Number" + print "---- ------" } + { printf "%-10s %s\n", $1, $2 }' BBS-list + + Did you notice that we mixed `print' and `printf' statements in the +above example? We could have used just `printf' statements to get the +same results: + + awk 'BEGIN { printf "%-10s %s\n", "Name", "Number" + printf "%-10s %s\n", "----", "------" } + { printf "%-10s %s\n", $1, $2 }' BBS-list + +By outputting each column heading with the same format specification +used for the elements of the column, we have made sure that the headings +are aligned just like the columns. + + The fact that the same format specification is used three times can +be emphasized by storing it in a variable, like this: + + awk 'BEGIN { format = "%-10s %s\n" + printf format, "Name", "Number" + printf format, "----", "------" } + { printf format, $1, $2 }' BBS-list + + See if you can use the `printf' statement to line up the headings and +table data for our `inventory-shipped' example covered earlier in the +section on the `print' statement (*note The `print' Statement: Print.). + + +File: gawk.info, Node: Redirection, Next: Special Files, Prev: Printf, Up: Printing + +Redirecting Output of `print' and `printf' +========================================== + + So far we have been dealing only with output that prints to the +standard output, usually your terminal. Both `print' and `printf' can +also send their output to other places. This is called "redirection". + + A redirection appears after the `print' or `printf' statement. +Redirections in `awk' are written just like redirections in shell +commands, except that they are written inside the `awk' program. + +* Menu: + +* File/Pipe Redirection:: Redirecting Output to Files and Pipes. +* Close Output:: How to close output files and pipes. + + +File: gawk.info, Node: File/Pipe Redirection, Next: Close Output, Prev: Redirection, Up: Redirection + +Redirecting Output to Files and Pipes +------------------------------------- + + Here are the three forms of output redirection. They are all shown +for the `print' statement, but they work identically for `printf' also. + +`print ITEMS > OUTPUT-FILE' + This type of redirection prints the items onto the output file + OUTPUT-FILE. The file name OUTPUT-FILE can be any expression. + Its value is changed to a string and then used as a file name + (*note Expressions as Action Statements: Expressions.). + + When this type of redirection is used, the OUTPUT-FILE is erased + before the first output is written to it. Subsequent writes do not + erase OUTPUT-FILE, but append to it. If OUTPUT-FILE does not + exist, then it is created. + + For example, here is how one `awk' program can write a list of BBS + names to a file `name-list' and a list of phone numbers to a file + `phone-list'. Each output file contains one name or number per + line. + + awk '{ print $2 > "phone-list" + print $1 > "name-list" }' BBS-list + +`print ITEMS >> OUTPUT-FILE' + This type of redirection prints the items onto the output file + OUTPUT-FILE. The difference between this and the single-`>' + redirection is that the old contents (if any) of OUTPUT-FILE are + not erased. Instead, the `awk' output is appended to the file. + +`print ITEMS | COMMAND' + It is also possible to send output through a "pipe" instead of + into a file. This type of redirection opens a pipe to COMMAND + and writes the values of ITEMS through this pipe, to another + process created to execute COMMAND. + + The redirection argument COMMAND is actually an `awk' expression. + Its value is converted to a string, whose contents give the shell + command to be run. + + For example, this produces two files, one unsorted list of BBS + names and one list sorted in reverse alphabetical order: + + awk '{ print $1 > "names.unsorted" + print $1 | "sort -r > names.sorted" }' BBS-list + + Here the unsorted list is written with an ordinary redirection + while the sorted list is written by piping through the `sort' + utility. + + Here is an example that uses redirection to mail a message to a + mailing list `bug-system'. This might be useful when trouble is + encountered in an `awk' script run periodically for system + maintenance. + + report = "mail bug-system" + print "Awk script failed:", $0 | report + print "at record number", FNR, "of", FILENAME | report + close(report) + + We call the `close' function here because it's a good idea to close + the pipe as soon as all the intended output has been sent to it. + *Note Closing Output Files and Pipes: Close Output, for more + information on this. This example also illustrates the use of a + variable to represent a FILE or COMMAND: it is not necessary to + always use a string constant. Using a variable is generally a + good idea, since `awk' requires you to spell the string value + identically every time. + + Redirecting output using `>', `>>', or `|' asks the system to open a +file or pipe only if the particular FILE or COMMAND you've specified +has not already been written to by your program, or if it has been +closed since it was last written to. + + +File: gawk.info, Node: Close Output, Prev: File/Pipe Redirection, Up: Redirection + +Closing Output Files and Pipes +------------------------------ + + When a file or pipe is opened, the file name or command associated +with it is remembered by `awk' and subsequent writes to the same file or +command are appended to the previous writes. The file or pipe stays +open until `awk' exits. This is usually convenient. + + Sometimes there is a reason to close an output file or pipe earlier +than that. To do this, use the `close' function, as follows: + + close(FILENAME) + +or + + close(COMMAND) + + The argument FILENAME or COMMAND can be any expression. Its value +must exactly equal the string used to open the file or pipe to begin +with--for example, if you open a pipe with this: + + print $1 | "sort -r > names.sorted" + +then you must close it with this: + + close("sort -r > names.sorted") + + Here are some reasons why you might need to close an output file: + + * To write a file and read it back later on in the same `awk' + program. Close the file when you are finished writing it; then + you can start reading it with `getline' (*note Explicit Input with + `getline': Getline.). + + * To write numerous files, successively, in the same `awk' program. + If you don't close the files, eventually you may exceed a system + limit on the number of open files in one process. So close each + one when you are finished writing it. + + * To make a command finish. When you redirect output through a pipe, + the command reading the pipe normally continues to try to read + input as long as the pipe is open. Often this means the command + cannot really do its work until the pipe is closed. For example, + if you redirect output to the `mail' program, the message is not + actually sent until the pipe is closed. + + * To run the same program a second time, with the same arguments. + This is not the same thing as giving more input to the first run! + + For example, suppose you pipe output to the `mail' program. If you + output several lines redirected to this pipe without closing it, + they make a single message of several lines. By contrast, if you + close the pipe after each line of output, then each line makes a + separate message. + + `close' returns a value of zero if the close succeeded. Otherwise, +the value will be non-zero. In this case, `gawk' sets the variable +`ERRNO' to a string describing the error that occurred. + + +File: gawk.info, Node: Special Files, Prev: Redirection, Up: Printing + +Standard I/O Streams +==================== + + Running programs conventionally have three input and output streams +already available to them for reading and writing. These are known as +the "standard input", "standard output", and "standard error output". +These streams are, by default, terminal input and output, but they are +often redirected with the shell, via the `<', `<<', `>', `>>', `>&' and +`|' operators. Standard error is used only for writing error messages; +the reason we have two separate streams, standard output and standard +error, is so that they can be redirected separately. + + In other implementations of `awk', the only way to write an error +message to standard error in an `awk' program is as follows: + + print "Serious error detected!\n" | "cat 1>&2" + +This works by opening a pipeline to a shell command which can access the +standard error stream which it inherits from the `awk' process. This +is far from elegant, and is also inefficient, since it requires a +separate process. So people writing `awk' programs have often +neglected to do this. Instead, they have sent the error messages to the +terminal, like this: + + NF != 4 { + printf("line %d skipped: doesn't have 4 fields\n", FNR) > "/dev/tty" + } + +This has the same effect most of the time, but not always: although the +standard error stream is usually the terminal, it can be redirected, and +when that happens, writing to the terminal is not correct. In fact, if +`awk' is run from a background job, it may not have a terminal at all. +Then opening `/dev/tty' will fail. + + `gawk' provides special file names for accessing the three standard +streams. When you redirect input or output in `gawk', if the file name +matches one of these special names, then `gawk' directly uses the +stream it stands for. + +`/dev/stdin' + The standard input (file descriptor 0). + +`/dev/stdout' + The standard output (file descriptor 1). + +`/dev/stderr' + The standard error output (file descriptor 2). + +`/dev/fd/N' + The file associated with file descriptor N. Such a file must have + been opened by the program initiating the `awk' execution + (typically the shell). Unless you take special pains, only + descriptors 0, 1 and 2 are available. + + The file names `/dev/stdin', `/dev/stdout', and `/dev/stderr' are +aliases for `/dev/fd/0', `/dev/fd/1', and `/dev/fd/2', respectively, +but they are more self-explanatory. + + The proper way to write an error message in a `gawk' program is to +use `/dev/stderr', like this: + + NF != 4 { + printf("line %d skipped: doesn't have 4 fields\n", FNR) > "/dev/stderr" + } + + `gawk' also provides special file names that give access to +information about the running `gawk' process. Each of these "files" +provides a single record of information. To read them more than once, +you must first close them with the `close' function (*note Closing +Input Files and Pipes: Close Input.). The filenames are: + +`/dev/pid' + Reading this file returns the process ID of the current process, + in decimal, terminated with a newline. + +`/dev/ppid' + Reading this file returns the parent process ID of the current + process, in decimal, terminated with a newline. + +`/dev/pgrpid' + Reading this file returns the process group ID of the current + process, in decimal, terminated with a newline. + +`/dev/user' + Reading this file returns a single record terminated with a + newline. The fields are separated with blanks. The fields + represent the following information: + + `$1' + The value of the `getuid' system call. + + `$2' + The value of the `geteuid' system call. + + `$3' + The value of the `getgid' system call. + + `$4' + The value of the `getegid' system call. + + If there are any additional fields, they are the group IDs + returned by `getgroups' system call. (Multiple groups may not be + supported on all systems.) + + These special file names may be used on the command line as data +files, as well as for I/O redirections within an `awk' program. They +may not be used as source files with the `-f' option. + + Recognition of these special file names is disabled if `gawk' is in +compatibility mode (*note Invoking `awk': Command Line.). + + *Caution*: Unless your system actually has a `/dev/fd' directory + (or any of the other above listed special files), the + interpretation of these file names is done by `gawk' itself. For + example, using `/dev/fd/4' for output will actually write on file + descriptor 4, and not on a new file descriptor that was `dup''ed + from file descriptor 4. Most of the time this does not matter; + however, it is important to *not* close any of the files related + to file descriptors 0, 1, and 2. If you do close one of these + files, unpredictable behavior will result. + + +File: gawk.info, Node: One-liners, Next: Patterns, Prev: Printing, Up: Top + +Useful "One-liners" +******************* + + Useful `awk' programs are often short, just a line or two. Here is a +collection of useful, short programs to get you started. Some of these +programs contain constructs that haven't been covered yet. The +description of the program will give you a good idea of what is going +on, but please read the rest of the manual to become an `awk' expert! + + Since you are reading this in Info, each line of the example code is +enclosed in quotes, to represent text that you would type literally. +The examples themselves represent shell commands that use single quotes +to keep the shell from interpreting the contents of the program. When +reading the examples, focus on the text between the open and close +quotes. + +`awk '{ if (NF > max) max = NF }' +` END { print max }'' + This program prints the maximum number of fields on any input line. + +`awk 'length($0) > 80'' + This program prints every line longer than 80 characters. The sole + rule has a relational expression as its pattern, and has no action + (so the default action, printing the record, is used). + +`awk 'NF > 0'' + This program prints every line that has at least one field. This + is an easy way to delete blank lines from a file (or rather, to + create a new file similar to the old file but from which the blank + lines have been deleted). + +`awk '{ if (NF > 0) print }'' + This program also prints every line that has at least one field. + Here we allow the rule to match every line, then decide in the + action whether to print. + +`awk 'BEGIN { for (i = 1; i <= 7; i++)' +` print int(101 * rand()) }'' + This program prints 7 random numbers from 0 to 100, inclusive. + +`ls -l FILES | awk '{ x += $4 } ; END { print "total bytes: " x }'' + This program prints the total number of bytes used by FILES. + +`expand FILE | awk '{ if (x < length()) x = length() }' +` END { print "maximum line length is " x }'' + This program prints the maximum line length of FILE. The input is + piped through the `expand' program to change tabs into spaces, so + the widths compared are actually the right-margin columns. + +`awk 'BEGIN { FS = ":" }' +` { print $1 | "sort" }' /etc/passwd' + This program prints a sorted list of the login names of all users. + +`awk '{ nlines++ }' +` END { print nlines }'' + This programs counts lines in a file. + +`awk 'END { print NR }'' + This program also counts lines in a file, but lets `awk' do the + work. + +`awk '{ print NR, $0 }'' + This program adds line numbers to all its input files, similar to + `cat -n'. + + +File: gawk.info, Node: Patterns, Next: Actions, Prev: One-liners, Up: Top + +Patterns +******** + + Patterns in `awk' control the execution of rules: a rule is executed +when its pattern matches the current input record. This chapter tells +all about how to write patterns. + +* Menu: + +* Kinds of Patterns:: A list of all kinds of patterns. + The following subsections describe + them in detail. +* Regexp:: Regular expressions such as `/foo/'. +* Comparison Patterns:: Comparison expressions such as `$1 > 10'. +* Boolean Patterns:: Combining comparison expressions. +* Expression Patterns:: Any expression can be used as a pattern. +* Ranges:: Pairs of patterns specify record ranges. +* BEGIN/END:: Specifying initialization and cleanup rules. +* Empty:: The empty pattern, which matches every record. + + +File: gawk.info, Node: Kinds of Patterns, Next: Regexp, Prev: Patterns, Up: Patterns + +Kinds of Patterns +================= + + Here is a summary of the types of patterns supported in `awk'. + +`/REGULAR EXPRESSION/' + A regular expression as a pattern. It matches when the text of the + input record fits the regular expression. (*Note Regular + Expressions as Patterns: Regexp.) + +`EXPRESSION' + A single expression. It matches when its value, converted to a + number, is nonzero (if a number) or nonnull (if a string). (*Note + Expressions as Patterns: Expression Patterns.) + +`PAT1, PAT2' + A pair of patterns separated by a comma, specifying a range of + records. (*Note Specifying Record Ranges with Patterns: Ranges.) + +`BEGIN' +`END' + Special patterns to supply start-up or clean-up information to + `awk'. (*Note `BEGIN' and `END' Special Patterns: BEGIN/END.) + +`NULL' + The empty pattern matches every input record. (*Note The Empty + Pattern: Empty.) + + +File: gawk.info, Node: Regexp, Next: Comparison Patterns, Prev: Kinds of Patterns, Up: Patterns + +Regular Expressions as Patterns +=============================== + + A "regular expression", or "regexp", is a way of describing a class +of strings. A regular expression enclosed in slashes (`/') is an `awk' +pattern that matches every input record whose text belongs to that +class. + + The simplest regular expression is a sequence of letters, numbers, or +both. Such a regexp matches any string that contains that sequence. +Thus, the regexp `foo' matches any string containing `foo'. Therefore, +the pattern `/foo/' matches any input record containing `foo'. Other +kinds of regexps let you specify more complicated classes of strings. + +* Menu: + +* Regexp Usage:: How to Use Regular Expressions +* Regexp Operators:: Regular Expression Operators +* Case-sensitivity:: How to do case-insensitive matching. + + +File: gawk.info, Node: Regexp Usage, Next: Regexp Operators, Prev: Regexp, Up: Regexp + +How to Use Regular Expressions +------------------------------ + + A regular expression can be used as a pattern by enclosing it in +slashes. Then the regular expression is matched against the entire +text of each record. (Normally, it only needs to match some part of +the text in order to succeed.) For example, this prints the second +field of each record that contains `foo' anywhere: + + awk '/foo/ { print $2 }' BBS-list + + Regular expressions can also be used in comparison expressions. Then +you can specify the string to match against; it need not be the entire +current input record. These comparison expressions can be used as +patterns or in `if', `while', `for', and `do' statements. + +`EXP ~ /REGEXP/' + This is true if the expression EXP (taken as a character string) + is matched by REGEXP. The following example matches, or selects, + all input records with the upper-case letter `J' somewhere in the + first field: + + awk '$1 ~ /J/' inventory-shipped + + So does this: + + awk '{ if ($1 ~ /J/) print }' inventory-shipped + +`EXP !~ /REGEXP/' + This is true if the expression EXP (taken as a character string) + is *not* matched by REGEXP. The following example matches, or + selects, all input records whose first field *does not* contain + the upper-case letter `J': + + awk '$1 !~ /J/' inventory-shipped + + The right hand side of a `~' or `!~' operator need not be a constant +regexp (i.e., a string of characters between slashes). It may be any +expression. The expression is evaluated, and converted if necessary to +a string; the contents of the string are used as the regexp. A regexp +that is computed in this way is called a "dynamic regexp". For example: + + identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" + $0 ~ identifier_regexp + +sets `identifier_regexp' to a regexp that describes `awk' variable +names, and tests if the input record matches this regexp. + + +File: gawk.info, Node: Regexp Operators, Next: Case-sensitivity, Prev: Regexp Usage, Up: Regexp + +Regular Expression Operators +---------------------------- + + You can combine regular expressions with the following characters, +called "regular expression operators", or "metacharacters", to increase +the power and versatility of regular expressions. + + Here is a table of metacharacters. All characters not listed in the +table stand for themselves. + +`^' + This matches the beginning of the string or the beginning of a line + within the string. For example: + + ^@chapter + + matches the `@chapter' at the beginning of a string, and can be + used to identify chapter beginnings in Texinfo source files. + +`$' + This is similar to `^', but it matches only at the end of a string + or the end of a line within the string. For example: + + p$ + + matches a record that ends with a `p'. + +`.' + This matches any single character except a newline. For example: + + .P + + matches any single character followed by a `P' in a string. Using + concatenation we can make regular expressions like `U.A', which + matches any three-character sequence that begins with `U' and ends + with `A'. + +`[...]' + This is called a "character set". It matches any one of the + characters that are enclosed in the square brackets. For example: + + [MVX] + + matches any one of the characters `M', `V', or `X' in a string. + + Ranges of characters are indicated by using a hyphen between the + beginning and ending characters, and enclosing the whole thing in + brackets. For example: + + [0-9] + + matches any digit. + + To include the character `\', `]', `-' or `^' in a character set, + put a `\' in front of it. For example: + + [d\]] + + matches either `d', or `]'. + + This treatment of `\' is compatible with other `awk' + implementations, and is also mandated by the POSIX Command Language + and Utilities standard. The regular expressions in `awk' are a + superset of the POSIX specification for Extended Regular + Expressions (EREs). POSIX EREs are based on the regular + expressions accepted by the traditional `egrep' utility. + + In `egrep' syntax, backslash is not syntactically special within + square brackets. This means that special tricks have to be used to + represent the characters `]', `-' and `^' as members of a + character set. + + In `egrep' syntax, to match `-', write it as `---', which is a + range containing only `-'. You may also give `-' as the first or + last character in the set. To match `^', put it anywhere except + as the first character of a set. To match a `]', make it the + first character in the set. For example: + + []d^] + + matches either `]', `d' or `^'. + +`[^ ...]' + This is a "complemented character set". The first character after + the `[' *must* be a `^'. It matches any characters *except* those + in the square brackets (or newline). For example: + + [^0-9] + + matches any character that is not a digit. + +`|' + This is the "alternation operator" and it is used to specify + alternatives. For example: + + ^P|[0-9] + + matches any string that matches either `^P' or `[0-9]'. This + means it matches any string that contains a digit or starts with + `P'. + + The alternation applies to the largest possible regexps on either + side. + +`(...)' + Parentheses are used for grouping in regular expressions as in + arithmetic. They can be used to concatenate regular expressions + containing the alternation operator, `|'. + +`*' + This symbol means that the preceding regular expression is to be + repeated as many times as possible to find a match. For example: + + ph* + + applies the `*' symbol to the preceding `h' and looks for matches + to one `p' followed by any number of `h's. This will also match + just `p' if no `h's are present. + + The `*' repeats the *smallest* possible preceding expression. + (Use parentheses if you wish to repeat a larger expression.) It + finds as many repetitions as possible. For example: + + awk '/\(c[ad][ad]*r x\)/ { print }' sample + + prints every record in the input containing a string of the form + `(car x)', `(cdr x)', `(cadr x)', and so on. + +`+' + This symbol is similar to `*', but the preceding expression must be + matched at least once. This means that: + + wh+y + + would match `why' and `whhy' but not `wy', whereas `wh*y' would + match all three of these strings. This is a simpler way of + writing the last `*' example: + + awk '/\(c[ad]+r x\)/ { print }' sample + +`?' + This symbol is similar to `*', but the preceding expression can be + matched once or not at all. For example: + + fe?d + + will match `fed' and `fd', but nothing else. + +`\' + This is used to suppress the special meaning of a character when + matching. For example: + + \$ + + matches the character `$'. + + The escape sequences used for string constants (*note Constant + Expressions: Constants.) are valid in regular expressions as well; + they are also introduced by a `\'. + + In regular expressions, the `*', `+', and `?' operators have the +highest precedence, followed by concatenation, and finally by `|'. As +in arithmetic, parentheses can change how operators are grouped. + + +File: gawk.info, Node: Case-sensitivity, Prev: Regexp Operators, Up: Regexp + +Case-sensitivity in Matching +---------------------------- + + Case is normally significant in regular expressions, both when +matching ordinary characters (i.e., not metacharacters), and inside +character sets. Thus a `w' in a regular expression matches only a +lower case `w' and not an upper case `W'. + + The simplest way to do a case-independent match is to use a character +set: `[Ww]'. However, this can be cumbersome if you need to use it +often; and it can make the regular expressions harder for humans to +read. There are two other alternatives that you might prefer. + + One way to do a case-insensitive match at a particular point in the +program is to convert the data to a single case, using the `tolower' or +`toupper' built-in string functions (which we haven't discussed yet; +*note Built-in Functions for String Manipulation: String Functions.). +For example: + + tolower($1) ~ /foo/ { ... } + +converts the first field to lower case before matching against it. + + Another method is to set the variable `IGNORECASE' to a nonzero +value (*note Built-in Variables::.). When `IGNORECASE' is not zero, +*all* regexp operations ignore case. Changing the value of +`IGNORECASE' dynamically controls the case sensitivity of your program +as it runs. Case is significant by default because `IGNORECASE' (like +most variables) is initialized to zero. + + x = "aB" + if (x ~ /ab/) ... # this test will fail + + IGNORECASE = 1 + if (x ~ /ab/) ... # now it will succeed + + In general, you cannot use `IGNORECASE' to make certain rules +case-insensitive and other rules case-sensitive, because there is no way +to set `IGNORECASE' just for the pattern of a particular rule. To do +this, you must use character sets or `tolower'. However, one thing you +can do only with `IGNORECASE' is turn case-sensitivity on or off +dynamically for all the rules at once. + + `IGNORECASE' can be set on the command line, or in a `BEGIN' rule. +Setting `IGNORECASE' from the command line is a way to make a program +case-insensitive without having to edit it. + + The value of `IGNORECASE' has no effect if `gawk' is in +compatibility mode (*note Invoking `awk': Command Line.). Case is +always significant in compatibility mode. + + +File: gawk.info, Node: Comparison Patterns, Next: Boolean Patterns, Prev: Regexp, Up: Patterns + +Comparison Expressions as Patterns +================================== + + "Comparison patterns" test relationships such as equality between +two strings or numbers. They are a special case of expression patterns +(*note Expressions as Patterns: Expression Patterns.). They are written +with "relational operators", which are a superset of those in C. Here +is a table of them: + +`X < Y' + True if X is less than Y. + +`X <= Y' + True if X is less than or equal to Y. + +`X > Y' + True if X is greater than Y. + +`X >= Y' + True if X is greater than or equal to Y. + +`X == Y' + True if X is equal to Y. + +`X != Y' + True if X is not equal to Y. + +`X ~ Y' + True if X matches the regular expression described by Y. + +`X !~ Y' + True if X does not match the regular expression described by Y. + + The operands of a relational operator are compared as numbers if they +are both numbers. Otherwise they are converted to, and compared as, +strings (*note Conversion of Strings and Numbers: Conversion., for the +detailed rules). Strings are compared by comparing the first character +of each, then the second character of each, and so on, until there is a +difference. If the two strings are equal until the shorter one runs +out, the shorter one is considered to be less than the longer one. +Thus, `"10"' is less than `"9"', and `"abc"' is less than `"abcd"'. + + The left operand of the `~' and `!~' operators is a string. The +right operand is either a constant regular expression enclosed in +slashes (`/REGEXP/'), or any expression, whose string value is used as +a dynamic regular expression (*note How to Use Regular Expressions: +Regexp Usage.). + + The following example prints the second field of each input record +whose first field is precisely `foo'. + + awk '$1 == "foo" { print $2 }' BBS-list + +Contrast this with the following regular expression match, which would +accept any record with a first field that contains `foo': + + awk '$1 ~ "foo" { print $2 }' BBS-list + +or, equivalently, this one: + + awk '$1 ~ /foo/ { print $2 }' BBS-list + + +File: gawk.info, Node: Boolean Patterns, Next: Expression Patterns, Prev: Comparison Patterns, Up: Patterns + +Boolean Operators and Patterns +============================== + + A "boolean pattern" is an expression which combines other patterns +using the "boolean operators" "or" (`||'), "and" (`&&'), and "not" +(`!'). Whether the boolean pattern matches an input record depends on +whether its subpatterns match. + + For example, the following command prints all records in the input +file `BBS-list' that contain both `2400' and `foo'. + + awk '/2400/ && /foo/' BBS-list + + The following command prints all records in the input file +`BBS-list' that contain *either* `2400' or `foo', or both. + + awk '/2400/ || /foo/' BBS-list + + The following command prints all records in the input file +`BBS-list' that do *not* contain the string `foo'. + + awk '! /foo/' BBS-list + + Note that boolean patterns are a special case of expression patterns +(*note Expressions as Patterns: Expression Patterns.); they are +expressions that use the boolean operators. *Note Boolean Expressions: +Boolean Ops, for complete information on the boolean operators. + + The subpatterns of a boolean pattern can be constant regular +expressions, comparisons, or any other `awk' expressions. Range +patterns are not expressions, so they cannot appear inside boolean +patterns. Likewise, the special patterns `BEGIN' and `END', which +never match any input record, are not expressions and cannot appear +inside boolean patterns. + + +File: gawk.info, Node: Expression Patterns, Next: Ranges, Prev: Boolean Patterns, Up: Patterns + +Expressions as Patterns +======================= + + Any `awk' expression is also valid as an `awk' pattern. Then the +pattern "matches" if the expression's value is nonzero (if a number) or +nonnull (if a string). + + The expression is reevaluated each time the rule is tested against a +new input record. If the expression uses fields such as `$1', the +value depends directly on the new input record's text; otherwise, it +depends only on what has happened so far in the execution of the `awk' +program, but that may still be useful. + + Comparison patterns are actually a special case of this. For +example, the expression `$5 == "foo"' has the value 1 when the value of +`$5' equals `"foo"', and 0 otherwise; therefore, this expression as a +pattern matches when the two values are equal. + + Boolean patterns are also special cases of expression patterns. + + A constant regexp as a pattern is also a special case of an +expression pattern. `/foo/' as an expression has the value 1 if `foo' +appears in the current input record; thus, as a pattern, `/foo/' +matches any record containing `foo'. + + Other implementations of `awk' that are not yet POSIX compliant are +less general than `gawk': they allow comparison expressions, and +boolean combinations thereof (optionally with parentheses), but not +necessarily other kinds of expressions. + + +File: gawk.info, Node: Ranges, Next: BEGIN/END, Prev: Expression Patterns, Up: Patterns + +Specifying Record Ranges with Patterns +====================================== + + A "range pattern" is made of two patterns separated by a comma, of +the form `BEGPAT, ENDPAT'. It matches ranges of consecutive input +records. The first pattern BEGPAT controls where the range begins, and +the second one ENDPAT controls where it ends. For example, + + awk '$1 == "on", $1 == "off"' + +prints every record between `on'/`off' pairs, inclusive. + + A range pattern starts out by matching BEGPAT against every input +record; when a record matches BEGPAT, the range pattern becomes "turned +on". The range pattern matches this record. As long as it stays +turned on, it automatically matches every input record read. It also +matches ENDPAT against every input record; when that succeeds, the +range pattern is turned off again for the following record. Now it +goes back to checking BEGPAT against each record. + + The record that turns on the range pattern and the one that turns it +off both match the range pattern. If you don't want to operate on +these records, you can write `if' statements in the rule's action to +distinguish them. + + It is possible for a pattern to be turned both on and off by the same +record, if both conditions are satisfied by that record. Then the +action is executed for just that record. + + +File: gawk.info, Node: BEGIN/END, Next: Empty, Prev: Ranges, Up: Patterns + +`BEGIN' and `END' Special Patterns +================================== + + `BEGIN' and `END' are special patterns. They are not used to match +input records. Rather, they are used for supplying start-up or +clean-up information to your `awk' script. A `BEGIN' rule is executed, +once, before the first input record has been read. An `END' rule is +executed, once, after all the input has been read. For example: + + awk 'BEGIN { print "Analysis of `foo'" } + /foo/ { ++foobar } + END { print "`foo' appears " foobar " times." }' BBS-list + + This program finds the number of records in the input file `BBS-list' +that contain the string `foo'. The `BEGIN' rule prints a title for the +report. There is no need to use the `BEGIN' rule to initialize the +counter `foobar' to zero, as `awk' does this for us automatically +(*note Variables::.). + + The second rule increments the variable `foobar' every time a record +containing the pattern `foo' is read. The `END' rule prints the value +of `foobar' at the end of the run. + + The special patterns `BEGIN' and `END' cannot be used in ranges or +with boolean operators (indeed, they cannot be used with any operators). + + An `awk' program may have multiple `BEGIN' and/or `END' rules. They +are executed in the order they appear, all the `BEGIN' rules at +start-up and all the `END' rules at termination. + + Multiple `BEGIN' and `END' sections are useful for writing library +functions, since each library can have its own `BEGIN' or `END' rule to +do its own initialization and/or cleanup. Note that the order in which +library functions are named on the command line controls the order in +which their `BEGIN' and `END' rules are executed. Therefore you have +to be careful to write such rules in library files so that the order in +which they are executed doesn't matter. *Note Invoking `awk': Command +Line, for more information on using library functions. + + If an `awk' program only has a `BEGIN' rule, and no other rules, +then the program exits after the `BEGIN' rule has been run. (Older +versions of `awk' used to keep reading and ignoring input until end of +file was seen.) However, if an `END' rule exists as well, then the +input will be read, even if there are no other rules in the program. +This is necessary in case the `END' rule checks the `NR' variable. + + `BEGIN' and `END' rules must have actions; there is no default +action for these rules since there is no current record when they run. + + +File: gawk.info, Node: Empty, Prev: BEGIN/END, Up: Patterns + +The Empty Pattern +================= + + An empty pattern is considered to match *every* input record. For +example, the program: + + awk '{ print $1 }' BBS-list + +prints the first field of every record. + |