diff options
author | Arnold D. Robbins <arnold@skeeve.com> | 2016-10-26 21:49:00 +0300 |
---|---|---|
committer | Arnold D. Robbins <arnold@skeeve.com> | 2016-10-26 21:49:00 +0300 |
commit | e5abd6a16d42fc0f42277919a2d0a2c28476788c (patch) | |
tree | c620f8fd78ec318c7614efeb485c324be23689b7 | |
parent | ff7329df359352a8f62a21bbb9fea47cba29b589 (diff) | |
download | egawk-e5abd6a16d42fc0f42277919a2d0a2c28476788c.tar.gz egawk-e5abd6a16d42fc0f42277919a2d0a2c28476788c.tar.bz2 egawk-e5abd6a16d42fc0f42277919a2d0a2c28476788c.zip |
Add .info files back.
-rw-r--r-- | .gitignore | 2 | ||||
-rw-r--r-- | doc/gawk.info | 35781 | ||||
-rw-r--r-- | doc/gawkinet.info | 4406 |
3 files changed, 40187 insertions, 2 deletions
@@ -16,5 +16,3 @@ gawk stamp-h1 test/fmtspcl.ok - -doc/*.info diff --git a/doc/gawk.info b/doc/gawk.info new file mode 100644 index 00000000..b8ab365a --- /dev/null +++ b/doc/gawk.info @@ -0,0 +1,35781 @@ +This is gawk.info, produced by makeinfo version 6.1 from gawk.texi. + +Copyright (C) 1989, 1991, 1992, 1993, 1996-2005, 2007, 2009-2016 +Free Software Foundation, Inc. + + + This is Edition 4.1 of 'GAWK: Effective AWK Programming: A User's +Guide for GNU Awk', for the 4.1.4 (or later) version of the GNU +implementation of AWK. + + Permission is granted to copy, distribute and/or modify this document +under the terms of the GNU Free Documentation License, Version 1.3 or +any later version published by the Free Software Foundation; with the +Invariant Sections being "GNU General Public License", with the +Front-Cover Texts being "A GNU Manual", and with the Back-Cover Texts as +in (a) below. A copy of the license is included in the section entitled +"GNU Free Documentation License". + + a. The FSF's Back-Cover Text is: "You have the freedom to copy and + modify this GNU manual." +INFO-DIR-SECTION Text creation and manipulation +START-INFO-DIR-ENTRY +* Gawk: (gawk). A text scanning and processing language. +END-INFO-DIR-ENTRY + +INFO-DIR-SECTION Individual utilities +START-INFO-DIR-ENTRY +* awk: (gawk)Invoking gawk. Text scanning and processing. +END-INFO-DIR-ENTRY + + +File: gawk.info, Node: Top, Next: Foreword3, Up: (dir) + +General Introduction +******************** + +This file documents 'awk', a program that you can use to select +particular records in a file and perform operations upon them. + + Copyright (C) 1989, 1991, 1992, 1993, 1996-2005, 2007, 2009-2016 +Free Software Foundation, Inc. + + + This is Edition 4.1 of 'GAWK: Effective AWK Programming: A User's +Guide for GNU Awk', for the 4.1.4 (or later) version of the GNU +implementation of AWK. + + Permission is granted to copy, distribute and/or modify this document +under the terms of the GNU Free Documentation License, Version 1.3 or +any later version published by the Free Software Foundation; with the +Invariant Sections being "GNU General Public License", with the +Front-Cover Texts being "A GNU Manual", and with the Back-Cover Texts as +in (a) below. A copy of the license is included in the section entitled +"GNU Free Documentation License". + + a. The FSF's Back-Cover Text is: "You have the freedom to copy and + modify this GNU manual." + +* Menu: + +* Foreword3:: Some nice words about this + Info file. +* Foreword4:: More nice words. +* Preface:: What this Info file is about; brief + history and acknowledgments. +* Getting Started:: A basic introduction to using + 'awk'. How to run an 'awk' + program. Command-line syntax. +* Invoking Gawk:: How to run 'gawk'. +* Regexp:: All about matching things using regular + expressions. +* Reading Files:: How to read files and manipulate fields. +* Printing:: How to print using 'awk'. Describes + the 'print' and 'printf' + statements. Also describes redirection of + output. +* Expressions:: Expressions are the basic building blocks + of statements. +* Patterns and Actions:: Overviews of patterns and actions. +* Arrays:: The description and use of arrays. Also + includes array-oriented control statements. +* Functions:: Built-in and user-defined functions. +* Library Functions:: A Library of 'awk' Functions. +* Sample Programs:: Many 'awk' programs with complete + explanations. +* Advanced Features:: Stuff for advanced users, specific to + 'gawk'. +* Internationalization:: Getting 'gawk' to speak your + language. +* Debugger:: The 'gawk' debugger. +* Arbitrary Precision Arithmetic:: Arbitrary precision arithmetic with + 'gawk'. +* Dynamic Extensions:: Adding new built-in functions to + 'gawk'. +* Language History:: The evolution of the 'awk' + language. +* Installation:: Installing 'gawk' under various + operating systems. +* Notes:: Notes about adding things to 'gawk' + and possible future work. +* Basic Concepts:: A very quick introduction to programming + concepts. +* Glossary:: An explanation of some unfamiliar terms. +* Copying:: Your right to copy and distribute + 'gawk'. +* GNU Free Documentation License:: The license for this Info file. +* Index:: Concept and Variable Index. + +* History:: The history of 'gawk' and + 'awk'. +* Names:: What name to use to find + 'awk'. +* This Manual:: Using this Info file. Includes + sample input files that you can use. +* Conventions:: Typographical Conventions. +* Manual History:: Brief history of the GNU project and + this Info file. +* How To Contribute:: Helping to save the world. +* Acknowledgments:: Acknowledgments. +* Running gawk:: How to run 'gawk' programs; + includes command-line syntax. +* One-shot:: Running a short throwaway + 'awk' program. +* Read Terminal:: Using no input files (input from the + keyboard instead). +* Long:: Putting permanent 'awk' + programs in files. +* Executable Scripts:: Making self-contained 'awk' + programs. +* Comments:: Adding documentation to 'gawk' + programs. +* Quoting:: More discussion of shell quoting + issues. +* DOS Quoting:: Quoting in Windows Batch Files. +* Sample Data Files:: Sample data files for use in the + 'awk' programs illustrated in + this Info file. +* Very Simple:: A very simple example. +* Two Rules:: A less simple one-line example using + two rules. +* More Complex:: A more complex example. +* Statements/Lines:: Subdividing or combining statements + into lines. +* Other Features:: Other Features of 'awk'. +* When:: When to use 'gawk' and when to + use other things. +* Intro Summary:: Summary of the introduction. +* Command Line:: How to run 'awk'. +* Options:: Command-line options and their + meanings. +* Other Arguments:: Input file names and variable + assignments. +* Naming Standard Input:: How to specify standard input with + other files. +* Environment Variables:: The environment variables + 'gawk' uses. +* AWKPATH Variable:: Searching directories for + 'awk' programs. +* AWKLIBPATH Variable:: Searching directories for + 'awk' shared libraries. +* Other Environment Variables:: The environment variables. +* Exit Status:: 'gawk''s exit status. +* Include Files:: Including other files into your + program. +* Loading Shared Libraries:: Loading shared libraries into your + program. +* Obsolete:: Obsolete Options and/or features. +* Undocumented:: Undocumented Options and Features. +* Invoking Summary:: Invocation summary. +* Regexp Usage:: How to Use Regular Expressions. +* Escape Sequences:: How to write nonprinting characters. +* Regexp Operators:: Regular Expression Operators. +* Bracket Expressions:: What can go between '[...]'. +* Leftmost Longest:: How much text matches. +* Computed Regexps:: Using Dynamic Regexps. +* GNU Regexp Operators:: Operators specific to GNU software. +* Case-sensitivity:: How to do case-insensitive matching. +* Strong Regexp Constants:: Strongly typed regexp constants. +* Regexp Summary:: Regular expressions summary. +* Records:: Controlling how data is split into + records. +* awk split records:: How standard 'awk' splits + records. +* gawk split records:: How 'gawk' splits records. +* Fields:: An introduction to fields. +* Nonconstant Fields:: Nonconstant Field Numbers. +* Changing Fields:: Changing the Contents of a Field. +* Field Separators:: The field separator and how to change + it. +* Default Field Splitting:: How fields are normally separated. +* Regexp Field Splitting:: Using regexps as the field separator. +* Single Character Fields:: Making each character a separate + field. +* Command Line Field Separator:: Setting 'FS' from the command + line. +* Full Line Fields:: Making the full line be a single + field. +* Field Splitting Summary:: Some final points and a summary table. +* Constant Size:: Reading constant width data. +* Splitting By Content:: Defining Fields By Content +* Multiple Line:: Reading multiline records. +* Getline:: Reading files under explicit program + control using the 'getline' + function. +* Plain Getline:: Using 'getline' with no + arguments. +* Getline/Variable:: Using 'getline' into a variable. +* Getline/File:: Using 'getline' from a file. +* Getline/Variable/File:: Using 'getline' into a variable + from a file. +* Getline/Pipe:: Using 'getline' from a pipe. +* Getline/Variable/Pipe:: Using 'getline' into a variable + from a pipe. +* Getline/Coprocess:: Using 'getline' from a coprocess. +* Getline/Variable/Coprocess:: Using 'getline' into a variable + from a coprocess. +* Getline Notes:: Important things to know about + 'getline'. +* Getline Summary:: Summary of 'getline' Variants. +* Read Timeout:: Reading input with a timeout. +* Retrying Input:: Retrying input after certain errors. +* Command-line directories:: What happens if you put a directory on + the command line. +* Input Summary:: Input summary. +* Input Exercises:: Exercises. +* Print:: The 'print' statement. +* Print Examples:: Simple examples of 'print' + statements. +* Output Separators:: The output separators and how to + change them. +* OFMT:: Controlling Numeric Output With + 'print'. +* Printf:: The 'printf' statement. +* Basic Printf:: Syntax of the 'printf' statement. +* Control Letters:: Format-control letters. +* Format Modifiers:: Format-specification modifiers. +* Printf Examples:: Several examples. +* Redirection:: How to redirect output to multiple + files and pipes. +* Special FD:: Special files for I/O. +* Special Files:: File name interpretation in + 'gawk'. 'gawk' allows + access to inherited file descriptors. +* Other Inherited Files:: Accessing other open files with + 'gawk'. +* Special Network:: Special files for network + communications. +* Special Caveats:: Things to watch out for. +* Close Files And Pipes:: Closing Input and Output Files and + Pipes. +* Nonfatal:: Enabling Nonfatal Output. +* Output Summary:: Output summary. +* Output Exercises:: Exercises. +* Values:: Constants, Variables, and Regular + Expressions. +* Constants:: String, numeric and regexp constants. +* Scalar Constants:: Numeric and string constants. +* Nondecimal-numbers:: What are octal and hex numbers. +* Regexp Constants:: Regular Expression constants. +* Using Constant Regexps:: When and how to use a regexp constant. +* Variables:: Variables give names to values for + later use. +* Using Variables:: Using variables in your programs. +* Assignment Options:: Setting variables on the command line + and a summary of command-line syntax. + This is an advanced method of input. +* Conversion:: The conversion of strings to numbers + and vice versa. +* Strings And Numbers:: How 'awk' Converts Between + Strings And Numbers. +* Locale influences conversions:: How the locale may affect conversions. +* All Operators:: 'gawk''s operators. +* Arithmetic Ops:: Arithmetic operations ('+', + '-', etc.) +* Concatenation:: Concatenating strings. +* Assignment Ops:: Changing the value of a variable or a + field. +* Increment Ops:: Incrementing the numeric value of a + variable. +* Truth Values and Conditions:: Testing for true and false. +* Truth Values:: What is "true" and what is + "false". +* Typing and Comparison:: How variables acquire types and how + this affects comparison of numbers and + strings with '<', etc. +* Variable Typing:: String type versus numeric type. +* Comparison Operators:: The comparison operators. +* POSIX String Comparison:: String comparison with POSIX rules. +* Boolean Ops:: Combining comparison expressions using + boolean operators '||' ("or"), + '&&' ("and") and '!' + ("not"). +* Conditional Exp:: Conditional expressions select between + two subexpressions under control of a + third subexpression. +* Function Calls:: A function call is an expression. +* Precedence:: How various operators nest. +* Locales:: How the locale affects things. +* Expressions Summary:: Expressions summary. +* Pattern Overview:: What goes into a pattern. +* Regexp Patterns:: Using regexps as patterns. +* Expression Patterns:: Any expression can be used as a + pattern. +* Ranges:: Pairs of patterns specify record + ranges. +* BEGIN/END:: Specifying initialization and cleanup + rules. +* Using BEGIN/END:: How and why to use BEGIN/END rules. +* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. +* BEGINFILE/ENDFILE:: Two special patterns for advanced + control. +* Empty:: The empty pattern, which matches every + record. +* Using Shell Variables:: How to use shell variables with + 'awk'. +* Action Overview:: What goes into an action. +* Statements:: Describes the various control + statements in detail. +* If Statement:: Conditionally execute some + 'awk' statements. +* While Statement:: Loop until some condition is + satisfied. +* Do Statement:: Do specified action while looping + until some condition is satisfied. +* For Statement:: Another looping statement, that + provides initialization and increment + clauses. +* Switch Statement:: Switch/case evaluation for conditional + execution of statements based on a + value. +* Break Statement:: Immediately exit the innermost + enclosing loop. +* Continue Statement:: Skip to the end of the innermost + enclosing loop. +* Next Statement:: Stop processing the current input + record. +* Nextfile Statement:: Stop processing the current file. +* Exit Statement:: Stop execution of 'awk'. +* Built-in Variables:: Summarizes the predefined variables. +* User-modified:: Built-in variables that you change to + control 'awk'. +* Auto-set:: Built-in variables where 'awk' + gives you information. +* ARGC and ARGV:: Ways to use 'ARGC' and + 'ARGV'. +* Pattern Action Summary:: Patterns and Actions summary. +* Array Basics:: The basics of arrays. +* Array Intro:: Introduction to Arrays +* Reference to Elements:: How to examine one element of an + array. +* Assigning Elements:: How to change an element of an array. +* Array Example:: Basic Example of an Array +* Scanning an Array:: A variation of the 'for' + statement. It loops through the + indices of an array's existing + elements. +* Controlling Scanning:: Controlling the order in which arrays + are scanned. +* Numeric Array Subscripts:: How to use numbers as subscripts in + 'awk'. +* Uninitialized Subscripts:: Using Uninitialized variables as + subscripts. +* Delete:: The 'delete' statement removes an + element from an array. +* Multidimensional:: Emulating multidimensional arrays in + 'awk'. +* Multiscanning:: Scanning multidimensional arrays. +* Arrays of Arrays:: True multidimensional arrays. +* Arrays Summary:: Summary of arrays. +* Built-in:: Summarizes the built-in functions. +* Calling Built-in:: How to call built-in functions. +* Numeric Functions:: Functions that work with numbers, + including 'int()', 'sin()' + and 'rand()'. +* String Functions:: Functions for string manipulation, + such as 'split()', 'match()' + and 'sprintf()'. +* Gory Details:: More than you want to know about + '\' and '&' with + 'sub()', 'gsub()', and + 'gensub()'. +* I/O Functions:: Functions for files and shell + commands. +* Time Functions:: Functions for dealing with timestamps. +* Bitwise Functions:: Functions for bitwise operations. +* Type Functions:: Functions for type information. +* I18N Functions:: Functions for string translation. +* User-defined:: Describes User-defined functions in + detail. +* Definition Syntax:: How to write definitions and what they + mean. +* Function Example:: An example function definition and + what it does. +* Function Caveats:: Things to watch out for. +* Calling A Function:: Don't use spaces. +* Variable Scope:: Controlling variable scope. +* Pass By Value/Reference:: Passing parameters. +* Return Statement:: Specifying the value a function + returns. +* Dynamic Typing:: How variable types can change at + runtime. +* Indirect Calls:: Choosing the function to call at + runtime. +* Functions Summary:: Summary of functions. +* Library Names:: How to best name private global + variables in library functions. +* General Functions:: Functions that are of general use. +* Strtonum Function:: A replacement for the built-in + 'strtonum()' function. +* Assert Function:: A function for assertions in + 'awk' programs. +* Round Function:: A function for rounding if + 'sprintf()' does not do it + correctly. +* Cliff Random Function:: The Cliff Random Number Generator. +* Ordinal Functions:: Functions for using characters as + numbers and vice versa. +* Join Function:: A function to join an array into a + string. +* Getlocaltime Function:: A function to get formatted times. +* Readfile Function:: A function to read an entire file at + once. +* Shell Quoting:: A function to quote strings for the + shell. +* Data File Management:: Functions for managing command-line + data files. +* Filetrans Function:: A function for handling data file + transitions. +* Rewind Function:: A function for rereading the current + file. +* File Checking:: Checking that data files are readable. +* Empty Files:: Checking for zero-length files. +* Ignoring Assigns:: Treating assignments as file names. +* Getopt Function:: A function for processing command-line + arguments. +* Passwd Functions:: Functions for getting user + information. +* Group Functions:: Functions for getting group + information. +* Walking Arrays:: A function to walk arrays of arrays. +* Library Functions Summary:: Summary of library functions. +* Library Exercises:: Exercises. +* Running Examples:: How to run these examples. +* Clones:: Clones of common utilities. +* Cut Program:: The 'cut' utility. +* Egrep Program:: The 'egrep' utility. +* Id Program:: The 'id' utility. +* Split Program:: The 'split' utility. +* Tee Program:: The 'tee' utility. +* Uniq Program:: The 'uniq' utility. +* Wc Program:: The 'wc' utility. +* Miscellaneous Programs:: Some interesting 'awk' + programs. +* Dupword Program:: Finding duplicated words in a + document. +* Alarm Program:: An alarm clock. +* Translate Program:: A program similar to the 'tr' + utility. +* Labels Program:: Printing mailing labels. +* Word Sorting:: A program to produce a word usage + count. +* History Sorting:: Eliminating duplicate entries from a + history file. +* Extract Program:: Pulling out programs from Texinfo + source files. +* Simple Sed:: A Simple Stream Editor. +* Igawk Program:: A wrapper for 'awk' that + includes files. +* Anagram Program:: Finding anagrams from a dictionary. +* Signature Program:: People do amazing things with too much + time on their hands. +* Programs Summary:: Summary of programs. +* Programs Exercises:: Exercises. +* Nondecimal Data:: Allowing nondecimal input data. +* Array Sorting:: Facilities for controlling array + traversal and sorting arrays. +* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. +* Array Sorting Functions:: How to use 'asort()' and + 'asorti()'. +* Two-way I/O:: Two-way communications with another + process. +* TCP/IP Networking:: Using 'gawk' for network + programming. +* Profiling:: Profiling your 'awk' programs. +* Advanced Features Summary:: Summary of advanced features. +* I18N and L10N:: Internationalization and Localization. +* Explaining gettext:: How GNU 'gettext' works. +* Programmer i18n:: Features for the programmer. +* Translator i18n:: Features for the translator. +* String Extraction:: Extracting marked strings. +* Printf Ordering:: Rearranging 'printf' arguments. +* I18N Portability:: 'awk'-level portability + issues. +* I18N Example:: A simple i18n example. +* Gawk I18N:: 'gawk' is also + internationalized. +* I18N Summary:: Summary of I18N stuff. +* Debugging:: Introduction to 'gawk' + debugger. +* Debugging Concepts:: Debugging in General. +* Debugging Terms:: Additional Debugging Concepts. +* Awk Debugging:: Awk Debugging. +* Sample Debugging Session:: Sample debugging session. +* Debugger Invocation:: How to Start the Debugger. +* Finding The Bug:: Finding the Bug. +* List of Debugger Commands:: Main debugger commands. +* Breakpoint Control:: Control of Breakpoints. +* Debugger Execution Control:: Control of Execution. +* Viewing And Changing Data:: Viewing and Changing Data. +* Execution Stack:: Dealing with the Stack. +* Debugger Info:: Obtaining Information about the + Program and the Debugger State. +* Miscellaneous Debugger Commands:: Miscellaneous Commands. +* Readline Support:: Readline support. +* Limitations:: Limitations and future plans. +* Debugging Summary:: Debugging summary. +* Computer Arithmetic:: A quick intro to computer math. +* Math Definitions:: Defining terms used. +* MPFR features:: The MPFR features in 'gawk'. +* FP Math Caution:: Things to know. +* Inexactness of computations:: Floating point math is not exact. +* Inexact representation:: Numbers are not exactly represented. +* Comparing FP Values:: How to compare floating point values. +* Errors accumulate:: Errors get bigger as they go. +* Getting Accuracy:: Getting more accuracy takes some work. +* Try To Round:: Add digits and round. +* Setting precision:: How to set the precision. +* Setting the rounding mode:: How to set the rounding mode. +* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic + with 'gawk'. +* POSIX Floating Point Problems:: Standards Versus Existing Practice. +* Floating point summary:: Summary of floating point discussion. +* Extension Intro:: What is an extension. +* Plugin License:: A note about licensing. +* Extension Mechanism Outline:: An outline of how it works. +* Extension API Description:: A full description of the API. +* Extension API Functions Introduction:: Introduction to the API functions. +* General Data Types:: The data types. +* Memory Allocation Functions:: Functions for allocating memory. +* Constructor Functions:: Functions for creating values. +* Registration Functions:: Functions to register things with + 'gawk'. +* Extension Functions:: Registering extension functions. +* Exit Callback Functions:: Registering an exit callback. +* Extension Version String:: Registering a version string. +* Input Parsers:: Registering an input parser. +* Output Wrappers:: Registering an output wrapper. +* Two-way processors:: Registering a two-way processor. +* Printing Messages:: Functions for printing messages. +* Updating ERRNO:: Functions for updating 'ERRNO'. +* Requesting Values:: How to get a value. +* Accessing Parameters:: Functions for accessing parameters. +* Symbol Table Access:: Functions for accessing global + variables. +* Symbol table by name:: Accessing variables by name. +* Symbol table by cookie:: Accessing variables by "cookie". +* Cached values:: Creating and using cached values. +* Array Manipulation:: Functions for working with arrays. +* Array Data Types:: Data types for working with arrays. +* Array Functions:: Functions for working with arrays. +* Flattening Arrays:: How to flatten arrays. +* Creating Arrays:: How to create and populate arrays. +* Redirection API:: How to access and manipulate redirections. +* Extension API Variables:: Variables provided by the API. +* Extension Versioning:: API Version information. +* Extension API Informational Variables:: Variables providing information about + 'gawk''s invocation. +* Extension API Boilerplate:: Boilerplate code for using the API. +* Finding Extensions:: How 'gawk' finds compiled + extensions. +* Extension Example:: Example C code for an extension. +* Internal File Description:: What the new functions will do. +* Internal File Ops:: The code for internal file operations. +* Using Internal File Ops:: How to use an external extension. +* Extension Samples:: The sample extensions that ship with + 'gawk'. +* Extension Sample File Functions:: The file functions sample. +* Extension Sample Fnmatch:: An interface to 'fnmatch()'. +* Extension Sample Fork:: An interface to 'fork()' and + other process functions. +* Extension Sample Inplace:: Enabling in-place file editing. +* Extension Sample Ord:: Character to value to character + conversions. +* Extension Sample Readdir:: An interface to 'readdir()'. +* Extension Sample Revout:: Reversing output sample output + wrapper. +* Extension Sample Rev2way:: Reversing data sample two-way + processor. +* Extension Sample Read write array:: Serializing an array to a file. +* Extension Sample Readfile:: Reading an entire file into a string. +* Extension Sample Time:: An interface to 'gettimeofday()' + and 'sleep()'. +* Extension Sample API Tests:: Tests for the API. +* gawkextlib:: The 'gawkextlib' project. +* Extension summary:: Extension summary. +* Extension Exercises:: Exercises. +* V7/SVR3.1:: The major changes between V7 and + System V Release 3.1. +* SVR4:: Minor changes between System V + Releases 3.1 and 4. +* POSIX:: New features from the POSIX standard. +* BTL:: New features from Brian Kernighan's + version of 'awk'. +* POSIX/GNU:: The extensions in 'gawk' not + in POSIX 'awk'. +* Feature History:: The history of the features in + 'gawk'. +* Common Extensions:: Common Extensions Summary. +* Ranges and Locales:: How locales used to affect regexp + ranges. +* Contributors:: The major contributors to + 'gawk'. +* History summary:: History summary. +* Gawk Distribution:: What is in the 'gawk' + distribution. +* Getting:: How to get the distribution. +* Extracting:: How to extract the distribution. +* Distribution contents:: What is in the distribution. +* Unix Installation:: Installing 'gawk' under + various versions of Unix. +* Quick Installation:: Compiling 'gawk' under Unix. +* Shell Startup Files:: Shell convenience functions. +* Additional Configuration Options:: Other compile-time options. +* Configuration Philosophy:: How it's all supposed to work. +* Non-Unix Installation:: Installation on Other Operating + Systems. +* PC Installation:: Installing and Compiling 'gawk' on + Microsoft Windows. +* PC Binary Installation:: Installing a prepared distribution. +* PC Compiling:: Compiling 'gawk' for Windows32. +* PC Using:: Running 'gawk' on Windows32. +* Cygwin:: Building and running 'gawk' + for Cygwin. +* MSYS:: Using 'gawk' In The MSYS + Environment. +* VMS Installation:: Installing 'gawk' on VMS. +* VMS Compilation:: How to compile 'gawk' under + VMS. +* VMS Dynamic Extensions:: Compiling 'gawk' dynamic + extensions on VMS. +* VMS Installation Details:: How to install 'gawk' under + VMS. +* VMS Running:: How to run 'gawk' under VMS. +* VMS GNV:: The VMS GNV Project. +* VMS Old Gawk:: An old version comes with some VMS + systems. +* Bugs:: Reporting Problems and Bugs. +* Bug address:: Where to send reports to. +* Usenet:: Where not to send reports to. +* Maintainers:: Maintainers of non-*nix ports. +* Other Versions:: Other freely available 'awk' + implementations. +* Installation summary:: Summary of installation. +* Compatibility Mode:: How to disable certain 'gawk' + extensions. +* Additions:: Making Additions To 'gawk'. +* Accessing The Source:: Accessing the Git repository. +* Adding Code:: Adding code to the main body of + 'gawk'. +* New Ports:: Porting 'gawk' to a new + operating system. +* Derived Files:: Why derived files are kept in the Git + repository. +* Future Extensions:: New features that may be implemented + one day. +* Implementation Limitations:: Some limitations of the + implementation. +* Extension Design:: Design notes about the extension API. +* Old Extension Problems:: Problems with the old mechanism. +* Extension New Mechanism Goals:: Goals for the new mechanism. +* Extension Other Design Decisions:: Some other design decisions. +* Extension Future Growth:: Some room for future growth. +* Old Extension Mechanism:: Some compatibility for old extensions. +* Notes summary:: Summary of implementation notes. +* Basic High Level:: The high level view. +* Basic Data Typing:: A very quick intro to data types. + + To my parents, for their love, and for the wonderful example they set +for me. + + To my wife Miriam, for making me complete. Thank you for building +your life together with me. + + To our children Chana, Rivka, Nachum and Malka, for enrichening our +lives in innumerable ways. + + +File: gawk.info, Node: Foreword3, Next: Foreword4, Prev: Top, Up: Top + +Foreword to the Third Edition +***************************** + +Arnold Robbins and I are good friends. We were introduced in 1990 by +circumstances--and our favorite programming language, AWK. The +circumstances started a couple of years earlier. I was working at a new +job and noticed an unplugged Unix computer sitting in the corner. No +one knew how to use it, and neither did I. However, a couple of days +later, it was running, and I was 'root' and the one-and-only user. That +day, I began the transition from statistician to Unix programmer. + + On one of many trips to the library or bookstore in search of books +on Unix, I found the gray AWK book, a.k.a. Alfred V. Aho, Brian W. +Kernighan, and Peter J. Weinberger's 'The AWK Programming Language' +(Addison-Wesley, 1988). 'awk''s simple programming paradigm--find a +pattern in the input and then perform an action--often reduced complex +or tedious data manipulations to a few lines of code. I was excited to +try my hand at programming in AWK. + + Alas, the 'awk' on my computer was a limited version of the language +described in the gray book. I discovered that my computer had "old +'awk'" and the book described "new 'awk'." I learned that this was +typical; the old version refused to step aside or relinquish its name. +If a system had a new 'awk', it was invariably called 'nawk', and few +systems had it. The best way to get a new 'awk' was to 'ftp' the source +code for 'gawk' from 'prep.ai.mit.edu'. 'gawk' was a version of new +'awk' written by David Trueman and Arnold, and available under the GNU +General Public License. + + (Incidentally, it's no longer difficult to find a new 'awk'. 'gawk' +ships with GNU/Linux, and you can download binaries or source code for +almost any system; my wife uses 'gawk' on her VMS box.) + + My Unix system started out unplugged from the wall; it certainly was +not plugged into a network. So, oblivious to the existence of 'gawk' +and the Unix community in general, and desiring a new 'awk', I wrote my +own, called 'mawk'. Before I was finished, I knew about 'gawk', but it +was too late to stop, so I eventually posted to a 'comp.sources' +newsgroup. + + A few days after my posting, I got a friendly email from Arnold +introducing himself. He suggested we share design and algorithms and +attached a draft of the POSIX standard so that I could update 'mawk' to +support language extensions added after publication of 'The AWK +Programming Language'. + + Frankly, if our roles had been reversed, I would not have been so +open and we probably would have never met. I'm glad we did meet. He is +an AWK expert's AWK expert and a genuinely nice person. Arnold +contributes significant amounts of his expertise and time to the Free +Software Foundation. + + This book is the 'gawk' reference manual, but at its core it is a +book about AWK programming that will appeal to a wide audience. It is a +definitive reference to the AWK language as defined by the 1987 Bell +Laboratories release and codified in the 1992 POSIX Utilities standard. + + On the other hand, the novice AWK programmer can study a wealth of +practical programs that emphasize the power of AWK's basic idioms: +data-driven control flow, pattern matching with regular expressions, and +associative arrays. Those looking for something new can try out +'gawk''s interface to network protocols via special '/inet' files. + + The programs in this book make clear that an AWK program is typically +much smaller and faster to develop than a counterpart written in C. +Consequently, there is often a payoff to prototyping an algorithm or +design in AWK to get it running quickly and expose problems early. +Often, the interpreted performance is adequate and the AWK prototype +becomes the product. + + The new 'pgawk' (profiling 'gawk'), produces program execution +counts. I recently experimented with an algorithm that for n lines of +input, exhibited ~ C n^2 performance, while theory predicted ~ C n log n +behavior. A few minutes poring over the 'awkprof.out' profile +pinpointed the problem to a single line of code. 'pgawk' is a welcome +addition to my programmer's toolbox. + + Arnold has distilled over a decade of experience writing and using +AWK programs, and developing 'gawk', into this book. If you use AWK or +want to learn how, then read this book. + + Michael Brennan + Author of 'mawk' + March 2001 + + +File: gawk.info, Node: Foreword4, Next: Preface, Prev: Foreword3, Up: Top + +Foreword to the Fourth Edition +****************************** + +Some things don't change. Thirteen years ago I wrote: "If you use AWK +or want to learn how, then read this book." True then, and still true +today. + + Learning to use a programming language is about more than mastering +the syntax. One needs to acquire an understanding of how to use the +features of the language to solve practical programming problems. A +focus of this book is many examples that show how to use AWK. + + Some things do change. Our computers are much faster and have more +memory. Consequently, speed and storage inefficiencies of a high-level +language matter less. Prototyping in AWK and then rewriting in C for +performance reasons happens less, because more often the prototype is +fast enough. + + Of course, there are computing operations that are best done in C or +C++. With 'gawk' 4.1 and later, you do not have to choose between +writing your program in AWK or in C/C++. You can write most of your +program in AWK and the aspects that require C/C++ capabilities can be +written in C/C++, and then the pieces glued together when the 'gawk' +module loads the C/C++ module as a dynamic plug-in. *note Dynamic +Extensions::, has all the details, and, as expected, many examples to +help you learn the ins and outs. + + I enjoy programming in AWK and had fun (re)reading this book. I +think you will too. + + Michael Brennan + Author of 'mawk' + October 2014 + + +File: gawk.info, Node: Preface, Next: Getting Started, Prev: Foreword4, Up: Top + +Preface +******* + +Several kinds of tasks occur repeatedly when working with text files. +You might want to extract certain lines and discard the rest. Or you +may need to make changes wherever certain patterns appear, but leave the +rest of the file alone. Such jobs are often easy with 'awk'. The 'awk' +utility interprets a special-purpose programming language that makes it +easy to handle simple data-reformatting jobs. + + The GNU implementation of 'awk' is called 'gawk'; if you invoke it +with the proper options or environment variables, it is fully compatible +with the POSIX(1) specification of the 'awk' language and with the Unix +version of 'awk' maintained by Brian Kernighan. This means that all +properly written 'awk' programs should work with 'gawk'. So most of the +time, we don't distinguish between 'gawk' and other 'awk' +implementations. + + Using 'awk' you can: + + * Manage small, personal databases + + * Generate reports + + * Validate data + + * Produce indexes and perform other document-preparation tasks + + * Experiment with algorithms that you can adapt later to other + computer languages + + In addition, 'gawk' provides facilities that make it easy to: + + * Extract bits and pieces of data for processing + + * Sort data + + * Perform simple network communications + + * Profile and debug 'awk' programs + + * Extend the language with functions written in C or C++ + + This Info file teaches you about the 'awk' language and how you can +use it effectively. You should already be familiar with basic system +commands, such as 'cat' and 'ls',(2) as well as basic shell facilities, +such as input/output (I/O) redirection and pipes. + + Implementations of the 'awk' language are available for many +different computing environments. This Info file, while describing the +'awk' language in general, also describes the particular implementation +of 'awk' called 'gawk' (which stands for "GNU 'awk'"). 'gawk' runs on a +broad range of Unix systems, ranging from Intel-architecture PC-based +computers up through large-scale systems. 'gawk' has also been ported +to Mac OS X, Microsoft Windows (all versions), and OpenVMS.(3) + +* Menu: + +* History:: The history of 'gawk' and + 'awk'. +* Names:: What name to use to find 'awk'. +* This Manual:: Using this Info file. Includes sample + input files that you can use. +* Conventions:: Typographical Conventions. +* Manual History:: Brief history of the GNU project and this + Info file. +* How To Contribute:: Helping to save the world. +* Acknowledgments:: Acknowledgments. + + ---------- Footnotes ---------- + + (1) The 2008 POSIX standard is accessible online at +<http://www.opengroup.org/onlinepubs/9699919799/>. + + (2) These utilities are available on POSIX-compliant systems, as well +as on traditional Unix-based systems. If you are using some other +operating system, you still need to be familiar with the ideas of I/O +redirection and pipes. + + (3) Some other, obsolete systems to which 'gawk' was once ported are +no longer supported and the code for those systems has been removed. + + +File: gawk.info, Node: History, Next: Names, Up: Preface + +History of 'awk' and 'gawk' +=========================== + + Recipe for a Programming Language + + 1 part 'egrep' 1 part 'snobol' + 2 parts 'ed' 3 parts C + + Blend all parts well using 'lex' and 'yacc'. Document minimally and +release. + + After eight years, add another part 'egrep' and two more parts C. +Document very well and release. + + The name 'awk' comes from the initials of its designers: Alfred V. +Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version +of 'awk' was written in 1977 at AT&T Bell Laboratories. In 1985, a new +version made the programming language more powerful, introducing +user-defined functions, multiple input streams, and computed regular +expressions. This new version became widely available with Unix System +V Release 3.1 (1987). The version in System V Release 4 (1989) added +some new features and cleaned up the behavior in some of the "dark +corners" of the language. The specification for 'awk' in the POSIX +Command Language and Utilities standard further clarified the language. +Both the 'gawk' designers and the original 'awk' designers at Bell +Laboratories provided feedback for the POSIX specification. + + Paul Rubin wrote 'gawk' in 1986. Jay Fenlason completed it, with +advice from Richard Stallman. John Woods contributed parts of the code +as well. In 1988 and 1989, David Trueman, with help from me, thoroughly +reworked 'gawk' for compatibility with the newer 'awk'. Circa 1994, I +became the primary maintainer. Current development focuses on bug +fixes, performance improvements, standards compliance, and, +occasionally, new features. + + In May 1997, Ju"rgen Kahrs felt the need for network access from +'awk', and with a little help from me, set about adding features to do +this for 'gawk'. At that time, he also wrote the bulk of 'TCP/IP +Internetworking with 'gawk'' (a separate document, available as part of +the 'gawk' distribution). His code finally became part of the main +'gawk' distribution with 'gawk' version 3.1. + + John Haque rewrote the 'gawk' internals, in the process providing an +'awk'-level debugger. This version became available as 'gawk' version +4.0 in 2011. + + *Note Contributors:: for a full list of those who have made important +contributions to 'gawk'. + + +File: gawk.info, Node: Names, Next: This Manual, Prev: History, Up: Preface + +A Rose by Any Other Name +======================== + +The 'awk' language has evolved over the years. Full details are +provided in *note Language History::. The language described in this +Info file is often referred to as "new 'awk'." By analogy, the original +version of 'awk' is referred to as "old 'awk'." + + On most current systems, when you run the 'awk' utility you get some +version of new 'awk'.(1) If your system's standard 'awk' is the old +one, you will see something like this if you try the test program: + + $ awk 1 /dev/null + error-> awk: syntax error near line 1 + error-> awk: bailing out near line 1 + +In this case, you should find a version of new 'awk', or just install +'gawk'! + + Throughout this Info file, whenever we refer to a language feature +that should be available in any complete implementation of POSIX 'awk', +we simply use the term 'awk'. When referring to a feature that is +specific to the GNU implementation, we use the term 'gawk'. + + ---------- Footnotes ---------- + + (1) Only Solaris systems still use an old 'awk' for the default 'awk' +utility. A more modern 'awk' lives in '/usr/xpg6/bin' on these systems. + + +File: gawk.info, Node: This Manual, Next: Conventions, Prev: Names, Up: Preface + +Using This Book +=============== + +The term 'awk' refers to a particular program as well as to the language +you use to tell this program what to do. When we need to be careful, we +call the language "the 'awk' language," and the program "the 'awk' +utility." This Info file explains both how to write programs in the +'awk' language and how to run the 'awk' utility. The term "'awk' +program" refers to a program written by you in the 'awk' programming +language. + + Primarily, this Info file explains the features of 'awk' as defined +in the POSIX standard. It does so in the context of the 'gawk' +implementation. While doing so, it also attempts to describe important +differences between 'gawk' and other 'awk' implementations.(1) Finally, +it notes any 'gawk' features that are not in the POSIX standard for +'awk'. + + There are sidebars scattered throughout the Info file. They add a +more complete explanation of points that are relevant, but not likely to +be of interest on first reading. All appear in the index, under the +heading "sidebar." + + Most of the time, the examples use complete 'awk' programs. Some of +the more advanced minor nodes show only the part of the 'awk' program +that illustrates the concept being described. + + Although this Info file is aimed principally at people who have not +been exposed to 'awk', there is a lot of information here that even the +'awk' expert should find useful. In particular, the description of +POSIX 'awk' and the example programs in *note Library Functions::, and +in *note Sample Programs::, should be of interest. + + This Info file is split into several parts, as follows: + + * Part I describes the 'awk' language and the 'gawk' program in + detail. It starts with the basics, and continues through all of + the features of 'awk'. It contains the following chapters: + + - *note Getting Started::, provides the essentials you need to + know to begin using 'awk'. + + - *note Invoking Gawk::, describes how to run 'gawk', the + meaning of its command-line options, and how it finds 'awk' + program source files. + + - *note Regexp::, introduces regular expressions in general, and + in particular the flavors supported by POSIX 'awk' and 'gawk'. + + - *note Reading Files::, describes how 'awk' reads your data. + It introduces the concepts of records and fields, as well as + the 'getline' command. I/O redirection is first described + here. Network I/O is also briefly introduced here. + + - *note Printing::, describes how 'awk' programs can produce + output with 'print' and 'printf'. + + - *note Expressions::, describes expressions, which are the + basic building blocks for getting most things done in a + program. + + - *note Patterns and Actions::, describes how to write patterns + for matching records, actions for doing something when a + record is matched, and the predefined variables 'awk' and + 'gawk' use. + + - *note Arrays::, covers 'awk''s one-and-only data structure: + the associative array. Deleting array elements and whole + arrays is described, as well as sorting arrays in 'gawk'. The + major node also describes how 'gawk' provides arrays of + arrays. + + - *note Functions::, describes the built-in functions 'awk' and + 'gawk' provide, as well as how to define your own functions. + It also discusses how 'gawk' lets you call functions + indirectly. + + * Part II shows how to use 'awk' and 'gawk' for problem solving. + There is lots of code here for you to read and learn from. This + part contains the following chapters: + + - *note Library Functions::, provides a number of functions + meant to be used from main 'awk' programs. + + - *note Sample Programs::, provides many sample 'awk' programs. + + Reading these two chapters allows you to see 'awk' solving real + problems. + + * Part III focuses on features specific to 'gawk'. It contains the + following chapters: + + - *note Advanced Features::, describes a number of advanced + features. Of particular note are the abilities to control the + order of array traversal, have two-way communications with + another process, perform TCP/IP networking, and profile your + 'awk' programs. + + - *note Internationalization::, describes special features for + translating program messages into different languages at + runtime. + + - *note Debugger::, describes the 'gawk' debugger. + + - *note Arbitrary Precision Arithmetic::, describes advanced + arithmetic facilities. + + - *note Dynamic Extensions::, describes how to add new variables + and functions to 'gawk' by writing extensions in C or C++. + + * Part IV provides the appendices, the Glossary, and two licenses + that cover the 'gawk' source code and this Info file, respectively. + It contains the following appendices: + + - *note Language History::, describes how the 'awk' language has + evolved since its first release to the present. It also + describes how 'gawk' has acquired features over time. + + - *note Installation::, describes how to get 'gawk', how to + compile it on POSIX-compatible systems, and how to compile and + use it on different non-POSIX systems. It also describes how + to report bugs in 'gawk' and where to get other freely + available 'awk' implementations. + + - *note Notes::, describes how to disable 'gawk''s extensions, + as well as how to contribute new code to 'gawk', and some + possible future directions for 'gawk' development. + + - *note Basic Concepts::, provides some very cursory background + material for those who are completely unfamiliar with computer + programming. + + The *note Glossary::, defines most, if not all, of the + significant terms used throughout the Info file. If you find + terms that you aren't familiar with, try looking them up here. + + - *note Copying::, and *note GNU Free Documentation License::, + present the licenses that cover the 'gawk' source code and + this Info file, respectively. + + ---------- Footnotes ---------- + + (1) All such differences appear in the index under the entry +"differences in 'awk' and 'gawk'." + + +File: gawk.info, Node: Conventions, Next: Manual History, Prev: This Manual, Up: Preface + +Typographical Conventions +========================= + +This Info file is written in Texinfo +(http://www.gnu.org/software/texinfo/), the GNU documentation formatting +language. A single Texinfo source file is used to produce both the +printed and online versions of the documentation. This minor node +briefly documents the typographical conventions used in Texinfo. + + Examples you would type at the command line are preceded by the +common shell primary and secondary prompts, '$' and '>'. Input that you +type is shown 'like this'. Output from the command is preceded by the +glyph "-|". This typically represents the command's standard output. +Error messages and other output on the command's standard error are +preceded by the glyph "error->". For example: + + $ echo hi on stdout + -| hi on stdout + $ echo hello on stderr 1>&2 + error-> hello on stderr + + Characters that you type at the keyboard look 'like this'. In +particular, there are special characters called "control characters." +These are characters that you type by holding down both the 'CONTROL' +key and another key, at the same time. For example, a 'Ctrl-d' is typed +by first pressing and holding the 'CONTROL' key, next pressing the 'd' +key, and finally releasing both keys. + + For the sake of brevity, throughout this Info file, we refer to Brian +Kernighan's version of 'awk' as "BWK 'awk'." (*Note Other Versions:: +for information on his and other versions.) + +Dark Corners +------------ + + Dark corners are basically fractal--no matter how much you + illuminate, there's always a smaller but darker one. + -- _Brian Kernighan_ + + Until the POSIX standard (and 'GAWK: Effective AWK Programming'), +many features of 'awk' were either poorly documented or not documented +at all. Descriptions of such features (often called "dark corners") are +noted in this Info file with "(d.c.)." They also appear in the index +under the heading "dark corner." + + But, as noted by the opening quote, any coverage of dark corners is +by definition incomplete. + + Extensions to the standard 'awk' language that are supported by more +than one 'awk' implementation are marked "(c.e.)," and listed in the +index under "common extensions" and "extensions, common." + + +File: gawk.info, Node: Manual History, Next: How To Contribute, Prev: Conventions, Up: Preface + +The GNU Project and This Book +============================= + +The Free Software Foundation (FSF) is a nonprofit organization dedicated +to the production and distribution of freely distributable software. It +was founded by Richard M. Stallman, the author of the original Emacs +editor. GNU Emacs is the most widely used version of Emacs today. + + The GNU(1) Project is an ongoing effort on the part of the Free +Software Foundation to create a complete, freely distributable, +POSIX-compliant computing environment. The FSF uses the GNU General +Public License (GPL) to ensure that its software's source code is always +available to the end user. A copy of the GPL is included for your +reference (*note Copying::). The GPL applies to the C language source +code for 'gawk'. To find out more about the FSF and the GNU Project +online, see the GNU Project's home page (http://www.gnu.org). This Info +file may also be read from GNU's website +(http://www.gnu.org/software/gawk/manual/). + + A shell, an editor (Emacs), highly portable optimizing C, C++, and +Objective-C compilers, a symbolic debugger and dozens of large and small +utilities (such as 'gawk'), have all been completed and are freely +available. The GNU operating system kernel (the HURD), has been +released but remains in an early stage of development. + + Until the GNU operating system is more fully developed, you should +consider using GNU/Linux, a freely distributable, Unix-like operating +system for Intel, Power Architecture, Sun SPARC, IBM S/390, and other +systems.(2) Many GNU/Linux distributions are available for download +from the Internet. + + The Info file itself has gone through multiple previous editions. +Paul Rubin wrote the very first draft of 'The GAWK Manual'; it was +around 40 pages long. Diane Close and Richard Stallman improved it, +yielding a version that was around 90 pages and barely described the +original, "old" version of 'awk'. + + I started working with that version in the fall of 1988. As work on +it progressed, the FSF published several preliminary versions (numbered +0.X). In 1996, edition 1.0 was released with 'gawk' 3.0.0. The FSF +published the first two editions under the title 'The GNU Awk User's +Guide'. + + This edition maintains the basic structure of the previous editions. +For FSF edition 4.0, the content was thoroughly reviewed and updated. +All references to 'gawk' versions prior to 4.0 were removed. Of +significant note for that edition was the addition of *note Debugger::. + + For FSF edition 4.1, the content has been reorganized into parts, and +the major new additions are *note Arbitrary Precision Arithmetic::, and +*note Dynamic Extensions::. + + This Info file will undoubtedly continue to evolve. If you find an +error in the Info file, please report it! *Note Bugs:: for information +on submitting problem reports electronically. + + ---------- Footnotes ---------- + + (1) GNU stands for "GNU's Not Unix." + + (2) The terminology "GNU/Linux" is explained in the *note Glossary::. + + +File: gawk.info, Node: How To Contribute, Next: Acknowledgments, Prev: Manual History, Up: Preface + +How to Contribute +================= + +As the maintainer of GNU 'awk', I once thought that I would be able to +manage a collection of publicly available 'awk' programs and I even +solicited contributions. Making things available on the Internet helps +keep the 'gawk' distribution down to manageable size. + + The initial collection of material, such as it is, is still available +at <ftp://ftp.freefriends.org/arnold/Awkstuff>. In the hopes of doing +something more broad, I acquired the 'awk.info' domain. + + However, I found that I could not dedicate enough time to managing +contributed code: the archive did not grow and the domain went unused +for several years. + + Late in 2008, a volunteer took on the task of setting up an +'awk'-related website--<http://awk.info>--and did a very nice job. + + If you have written an interesting 'awk' program, or have written a +'gawk' extension that you would like to share with the rest of the +world, please see <http://awk.info/?contribute> for how to contribute it +to the website. + + +File: gawk.info, Node: Acknowledgments, Prev: How To Contribute, Up: Preface + +Acknowledgments +=============== + +The initial draft of 'The GAWK Manual' had the following +acknowledgments: + + Many people need to be thanked for their assistance in producing + this manual. Jay Fenlason contributed many ideas and sample + programs. Richard Mlynarik and Robert Chassell gave helpful + comments on drafts of this manual. The paper 'A Supplemental + Document for AWK' by John W. Pierce of the Chemistry Department at + UC San Diego, pinpointed several issues relevant both to 'awk' + implementation and to this manual, that would otherwise have + escaped us. + + I would like to acknowledge Richard M. Stallman, for his vision of a +better world and for his courage in founding the FSF and starting the +GNU Project. + + Earlier editions of this Info file had the following +acknowledgements: + + The following people (in alphabetical order) provided helpful + comments on various versions of this book: Rick Adams, Dr. Nelson + H.F. Beebe, Karl Berry, Dr. Michael Brennan, Rich Burridge, Claire + Cloutier, Diane Close, Scott Deifik, Christopher ("Topher") Eliot, + Jeffrey Friedl, Dr. Darrel Hankerson, Michal Jaegermann, Dr. + Richard J. LeBlanc, Michael Lijewski, Pat Rankin, Miriam Robbins, + Mary Sheehan, and Chuck Toporek. + + Robert J. Chassell provided much valuable advice on the use of + Texinfo. He also deserves special thanks for convincing me _not_ + to title this Info file 'How to Gawk Politely'. Karl Berry helped + significantly with the TeX part of Texinfo. + + I would like to thank Marshall and Elaine Hartholz of Seattle and + Dr. Bert and Rita Schreiber of Detroit for large amounts of quiet + vacation time in their homes, which allowed me to make significant + progress on this Info file and on 'gawk' itself. + + Phil Hughes of SSC contributed in a very important way by loaning + me his laptop GNU/Linux system, not once, but twice, which allowed + me to do a lot of work while away from home. + + David Trueman deserves special credit; he has done a yeoman job of + evolving 'gawk' so that it performs well and without bugs. + Although he is no longer involved with 'gawk', working with him on + this project was a significant pleasure. + + The intrepid members of the GNITS mailing list, and most notably + Ulrich Drepper, provided invaluable help and feedback for the + design of the internationalization features. + + Chuck Toporek, Mary Sheehan, and Claire Cloutier of O'Reilly & + Associates contributed significant editorial help for this Info + file for the 3.1 release of 'gawk'. + + Dr. Nelson Beebe, Andreas Buening, Dr. Manuel Collado, Antonio +Colombo, Stephen Davies, Scott Deifik, Akim Demaille, Daniel Richard G., +Darrel Hankerson, Michal Jaegermann, Ju"rgen Kahrs, Stepan Kasal, John +Malmberg, Dave Pitts, Chet Ramey, Pat Rankin, Andrew Schorr, Corinna +Vinschen, and Eli Zaretskii (in alphabetical order) make up the current +'gawk' "crack portability team." Without their hard work and help, +'gawk' would not be nearly the robust, portable program it is today. It +has been and continues to be a pleasure working with this team of fine +people. + + Notable code and documentation contributions were made by a number of +people. *Note Contributors:: for the full list. + + Thanks to Michael Brennan for the Forewords. + + Thanks to Patrice Dumas for the new 'makeinfo' program. Thanks to +Karl Berry, who continues to work to keep the Texinfo markup language +sane. + + Robert P.J. Day, Michael Brennan, and Brian Kernighan kindly acted as +reviewers for the 2015 edition of this Info file. Their feedback helped +improve the final work. + + I would also like to thank Brian Kernighan for his invaluable +assistance during the testing and debugging of 'gawk', and for his +ongoing help and advice in clarifying numerous points about the +language. We could not have done nearly as good a job on either 'gawk' +or its documentation without his help. + + Brian is in a class by himself as a programmer and technical author. +I have to thank him (yet again) for his ongoing friendship and for being +a role model to me for close to 30 years! Having him as a reviewer is +an exciting privilege. It has also been extremely humbling... + + I must thank my wonderful wife, Miriam, for her patience through the +many versions of this project, for her proofreading, and for sharing me +with the computer. I would like to thank my parents for their love, and +for the grace with which they raised and educated me. Finally, I also +must acknowledge my gratitude to G-d, for the many opportunities He has +sent my way, as well as for the gifts He has given me with which to take +advantage of those opportunities. + + +Arnold Robbins +Nof Ayalon +Israel +February 2015 + + +File: gawk.info, Node: Getting Started, Next: Invoking Gawk, Prev: Preface, Up: Top + +1 Getting Started with 'awk' +**************************** + +The basic function of 'awk' is to search files for lines (or other units +of text) that contain certain patterns. When a line matches one of the +patterns, 'awk' performs specified actions on that line. 'awk' +continues to process input lines in this way until it reaches the end of +the input files. + + Programs in 'awk' are different from programs in most other +languages, because 'awk' programs are "data driven" (i.e., you describe +the data you want to work with and then what to do when you find it). +Most other languages are "procedural"; you have to describe, in great +detail, every step the program should take. When working with +procedural languages, it is usually much harder to clearly describe the +data your program will process. For this reason, 'awk' programs are +often refreshingly easy to read and write. + + When you run 'awk', you specify an 'awk' "program" that tells 'awk' +what to do. The program consists of a series of "rules" (it may also +contain "function definitions", an advanced feature that we will ignore +for now; *note User-defined::). Each rule specifies one pattern to +search for and one action to perform upon finding the pattern. + + Syntactically, a rule consists of a "pattern" followed by an +"action". The action is enclosed in braces to separate it from the +pattern. Newlines usually separate rules. Therefore, an 'awk' program +looks like this: + + PATTERN { ACTION } + PATTERN { ACTION } + ... + +* Menu: + +* Running gawk:: How to run 'gawk' programs; includes + command-line syntax. +* Sample Data Files:: Sample data files for use in the 'awk' + programs illustrated in this Info file. +* Very Simple:: A very simple example. +* Two Rules:: A less simple one-line example using two + rules. +* More Complex:: A more complex example. +* Statements/Lines:: Subdividing or combining statements into + lines. +* Other Features:: Other Features of 'awk'. +* When:: When to use 'gawk' and when to use + other things. +* Intro Summary:: Summary of the introduction. + + +File: gawk.info, Node: Running gawk, Next: Sample Data Files, Up: Getting Started + +1.1 How to Run 'awk' Programs +============================= + +There are several ways to run an 'awk' program. If the program is +short, it is easiest to include it in the command that runs 'awk', like +this: + + awk 'PROGRAM' INPUT-FILE1 INPUT-FILE2 ... + + When the program is long, it is usually more convenient to put it in +a file and run it with a command like this: + + awk -f PROGRAM-FILE INPUT-FILE1 INPUT-FILE2 ... + + This minor node discusses both mechanisms, along with several +variations of each. + +* Menu: + +* One-shot:: Running a short throwaway 'awk' + program. +* Read Terminal:: Using no input files (input from the keyboard + instead). +* Long:: Putting permanent 'awk' programs in + files. +* Executable Scripts:: Making self-contained 'awk' programs. +* Comments:: Adding documentation to 'gawk' + programs. +* Quoting:: More discussion of shell quoting issues. + + +File: gawk.info, Node: One-shot, Next: Read Terminal, Up: Running gawk + +1.1.1 One-Shot Throwaway 'awk' Programs +--------------------------------------- + +Once you are familiar with 'awk', you will often type in simple programs +the moment you want to use them. Then you can write the program as the +first argument of the 'awk' command, like this: + + awk 'PROGRAM' INPUT-FILE1 INPUT-FILE2 ... + +where PROGRAM consists of a series of patterns and actions, as described +earlier. + + This command format instructs the "shell", or command interpreter, to +start 'awk' and use the PROGRAM to process records in the input file(s). +There are single quotes around PROGRAM so the shell won't interpret any +'awk' characters as special shell characters. The quotes also cause the +shell to treat all of PROGRAM as a single argument for 'awk', and allow +PROGRAM to be more than one line long. + + This format is also useful for running short or medium-sized 'awk' +programs from shell scripts, because it avoids the need for a separate +file for the 'awk' program. A self-contained shell script is more +reliable because there are no other files to misplace. + + Later in this chapter, in *note Very Simple::, we'll see examples of +several short, self-contained programs. + + +File: gawk.info, Node: Read Terminal, Next: Long, Prev: One-shot, Up: Running gawk + +1.1.2 Running 'awk' Without Input Files +--------------------------------------- + +You can also run 'awk' without any input files. If you type the +following command line: + + awk 'PROGRAM' + +'awk' applies the PROGRAM to the "standard input", which usually means +whatever you type on the keyboard. This continues until you indicate +end-of-file by typing 'Ctrl-d'. (On non-POSIX operating systems, the +end-of-file character may be different.) + + As an example, the following program prints a friendly piece of +advice (from Douglas Adams's 'The Hitchhiker's Guide to the Galaxy'), to +keep you from worrying about the complexities of computer programming: + + $ awk 'BEGIN { print "Don\47t Panic!" }' + -| Don't Panic! + + 'awk' executes statements associated with 'BEGIN' before reading any +input. If there are no other statements in your program, as is the case +here, 'awk' just stops, instead of trying to read input it doesn't know +how to process. The '\47' is a magic way (explained later) of getting a +single quote into the program, without having to engage in ugly shell +quoting tricks. + + NOTE: If you use Bash as your shell, you should execute the command + 'set +H' before running this program interactively, to disable the + C shell-style command history, which treats '!' as a special + character. We recommend putting this command into your personal + startup file. + + This next simple 'awk' program emulates the 'cat' utility; it copies +whatever you type on the keyboard to its standard output (why this works +is explained shortly): + + $ awk '{ print }' + Now is the time for all good men + -| Now is the time for all good men + to come to the aid of their country. + -| to come to the aid of their country. + Four score and seven years ago, ... + -| Four score and seven years ago, ... + What, me worry? + -| What, me worry? + Ctrl-d + + +File: gawk.info, Node: Long, Next: Executable Scripts, Prev: Read Terminal, Up: Running gawk + +1.1.3 Running Long Programs +--------------------------- + +Sometimes 'awk' programs are very long. In these cases, it is more +convenient to put the program into a separate file. In order to tell +'awk' to use that file for its program, you type: + + awk -f SOURCE-FILE INPUT-FILE1 INPUT-FILE2 ... + + The '-f' instructs the 'awk' utility to get the 'awk' program from +the file SOURCE-FILE (*note Options::). Any file name can be used for +SOURCE-FILE. For example, you could put the program: + + BEGIN { print "Don't Panic!" } + +into the file 'advice'. Then this command: + + awk -f advice + +does the same thing as this one: + + awk 'BEGIN { print "Don\47t Panic!" }' + +This was explained earlier (*note Read Terminal::). Note that you don't +usually need single quotes around the file name that you specify with +'-f', because most file names don't contain any of the shell's special +characters. Notice that in 'advice', the 'awk' program did not have +single quotes around it. The quotes are only needed for programs that +are provided on the 'awk' command line. (Also, placing the program in a +file allows us to use a literal single quote in the program text, +instead of the magic '\47'.) + + If you want to clearly identify an 'awk' program file as such, you +can add the extension '.awk' to the file name. This doesn't affect the +execution of the 'awk' program but it does make "housekeeping" easier. + + +File: gawk.info, Node: Executable Scripts, Next: Comments, Prev: Long, Up: Running gawk + +1.1.4 Executable 'awk' Programs +------------------------------- + +Once you have learned 'awk', you may want to write self-contained 'awk' +scripts, using the '#!' script mechanism. You can do this on many +systems.(1) For example, you could update the file 'advice' to look +like this: + + #! /bin/awk -f + + BEGIN { print "Don't Panic!" } + +After making this file executable (with the 'chmod' utility), simply +type 'advice' at the shell and the system arranges to run 'awk' as if +you had typed 'awk -f advice': + + $ chmod +x advice + $ advice + -| Don't Panic! + +(We assume you have the current directory in your shell's search path +variable [typically '$PATH']. If not, you may need to type './advice' +at the shell.) + + Self-contained 'awk' scripts are useful when you want to write a +program that users can invoke without their having to know that the +program is written in 'awk'. + + Understanding '#!' + + 'awk' is an "interpreted" language. This means that the 'awk' +utility reads your program and then processes your data according to the +instructions in your program. (This is different from a "compiled" +language such as C, where your program is first compiled into machine +code that is executed directly by your system's processor.) The 'awk' +utility is thus termed an "interpreter". Many modern languages are +interpreted. + + The line beginning with '#!' lists the full file name of an +interpreter to run and a single optional initial command-line argument +to pass to that interpreter. The operating system then runs the +interpreter with the given argument and the full argument list of the +executed program. The first argument in the list is the full file name +of the 'awk' program. The rest of the argument list contains either +options to 'awk', or data files, or both. (Note that on many systems +'awk' may be found in '/usr/bin' instead of in '/bin'.) + + Some systems limit the length of the interpreter name to 32 +characters. Often, this can be dealt with by using a symbolic link. + + You should not put more than one argument on the '#!' line after the +path to 'awk'. It does not work. The operating system treats the rest +of the line as a single argument and passes it to 'awk'. Doing this +leads to confusing behavior--most likely a usage diagnostic of some sort +from 'awk'. + + Finally, the value of 'ARGV[0]' (*note Built-in Variables::) varies +depending upon your operating system. Some systems put 'awk' there, +some put the full pathname of 'awk' (such as '/bin/awk'), and some put +the name of your script ('advice'). (d.c.) Don't rely on the value of +'ARGV[0]' to provide your script name. + + ---------- Footnotes ---------- + + (1) The '#!' mechanism works on GNU/Linux systems, BSD-based systems, +and commercial Unix systems. + + +File: gawk.info, Node: Comments, Next: Quoting, Prev: Executable Scripts, Up: Running gawk + +1.1.5 Comments in 'awk' Programs +-------------------------------- + +A "comment" is some text that is included in a program for the sake of +human readers; it is not really an executable part of the program. +Comments can explain what the program does and how it works. Nearly all +programming languages have provisions for comments, as programs are +typically hard to understand without them. + + In the 'awk' language, a comment starts with the number sign +character ('#') and continues to the end of the line. The '#' does not +have to be the first character on the line. The 'awk' language ignores +the rest of a line following a number sign. For example, we could have +put the following into 'advice': + + # This program prints a nice, friendly message. It helps + # keep novice users from being afraid of the computer. + BEGIN { print "Don't Panic!" } + + You can put comment lines into keyboard-composed throwaway 'awk' +programs, but this usually isn't very useful; the purpose of a comment +is to help you or another person understand the program when reading it +at a later time. + + CAUTION: As mentioned in *note One-shot::, you can enclose short to + medium-sized programs in single quotes, in order to keep your shell + scripts self-contained. When doing so, _don't_ put an apostrophe + (i.e., a single quote) into a comment (or anywhere else in your + program). The shell interprets the quote as the closing quote for + the entire program. As a result, usually the shell prints a + message about mismatched quotes, and if 'awk' actually runs, it + will probably print strange messages about syntax errors. For + example, look at the following: + + $ awk 'BEGIN { print "hello" } # let's be cute' + > + + The shell sees that the first two quotes match, and that a new + quoted object begins at the end of the command line. It therefore + prompts with the secondary prompt, waiting for more input. With + Unix 'awk', closing the quoted string produces this result: + + $ awk '{ print "hello" } # let's be cute' + > ' + error-> awk: can't open file be + error-> source line number 1 + + Putting a backslash before the single quote in 'let's' wouldn't + help, because backslashes are not special inside single quotes. + The next node describes the shell's quoting rules. + + +File: gawk.info, Node: Quoting, Prev: Comments, Up: Running gawk + +1.1.6 Shell Quoting Issues +-------------------------- + +* Menu: + +* DOS Quoting:: Quoting in Windows Batch Files. + +For short to medium-length 'awk' programs, it is most convenient to +enter the program on the 'awk' command line. This is best done by +enclosing the entire program in single quotes. This is true whether you +are entering the program interactively at the shell prompt, or writing +it as part of a larger shell script: + + awk 'PROGRAM TEXT' INPUT-FILE1 INPUT-FILE2 ... + + Once you are working with the shell, it is helpful to have a basic +knowledge of shell quoting rules. The following rules apply only to +POSIX-compliant, Bourne-style shells (such as Bash, the GNU Bourne-Again +Shell). If you use the C shell, you're on your own. + + Before diving into the rules, we introduce a concept that appears +throughout this Info file, which is that of the "null", or empty, +string. + + The null string is character data that has no value. In other words, +it is empty. It is written in 'awk' programs like this: '""'. In the +shell, it can be written using single or double quotes: '""' or ''''. +Although the null string has no characters in it, it does exist. For +example, consider this command: + + $ echo "" + +Here, the 'echo' utility receives a single argument, even though that +argument has no characters in it. In the rest of this Info file, we use +the terms "null string" and "empty string" interchangeably. Now, on to +the quoting rules: + + * Quoted items can be concatenated with nonquoted items as well as + with other quoted items. The shell turns everything into one + argument for the command. + + * Preceding any single character with a backslash ('\') quotes that + character. The shell removes the backslash and passes the quoted + character on to the command. + + * Single quotes protect everything between the opening and closing + quotes. The shell does no interpretation of the quoted text, + passing it on verbatim to the command. It is _impossible_ to embed + a single quote inside single-quoted text. Refer back to *note + Comments:: for an example of what happens if you try. + + * Double quotes protect most things between the opening and closing + quotes. The shell does at least variable and command substitution + on the quoted text. Different shells may do additional kinds of + processing on double-quoted text. + + Because certain characters within double-quoted text are processed + by the shell, they must be "escaped" within the text. Of note are + the characters '$', '`', '\', and '"', all of which must be + preceded by a backslash within double-quoted text if they are to be + passed on literally to the program. (The leading backslash is + stripped first.) Thus, the example seen in *note Read Terminal::: + + awk 'BEGIN { print "Don\47t Panic!" }' + + could instead be written this way: + + $ awk "BEGIN { print \"Don't Panic!\" }" + -| Don't Panic! + + Note that the single quote is not special within double quotes. + + * Null strings are removed when they occur as part of a non-null + command-line argument, while explicit null objects are kept. For + example, to specify that the field separator 'FS' should be set to + the null string, use: + + awk -F "" 'PROGRAM' FILES # correct + + Don't use this: + + awk -F"" 'PROGRAM' FILES # wrong! + + In the second case, 'awk' attempts to use the text of the program + as the value of 'FS', and the first file name as the text of the + program! This results in syntax errors at best, and confusing + behavior at worst. + + Mixing single and double quotes is difficult. You have to resort to +shell quoting tricks, like this: + + $ awk 'BEGIN { print "Here is a single quote <'"'"'>" }' + -| Here is a single quote <'> + +This program consists of three concatenated quoted strings. The first +and the third are single-quoted, and the second is double-quoted. + + This can be "simplified" to: + + $ awk 'BEGIN { print "Here is a single quote <'\''>" }' + -| Here is a single quote <'> + +Judge for yourself which of these two is the more readable. + + Another option is to use double quotes, escaping the embedded, +'awk'-level double quotes: + + $ awk "BEGIN { print \"Here is a single quote <'>\" }" + -| Here is a single quote <'> + +This option is also painful, because double quotes, backslashes, and +dollar signs are very common in more advanced 'awk' programs. + + A third option is to use the octal escape sequence equivalents (*note +Escape Sequences::) for the single- and double-quote characters, like +so: + + $ awk 'BEGIN { print "Here is a single quote <\47>" }' + -| Here is a single quote <'> + $ awk 'BEGIN { print "Here is a double quote <\42>" }' + -| Here is a double quote <"> + +This works nicely, but you should comment clearly what the escapes mean. + + A fourth option is to use command-line variable assignment, like +this: + + $ awk -v sq="'" 'BEGIN { print "Here is a single quote <" sq ">" }' + -| Here is a single quote <'> + + (Here, the two string constants and the value of 'sq' are +concatenated into a single string that is printed by 'print'.) + + If you really need both single and double quotes in your 'awk' +program, it is probably best to move it into a separate file, where the +shell won't be part of the picture and you can say what you mean. + + +File: gawk.info, Node: DOS Quoting, Up: Quoting + +1.1.6.1 Quoting in MS-Windows Batch Files +......................................... + +Although this Info file generally only worries about POSIX systems and +the POSIX shell, the following issue arises often enough for many users +that it is worth addressing. + + The "shells" on Microsoft Windows systems use the double-quote +character for quoting, and make it difficult or impossible to include an +escaped double-quote character in a command-line script. The following +example, courtesy of Jeroen Brink, shows how to print all lines in a +file surrounded by double quotes: + + gawk "{ print \"\042\" $0 \"\042\" }" FILE + + +File: gawk.info, Node: Sample Data Files, Next: Very Simple, Prev: Running gawk, Up: Getting Started + +1.2 Data files for the Examples +=============================== + +Many of the examples in this Info file take their input from two sample +data files. The first, 'mail-list', represents a list of peoples' names +together with their email addresses and information about those people. +The second data file, called 'inventory-shipped', contains information +about monthly shipments. In both files, each line is considered to be +one "record". + + In 'mail-list', each record contains the name of a person, his/her +phone number, his/her email address, and a code for his/her relationship +with the author of the list. The columns are aligned using spaces. An +'A' in the last column means that the person is an acquaintance. An 'F' +in the last column means that the person is a friend. An 'R' means that +the person is a relative: + + Amelia 555-5553 amelia.zodiacusque@gmail.com F + Anthony 555-3412 anthony.asserturo@hotmail.com A + Becky 555-7685 becky.algebrarum@gmail.com A + Bill 555-1675 bill.drowning@hotmail.com A + Broderick 555-0542 broderick.aliquotiens@yahoo.com R + Camilla 555-2912 camilla.infusarum@skynet.be R + Fabius 555-1234 fabius.undevicesimus@ucb.edu F + Julie 555-6699 julie.perscrutabor@skeeve.com F + Martin 555-6480 martin.codicibus@hotmail.com A + Samuel 555-3430 samuel.lanceolis@shu.edu A + Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R + + The data file 'inventory-shipped' represents information about +shipments during the year. Each record contains the month, the number +of green crates shipped, the number of red boxes shipped, the number of +orange bags shipped, and the number of blue packages shipped, +respectively. There are 16 entries, covering the 12 months of last year +and the first four months of the current year. An empty line separates +the data for the two years: + + Jan 13 25 15 115 + Feb 15 32 24 226 + Mar 15 24 34 228 + Apr 31 52 63 420 + May 16 34 29 208 + Jun 31 42 75 492 + Jul 24 34 67 436 + Aug 15 34 47 316 + Sep 13 55 37 277 + Oct 29 54 68 525 + Nov 20 87 82 577 + Dec 17 35 61 401 + + Jan 21 36 64 620 + Feb 26 58 80 652 + Mar 24 75 70 495 + Apr 21 70 74 514 + + The sample files are included in the 'gawk' distribution, in the +directory 'awklib/eg/data'. + + +File: gawk.info, Node: Very Simple, Next: Two Rules, Prev: Sample Data Files, Up: Getting Started + +1.3 Some Simple Examples +======================== + +The following command runs a simple 'awk' program that searches the +input file 'mail-list' for the character string 'li' (a grouping of +characters is usually called a "string"; the term "string" is based on +similar usage in English, such as "a string of pearls" or "a string of +cars in a train"): + + awk '/li/ { print $0 }' mail-list + +When lines containing 'li' are found, they are printed because +'print $0' means print the current line. (Just 'print' by itself means +the same thing, so we could have written that instead.) + + You will notice that slashes ('/') surround the string 'li' in the +'awk' program. The slashes indicate that 'li' is the pattern to search +for. This type of pattern is called a "regular expression", which is +covered in more detail later (*note Regexp::). The pattern is allowed +to match parts of words. There are single quotes around the 'awk' +program so that the shell won't interpret any of it as special shell +characters. + + Here is what this program prints: + + $ awk '/li/ { print $0 }' mail-list + -| Amelia 555-5553 amelia.zodiacusque@gmail.com F + -| Broderick 555-0542 broderick.aliquotiens@yahoo.com R + -| Julie 555-6699 julie.perscrutabor@skeeve.com F + -| Samuel 555-3430 samuel.lanceolis@shu.edu A + + In an 'awk' rule, either the pattern or the action can be omitted, +but not both. If the pattern is omitted, then the action is performed +for _every_ input line. If the action is omitted, the default action is +to print all lines that match the pattern. + + Thus, we could leave out the action (the 'print' statement and the +braces) in the previous example and the result would be the same: 'awk' +prints all lines matching the pattern 'li'. By comparison, omitting the +'print' statement but retaining the braces makes an empty action that +does nothing (i.e., no lines are printed). + + Many practical 'awk' programs are just a line or two long. Following +is a collection of useful, short programs to get you started. Some of +these programs contain constructs that haven't been covered yet. (The +description of the program will give you a good idea of what is going +on, but you'll need to read the rest of the Info file to become an 'awk' +expert!) Most of the examples use a data file named 'data'. This is +just a placeholder; if you use these programs yourself, substitute your +own file names for 'data'. For future reference, note that there is +often more than one way to do things in 'awk'. At some point, you may +want to look back at these examples and see if you can come up with +different ways to do the same things shown here: + + * Print every line that is longer than 80 characters: + + awk 'length($0) > 80' data + + The sole rule has a relational expression as its pattern and has no + action--so it uses the default action, printing the record. + + * Print the length of the longest input line: + + awk '{ if (length($0) > max) max = length($0) } + END { print max }' data + + The code associated with 'END' executes after all input has been + read; it's the other side of the coin to 'BEGIN'. + + * Print the length of the longest line in 'data': + + expand data | awk '{ if (x < length($0)) x = length($0) } + END { print "maximum line length is " x }' + + This example differs slightly from the previous one: the input is + processed by the 'expand' utility to change TABs into spaces, so + the widths compared are actually the right-margin columns, as + opposed to the number of input characters on each line. + + * Print every line that has at least one field: + + awk 'NF > 0' data + + This is an easy way to delete blank lines from a file (or rather, + to create a new file similar to the old file but from which the + blank lines have been removed). + + * Print seven random numbers from 0 to 100, inclusive: + + awk 'BEGIN { for (i = 1; i <= 7; i++) + print int(101 * rand()) }' + + * Print the total number of bytes used by FILES: + + ls -l FILES | awk '{ x += $5 } + END { print "total bytes: " x }' + + * Print the total number of kilobytes used by FILES: + + ls -l FILES | awk '{ x += $5 } + END { print "total K-bytes:", x / 1024 }' + + * Print a sorted list of the login names of all users: + + awk -F: '{ print $1 }' /etc/passwd | sort + + * Count the lines in a file: + + awk 'END { print NR }' data + + * Print the even-numbered lines in the data file: + + awk 'NR % 2 == 0' data + + If you used the expression 'NR % 2 == 1' instead, the program would + print the odd-numbered lines. + + +File: gawk.info, Node: Two Rules, Next: More Complex, Prev: Very Simple, Up: Getting Started + +1.4 An Example with Two Rules +============================= + +The 'awk' utility reads the input files one line at a time. For each +line, 'awk' tries the patterns of each rule. If several patterns match, +then several actions execute in the order in which they appear in the +'awk' program. If no patterns match, then no actions run. + + After processing all the rules that match the line (and perhaps there +are none), 'awk' reads the next line. (However, *note Next Statement:: +and also *note Nextfile Statement::.) This continues until the program +reaches the end of the file. For example, the following 'awk' program +contains two rules: + + /12/ { print $0 } + /21/ { print $0 } + +The first rule has the string '12' as the pattern and 'print $0' as the +action. The second rule has the string '21' as the pattern and also has +'print $0' as the action. Each rule's action is enclosed in its own +pair of braces. + + This program prints every line that contains the string '12' _or_ the +string '21'. If a line contains both strings, it is printed twice, once +by each rule. + + This is what happens if we run this program on our two sample data +files, 'mail-list' and 'inventory-shipped': + + $ awk '/12/ { print $0 } + > /21/ { print $0 }' mail-list inventory-shipped + -| Anthony 555-3412 anthony.asserturo@hotmail.com A + -| Camilla 555-2912 camilla.infusarum@skynet.be R + -| Fabius 555-1234 fabius.undevicesimus@ucb.edu F + -| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R + -| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R + -| Jan 21 36 64 620 + -| Apr 21 70 74 514 + +Note how the line beginning with 'Jean-Paul' in 'mail-list' was printed +twice, once for each rule. + + +File: gawk.info, Node: More Complex, Next: Statements/Lines, Prev: Two Rules, Up: Getting Started + +1.5 A More Complex Example +========================== + +Now that we've mastered some simple tasks, let's look at what typical +'awk' programs do. This example shows how 'awk' can be used to +summarize, select, and rearrange the output of another utility. It uses +features that haven't been covered yet, so don't worry if you don't +understand all the details: + + ls -l | awk '$6 == "Nov" { sum += $5 } + END { print sum }' + + This command prints the total number of bytes in all the files in the +current directory that were last modified in November (of any year). +The 'ls -l' part of this example is a system command that gives you a +listing of the files in a directory, including each file's size and the +date the file was last modified. Its output looks like this: + + -rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile + -rw-r--r-- 1 arnold user 10809 Nov 7 13:03 awk.h + -rw-r--r-- 1 arnold user 983 Apr 13 12:14 awk.tab.h + -rw-r--r-- 1 arnold user 31869 Jun 15 12:20 awkgram.y + -rw-r--r-- 1 arnold user 22414 Nov 7 13:03 awk1.c + -rw-r--r-- 1 arnold user 37455 Nov 7 13:03 awk2.c + -rw-r--r-- 1 arnold user 27511 Dec 9 13:07 awk3.c + -rw-r--r-- 1 arnold user 7989 Nov 7 13:03 awk4.c + +The first field contains read-write permissions, the second field +contains the number of links to the file, and the third field identifies +the file's owner. The fourth field identifies the file's group. The +fifth field contains the file's size in bytes. The sixth, seventh, and +eighth fields contain the month, day, and time, respectively, that the +file was last modified. Finally, the ninth field contains the file +name. + + The '$6 == "Nov"' in our 'awk' program is an expression that tests +whether the sixth field of the output from 'ls -l' matches the string +'Nov'. Each time a line has the string 'Nov' for its sixth field, 'awk' +performs the action 'sum += $5'. This adds the fifth field (the file's +size) to the variable 'sum'. As a result, when 'awk' has finished +reading all the input lines, 'sum' is the total of the sizes of the +files whose lines matched the pattern. (This works because 'awk' +variables are automatically initialized to zero.) + + After the last line of output from 'ls' has been processed, the 'END' +rule executes and prints the value of 'sum'. In this example, the value +of 'sum' is 80600. + + These more advanced 'awk' techniques are covered in later minor nodes +(*note Action Overview::). Before you can move on to more advanced +'awk' programming, you have to know how 'awk' interprets your input and +displays your output. By manipulating fields and using 'print' +statements, you can produce some very useful and impressive-looking +reports. + + +File: gawk.info, Node: Statements/Lines, Next: Other Features, Prev: More Complex, Up: Getting Started + +1.6 'awk' Statements Versus Lines +================================= + +Most often, each line in an 'awk' program is a separate statement or +separate rule, like this: + + awk '/12/ { print $0 } + /21/ { print $0 }' mail-list inventory-shipped + + However, 'gawk' ignores newlines after any of the following symbols +and keywords: + + , { ? : || && do else + +A newline at any other point is considered the end of the statement.(1) + + If you would like to split a single statement into two lines at a +point where a newline would terminate it, you can "continue" it by +ending the first line with a backslash character ('\'). The backslash +must be the final character on the line in order to be recognized as a +continuation character. A backslash is allowed anywhere in the +statement, even in the middle of a string or regular expression. For +example: + + awk '/This regular expression is too long, so continue it\ + on the next line/ { print $1 }' + +We have generally not used backslash continuation in our sample +programs. 'gawk' places no limit on the length of a line, so backslash +continuation is never strictly necessary; it just makes programs more +readable. For this same reason, as well as for clarity, we have kept +most statements short in the programs presented throughout the Info +file. Backslash continuation is most useful when your 'awk' program is +in a separate source file instead of entered from the command line. You +should also note that many 'awk' implementations are more particular +about where you may use backslash continuation. For example, they may +not allow you to split a string constant using backslash continuation. +Thus, for maximum portability of your 'awk' programs, it is best not to +split your lines in the middle of a regular expression or a string. + + CAUTION: _Backslash continuation does not work as described with + the C shell._ It works for 'awk' programs in files and for + one-shot programs, _provided_ you are using a POSIX-compliant + shell, such as the Unix Bourne shell or Bash. But the C shell + behaves differently! There you must use two backslashes in a row, + followed by a newline. Note also that when using the C shell, + _every_ newline in your 'awk' program must be escaped with a + backslash. To illustrate: + + % awk 'BEGIN { \ + ? print \\ + ? "hello, world" \ + ? }' + -| hello, world + + Here, the '%' and '?' are the C shell's primary and secondary + prompts, analogous to the standard shell's '$' and '>'. + + Compare the previous example to how it is done with a + POSIX-compliant shell: + + $ awk 'BEGIN { + > print \ + > "hello, world" + > }' + -| hello, world + + 'awk' is a line-oriented language. Each rule's action has to begin +on the same line as the pattern. To have the pattern and action on +separate lines, you _must_ use backslash continuation; there is no other +option. + + Another thing to keep in mind is that backslash continuation and +comments do not mix. As soon as 'awk' sees the '#' that starts a +comment, it ignores _everything_ on the rest of the line. For example: + + $ gawk 'BEGIN { print "dont panic" # a friendly \ + > BEGIN rule + > }' + error-> gawk: cmd. line:2: BEGIN rule + error-> gawk: cmd. line:2: ^ syntax error + +In this case, it looks like the backslash would continue the comment +onto the next line. However, the backslash-newline combination is never +even noticed because it is "hidden" inside the comment. Thus, the +'BEGIN' is noted as a syntax error. + + When 'awk' statements within one rule are short, you might want to +put more than one of them on a line. This is accomplished by separating +the statements with a semicolon (';'). This also applies to the rules +themselves. Thus, the program shown at the start of this minor node +could also be written this way: + + /12/ { print $0 } ; /21/ { print $0 } + + NOTE: The requirement that states that rules on the same line must + be separated with a semicolon was not in the original 'awk' + language; it was added for consistency with the treatment of + statements within an action. + + ---------- Footnotes ---------- + + (1) The '?' and ':' referred to here is the three-operand conditional +expression described in *note Conditional Exp::. Splitting lines after +'?' and ':' is a minor 'gawk' extension; if '--posix' is specified +(*note Options::), then this extension is disabled. + + +File: gawk.info, Node: Other Features, Next: When, Prev: Statements/Lines, Up: Getting Started + +1.7 Other Features of 'awk' +=========================== + +The 'awk' language provides a number of predefined, or "built-in", +variables that your programs can use to get information from 'awk'. +There are other variables your program can set as well to control how +'awk' processes your data. + + In addition, 'awk' provides a number of built-in functions for doing +common computational and string-related operations. 'gawk' provides +built-in functions for working with timestamps, performing bit +manipulation, for runtime string translation (internationalization), +determining the type of a variable, and array sorting. + + As we develop our presentation of the 'awk' language, we will +introduce most of the variables and many of the functions. They are +described systematically in *note Built-in Variables:: and in *note +Built-in::. + + +File: gawk.info, Node: When, Next: Intro Summary, Prev: Other Features, Up: Getting Started + +1.8 When to Use 'awk' +===================== + +Now that you've seen some of what 'awk' can do, you might wonder how +'awk' could be useful for you. By using utility programs, advanced +patterns, field separators, arithmetic statements, and other selection +criteria, you can produce much more complex output. The 'awk' language +is very useful for producing reports from large amounts of raw data, +such as summarizing information from the output of other utility +programs like 'ls'. (*Note More Complex::.) + + Programs written with 'awk' are usually much smaller than they would +be in other languages. This makes 'awk' programs easy to compose and +use. Often, 'awk' programs can be quickly composed at your keyboard, +used once, and thrown away. Because 'awk' programs are interpreted, you +can avoid the (usually lengthy) compilation part of the typical +edit-compile-test-debug cycle of software development. + + Complex programs have been written in 'awk', including a complete +retargetable assembler for eight-bit microprocessors (*note Glossary::, +for more information), and a microcode assembler for a special-purpose +Prolog computer. The original 'awk''s capabilities were strained by +tasks of such complexity, but modern versions are more capable. + + If you find yourself writing 'awk' scripts of more than, say, a few +hundred lines, you might consider using a different programming +language. The shell is good at string and pattern matching; in +addition, it allows powerful use of the system utilities. Python offers +a nice balance between high-level ease of programming and access to +system facilities.(1) + + ---------- Footnotes ---------- + + (1) Other popular scripting languages include Ruby and Perl. + + +File: gawk.info, Node: Intro Summary, Prev: When, Up: Getting Started + +1.9 Summary +=========== + + * Programs in 'awk' consist of PATTERN-ACTION pairs. + + * An ACTION without a PATTERN always runs. The default ACTION for a + pattern without one is '{ print $0 }'. + + * Use either 'awk 'PROGRAM' FILES' or 'awk -f PROGRAM-FILE FILES' to + run 'awk'. + + * You may use the special '#!' header line to create 'awk' programs + that are directly executable. + + * Comments in 'awk' programs start with '#' and continue to the end + of the same line. + + * Be aware of quoting issues when writing 'awk' programs as part of a + larger shell script (or MS-Windows batch file). + + * You may use backslash continuation to continue a source line. + Lines are automatically continued after a comma, open brace, + question mark, colon, '||', '&&', 'do', and 'else'. + + +File: gawk.info, Node: Invoking Gawk, Next: Regexp, Prev: Getting Started, Up: Top + +2 Running 'awk' and 'gawk' +************************** + +This major node covers how to run 'awk', both POSIX-standard and +'gawk'-specific command-line options, and what 'awk' and 'gawk' do with +nonoption arguments. It then proceeds to cover how 'gawk' searches for +source files, reading standard input along with other files, 'gawk''s +environment variables, 'gawk''s exit status, using include files, and +obsolete and undocumented options and/or features. + + Many of the options and features described here are discussed in more +detail later in the Info file; feel free to skip over things in this +major node that don't interest you right now. + +* Menu: + +* Command Line:: How to run 'awk'. +* Options:: Command-line options and their meanings. +* Other Arguments:: Input file names and variable assignments. +* Naming Standard Input:: How to specify standard input with other + files. +* Environment Variables:: The environment variables 'gawk' uses. +* Exit Status:: 'gawk''s exit status. +* Include Files:: Including other files into your program. +* Loading Shared Libraries:: Loading shared libraries into your program. +* Obsolete:: Obsolete Options and/or features. +* Undocumented:: Undocumented Options and Features. +* Invoking Summary:: Invocation summary. + + +File: gawk.info, Node: Command Line, Next: Options, Up: Invoking Gawk + +2.1 Invoking 'awk' +================== + +There are two ways to run 'awk'--with an explicit program or with one or +more program files. Here are templates for both of them; items enclosed +in [...] in these templates are optional: + + 'awk' [OPTIONS] '-f' PROGFILE ['--'] FILE ... + 'awk' [OPTIONS] ['--'] ''PROGRAM'' FILE ... + + In addition to traditional one-letter POSIX-style options, 'gawk' +also supports GNU long options. + + It is possible to invoke 'awk' with an empty program: + + awk '' datafile1 datafile2 + +Doing so makes little sense, though; 'awk' exits silently when given an +empty program. (d.c.) If '--lint' has been specified on the command +line, 'gawk' issues a warning that the program is empty. + + +File: gawk.info, Node: Options, Next: Other Arguments, Prev: Command Line, Up: Invoking Gawk + +2.2 Command-Line Options +======================== + +Options begin with a dash and consist of a single character. GNU-style +long options consist of two dashes and a keyword. The keyword can be +abbreviated, as long as the abbreviation allows the option to be +uniquely identified. If the option takes an argument, either the +keyword is immediately followed by an equals sign ('=') and the +argument's value, or the keyword and the argument's value are separated +by whitespace. If a particular option with a value is given more than +once, it is the last value that counts. + + Each long option for 'gawk' has a corresponding POSIX-style short +option. The long and short options are interchangeable in all contexts. +The following list describes options mandated by the POSIX standard: + +'-F FS' +'--field-separator FS' + Set the 'FS' variable to FS (*note Field Separators::). + +'-f SOURCE-FILE' +'--file SOURCE-FILE' + Read the 'awk' program source from SOURCE-FILE instead of in the + first nonoption argument. This option may be given multiple times; + the 'awk' program consists of the concatenation of the contents of + each specified SOURCE-FILE. + +'-v VAR=VAL' +'--assign VAR=VAL' + Set the variable VAR to the value VAL _before_ execution of the + program begins. Such variable values are available inside the + 'BEGIN' rule (*note Other Arguments::). + + The '-v' option can only set one variable, but it can be used more + than once, setting another variable each time, like this: 'awk + -v foo=1 -v bar=2 ...'. + + CAUTION: Using '-v' to set the values of the built-in + variables may lead to surprising results. 'awk' will reset + the values of those variables as it needs to, possibly + ignoring any initial value you may have given. + +'-W GAWK-OPT' + Provide an implementation-specific option. This is the POSIX + convention for providing implementation-specific options. These + options also have corresponding GNU-style long options. Note that + the long options may be abbreviated, as long as the abbreviations + remain unique. The full list of 'gawk'-specific options is + provided next. + +'--' + Signal the end of the command-line options. The following + arguments are not treated as options even if they begin with '-'. + This interpretation of '--' follows the POSIX argument parsing + conventions. + + This is useful if you have file names that start with '-', or in + shell scripts, if you have file names that will be specified by the + user that could start with '-'. It is also useful for passing + options on to the 'awk' program; see *note Getopt Function::. + + The following list describes 'gawk'-specific options: + +'-b' +'--characters-as-bytes' + Cause 'gawk' to treat all input data as single-byte characters. In + addition, all output written with 'print' or 'printf' is treated as + single-byte characters. + + Normally, 'gawk' follows the POSIX standard and attempts to process + its input data according to the current locale (*note Locales::). + This can often involve converting multibyte characters into wide + characters (internally), and can lead to problems or confusion if + the input data does not contain valid multibyte characters. This + option is an easy way to tell 'gawk', "Hands off my data!" + +'-c' +'--traditional' + Specify "compatibility mode", in which the GNU extensions to the + 'awk' language are disabled, so that 'gawk' behaves just like BWK + 'awk'. *Note POSIX/GNU::, which summarizes the extensions. Also + see *note Compatibility Mode::. + +'-C' +'--copyright' + Print the short version of the General Public License and then + exit. + +'-d'[FILE] +'--dump-variables'['='FILE] + Print a sorted list of global variables, their types, and final + values to FILE. If no FILE is provided, print this list to a file + named 'awkvars.out' in the current directory. No space is allowed + between the '-d' and FILE, if FILE is supplied. + + Having a list of all global variables is a good way to look for + typographical errors in your programs. You would also use this + option if you have a large program with a lot of functions, and you + want to be sure that your functions don't inadvertently use global + variables that you meant to be local. (This is a particularly easy + mistake to make with simple variable names like 'i', 'j', etc.) + +'-D'[FILE] +'--debug'['='FILE] + Enable debugging of 'awk' programs (*note Debugging::). By + default, the debugger reads commands interactively from the + keyboard (standard input). The optional FILE argument allows you + to specify a file with a list of commands for the debugger to + execute noninteractively. No space is allowed between the '-D' and + FILE, if FILE is supplied. + +'-e' PROGRAM-TEXT +'--source' PROGRAM-TEXT + Provide program source code in the PROGRAM-TEXT. This option + allows you to mix source code in files with source code that you + enter on the command line. This is particularly useful when you + have library functions that you want to use from your command-line + programs (*note AWKPATH Variable::). + +'-E' FILE +'--exec' FILE + Similar to '-f', read 'awk' program text from FILE. There are two + differences from '-f': + + * This option terminates option processing; anything else on the + command line is passed on directly to the 'awk' program. + + * Command-line variable assignments of the form 'VAR=VALUE' are + disallowed. + + This option is particularly necessary for World Wide Web CGI + applications that pass arguments through the URL; using this option + prevents a malicious (or other) user from passing in options, + assignments, or 'awk' source code (via '-e') to the CGI + application.(1) This option should be used with '#!' scripts + (*note Executable Scripts::), like so: + + #! /usr/local/bin/gawk -E + + AWK PROGRAM HERE ... + +'-g' +'--gen-pot' + Analyze the source program and generate a GNU 'gettext' portable + object template file on standard output for all string constants + that have been marked for translation. *Note + Internationalization::, for information about this option. + +'-h' +'--help' + Print a "usage" message summarizing the short- and long-style + options that 'gawk' accepts and then exit. + +'-i' SOURCE-FILE +'--include' SOURCE-FILE + Read an 'awk' source library from SOURCE-FILE. This option is + completely equivalent to using the '@include' directive inside your + program. It is very similar to the '-f' option, but there are two + important differences. First, when '-i' is used, the program + source is not loaded if it has been previously loaded, whereas with + '-f', 'gawk' always loads the file. Second, because this option is + intended to be used with code libraries, 'gawk' does not recognize + such files as constituting main program input. Thus, after + processing an '-i' argument, 'gawk' still expects to find the main + source code via the '-f' option or on the command line. + +'-l' EXT +'--load' EXT + Load a dynamic extension named EXT. Extensions are stored as + system shared libraries. This option searches for the library + using the 'AWKLIBPATH' environment variable. The correct library + suffix for your platform will be supplied by default, so it need + not be specified in the extension name. The extension + initialization routine should be named 'dl_load()'. An alternative + is to use the '@load' keyword inside the program to load a shared + library. This advanced feature is described in detail in *note + Dynamic Extensions::. + +'-L'[VALUE] +'--lint'['='VALUE] + Warn about constructs that are dubious or nonportable to other + 'awk' implementations. No space is allowed between the '-L' and + VALUE, if VALUE is supplied. Some warnings are issued when 'gawk' + first reads your program. Others are issued at runtime, as your + program executes. With an optional argument of 'fatal', lint + warnings become fatal errors. This may be drastic, but its use + will certainly encourage the development of cleaner 'awk' programs. + With an optional argument of 'invalid', only warnings about things + that are actually invalid are issued. (This is not fully + implemented yet.) + + Some warnings are only printed once, even if the dubious constructs + they warn about occur multiple times in your 'awk' program. Thus, + when eliminating problems pointed out by '--lint', you should take + care to search for all occurrences of each inappropriate construct. + As 'awk' programs are usually short, doing so is not burdensome. + +'-M' +'--bignum' + Select arbitrary-precision arithmetic on numbers. This option has + no effect if 'gawk' is not compiled to use the GNU MPFR and MP + libraries (*note Arbitrary Precision Arithmetic::). + +'-n' +'--non-decimal-data' + Enable automatic interpretation of octal and hexadecimal values in + input data (*note Nondecimal Data::). + + CAUTION: This option can severely break old programs. Use + with care. Also note that this option may disappear in a + future version of 'gawk'. + +'-N' +'--use-lc-numeric' + Force the use of the locale's decimal point character when parsing + numeric input data (*note Locales::). + +'-o'[FILE] +'--pretty-print'['='FILE] + Enable pretty-printing of 'awk' programs. Implies '--no-optimize'. + By default, the output program is created in a file named + 'awkprof.out' (*note Profiling::). The optional FILE argument + allows you to specify a different file name for the output. No + space is allowed between the '-o' and FILE, if FILE is supplied. + + NOTE: In the past, this option would also execute your + program. This is no longer the case. + +'-O' +'--optimize' + Enable 'gawk''s default optimizations on the internal + representation of the program. At the moment, this includes simple + constant folding and tail recursion elimination in function calls. + + These optimizations are enabled by default. This option remains + primarily for backwards compatibility. However, it may be used to + cancel the effect of an earlier '-s' option (see later in this + list). + +'-p'[FILE] +'--profile'['='FILE] + Enable profiling of 'awk' programs (*note Profiling::). Implies + '--no-optimize'. By default, profiles are created in a file named + 'awkprof.out'. The optional FILE argument allows you to specify a + different file name for the profile file. No space is allowed + between the '-p' and FILE, if FILE is supplied. + + The profile contains execution counts for each statement in the + program in the left margin, and function call counts for each + function. + +'-P' +'--posix' + Operate in strict POSIX mode. This disables all 'gawk' extensions + (just like '--traditional') and disables all extensions not allowed + by POSIX. *Note Common Extensions:: for a summary of the extensions + in 'gawk' that are disabled by this option. Also, the following + additional restrictions apply: + + * Newlines are not allowed after '?' or ':' (*note Conditional + Exp::). + + * Specifying '-Ft' on the command line does not set the value of + 'FS' to be a single TAB character (*note Field Separators::). + + * The locale's decimal point character is used for parsing input + data (*note Locales::). + + If you supply both '--traditional' and '--posix' on the command + line, '--posix' takes precedence. 'gawk' issues a warning if both + options are supplied. + +'-r' +'--re-interval' + Allow interval expressions (*note Regexp Operators::) in regexps. + This is now 'gawk''s default behavior. Nevertheless, this option + remains (both for backward compatibility and for use in combination + with '--traditional'). + +'-s' +'--no-optimize' + Disable 'gawk''s default optimizations on the internal + representation of the program. + +'-S' +'--sandbox' + Disable the 'system()' function, input redirections with 'getline', + output redirections with 'print' and 'printf', and dynamic + extensions. This is particularly useful when you want to run 'awk' + scripts from questionable sources and need to make sure the scripts + can't access your system (other than the specified input data + file). + +'-t' +'--lint-old' + Warn about constructs that are not available in the original + version of 'awk' from Version 7 Unix (*note V7/SVR3.1::). + +'-V' +'--version' + Print version information for this particular copy of 'gawk'. This + allows you to determine if your copy of 'gawk' is up to date with + respect to whatever the Free Software Foundation is currently + distributing. It is also useful for bug reports (*note Bugs::). + + As long as program text has been supplied, any other options are +flagged as invalid with a warning message but are otherwise ignored. + + In compatibility mode, as a special case, if the value of FS supplied +to the '-F' option is 't', then 'FS' is set to the TAB character +('"\t"'). This is true only for '--traditional' and not for '--posix' +(*note Field Separators::). + + The '-f' option may be used more than once on the command line. If +it is, 'awk' reads its program source from all of the named files, as if +they had been concatenated together into one big file. This is useful +for creating libraries of 'awk' functions. These functions can be +written once and then retrieved from a standard place, instead of having +to be included in each individual program. The '-i' option is similar +in this regard. (As mentioned in *note Definition Syntax::, function +names must be unique.) + + With standard 'awk', library functions can still be used, even if the +program is entered at the keyboard, by specifying '-f /dev/tty'. After +typing your program, type 'Ctrl-d' (the end-of-file character) to +terminate it. (You may also use '-f -' to read program source from the +standard input, but then you will not be able to also use the standard +input as a source of data.) + + Because it is clumsy using the standard 'awk' mechanisms to mix +source file and command-line 'awk' programs, 'gawk' provides the '-e' +option. This does not require you to preempt the standard input for +your source code; it allows you to easily mix command-line and library +source code (*note AWKPATH Variable::). As with '-f', the '-e' and '-i' +options may also be used multiple times on the command line. + + If no '-f' or '-e' option is specified, then 'gawk' uses the first +nonoption command-line argument as the text of the program source code. + + If the environment variable 'POSIXLY_CORRECT' exists, then 'gawk' +behaves in strict POSIX mode, exactly as if you had supplied '--posix'. +Many GNU programs look for this environment variable to suppress +extensions that conflict with POSIX, but 'gawk' behaves differently: it +suppresses all extensions, even those that do not conflict with POSIX, +and behaves in strict POSIX mode. If '--lint' is supplied on the +command line and 'gawk' turns on POSIX mode because of +'POSIXLY_CORRECT', then it issues a warning message indicating that +POSIX mode is in effect. You would typically set this variable in your +shell's startup file. For a Bourne-compatible shell (such as Bash), you +would add these lines to the '.profile' file in your home directory: + + POSIXLY_CORRECT=true + export POSIXLY_CORRECT + + For a C shell-compatible shell,(2) you would add this line to the +'.login' file in your home directory: + + setenv POSIXLY_CORRECT true + + Having 'POSIXLY_CORRECT' set is not recommended for daily use, but it +is good for testing the portability of your programs to other +environments. + + ---------- Footnotes ---------- + + (1) For more detail, please see Section 4.4 of RFC 3875 +(http://www.ietf.org/rfc/rfc3875). Also see the explanatory note sent +to the 'gawk' bug mailing list +(http://lists.gnu.org/archive/html/bug-gawk/2014-11/msg00022.html). + + (2) Not recommended. + + +File: gawk.info, Node: Other Arguments, Next: Naming Standard Input, Prev: Options, Up: Invoking Gawk + +2.3 Other Command-Line Arguments +================================ + +Any additional arguments on the command line are normally treated as +input files to be processed in the order specified. However, an +argument that has the form 'VAR=VALUE', assigns the value VALUE to the +variable VAR--it does not specify a file at all. (See *note Assignment +Options::.) In the following example, COUNT=1 is a variable assignment, +not a file name: + + awk -f program.awk file1 count=1 file2 + + All the command-line arguments are made available to your 'awk' +program in the 'ARGV' array (*note Built-in Variables::). Command-line +options and the program text (if present) are omitted from 'ARGV'. All +other arguments, including variable assignments, are included. As each +element of 'ARGV' is processed, 'gawk' sets 'ARGIND' to the index in +'ARGV' of the current element. + + Changing 'ARGC' and 'ARGV' in your 'awk' program lets you control how +'awk' processes the input files; this is described in more detail in +*note ARGC and ARGV::. + + The distinction between file name arguments and variable-assignment +arguments is made when 'awk' is about to open the next input file. At +that point in execution, it checks the file name to see whether it is +really a variable assignment; if so, 'awk' sets the variable instead of +reading a file. + + Therefore, the variables actually receive the given values after all +previously specified files have been read. In particular, the values of +variables assigned in this fashion are _not_ available inside a 'BEGIN' +rule (*note BEGIN/END::), because such rules are run before 'awk' begins +scanning the argument list. + + The variable values given on the command line are processed for +escape sequences (*note Escape Sequences::). (d.c.) + + In some very early implementations of 'awk', when a variable +assignment occurred before any file names, the assignment would happen +_before_ the 'BEGIN' rule was executed. 'awk''s behavior was thus +inconsistent; some command-line assignments were available inside the +'BEGIN' rule, while others were not. Unfortunately, some applications +came to depend upon this "feature." When 'awk' was changed to be more +consistent, the '-v' option was added to accommodate applications that +depended upon the old behavior. + + The variable assignment feature is most useful for assigning to +variables such as 'RS', 'OFS', and 'ORS', which control input and output +formats, before scanning the data files. It is also useful for +controlling state if multiple passes are needed over a data file. For +example: + + awk 'pass == 1 { PASS 1 STUFF } + pass == 2 { PASS 2 STUFF }' pass=1 mydata pass=2 mydata + + Given the variable assignment feature, the '-F' option for setting +the value of 'FS' is not strictly necessary. It remains for historical +compatibility. + + +File: gawk.info, Node: Naming Standard Input, Next: Environment Variables, Prev: Other Arguments, Up: Invoking Gawk + +2.4 Naming Standard Input +========================= + +Often, you may wish to read standard input together with other files. +For example, you may wish to read one file, read standard input coming +from a pipe, and then read another file. + + The way to name the standard input, with all versions of 'awk', is to +use a single, standalone minus sign or dash, '-'. For example: + + SOME_COMMAND | awk -f myprog.awk file1 - file2 + +Here, 'awk' first reads 'file1', then it reads the output of +SOME_COMMAND, and finally it reads 'file2'. + + You may also use '"-"' to name standard input when reading files with +'getline' (*note Getline/File::). + + In addition, 'gawk' allows you to specify the special file name +'/dev/stdin', both on the command line and with 'getline'. Some other +versions of 'awk' also support this, but it is not standard. (Some +operating systems provide a '/dev/stdin' file in the filesystem; +however, 'gawk' always processes this file name itself.) + + +File: gawk.info, Node: Environment Variables, Next: Exit Status, Prev: Naming Standard Input, Up: Invoking Gawk + +2.5 The Environment Variables 'gawk' Uses +========================================= + +A number of environment variables influence how 'gawk' behaves. + +* Menu: + +* AWKPATH Variable:: Searching directories for 'awk' + programs. +* AWKLIBPATH Variable:: Searching directories for 'awk' shared + libraries. +* Other Environment Variables:: The environment variables. + + +File: gawk.info, Node: AWKPATH Variable, Next: AWKLIBPATH Variable, Up: Environment Variables + +2.5.1 The 'AWKPATH' Environment Variable +---------------------------------------- + +The previous minor node described how 'awk' program files can be named +on the command line with the '-f' option. In most 'awk' +implementations, you must supply a precise pathname for each program +file, unless the file is in the current directory. But with 'gawk', if +the file name supplied to the '-f' or '-i' options does not contain a +directory separator '/', then 'gawk' searches a list of directories +(called the "search path") one by one, looking for a file with the +specified name. + + The search path is a string consisting of directory names separated +by colons.(1) 'gawk' gets its search path from the 'AWKPATH' +environment variable. If that variable does not exist, or if it has an +empty value, 'gawk' uses a default path (described shortly). + + The search path feature is particularly helpful for building +libraries of useful 'awk' functions. The library files can be placed in +a standard directory in the default path and then specified on the +command line with a short file name. Otherwise, you would have to type +the full file name for each file. + + By using the '-i' or '-f' options, your command-line 'awk' programs +can use facilities in 'awk' library files (*note Library Functions::). +Path searching is not done if 'gawk' is in compatibility mode. This is +true for both '--traditional' and '--posix'. *Note Options::. + + If the source code file is not found after the initial search, the +path is searched again after adding the suffix '.awk' to the file name. + + 'gawk''s path search mechanism is similar to the shell's. (See 'The +Bourne-Again SHell manual' (http://www.gnu.org/software/bash/manual/).) +It treats a null entry in the path as indicating the current directory. +(A null entry is indicated by starting or ending the path with a colon +or by placing two colons next to each other ['::'].) + + NOTE: To include the current directory in the path, either place + '.' as an entry in the path or write a null entry in the path. + + Different past versions of 'gawk' would also look explicitly in the + current directory, either before or after the path search. As of + version 4.1.2, this no longer happens; if you wish to look in the + current directory, you must include '.' either as a separate entry + or as a null entry in the search path. + + The default value for 'AWKPATH' is '.:/usr/local/share/awk'.(2) +Since '.' is included at the beginning, 'gawk' searches first in the +current directory and then in '/usr/local/share/awk'. In practice, this +means that you will rarely need to change the value of 'AWKPATH'. + + *Note Shell Startup Files::, for information on functions that help +to manipulate the 'AWKPATH' variable. + + 'gawk' places the value of the search path that it used into +'ENVIRON["AWKPATH"]'. This provides access to the actual search path +value from within an 'awk' program. + + Although you can change 'ENVIRON["AWKPATH"]' within your 'awk' +program, this has no effect on the running program's behavior. This +makes sense: the 'AWKPATH' environment variable is used to find the +program source files. Once your program is running, all the files have +been found, and 'gawk' no longer needs to use 'AWKPATH'. + + ---------- Footnotes ---------- + + (1) Semicolons on MS-Windows. + + (2) Your version of 'gawk' may use a different directory; it will +depend upon how 'gawk' was built and installed. The actual directory is +the value of '$(datadir)' generated when 'gawk' was configured. You +probably don't need to worry about this, though. + + +File: gawk.info, Node: AWKLIBPATH Variable, Next: Other Environment Variables, Prev: AWKPATH Variable, Up: Environment Variables + +2.5.2 The 'AWKLIBPATH' Environment Variable +------------------------------------------- + +The 'AWKLIBPATH' environment variable is similar to the 'AWKPATH' +variable, but it is used to search for loadable extensions (stored as +system shared libraries) specified with the '-l' option rather than for +source files. If the extension is not found, the path is searched again +after adding the appropriate shared library suffix for the platform. +For example, on GNU/Linux systems, the suffix '.so' is used. The search +path specified is also used for extensions loaded via the '@load' +keyword (*note Loading Shared Libraries::). + + If 'AWKLIBPATH' does not exist in the environment, or if it has an +empty value, 'gawk' uses a default path; this is typically +'/usr/local/lib/gawk', although it can vary depending upon how 'gawk' +was built. + + *Note Shell Startup Files::, for information on functions that help +to manipulate the 'AWKLIBPATH' variable. + + 'gawk' places the value of the search path that it used into +'ENVIRON["AWKLIBPATH"]'. This provides access to the actual search path +value from within an 'awk' program. + + +File: gawk.info, Node: Other Environment Variables, Prev: AWKLIBPATH Variable, Up: Environment Variables + +2.5.3 Other Environment Variables +--------------------------------- + +A number of other environment variables affect 'gawk''s behavior, but +they are more specialized. Those in the following list are meant to be +used by regular users: + +'GAWK_MSEC_SLEEP' + Specifies the interval between connection retries, in milliseconds. + On systems that do not support the 'usleep()' system call, the + value is rounded up to an integral number of seconds. + +'GAWK_READ_TIMEOUT' + Specifies the time, in milliseconds, for 'gawk' to wait for input + before returning with an error. *Note Read Timeout::. + +'GAWK_SOCK_RETRIES' + Controls the number of times 'gawk' attempts to retry a two-way + TCP/IP (socket) connection before giving up. *Note TCP/IP + Networking::. Note that when nonfatal I/O is enabled (*note + Nonfatal::), 'gawk' only tries to open a TCP/IP socket once. + +'POSIXLY_CORRECT' + Causes 'gawk' to switch to POSIX-compatibility mode, disabling all + traditional and GNU extensions. *Note Options::. + + The environment variables in the following list are meant for use by +the 'gawk' developers for testing and tuning. They are subject to +change. The variables are: + +'AWKBUFSIZE' + This variable only affects 'gawk' on POSIX-compliant systems. With + a value of 'exact', 'gawk' uses the size of each input file as the + size of the memory buffer to allocate for I/O. Otherwise, the value + should be a number, and 'gawk' uses that number as the size of the + buffer to allocate. (When this variable is not set, 'gawk' uses + the smaller of the file's size and the "default" blocksize, which + is usually the filesystem's I/O blocksize.) + +'AWK_HASH' + If this variable exists with a value of 'gst', 'gawk' switches to + using the hash function from GNU Smalltalk for managing arrays. + This function may be marginally faster than the standard function. + +'AWKREADFUNC' + If this variable exists, 'gawk' switches to reading source files + one line at a time, instead of reading in blocks. This exists for + debugging problems on filesystems on non-POSIX operating systems + where I/O is performed in records, not in blocks. + +'GAWK_MSG_SRC' + If this variable exists, 'gawk' includes the file name and line + number within the 'gawk' source code from which warning and/or + fatal messages are generated. Its purpose is to help isolate the + source of a message, as there are multiple places that produce the + same warning or error message. + +'GAWK_LOCALE_DIR' + Specifies the location of compiled message object files for 'gawk' + itself. This is passed to the 'bindtextdomain()' function when + 'gawk' starts up. + +'GAWK_NO_DFA' + If this variable exists, 'gawk' does not use the DFA regexp matcher + for "does it match" kinds of tests. This can cause 'gawk' to be + slower. Its purpose is to help isolate differences between the two + regexp matchers that 'gawk' uses internally. (There aren't + supposed to be differences, but occasionally theory and practice + don't coordinate with each other.) + +'GAWK_STACKSIZE' + This specifies the amount by which 'gawk' should grow its internal + evaluation stack, when needed. + +'INT_CHAIN_MAX' + This specifies intended maximum number of items 'gawk' will + maintain on a hash chain for managing arrays indexed by integers. + +'STR_CHAIN_MAX' + This specifies intended maximum number of items 'gawk' will + maintain on a hash chain for managing arrays indexed by strings. + +'TIDYMEM' + If this variable exists, 'gawk' uses the 'mtrace()' library calls + from the GNU C library to help track down possible memory leaks. + + +File: gawk.info, Node: Exit Status, Next: Include Files, Prev: Environment Variables, Up: Invoking Gawk + +2.6 'gawk''s Exit Status +======================== + +If the 'exit' statement is used with a value (*note Exit Statement::), +then 'gawk' exits with the numeric value given to it. + + Otherwise, if there were no problems during execution, 'gawk' exits +with the value of the C constant 'EXIT_SUCCESS'. This is usually zero. + + If an error occurs, 'gawk' exits with the value of the C constant +'EXIT_FAILURE'. This is usually one. + + If 'gawk' exits because of a fatal error, the exit status is two. On +non-POSIX systems, this value may be mapped to 'EXIT_FAILURE'. + + +File: gawk.info, Node: Include Files, Next: Loading Shared Libraries, Prev: Exit Status, Up: Invoking Gawk + +2.7 Including Other Files into Your Program +=========================================== + +This minor node describes a feature that is specific to 'gawk'. + + The '@include' keyword can be used to read external 'awk' source +files. This gives you the ability to split large 'awk' source files +into smaller, more manageable pieces, and also lets you reuse common +'awk' code from various 'awk' scripts. In other words, you can group +together 'awk' functions used to carry out specific tasks into external +files. These files can be used just like function libraries, using the +'@include' keyword in conjunction with the 'AWKPATH' environment +variable. Note that source files may also be included using the '-i' +option. + + Let's see an example. We'll start with two (trivial) 'awk' scripts, +namely 'test1' and 'test2'. Here is the 'test1' script: + + BEGIN { + print "This is script test1." + } + +and here is 'test2': + + @include "test1" + BEGIN { + print "This is script test2." + } + + Running 'gawk' with 'test2' produces the following result: + + $ gawk -f test2 + -| This is script test1. + -| This is script test2. + + 'gawk' runs the 'test2' script, which includes 'test1' using the +'@include' keyword. So, to include external 'awk' source files, you +just use '@include' followed by the name of the file to be included, +enclosed in double quotes. + + NOTE: Keep in mind that this is a language construct and the file + name cannot be a string variable, but rather just a literal string + constant in double quotes. + + The files to be included may be nested; e.g., given a third script, +namely 'test3': + + @include "test2" + BEGIN { + print "This is script test3." + } + +Running 'gawk' with the 'test3' script produces the following results: + + $ gawk -f test3 + -| This is script test1. + -| This is script test2. + -| This is script test3. + + The file name can, of course, be a pathname. For example: + + @include "../io_funcs" + +and: + + @include "/usr/awklib/network" + +are both valid. The 'AWKPATH' environment variable can be of great +value when using '@include'. The same rules for the use of the +'AWKPATH' variable in command-line file searches (*note AWKPATH +Variable::) apply to '@include' also. + + This is very helpful in constructing 'gawk' function libraries. If +you have a large script with useful, general-purpose 'awk' functions, +you can break it down into library files and put those files in a +special directory. You can then include those "libraries," either by +using the full pathnames of the files, or by setting the 'AWKPATH' +environment variable accordingly and then using '@include' with just the +file part of the full pathname. Of course, you can keep library files +in more than one directory; the more complex the working environment is, +the more directories you may need to organize the files to be included. + + Given the ability to specify multiple '-f' options, the '@include' +mechanism is not strictly necessary. However, the '@include' keyword +can help you in constructing self-contained 'gawk' programs, thus +reducing the need for writing complex and tedious command lines. In +particular, '@include' is very useful for writing CGI scripts to be run +from web pages. + + As mentioned in *note AWKPATH Variable::, the current directory is +always searched first for source files, before searching in 'AWKPATH'; +this also applies to files named with '@include'. + + +File: gawk.info, Node: Loading Shared Libraries, Next: Obsolete, Prev: Include Files, Up: Invoking Gawk + +2.8 Loading Dynamic Extensions into Your Program +================================================ + +This minor node describes a feature that is specific to 'gawk'. + + The '@load' keyword can be used to read external 'awk' extensions +(stored as system shared libraries). This allows you to link in +compiled code that may offer superior performance and/or give you access +to extended capabilities not supported by the 'awk' language. The +'AWKLIBPATH' variable is used to search for the extension. Using +'@load' is completely equivalent to using the '-l' command-line option. + + If the extension is not initially found in 'AWKLIBPATH', another +search is conducted after appending the platform's default shared +library suffix to the file name. For example, on GNU/Linux systems, the +suffix '.so' is used: + + $ gawk '@load "ordchr"; BEGIN {print chr(65)}' + -| A + +This is equivalent to the following example: + + $ gawk -lordchr 'BEGIN {print chr(65)}' + -| A + +For command-line usage, the '-l' option is more convenient, but '@load' +is useful for embedding inside an 'awk' source file that requires access +to an extension. + + *note Dynamic Extensions::, describes how to write extensions (in C +or C++) that can be loaded with either '@load' or the '-l' option. It +also describes the 'ordchr' extension. + + +File: gawk.info, Node: Obsolete, Next: Undocumented, Prev: Loading Shared Libraries, Up: Invoking Gawk + +2.9 Obsolete Options and/or Features +==================================== + +This minor node describes features and/or command-line options from +previous releases of 'gawk' that either are not available in the current +version or are still supported but deprecated (meaning that they will +_not_ be in the next release). + + The process-related special files '/dev/pid', '/dev/ppid', +'/dev/pgrpid', and '/dev/user' were deprecated in 'gawk' 3.1, but still +worked. As of version 4.0, they are no longer interpreted specially by +'gawk'. (Use 'PROCINFO' instead; see *note Auto-set::.) + + +File: gawk.info, Node: Undocumented, Next: Invoking Summary, Prev: Obsolete, Up: Invoking Gawk + +2.10 Undocumented Options and Features +====================================== + + Use the Source, Luke! + -- _Obi-Wan_ + + This minor node intentionally left blank. + + +File: gawk.info, Node: Invoking Summary, Prev: Undocumented, Up: Invoking Gawk + +2.11 Summary +============ + + * Use either 'awk 'PROGRAM' FILES' or 'awk -f PROGRAM-FILE FILES' to + run 'awk'. + + * The three standard options for all versions of 'awk' are '-f', + '-F', and '-v'. 'gawk' supplies these and many others, as well as + corresponding GNU-style long options. + + * Nonoption command-line arguments are usually treated as file names, + unless they have the form 'VAR=VALUE', in which case they are taken + as variable assignments to be performed at that point in processing + the input. + + * All nonoption command-line arguments, excluding the program text, + are placed in the 'ARGV' array. Adjusting 'ARGC' and 'ARGV' + affects how 'awk' processes input. + + * You can use a single minus sign ('-') to refer to standard input on + the command line. 'gawk' also lets you use the special file name + '/dev/stdin'. + + * 'gawk' pays attention to a number of environment variables. + 'AWKPATH', 'AWKLIBPATH', and 'POSIXLY_CORRECT' are the most + important ones. + + * 'gawk''s exit status conveys information to the program that + invoked it. Use the 'exit' statement from within an 'awk' program + to set the exit status. + + * 'gawk' allows you to include other 'awk' source files into your + program using the '@include' statement and/or the '-i' and '-f' + command-line options. + + * 'gawk' allows you to load additional functions written in C or C++ + using the '@load' statement and/or the '-l' option. (This advanced + feature is described later, in *note Dynamic Extensions::.) + + +File: gawk.info, Node: Regexp, Next: Reading Files, Prev: Invoking Gawk, Up: Top + +3 Regular Expressions +********************* + +A "regular expression", or "regexp", is a way of describing a set of +strings. Because regular expressions are such a fundamental part of +'awk' programming, their format and use deserve a separate major node. + + A regular expression enclosed in slashes ('/') is an 'awk' pattern +that matches every input record whose text belongs to that set. The +simplest regular expression is a sequence of letters, numbers, or both. +Such a regexp matches any string that contains that sequence. Thus, the +regexp 'foo' matches any string containing 'foo'. Thus, the pattern +'/foo/' matches any input record containing the three adjacent +characters 'foo' _anywhere_ in the record. Other kinds of regexps let +you specify more complicated classes of strings. + +* Menu: + +* Regexp Usage:: How to Use Regular Expressions. +* Escape Sequences:: How to write nonprinting characters. +* Regexp Operators:: Regular Expression Operators. +* Bracket Expressions:: What can go between '[...]'. +* Leftmost Longest:: How much text matches. +* Computed Regexps:: Using Dynamic Regexps. +* GNU Regexp Operators:: Operators specific to GNU software. +* Case-sensitivity:: How to do case-insensitive matching. +* Strong Regexp Constants:: Strongly typed regexp constants. +* Regexp Summary:: Regular expressions summary. + + +File: gawk.info, Node: Regexp Usage, Next: Escape Sequences, Up: Regexp + +3.1 How to Use Regular Expressions +================================== + +A regular expression can be used as a pattern by enclosing it in +slashes. Then the regular expression is tested against the entire text +of each record. (Normally, it only needs to match some part of the text +in order to succeed.) For example, the following prints the second +field of each record where the string 'li' appears anywhere in the +record: + + $ awk '/li/ { print $2 }' mail-list + -| 555-5553 + -| 555-0542 + -| 555-6699 + -| 555-3430 + + Regular expressions can also be used in matching expressions. These +expressions allow you to specify the string to match against; it need +not be the entire current input record. The two operators '~' and '!~' +perform regular expression comparisons. Expressions using these +operators can be used as patterns, or in 'if', 'while', 'for', and 'do' +statements. (*Note Statements::.) For example, the following is true +if the expression EXP (taken as a string) matches REGEXP: + + EXP ~ /REGEXP/ + +This example matches, or selects, all input records with the uppercase +letter 'J' somewhere in the first field: + + $ awk '$1 ~ /J/' inventory-shipped + -| Jan 13 25 15 115 + -| Jun 31 42 75 492 + -| Jul 24 34 67 436 + -| Jan 21 36 64 620 + + So does this: + + awk '{ if ($1 ~ /J/) print }' inventory-shipped + + This next example is true if the expression EXP (taken as a character +string) does _not_ match REGEXP: + + EXP !~ /REGEXP/ + + The following example matches, or selects, all input records whose +first field _does not_ contain the uppercase letter 'J': + + $ awk '$1 !~ /J/' inventory-shipped + -| Feb 15 32 24 226 + -| Mar 15 24 34 228 + -| Apr 31 52 63 420 + -| May 16 34 29 208 + ... + + When a regexp is enclosed in slashes, such as '/foo/', we call it a +"regexp constant", much like '5.27' is a numeric constant and '"foo"' is +a string constant. + + +File: gawk.info, Node: Escape Sequences, Next: Regexp Operators, Prev: Regexp Usage, Up: Regexp + +3.2 Escape Sequences +==================== + +Some characters cannot be included literally in string constants +('"foo"') or regexp constants ('/foo/'). Instead, they should be +represented with "escape sequences", which are character sequences +beginning with a backslash ('\'). One use of an escape sequence is to +include a double-quote character in a string constant. Because a plain +double quote ends the string, you must use '\"' to represent an actual +double-quote character as a part of the string. For example: + + $ awk 'BEGIN { print "He said \"hi!\" to her." }' + -| He said "hi!" to her. + + The backslash character itself is another character that cannot be +included normally; you must write '\\' to put one backslash in the +string or regexp. Thus, the string whose contents are the two +characters '"' and '\' must be written '"\"\\"'. + + Other escape sequences represent unprintable characters such as TAB +or newline. There is nothing to stop you from entering most unprintable +characters directly in a string constant or regexp constant, but they +may look ugly. + + The following list presents all the escape sequences used in 'awk' +and what they represent. Unless noted otherwise, all these escape +sequences apply to both string constants and regexp constants: + +'\\' + A literal backslash, '\'. + +'\a' + The "alert" character, 'Ctrl-g', ASCII code 7 (BEL). (This often + makes some sort of audible noise.) + +'\b' + Backspace, 'Ctrl-h', ASCII code 8 (BS). + +'\f' + Formfeed, 'Ctrl-l', ASCII code 12 (FF). + +'\n' + Newline, 'Ctrl-j', ASCII code 10 (LF). + +'\r' + Carriage return, 'Ctrl-m', ASCII code 13 (CR). + +'\t' + Horizontal TAB, 'Ctrl-i', ASCII code 9 (HT). + +'\v' + Vertical TAB, 'Ctrl-k', ASCII code 11 (VT). + +'\NNN' + The octal value NNN, where NNN stands for 1 to 3 digits between '0' + and '7'. For example, the code for the ASCII ESC (escape) + character is '\033'. + +'\xHH...' + The hexadecimal value HH, where HH stands for a sequence of + hexadecimal digits ('0'-'9', and either 'A'-'F' or 'a'-'f'). A + maximum of two digts are allowed after the '\x'. Any further + hexadecimal digits are treated as simple letters or numbers. + (c.e.) (The '\x' escape sequence is not allowed in POSIX awk.) + + CAUTION: In ISO C, the escape sequence continues until the + first nonhexadecimal digit is seen. For many years, 'gawk' + would continue incorporating hexadecimal digits into the value + until a non-hexadecimal digit or the end of the string was + encountered. However, using more than two hexadecimal digits + produced undefined results. As of version 4.2, only two + digits are processed. + +'\/' + A literal slash (necessary for regexp constants only). This + sequence is used when you want to write a regexp constant that + contains a slash (such as '/.*:\/home\/[[:alnum:]]+:.*/'; the + '[[:alnum:]]' notation is discussed in *note Bracket + Expressions::). Because the regexp is delimited by slashes, you + need to escape any slash that is part of the pattern, in order to + tell 'awk' to keep processing the rest of the regexp. + +'\"' + A literal double quote (necessary for string constants only). This + sequence is used when you want to write a string constant that + contains a double quote (such as '"He said \"hi!\" to her."'). + Because the string is delimited by double quotes, you need to + escape any quote that is part of the string, in order to tell 'awk' + to keep processing the rest of the string. + + In 'gawk', a number of additional two-character sequences that begin +with a backslash have special meaning in regexps. *Note GNU Regexp +Operators::. + + In a regexp, a backslash before any character that is not in the +previous list and not listed in *note GNU Regexp Operators:: means that +the next character should be taken literally, even if it would normally +be a regexp operator. For example, '/a\+b/' matches the three +characters 'a+b'. + + For complete portability, do not use a backslash before any character +not shown in the previous list or that is not an operator. + + Backslash Before Regular Characters + + If you place a backslash in a string constant before something that +is not one of the characters previously listed, POSIX 'awk' purposely +leaves what happens as undefined. There are two choices: + +Strip the backslash out + This is what BWK 'awk' and 'gawk' both do. For example, '"a\qc"' + is the same as '"aqc"'. (Because this is such an easy bug both to + introduce and to miss, 'gawk' warns you about it.) Consider 'FS = + "[ \t]+\|[ \t]+"' to use vertical bars surrounded by whitespace as + the field separator. There should be two backslashes in the + string: 'FS = "[ \t]+\\|[ \t]+"'.) + +Leave the backslash alone + Some other 'awk' implementations do this. In such implementations, + typing '"a\qc"' is the same as typing '"a\\qc"'. + + To summarize: + + * The escape sequences in the preceding list are always processed + first, for both string constants and regexp constants. This + happens very early, as soon as 'awk' reads your program. + + * 'gawk' processes both regexp constants and dynamic regexps (*note + Computed Regexps::), for the special operators listed in *note GNU + Regexp Operators::. + + * A backslash before any other character means to treat that + character literally. + + Escape Sequences for Metacharacters + + Suppose you use an octal or hexadecimal escape to represent a regexp +metacharacter. (See *note Regexp Operators::.) Does 'awk' treat the +character as a literal character or as a regexp operator? + + Historically, such characters were taken literally. (d.c.) However, +the POSIX standard indicates that they should be treated as real +metacharacters, which is what 'gawk' does. In compatibility mode (*note +Options::), 'gawk' treats the characters represented by octal and +hexadecimal escape sequences literally when used in regexp constants. +Thus, '/a\52b/' is equivalent to '/a\*b/'. + + +File: gawk.info, Node: Regexp Operators, Next: Bracket Expressions, Prev: Escape Sequences, Up: Regexp + +3.3 Regular Expression Operators +================================ + +You can combine regular expressions with special characters, called +"regular expression operators" or "metacharacters", to increase the +power and versatility of regular expressions. + + The escape sequences described in *note Escape Sequences:: are valid +inside a regexp. They are introduced by a '\' and are recognized and +converted into corresponding real characters as the very first step in +processing regexps. + + Here is a list of metacharacters. All characters that are not escape +sequences and that are not listed here stand for themselves: + +'\' + This suppresses the special meaning of a character when matching. + For example, '\$' matches the character '$'. + +'^' + This matches the beginning of a string. '^@chapter' matches + '@chapter' at the beginning of a string, for example, and can be + used to identify chapter beginnings in Texinfo source files. The + '^' is known as an "anchor", because it anchors the pattern to + match only at the beginning of the string. + + It is important to realize that '^' does not match the beginning of + a line (the point right after a '\n' newline character) embedded in + a string. The condition is not true in the following example: + + if ("line1\nLINE 2" ~ /^L/) ... + +'$' + This is similar to '^', but it matches only at the end of a string. + For example, 'p$' matches a record that ends with a 'p'. The '$' + is an anchor and does not match the end of a line (the point right + before a '\n' newline character) embedded in a string. The + condition in the following example is not true: + + if ("line1\nLINE 2" ~ /1$/) ... + +'.' (period) + This matches any single character, _including_ the newline + character. For example, '.P' matches any single character followed + by a 'P' in a string. Using concatenation, we can make a regular + expression such as 'U.A', which matches any three-character + sequence that begins with 'U' and ends with 'A'. + + In strict POSIX mode (*note Options::), '.' does not match the NUL + character, which is a character with all bits equal to zero. + Otherwise, NUL is just another character. Other versions of 'awk' + may not be able to match the NUL character. + +'['...']' + This is called a "bracket expression".(1) It matches any _one_ of + the characters that are enclosed in the square brackets. For + example, '[MVX]' matches any one of the characters 'M', 'V', or 'X' + in a string. A full discussion of what can be inside the square + brackets of a bracket expression is given in *note Bracket + Expressions::. + +'[^'...']' + This is a "complemented bracket expression". The first character + after the '[' _must_ be a '^'. It matches any characters _except_ + those in the square brackets. For example, '[^awk]' matches any + character that is not an 'a', 'w', or 'k'. + +'|' + This is the "alternation operator" and it is used to specify + alternatives. The '|' has the lowest precedence of all the regular + expression operators. For example, '^P|[aeiouy]' matches any + string that matches either '^P' or '[aeiouy]'. This means it + matches any string that starts with 'P' or contains (anywhere + within it) a lowercase English vowel. + + The alternation applies to the largest possible regexps on either + side. + +'('...')' + Parentheses are used for grouping in regular expressions, as in + arithmetic. They can be used to concatenate regular expressions + containing the alternation operator, '|'. For example, + '@(samp|code)\{[^}]+\}' matches both '@code{foo}' and '@samp{bar}'. + (These are Texinfo formatting control sequences. The '+' is + explained further on in this list.) + +'*' + This symbol means that the preceding regular expression should be + repeated as many times as necessary to find a match. For example, + 'ph*' applies the '*' symbol to the preceding 'h' and looks for + matches of one 'p' followed by any number of 'h's. This also + matches just 'p' if no 'h's are present. + + There are two subtle points to understand about how '*' works. + First, the '*' applies only to the single preceding regular + expression component (e.g., in 'ph*', it applies just to the 'h'). + To cause '*' to apply to a larger subexpression, use parentheses: + '(ph)*' matches 'ph', 'phph', 'phphph', and so on. + + Second, '*' finds as many repetitions as possible. If the text to + be matched is 'phhhhhhhhhhhhhhooey', 'ph*' matches all of the 'h's. + +'+' + This symbol is similar to '*', except that the preceding expression + must be matched at least once. This means that 'wh+y' would match + 'why' and 'whhy', but not 'wy', whereas 'wh*y' would match all + three. + +'?' + This symbol is similar to '*', except that the preceding expression + can be matched either once or not at all. For example, 'fe?d' + matches 'fed' and 'fd', but nothing else. + +'{'N'}' +'{'N',}' +'{'N','M'}' + One or two numbers inside braces denote an "interval expression". + If there is one number in the braces, the preceding regexp is + repeated N times. If there are two numbers separated by a comma, + the preceding regexp is repeated N to M times. If there is one + number followed by a comma, then the preceding regexp is repeated + at least N times: + + 'wh{3}y' + Matches 'whhhy', but not 'why' or 'whhhhy'. + + 'wh{3,5}y' + Matches 'whhhy', 'whhhhy', or 'whhhhhy' only. + + 'wh{2,}y' + Matches 'whhy', 'whhhy', and so on. + + Interval expressions were not traditionally available in 'awk'. + They were added as part of the POSIX standard to make 'awk' and + 'egrep' consistent with each other. + + Initially, because old programs may use '{' and '}' in regexp + constants, 'gawk' did _not_ match interval expressions in regexps. + + However, beginning with version 4.0, 'gawk' does match interval + expressions by default. This is because compatibility with POSIX + has become more important to most 'gawk' users than compatibility + with old programs. + + For programs that use '{' and '}' in regexp constants, it is good + practice to always escape them with a backslash. Then the regexp + constants are valid and work the way you want them to, using any + version of 'awk'.(2) + + Finally, when '{' and '}' appear in regexp constants in a way that + cannot be interpreted as an interval expression (such as '/q{a}/'), + then they stand for themselves. + + In regular expressions, the '*', '+', and '?' operators, as well as +the braces '{' and '}', have the highest precedence, followed by +concatenation, and finally by '|'. As in arithmetic, parentheses can +change how operators are grouped. + + In POSIX 'awk' and 'gawk', the '*', '+', and '?' operators stand for +themselves when there is nothing in the regexp that precedes them. For +example, '/+/' matches a literal plus sign. However, many other +versions of 'awk' treat such a usage as a syntax error. + + If 'gawk' is in compatibility mode (*note Options::), interval +expressions are not available in regular expressions. + + ---------- Footnotes ---------- + + (1) In other literature, you may see a bracket expression referred to +as either a "character set", a "character class", or a "character list". + + (2) Use two backslashes if you're using a string constant with a +regexp operator or function. + + +File: gawk.info, Node: Bracket Expressions, Next: Leftmost Longest, Prev: Regexp Operators, Up: Regexp + +3.4 Using Bracket Expressions +============================= + +As mentioned earlier, a bracket expression matches any character among +those listed between the opening and closing square brackets. + + Within a bracket expression, a "range expression" consists of two +characters separated by a hyphen. It matches any single character that +sorts between the two characters, based upon the system's native +character set. For example, '[0-9]' is equivalent to '[0123456789]'. +(See *note Ranges and Locales:: for an explanation of how the POSIX +standard and 'gawk' have changed over time. This is mainly of +historical interest.) + + With the increasing popularity of the Unicode character standard +(http://www.unicode.org), there is an additional wrinkle to consider. +Octal and hexadecimal escape sequences inside bracket expressions are +taken to represent only single-byte characters (characters whose values +fit within the range 0-256). To match a range of characters where the +endpoints of the range are larger than 256, enter the multibyte +encodings of the characters directly. + + To include one of the characters '\', ']', '-', or '^' in a bracket +expression, put a '\' in front of it. For example: + + [d\]] + +matches either 'd' or ']'. Additionally, if you place ']' right after +the opening '[', the closing bracket is treated as one of the characters +to be matched. + + The treatment of '\' in bracket expressions is compatible with other +'awk' implementations and is also mandated by POSIX. The regular +expressions in 'awk' are a superset of the POSIX specification for +Extended Regular Expressions (EREs). POSIX EREs are based on the +regular expressions accepted by the traditional 'egrep' utility. + + "Character classes" are a feature introduced in the POSIX standard. +A character class is a special notation for describing lists of +characters that have a specific attribute, but the actual characters can +vary from country to country and/or from character set to character set. +For example, the notion of what is an alphabetic character differs +between the United States and France. + + A character class is only valid in a regexp _inside_ the brackets of +a bracket expression. Character classes consist of '[:', a keyword +denoting the class, and ':]'. *note Table 3.1: table-char-classes. +lists the character classes defined by the POSIX standard. + +Class Meaning +-------------------------------------------------------------------------- +'[:alnum:]' Alphanumeric characters +'[:alpha:]' Alphabetic characters +'[:blank:]' Space and TAB characters +'[:cntrl:]' Control characters +'[:digit:]' Numeric characters +'[:graph:]' Characters that are both printable and visible (a space is + printable but not visible, whereas an 'a' is both) +'[:lower:]' Lowercase alphabetic characters +'[:print:]' Printable characters (characters that are not control + characters) +'[:punct:]' Punctuation characters (characters that are not letters, + digits, control characters, or space characters) +'[:space:]' Space characters (such as space, TAB, and formfeed, to name + a few) +'[:upper:]' Uppercase alphabetic characters +'[:xdigit:]'Characters that are hexadecimal digits + +Table 3.1: POSIX character classes + + For example, before the POSIX standard, you had to write +'/[A-Za-z0-9]/' to match alphanumeric characters. If your character set +had other alphabetic characters in it, this would not match them. With +the POSIX character classes, you can write '/[[:alnum:]]/' to match the +alphabetic and numeric characters in your character set. + + Some utilities that match regular expressions provide a nonstandard +'[:ascii:]' character class; 'awk' does not. However, you can simulate +such a construct using '[\x00-\x7F]'. This matches all values +numerically between zero and 127, which is the defined range of the +ASCII character set. Use a complemented character list ('[^\x00-\x7F]') +to match any single-byte characters that are not in the ASCII range. + + Two additional special sequences can appear in bracket expressions. +These apply to non-ASCII character sets, which can have single symbols +(called "collating elements") that are represented with more than one +character. They can also have several characters that are equivalent +for "collating", or sorting, purposes. (For example, in French, a plain +"e" and a grave-accented "e`" are equivalent.) These sequences are: + +Collating symbols + Multicharacter collating elements enclosed between '[.' and '.]'. + For example, if 'ch' is a collating element, then '[[.ch.]]' is a + regexp that matches this collating element, whereas '[ch]' is a + regexp that matches either 'c' or 'h'. + +Equivalence classes + Locale-specific names for a list of characters that are equal. The + name is enclosed between '[=' and '=]'. For example, the name 'e' + might be used to represent all of "e," "e^," "e`," and "e'." In + this case, '[[=e=]]' is a regexp that matches any of 'e', 'e^', + 'e'', or 'e`'. + + These features are very valuable in non-English-speaking locales. + + CAUTION: The library functions that 'gawk' uses for regular + expression matching currently recognize only POSIX character + classes; they do not recognize collating symbols or equivalence + classes. + + Inside a bracket expression, an opening bracket ('[') that does not +start a character class, collating element or equivalence class is taken +literally. This is also true of '.' and '*'. + + +File: gawk.info, Node: Leftmost Longest, Next: Computed Regexps, Prev: Bracket Expressions, Up: Regexp + +3.5 How Much Text Matches? +========================== + +Consider the following: + + echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }' + + This example uses the 'sub()' function to make a change to the input +record. ('sub()' replaces the first instance of any text matched by the +first argument with the string provided as the second argument; *note +String Functions::.) Here, the regexp '/a+/' indicates "one or more 'a' +characters," and the replacement text is '<A>'. + + The input contains four 'a' characters. 'awk' (and POSIX) regular +expressions always match the leftmost, _longest_ sequence of input +characters that can match. Thus, all four 'a' characters are replaced +with '<A>' in this example: + + $ echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }' + -| <A>bcd + + For simple match/no-match tests, this is not so important. But when +doing text matching and substitutions with the 'match()', 'sub()', +'gsub()', and 'gensub()' functions, it is very important. *Note String +Functions::, for more information on these functions. Understanding +this principle is also important for regexp-based record and field +splitting (*note Records::, and also *note Field Separators::). + + +File: gawk.info, Node: Computed Regexps, Next: GNU Regexp Operators, Prev: Leftmost Longest, Up: Regexp + +3.6 Using Dynamic Regexps +========================= + +The righthand side of a '~' or '!~' operator need not be a regexp +constant (i.e., a string of characters between slashes). It may be any +expression. The expression is evaluated and converted to a string if +necessary; the contents of the string are then used as the regexp. A +regexp computed in this way is called a "dynamic regexp" or a "computed +regexp": + + BEGIN { digits_regexp = "[[:digit:]]+" } + $0 ~ digits_regexp { print } + +This sets 'digits_regexp' to a regexp that describes one or more digits, +and tests whether the input record matches this regexp. + + NOTE: When using the '~' and '!~' operators, be aware that there is + a difference between a regexp constant enclosed in slashes and a + string constant enclosed in double quotes. If you are going to use + a string constant, you have to understand that the string is, in + essence, scanned _twice_: the first time when 'awk' reads your + program, and the second time when it goes to match the string on + the lefthand side of the operator with the pattern on the right. + This is true of any string-valued expression (such as + 'digits_regexp', shown in the previous example), not just string + constants. + + What difference does it make if the string is scanned twice? The +answer has to do with escape sequences, and particularly with +backslashes. To get a backslash into a regular expression inside a +string, you have to type two backslashes. + + For example, '/\*/' is a regexp constant for a literal '*'. Only one +backslash is needed. To do the same thing with a string, you have to +type '"\\*"'. The first backslash escapes the second one so that the +string actually contains the two characters '\' and '*'. + + Given that you can use both regexp and string constants to describe +regular expressions, which should you use? The answer is "regexp +constants," for several reasons: + + * String constants are more complicated to write and more difficult + to read. Using regexp constants makes your programs less + error-prone. Not understanding the difference between the two + kinds of constants is a common source of errors. + + * It is more efficient to use regexp constants. 'awk' can note that + you have supplied a regexp and store it internally in a form that + makes pattern matching more efficient. When using a string + constant, 'awk' must first convert the string into this internal + form and then perform the pattern matching. + + * Using regexp constants is better form; it shows clearly that you + intend a regexp match. + + Using '\n' in Bracket Expressions of Dynamic Regexps + + Some older versions of 'awk' do not allow the newline character to be +used inside a bracket expression for a dynamic regexp: + + $ awk '$0 ~ "[ \t\n]"' + error-> awk: newline in character class [ + error-> ]... + error-> source line number 1 + error-> context is + error-> $0 ~ "[ >>> \t\n]" <<< + + But a newline in a regexp constant works with no problem: + + $ awk '$0 ~ /[ \t\n]/' + here is a sample line + -| here is a sample line + Ctrl-d + + 'gawk' does not have this problem, and it isn't likely to occur often +in practice, but it's worth noting for future reference. + + +File: gawk.info, Node: GNU Regexp Operators, Next: Case-sensitivity, Prev: Computed Regexps, Up: Regexp + +3.7 'gawk'-Specific Regexp Operators +==================================== + +GNU software that deals with regular expressions provides a number of +additional regexp operators. These operators are described in this +minor node and are specific to 'gawk'; they are not available in other +'awk' implementations. Most of the additional operators deal with word +matching. For our purposes, a "word" is a sequence of one or more +letters, digits, or underscores ('_'): + +'\s' + Matches any whitespace character. Think of it as shorthand for + '[[:space:]]'. + +'\S' + Matches any character that is not whitespace. Think of it as + shorthand for '[^[:space:]]'. + +'\w' + Matches any word-constituent character--that is, it matches any + letter, digit, or underscore. Think of it as shorthand for + '[[:alnum:]_]'. + +'\W' + Matches any character that is not word-constituent. Think of it as + shorthand for '[^[:alnum:]_]'. + +'\<' + Matches the empty string at the beginning of a word. For example, + '/\<away/' matches 'away' but not 'stowaway'. + +'\>' + Matches the empty string at the end of a word. For example, + '/stow\>/' matches 'stow' but not 'stowaway'. + +'\y' + Matches the empty string at either the beginning or the end of a + word (i.e., the word boundar*y*). For example, '\yballs?\y' + matches either 'ball' or 'balls', as a separate word. + +'\B' + Matches the empty string that occurs between two word-constituent + characters. For example, '/\Brat\B/' matches 'crate', but it does + not match 'dirty rat'. '\B' is essentially the opposite of '\y'. + + There are two other operators that work on buffers. In Emacs, a +"buffer" is, naturally, an Emacs buffer. Other GNU programs, including +'gawk', consider the entire string to match as the buffer. The +operators are: + +'\`' + Matches the empty string at the beginning of a buffer (string) + +'\'' + Matches the empty string at the end of a buffer (string) + + Because '^' and '$' always work in terms of the beginning and end of +strings, these operators don't add any new capabilities for 'awk'. They +are provided for compatibility with other GNU software. + + In other GNU software, the word-boundary operator is '\b'. However, +that conflicts with the 'awk' language's definition of '\b' as +backspace, so 'gawk' uses a different letter. An alternative method +would have been to require two backslashes in the GNU operators, but +this was deemed too confusing. The current method of using '\y' for the +GNU '\b' appears to be the lesser of two evils. + + The various command-line options (*note Options::) control how 'gawk' +interprets characters in regexps: + +No options + In the default case, 'gawk' provides all the facilities of POSIX + regexps and the GNU regexp operators described in *note Regexp + Operators::. + +'--posix' + Match only POSIX regexps; the GNU operators are not special (e.g., + '\w' matches a literal 'w'). Interval expressions are allowed. + +'--traditional' + Match traditional Unix 'awk' regexps. The GNU operators are not + special, and interval expressions are not available. Because BWK + 'awk' supports them, the POSIX character classes ('[[:alnum:]]', + etc.) are available. Characters described by octal and + hexadecimal escape sequences are treated literally, even if they + represent regexp metacharacters. + +'--re-interval' + Allow interval expressions in regexps, if '--traditional' has been + provided. Otherwise, interval expressions are available by + default. + + +File: gawk.info, Node: Case-sensitivity, Next: Strong Regexp Constants, Prev: GNU Regexp Operators, Up: Regexp + +3.8 Case Sensitivity in Matching +================================ + +Case is normally significant in regular expressions, both when matching +ordinary characters (i.e., not metacharacters) and inside bracket +expressions. Thus, a 'w' in a regular expression matches only a +lowercase 'w' and not an uppercase 'W'. + + The simplest way to do a case-independent match is to use a bracket +expression--for example, '[Ww]'. However, this can be cumbersome if you +need to use it often, and it can make the regular expressions harder to +read. There are two alternatives that you might prefer. + + One way to perform a case-insensitive match at a particular point in +the program is to convert the data to a single case, using the +'tolower()' or 'toupper()' built-in string functions (which we haven't +discussed yet; *note String Functions::). For example: + + tolower($1) ~ /foo/ { ... } + +converts the first field to lowercase before matching against it. This +works in any POSIX-compliant 'awk'. + + Another method, specific to 'gawk', is to set the variable +'IGNORECASE' to a nonzero value (*note Built-in Variables::). When +'IGNORECASE' is not zero, _all_ regexp and string operations ignore +case. + + Changing the value of 'IGNORECASE' dynamically controls the case +sensitivity of the program as it runs. Case is significant by default +because 'IGNORECASE' (like most variables) is initialized to zero: + + x = "aB" + if (x ~ /ab/) ... # this test will fail + + IGNORECASE = 1 + if (x ~ /ab/) ... # now it will succeed + + In general, you cannot use 'IGNORECASE' to make certain rules case +insensitive and other rules case sensitive, as there is no +straightforward way to set 'IGNORECASE' just for the pattern of a +particular rule.(1) To do this, use either bracket expressions or +'tolower()'. However, one thing you can do with 'IGNORECASE' only is +dynamically turn case sensitivity on or off for all the rules at once. + + 'IGNORECASE' can be set on the command line or in a 'BEGIN' rule +(*note Other Arguments::; also *note Using BEGIN/END::). Setting +'IGNORECASE' from the command line is a way to make a program case +insensitive without having to edit it. + + In multibyte locales, the equivalences between upper- and lowercase +characters are tested based on the wide-character values of the locale's +character set. Otherwise, the characters are tested based on the +ISO-8859-1 (ISO Latin-1) character set. This character set is a +superset of the traditional 128 ASCII characters, which also provides a +number of characters suitable for use with European languages.(2) + + The value of 'IGNORECASE' has no effect if 'gawk' is in compatibility +mode (*note Options::). Case is always significant in compatibility +mode. + + ---------- Footnotes ---------- + + (1) Experienced C and C++ programmers will note that it is possible, +using something like 'IGNORECASE = 1 && /foObAr/ { ... }' and +'IGNORECASE = 0 || /foobar/ { ... }'. However, this is somewhat obscure +and we don't recommend it. + + (2) If you don't understand this, don't worry about it; it just means +that 'gawk' does the right thing. + + +File: gawk.info, Node: Strong Regexp Constants, Next: Regexp Summary, Prev: Case-sensitivity, Up: Regexp + +3.9 Strongly Typed Regexp Constants +=================================== + +This minor node describes a 'gawk'-specific feature. + + Regexp constants ('/.../') hold a strange position in the 'awk' +language. In most contexts, they act like an expression: '$0 ~ /.../'. +In other contexts, they denote only a regexp to be matched. In no case +are they really a "first class citizen" of the language. That is, you +cannot define a scalar variable whose type is "regexp" in the same sense +that you can define a variable to be a number or a string: + + num = 42 Numeric variable + str = "hi" String variable + re = /foo/ Wrong! re is the result of $0 ~ /foo/ + + +File: gawk.info, Node: Regexp Summary, Prev: Strong Regexp Constants, Up: Regexp + +3.10 Summary +============ + + * Regular expressions describe sets of strings to be matched. In + 'awk', regular expression constants are written enclosed between + slashes: '/'...'/'. + + * Regexp constants may be used standalone in patterns and in + conditional expressions, or as part of matching expressions using + the '~' and '!~' operators. + + * Escape sequences let you represent nonprintable characters and also + let you represent regexp metacharacters as literal characters to be + matched. + + * Regexp operators provide grouping, alternation, and repetition. + + * Bracket expressions give you a shorthand for specifying sets of + characters that can match at a particular point in a regexp. + Within bracket expressions, POSIX character classes let you specify + certain groups of characters in a locale-independent fashion. + + * Regular expressions match the leftmost longest text in the string + being matched. This matters for cases where you need to know the + extent of the match, such as for text substitution and when the + record separator is a regexp. + + * Matching expressions may use dynamic regexps (i.e., string values + treated as regular expressions). + + * 'gawk''s 'IGNORECASE' variable lets you control the case + sensitivity of regexp matching. In other 'awk' versions, use + 'tolower()' or 'toupper()'. + + +File: gawk.info, Node: Reading Files, Next: Printing, Prev: Regexp, Up: Top + +4 Reading Input Files +********************* + +In the typical 'awk' program, 'awk' reads all input either from the +standard input (by default, this is the keyboard, but often it is a pipe +from another command) or from files whose names you specify on the 'awk' +command line. If you specify input files, 'awk' reads them in order, +processing all the data from one before going on to the next. The name +of the current input file can be found in the predefined variable +'FILENAME' (*note Built-in Variables::). + + The input is read in units called "records", and is processed by the +rules of your program one record at a time. By default, each record is +one line. Each record is automatically split into chunks called +"fields". This makes it more convenient for programs to work on the +parts of a record. + + On rare occasions, you may need to use the 'getline' command. The +'getline' command is valuable both because it can do explicit input from +any number of files, and because the files used with it do not have to +be named on the 'awk' command line (*note Getline::). + +* Menu: + +* Records:: Controlling how data is split into records. +* Fields:: An introduction to fields. +* Nonconstant Fields:: Nonconstant Field Numbers. +* Changing Fields:: Changing the Contents of a Field. +* Field Separators:: The field separator and how to change it. +* Constant Size:: Reading constant width data. +* Splitting By Content:: Defining Fields By Content +* Multiple Line:: Reading multiline records. +* Getline:: Reading files under explicit program control + using the 'getline' function. +* Read Timeout:: Reading input with a timeout. +* Retrying Input:: Retrying input after certain errors. +* Command-line directories:: What happens if you put a directory on the + command line. +* Input Summary:: Input summary. +* Input Exercises:: Exercises. + + +File: gawk.info, Node: Records, Next: Fields, Up: Reading Files + +4.1 How Input Is Split into Records +=================================== + +'awk' divides the input for your program into records and fields. It +keeps track of the number of records that have been read so far from the +current input file. This value is stored in a predefined variable +called 'FNR', which is reset to zero every time a new file is started. +Another predefined variable, 'NR', records the total number of input +records read so far from all data files. It starts at zero, but is +never automatically reset to zero. + +* Menu: + +* awk split records:: How standard 'awk' splits records. +* gawk split records:: How 'gawk' splits records. + + +File: gawk.info, Node: awk split records, Next: gawk split records, Up: Records + +4.1.1 Record Splitting with Standard 'awk' +------------------------------------------ + +Records are separated by a character called the "record separator". By +default, the record separator is the newline character. This is why +records are, by default, single lines. To use a different character for +the record separator, simply assign that character to the predefined +variable 'RS'. + + Like any other variable, the value of 'RS' can be changed in the +'awk' program with the assignment operator, '=' (*note Assignment +Ops::). The new record-separator character should be enclosed in +quotation marks, which indicate a string constant. Often, the right +time to do this is at the beginning of execution, before any input is +processed, so that the very first record is read with the proper +separator. To do this, use the special 'BEGIN' pattern (*note +BEGIN/END::). For example: + + awk 'BEGIN { RS = "u" } + { print $0 }' mail-list + +changes the value of 'RS' to 'u', before reading any input. The new +value is a string whose first character is the letter "u"; as a result, +records are separated by the letter "u". Then the input file is read, +and the second rule in the 'awk' program (the action with no pattern) +prints each record. Because each 'print' statement adds a newline at +the end of its output, this 'awk' program copies the input with each 'u' +changed to a newline. Here are the results of running the program on +'mail-list': + + $ awk 'BEGIN { RS = "u" } + > { print $0 }' mail-list + -| Amelia 555-5553 amelia.zodiac + -| sq + -| e@gmail.com F + -| Anthony 555-3412 anthony.assert + -| ro@hotmail.com A + -| Becky 555-7685 becky.algebrar + -| m@gmail.com A + -| Bill 555-1675 bill.drowning@hotmail.com A + -| Broderick 555-0542 broderick.aliq + -| otiens@yahoo.com R + -| Camilla 555-2912 camilla.inf + -| sar + -| m@skynet.be R + -| Fabi + -| s 555-1234 fabi + -| s. + -| ndevicesim + -| s@ + -| cb.ed + -| F + -| J + -| lie 555-6699 j + -| lie.perscr + -| tabor@skeeve.com F + -| Martin 555-6480 martin.codicib + -| s@hotmail.com A + -| Sam + -| el 555-3430 sam + -| el.lanceolis@sh + -| .ed + -| A + -| Jean-Pa + -| l 555-2127 jeanpa + -| l.campanor + -| m@ny + -| .ed + -| R + -| + +Note that the entry for the name 'Bill' is not split. In the original +data file (*note Sample Data Files::), the line looks like this: + + Bill 555-1675 bill.drowning@hotmail.com A + +It contains no 'u', so there is no reason to split the record, unlike +the others, which each have one or more occurrences of the 'u'. In +fact, this record is treated as part of the previous record; the newline +separating them in the output is the original newline in the data file, +not the one added by 'awk' when it printed the record! + + Another way to change the record separator is on the command line, +using the variable-assignment feature (*note Other Arguments::): + + awk '{ print $0 }' RS="u" mail-list + +This sets 'RS' to 'u' before processing 'mail-list'. + + Using an alphabetic character such as 'u' for the record separator is +highly likely to produce strange results. Using an unusual character +such as '/' is more likely to produce correct behavior in the majority +of cases, but there are no guarantees. The moral is: Know Your Data. + + When using regular characters as the record separator, there is one +unusual case that occurs when 'gawk' is being fully POSIX-compliant +(*note Options::). Then, the following (extreme) pipeline prints a +surprising '1': + + $ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }' + -| 1 + + There is one field, consisting of a newline. The value of the +built-in variable 'NF' is the number of fields in the current record. +(In the normal case, 'gawk' treats the newline as whitespace, printing +'0' as the result. Most other versions of 'awk' also act this way.) + + Reaching the end of an input file terminates the current input +record, even if the last character in the file is not the character in +'RS'. (d.c.) + + The empty string '""' (a string without any characters) has a special +meaning as the value of 'RS'. It means that records are separated by +one or more blank lines and nothing else. *Note Multiple Line:: for +more details. + + If you change the value of 'RS' in the middle of an 'awk' run, the +new value is used to delimit subsequent records, but the record +currently being processed, as well as records already processed, are not +affected. + + After the end of the record has been determined, 'gawk' sets the +variable 'RT' to the text in the input that matched 'RS'. + + +File: gawk.info, Node: gawk split records, Prev: awk split records, Up: Records + +4.1.2 Record Splitting with 'gawk' +---------------------------------- + +When using 'gawk', the value of 'RS' is not limited to a one-character +string. It can be any regular expression (*note Regexp::). (c.e.) In +general, each record ends at the next string that matches the regular +expression; the next record starts at the end of the matching string. +This general rule is actually at work in the usual case, where 'RS' +contains just a newline: a record ends at the beginning of the next +matching string (the next newline in the input), and the following +record starts just after the end of this string (at the first character +of the following line). The newline, because it matches 'RS', is not +part of either record. + + When 'RS' is a single character, 'RT' contains the same single +character. However, when 'RS' is a regular expression, 'RT' contains +the actual input text that matched the regular expression. + + If the input file ends without any text matching 'RS', 'gawk' sets +'RT' to the null string. + + The following example illustrates both of these features. It sets +'RS' equal to a regular expression that matches either a newline or a +series of one or more uppercase letters with optional leading and/or +trailing whitespace: + + $ echo record 1 AAAA record 2 BBBB record 3 | + > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" } + > { print "Record =", $0,"and RT = [" RT "]" }' + -| Record = record 1 and RT = [ AAAA ] + -| Record = record 2 and RT = [ BBBB ] + -| Record = record 3 and RT = [ + -| ] + +The square brackets delineate the contents of 'RT', letting you see the +leading and trailing whitespace. The final value of 'RT' is a newline. +*Note Simple Sed:: for a more useful example of 'RS' as a regexp and +'RT'. + + If you set 'RS' to a regular expression that allows optional trailing +text, such as 'RS = "abc(XYZ)?"', it is possible, due to implementation +constraints, that 'gawk' may match the leading part of the regular +expression, but not the trailing part, particularly if the input text +that could match the trailing part is fairly long. 'gawk' attempts to +avoid this problem, but currently, there's no guarantee that this will +never happen. + + NOTE: Remember that in 'awk', the '^' and '$' anchor metacharacters + match the beginning and end of a _string_, and not the beginning + and end of a _line_. As a result, something like 'RS = + "^[[:upper:]]"' can only match at the beginning of a file. This is + because 'gawk' views the input file as one long string that happens + to contain newline characters. It is thus best to avoid anchor + metacharacters in the value of 'RS'. + + The use of 'RS' as a regular expression and the 'RT' variable are +'gawk' extensions; they are not available in compatibility mode (*note +Options::). In compatibility mode, only the first character of the +value of 'RS' determines the end of the record. + + 'RS = "\0"' Is Not Portable + + There are times when you might want to treat an entire data file as a +single record. The only way to make this happen is to give 'RS' a value +that you know doesn't occur in the input file. This is hard to do in a +general way, such that a program always works for arbitrary input files. + + You might think that for text files, the NUL character, which +consists of a character with all bits equal to zero, is a good value to +use for 'RS' in this case: + + BEGIN { RS = "\0" } # whole file becomes one record? + + 'gawk' in fact accepts this, and uses the NUL character for the +record separator. This works for certain special files, such as +'/proc/environ' on GNU/Linux systems, where the NUL character is in fact +the record separator. However, this usage is _not_ portable to most +other 'awk' implementations. + + Almost all other 'awk' implementations(1) store strings internally as +C-style strings. C strings use the NUL character as the string +terminator. In effect, this means that 'RS = "\0"' is the same as 'RS = +""'. (d.c.) + + It happens that recent versions of 'mawk' can use the NUL character +as a record separator. However, this is a special case: 'mawk' does not +allow embedded NUL characters in strings. (This may change in a future +version of 'mawk'.) + + *Note Readfile Function:: for an interesting way to read whole files. +If you are using 'gawk', see *note Extension Sample Readfile:: for +another option. + + ---------- Footnotes ---------- + + (1) At least that we know about. + + +File: gawk.info, Node: Fields, Next: Nonconstant Fields, Prev: Records, Up: Reading Files + +4.2 Examining Fields +==================== + +When 'awk' reads an input record, the record is automatically "parsed" +or separated by the 'awk' utility into chunks called "fields". By +default, fields are separated by "whitespace", like words in a line. +Whitespace in 'awk' means any string of one or more spaces, TABs, or +newlines; other characters that are considered whitespace by other +languages (such as formfeed, vertical tab, etc.) are _not_ considered +whitespace by 'awk'. + + The purpose of fields is to make it more convenient for you to refer +to these pieces of the record. You don't have to use them--you can +operate on the whole record if you want--but fields are what make simple +'awk' programs so powerful. + + You use a dollar sign ('$') to refer to a field in an 'awk' program, +followed by the number of the field you want. Thus, '$1' refers to the +first field, '$2' to the second, and so on. (Unlike in the Unix shells, +the field numbers are not limited to single digits. '$127' is the 127th +field in the record.) For example, suppose the following is a line of +input: + + This seems like a pretty nice example. + +Here the first field, or '$1', is 'This', the second field, or '$2', is +'seems', and so on. Note that the last field, '$7', is 'example.'. +Because there is no space between the 'e' and the '.', the period is +considered part of the seventh field. + + 'NF' is a predefined variable whose value is the number of fields in +the current record. 'awk' automatically updates the value of 'NF' each +time it reads a record. No matter how many fields there are, the last +field in a record can be represented by '$NF'. So, '$NF' is the same as +'$7', which is 'example.'. If you try to reference a field beyond the +last one (such as '$8' when the record has only seven fields), you get +the empty string. (If used in a numeric operation, you get zero.) + + The use of '$0', which looks like a reference to the "zeroth" field, +is a special case: it represents the whole input record. Use it when +you are not interested in specific fields. Here are some more examples: + + $ awk '$1 ~ /li/ { print $0 }' mail-list + -| Amelia 555-5553 amelia.zodiacusque@gmail.com F + -| Julie 555-6699 julie.perscrutabor@skeeve.com F + +This example prints each record in the file 'mail-list' whose first +field contains the string 'li'. + + By contrast, the following example looks for 'li' in _the entire +record_ and prints the first and last fields for each matching input +record: + + $ awk '/li/ { print $1, $NF }' mail-list + -| Amelia F + -| Broderick R + -| Julie F + -| Samuel A + + +File: gawk.info, Node: Nonconstant Fields, Next: Changing Fields, Prev: Fields, Up: Reading Files + +4.3 Nonconstant Field Numbers +============================= + +A field number need not be a constant. Any expression in the 'awk' +language can be used after a '$' to refer to a field. The value of the +expression specifies the field number. If the value is a string, rather +than a number, it is converted to a number. Consider this example: + + awk '{ print $NR }' + +Recall that 'NR' is the number of records read so far: one in the first +record, two in the second, and so on. So this example prints the first +field of the first record, the second field of the second record, and so +on. For the twentieth record, field number 20 is printed; most likely, +the record has fewer than 20 fields, so this prints a blank line. Here +is another example of using expressions as field numbers: + + awk '{ print $(2*2) }' mail-list + + 'awk' evaluates the expression '(2*2)' and uses its value as the +number of the field to print. The '*' represents multiplication, so the +expression '2*2' evaluates to four. The parentheses are used so that +the multiplication is done before the '$' operation; they are necessary +whenever there is a binary operator(1) in the field-number expression. +This example, then, prints the type of relationship (the fourth field) +for every line of the file 'mail-list'. (All of the 'awk' operators are +listed, in order of decreasing precedence, in *note Precedence::.) + + If the field number you compute is zero, you get the entire record. +Thus, '$(2-2)' has the same value as '$0'. Negative field numbers are +not allowed; trying to reference one usually terminates the program. +(The POSIX standard does not define what happens when you reference a +negative field number. 'gawk' notices this and terminates your program. +Other 'awk' implementations may behave differently.) + + As mentioned in *note Fields::, 'awk' stores the current record's +number of fields in the built-in variable 'NF' (also *note Built-in +Variables::). Thus, the expression '$NF' is not a special feature--it +is the direct consequence of evaluating 'NF' and using its value as a +field number. + + ---------- Footnotes ---------- + + (1) A "binary operator", such as '*' for multiplication, is one that +takes two operands. The distinction is required because 'awk' also has +unary (one-operand) and ternary (three-operand) operators. + + +File: gawk.info, Node: Changing Fields, Next: Field Separators, Prev: Nonconstant Fields, Up: Reading Files + +4.4 Changing the Contents of a Field +==================================== + +The contents of a field, as seen by 'awk', can be changed within an +'awk' program; this changes what 'awk' perceives as the current input +record. (The actual input is untouched; 'awk' _never_ modifies the +input file.) Consider the following example and its output: + + $ awk '{ nboxes = $3 ; $3 = $3 - 10 + > print nboxes, $3 }' inventory-shipped + -| 25 15 + -| 32 22 + -| 24 14 + ... + +The program first saves the original value of field three in the +variable 'nboxes'. The '-' sign represents subtraction, so this program +reassigns field three, '$3', as the original value of field three minus +ten: '$3 - 10'. (*Note Arithmetic Ops::.) Then it prints the original +and new values for field three. (Someone in the warehouse made a +consistent mistake while inventorying the red boxes.) + + For this to work, the text in '$3' must make sense as a number; the +string of characters must be converted to a number for the computer to +do arithmetic on it. The number resulting from the subtraction is +converted back to a string of characters that then becomes field three. +*Note Conversion::. + + When the value of a field is changed (as perceived by 'awk'), the +text of the input record is recalculated to contain the new field where +the old one was. In other words, '$0' changes to reflect the altered +field. Thus, this program prints a copy of the input file, with 10 +subtracted from the second field of each line: + + $ awk '{ $2 = $2 - 10; print $0 }' inventory-shipped + -| Jan 3 25 15 115 + -| Feb 5 32 24 226 + -| Mar 5 24 34 228 + ... + + It is also possible to assign contents to fields that are out of +range. For example: + + $ awk '{ $6 = ($5 + $4 + $3 + $2) + > print $6 }' inventory-shipped + -| 168 + -| 297 + -| 301 + ... + +We've just created '$6', whose value is the sum of fields '$2', '$3', +'$4', and '$5'. The '+' sign represents addition. For the file +'inventory-shipped', '$6' represents the total number of parcels shipped +for a particular month. + + Creating a new field changes 'awk''s internal copy of the current +input record, which is the value of '$0'. Thus, if you do 'print $0' +after adding a field, the record printed includes the new field, with +the appropriate number of field separators between it and the previously +existing fields. + + This recomputation affects and is affected by 'NF' (the number of +fields; *note Fields::). For example, the value of 'NF' is set to the +number of the highest field you create. The exact format of '$0' is +also affected by a feature that has not been discussed yet: the "output +field separator", 'OFS', used to separate the fields (*note Output +Separators::). + + Note, however, that merely _referencing_ an out-of-range field does +_not_ change the value of either '$0' or 'NF'. Referencing an +out-of-range field only produces an empty string. For example: + + if ($(NF+1) != "") + print "can't happen" + else + print "everything is normal" + +should print 'everything is normal', because 'NF+1' is certain to be out +of range. (*Note If Statement:: for more information about 'awk''s +'if-else' statements. *Note Typing and Comparison:: for more +information about the '!=' operator.) + + It is important to note that making an assignment to an existing +field changes the value of '$0' but does not change the value of 'NF', +even when you assign the empty string to a field. For example: + + $ echo a b c d | awk '{ OFS = ":"; $2 = "" + > print $0; print NF }' + -| a::c:d + -| 4 + +The field is still there; it just has an empty value, delimited by the +two colons between 'a' and 'c'. This example shows what happens if you +create a new field: + + $ echo a b c d | awk '{ OFS = ":"; $2 = ""; $6 = "new" + > print $0; print NF }' + -| a::c:d::new + -| 6 + +The intervening field, '$5', is created with an empty value (indicated +by the second pair of adjacent colons), and 'NF' is updated with the +value six. + + Decrementing 'NF' throws away the values of the fields after the new +value of 'NF' and recomputes '$0'. (d.c.) Here is an example: + + $ echo a b c d e f | awk '{ print "NF =", NF; + > NF = 3; print $0 }' + -| NF = 6 + -| a b c + + CAUTION: Some versions of 'awk' don't rebuild '$0' when 'NF' is + decremented. + + Finally, there are times when it is convenient to force 'awk' to +rebuild the entire record, using the current values of the fields and +'OFS'. To do this, use the seemingly innocuous assignment: + + $1 = $1 # force record to be reconstituted + print $0 # or whatever else with $0 + +This forces 'awk' to rebuild the record. It does help to add a comment, +as we've shown here. + + There is a flip side to the relationship between '$0' and the fields. +Any assignment to '$0' causes the record to be reparsed into fields +using the _current_ value of 'FS'. This also applies to any built-in +function that updates '$0', such as 'sub()' and 'gsub()' (*note String +Functions::). + + Understanding '$0' + + It is important to remember that '$0' is the _full_ record, exactly +as it was read from the input. This includes any leading or trailing +whitespace, and the exact whitespace (or other characters) that +separates the fields. + + It is a common error to try to change the field separators in a +record simply by setting 'FS' and 'OFS', and then expecting a plain +'print' or 'print $0' to print the modified record. + + But this does not work, because nothing was done to change the record +itself. Instead, you must force the record to be rebuilt, typically +with a statement such as '$1 = $1', as described earlier. + + +File: gawk.info, Node: Field Separators, Next: Constant Size, Prev: Changing Fields, Up: Reading Files + +4.5 Specifying How Fields Are Separated +======================================= + +* Menu: + +* Default Field Splitting:: How fields are normally separated. +* Regexp Field Splitting:: Using regexps as the field separator. +* Single Character Fields:: Making each character a separate field. +* Command Line Field Separator:: Setting 'FS' from the command line. +* Full Line Fields:: Making the full line be a single field. +* Field Splitting Summary:: Some final points and a summary table. + +The "field separator", which is either a single character or a regular +expression, controls the way 'awk' splits an input record into fields. +'awk' scans the input record for character sequences that match the +separator; the fields themselves are the text between the matches. + + In the examples that follow, we use the bullet symbol (*) to +represent spaces in the output. If the field separator is 'oo', then +the following line: + + moo goo gai pan + +is split into three fields: 'm', '*g', and '*gai*pan'. Note the leading +spaces in the values of the second and third fields. + + The field separator is represented by the predefined variable 'FS'. +Shell programmers take note: 'awk' does _not_ use the name 'IFS' that is +used by the POSIX-compliant shells (such as the Unix Bourne shell, 'sh', +or Bash). + + The value of 'FS' can be changed in the 'awk' program with the +assignment operator, '=' (*note Assignment Ops::). Often, the right +time to do this is at the beginning of execution before any input has +been processed, so that the very first record is read with the proper +separator. To do this, use the special 'BEGIN' pattern (*note +BEGIN/END::). For example, here we set the value of 'FS' to the string +'","': + + awk 'BEGIN { FS = "," } ; { print $2 }' + +Given the input line: + + John Q. Smith, 29 Oak St., Walamazoo, MI 42139 + +this 'awk' program extracts and prints the string '*29*Oak*St.'. + + Sometimes the input data contains separator characters that don't +separate fields the way you thought they would. For instance, the +person's name in the example we just used might have a title or suffix +attached, such as: + + John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139 + +The same program would extract '*LXIX' instead of '*29*Oak*St.'. If you +were expecting the program to print the address, you would be surprised. +The moral is to choose your data layout and separator characters +carefully to prevent such problems. (If the data is not in a form that +is easy to process, perhaps you can massage it first with a separate +'awk' program.) + + +File: gawk.info, Node: Default Field Splitting, Next: Regexp Field Splitting, Up: Field Separators + +4.5.1 Whitespace Normally Separates Fields +------------------------------------------ + +Fields are normally separated by whitespace sequences (spaces, TABs, and +newlines), not by single spaces. Two spaces in a row do not delimit an +empty field. The default value of the field separator 'FS' is a string +containing a single space, '" "'. If 'awk' interpreted this value in +the usual way, each space character would separate fields, so two spaces +in a row would make an empty field between them. The reason this does +not happen is that a single space as the value of 'FS' is a special +case--it is taken to specify the default manner of delimiting fields. + + If 'FS' is any other single character, such as '","', then each +occurrence of that character separates two fields. Two consecutive +occurrences delimit an empty field. If the character occurs at the +beginning or the end of the line, that too delimits an empty field. The +space character is the only single character that does not follow these +rules. + + +File: gawk.info, Node: Regexp Field Splitting, Next: Single Character Fields, Prev: Default Field Splitting, Up: Field Separators + +4.5.2 Using Regular Expressions to Separate Fields +-------------------------------------------------- + +The previous node discussed the use of single characters or simple +strings as the value of 'FS'. More generally, the value of 'FS' may be +a string containing any regular expression. In this case, each match in +the record for the regular expression separates fields. For example, +the assignment: + + FS = ", \t" + +makes every area of an input line that consists of a comma followed by a +space and a TAB into a field separator. ('\t' is an "escape sequence" +that stands for a TAB; *note Escape Sequences::, for the complete list +of similar escape sequences.) + + For a less trivial example of a regular expression, try using single +spaces to separate fields the way single commas are used. 'FS' can be +set to '"[ ]"' (left bracket, space, right bracket). This regular +expression matches a single space and nothing else (*note Regexp::). + + There is an important difference between the two cases of 'FS = " "' +(a single space) and 'FS = "[ \t\n]+"' (a regular expression matching +one or more spaces, TABs, or newlines). For both values of 'FS', fields +are separated by "runs" (multiple adjacent occurrences) of spaces, TABs, +and/or newlines. However, when the value of 'FS' is '" "', 'awk' first +strips leading and trailing whitespace from the record and then decides +where the fields are. For example, the following pipeline prints 'b': + + $ echo ' a b c d ' | awk '{ print $2 }' + -| b + +However, this pipeline prints 'a' (note the extra spaces around each +letter): + + $ echo ' a b c d ' | awk 'BEGIN { FS = "[ \t\n]+" } + > { print $2 }' + -| a + +In this case, the first field is null, or empty. + + The stripping of leading and trailing whitespace also comes into play +whenever '$0' is recomputed. For instance, study this pipeline: + + $ echo ' a b c d' | awk '{ print; $2 = $2; print }' + -| a b c d + -| a b c d + +The first 'print' statement prints the record as it was read, with +leading whitespace intact. The assignment to '$2' rebuilds '$0' by +concatenating '$1' through '$NF' together, separated by the value of +'OFS' (which is a space by default). Because the leading whitespace was +ignored when finding '$1', it is not part of the new '$0'. Finally, the +last 'print' statement prints the new '$0'. + + There is an additional subtlety to be aware of when using regular +expressions for field splitting. It is not well specified in the POSIX +standard, or anywhere else, what '^' means when splitting fields. Does +the '^' match only at the beginning of the entire record? Or is each +field separator a new string? It turns out that different 'awk' +versions answer this question differently, and you should not rely on +any specific behavior in your programs. (d.c.) + + As a point of information, BWK 'awk' allows '^' to match only at the +beginning of the record. 'gawk' also works this way. For example: + + $ echo 'xxAA xxBxx C' | + > gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++) + > printf "-->%s<--\n", $i }' + -| --><-- + -| -->AA<-- + -| -->xxBxx<-- + -| -->C<-- + + +File: gawk.info, Node: Single Character Fields, Next: Command Line Field Separator, Prev: Regexp Field Splitting, Up: Field Separators + +4.5.3 Making Each Character a Separate Field +-------------------------------------------- + +There are times when you may want to examine each character of a record +separately. This can be done in 'gawk' by simply assigning the null +string ('""') to 'FS'. (c.e.) In this case, each individual character +in the record becomes a separate field. For example: + + $ echo a b | gawk 'BEGIN { FS = "" } + > { + > for (i = 1; i <= NF; i = i + 1) + > print "Field", i, "is", $i + > }' + -| Field 1 is a + -| Field 2 is + -| Field 3 is b + + Traditionally, the behavior of 'FS' equal to '""' was not defined. +In this case, most versions of Unix 'awk' simply treat the entire record +as only having one field. (d.c.) In compatibility mode (*note +Options::), if 'FS' is the null string, then 'gawk' also behaves this +way. + + +File: gawk.info, Node: Command Line Field Separator, Next: Full Line Fields, Prev: Single Character Fields, Up: Field Separators + +4.5.4 Setting 'FS' from the Command Line +---------------------------------------- + +'FS' can be set on the command line. Use the '-F' option to do so. For +example: + + awk -F, 'PROGRAM' INPUT-FILES + +sets 'FS' to the ',' character. Notice that the option uses an +uppercase 'F' instead of a lowercase 'f'. The latter option ('-f') +specifies a file containing an 'awk' program. + + The value used for the argument to '-F' is processed in exactly the +same way as assignments to the predefined variable 'FS'. Any special +characters in the field separator must be escaped appropriately. For +example, to use a '\' as the field separator on the command line, you +would have to type: + + # same as FS = "\\" + awk -F\\\\ '...' files ... + +Because '\' is used for quoting in the shell, 'awk' sees '-F\\'. Then +'awk' processes the '\\' for escape characters (*note Escape +Sequences::), finally yielding a single '\' to use for the field +separator. + + As a special case, in compatibility mode (*note Options::), if the +argument to '-F' is 't', then 'FS' is set to the TAB character. If you +type '-F\t' at the shell, without any quotes, the '\' gets deleted, so +'awk' figures that you really want your fields to be separated with TABs +and not 't's. Use '-v FS="t"' or '-F"[t]"' on the command line if you +really do want to separate your fields with 't's. Use '-F '\t'' when +not in compatibility mode to specify that TABs separate fields. + + As an example, let's use an 'awk' program file called 'edu.awk' that +contains the pattern '/edu/' and the action 'print $1': + + /edu/ { print $1 } + + Let's also set 'FS' to be the '-' character and run the program on +the file 'mail-list'. The following command prints a list of the names +of the people that work at or attend a university, and the first three +digits of their phone numbers: + + $ awk -F- -f edu.awk mail-list + -| Fabius 555 + -| Samuel 555 + -| Jean + +Note the third line of output. The third line in the original file +looked like this: + + Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R + + The '-' as part of the person's name was used as the field separator, +instead of the '-' in the phone number that was originally intended. +This demonstrates why you have to be careful in choosing your field and +record separators. + + Perhaps the most common use of a single character as the field +separator occurs when processing the Unix system password file. On many +Unix systems, each user has a separate entry in the system password +file, with one line per user. The information in these lines is +separated by colons. The first field is the user's login name and the +second is the user's encrypted or shadow password. (A shadow password +is indicated by the presence of a single 'x' in the second field.) A +password file entry might look like this: + + arnold:x:2076:10:Arnold Robbins:/home/arnold:/bin/bash + + The following program searches the system password file and prints +the entries for users whose full name is not indicated: + + awk -F: '$5 == ""' /etc/passwd + + +File: gawk.info, Node: Full Line Fields, Next: Field Splitting Summary, Prev: Command Line Field Separator, Up: Field Separators + +4.5.5 Making the Full Line Be a Single Field +-------------------------------------------- + +Occasionally, it's useful to treat the whole input line as a single +field. This can be done easily and portably simply by setting 'FS' to +'"\n"' (a newline):(1) + + awk -F'\n' 'PROGRAM' FILES ... + +When you do this, '$1' is the same as '$0'. + + Changing 'FS' Does Not Affect the Fields + + According to the POSIX standard, 'awk' is supposed to behave as if +each record is split into fields at the time it is read. In particular, +this means that if you change the value of 'FS' after a record is read, +the values of the fields (i.e., how they were split) should reflect the +old value of 'FS', not the new one. + + However, many older implementations of 'awk' do not work this way. +Instead, they defer splitting the fields until a field is actually +referenced. The fields are split using the _current_ value of 'FS'! +(d.c.) This behavior can be difficult to diagnose. The following +example illustrates the difference between the two methods: + + sed 1q /etc/passwd | awk '{ FS = ":" ; print $1 }' + +which usually prints: + + root + +on an incorrect implementation of 'awk', while 'gawk' prints the full +first line of the file, something like: + + root:x:0:0:Root:/: + + (The 'sed'(2) command prints just the first line of '/etc/passwd'.) + + ---------- Footnotes ---------- + + (1) Thanks to Andrew Schorr for this tip. + + (2) The 'sed' utility is a "stream editor." Its behavior is also +defined by the POSIX standard. + + +File: gawk.info, Node: Field Splitting Summary, Prev: Full Line Fields, Up: Field Separators + +4.5.6 Field-Splitting Summary +----------------------------- + +It is important to remember that when you assign a string constant as +the value of 'FS', it undergoes normal 'awk' string processing. For +example, with Unix 'awk' and 'gawk', the assignment 'FS = "\.."' assigns +the character string '".."' to 'FS' (the backslash is stripped). This +creates a regexp meaning "fields are separated by occurrences of any two +characters." If instead you want fields to be separated by a literal +period followed by any single character, use 'FS = "\\.."'. + + The following list summarizes how fields are split, based on the +value of 'FS' ('==' means "is equal to"): + +'FS == " "' + Fields are separated by runs of whitespace. Leading and trailing + whitespace are ignored. This is the default. + +'FS == ANY OTHER SINGLE CHARACTER' + Fields are separated by each occurrence of the character. Multiple + successive occurrences delimit empty fields, as do leading and + trailing occurrences. The character can even be a regexp + metacharacter; it does not need to be escaped. + +'FS == REGEXP' + Fields are separated by occurrences of characters that match + REGEXP. Leading and trailing matches of REGEXP delimit empty + fields. + +'FS == ""' + Each individual character in the record becomes a separate field. + (This is a common extension; it is not specified by the POSIX + standard.) + + 'FS' and 'IGNORECASE' + + The 'IGNORECASE' variable (*note User-modified::) affects field +splitting _only_ when the value of 'FS' is a regexp. It has no effect +when 'FS' is a single character, even if that character is a letter. +Thus, in the following code: + + FS = "c" + IGNORECASE = 1 + $0 = "aCa" + print $1 + +The output is 'aCa'. If you really want to split fields on an +alphabetic character while ignoring case, use a regexp that will do it +for you (e.g., 'FS = "[c]"'). In this case, 'IGNORECASE' will take +effect. + + +File: gawk.info, Node: Constant Size, Next: Splitting By Content, Prev: Field Separators, Up: Reading Files + +4.6 Reading Fixed-Width Data +============================ + +This minor node discusses an advanced feature of 'gawk'. If you are a +novice 'awk' user, you might want to skip it on the first reading. + + 'gawk' provides a facility for dealing with fixed-width fields with +no distinctive field separator. For example, data of this nature arises +in the input for old Fortran programs where numbers are run together, or +in the output of programs that did not anticipate the use of their +output as input for other programs. + + An example of the latter is a table where all the columns are lined +up by the use of a variable number of spaces and _empty fields are just +spaces_. Clearly, 'awk''s normal field splitting based on 'FS' does not +work well in this case. Although a portable 'awk' program can use a +series of 'substr()' calls on '$0' (*note String Functions::), this is +awkward and inefficient for a large number of fields. + + The splitting of an input record into fixed-width fields is specified +by assigning a string containing space-separated numbers to the built-in +variable 'FIELDWIDTHS'. Each number specifies the width of the field, +_including_ columns between fields. If you want to ignore the columns +between fields, you can specify the width as a separate field that is +subsequently ignored. It is a fatal error to supply a field width that +has a negative value. The following data is the output of the Unix 'w' +utility. It is useful to illustrate the use of 'FIELDWIDTHS': + + 10:06pm up 21 days, 14:04, 23 users + User tty login idle JCPU PCPU what + hzuo ttyV0 8:58pm 9 5 vi p24.tex + hzang ttyV3 6:37pm 50 -csh + eklye ttyV5 9:53pm 7 1 em thes.tex + dportein ttyV6 8:17pm 1:47 -csh + gierd ttyD3 10:00pm 1 elm + dave ttyD4 9:47pm 4 4 w + brent ttyp0 26Jun91 4:46 26:46 4:41 bash + dave ttyq4 26Jun9115days 46 46 wnewmail + + The following program takes this input, converts the idle time to +number of seconds, and prints out the first two fields and the +calculated idle time: + + BEGIN { FIELDWIDTHS = "9 6 10 6 7 7 35" } + NR > 2 { + idle = $4 + sub(/^ +/, "", idle) # strip leading spaces + if (idle == "") + idle = 0 + if (idle ~ /:/) { + split(idle, t, ":") + idle = t[1] * 60 + t[2] + } + if (idle ~ /days/) + idle *= 24 * 60 * 60 + + print $1, $2, idle + } + + NOTE: The preceding program uses a number of 'awk' features that + haven't been introduced yet. + + Running the program on the data produces the following results: + + hzuo ttyV0 0 + hzang ttyV3 50 + eklye ttyV5 0 + dportein ttyV6 107 + gierd ttyD3 1 + dave ttyD4 0 + brent ttyp0 286 + dave ttyq4 1296000 + + Another (possibly more practical) example of fixed-width input data +is the input from a deck of balloting cards. In some parts of the +United States, voters mark their choices by punching holes in computer +cards. These cards are then processed to count the votes for any +particular candidate or on any particular issue. Because a voter may +choose not to vote on some issue, any column on the card may be empty. +An 'awk' program for processing such data could use the 'FIELDWIDTHS' +feature to simplify reading the data. (Of course, getting 'gawk' to run +on a system with card readers is another story!) + + Assigning a value to 'FS' causes 'gawk' to use 'FS' for field +splitting again. Use 'FS = FS' to make this happen, without having to +know the current value of 'FS'. In order to tell which kind of field +splitting is in effect, use 'PROCINFO["FS"]' (*note Auto-set::). The +value is '"FS"' if regular field splitting is being used, or +'"FIELDWIDTHS"' if fixed-width field splitting is being used: + + if (PROCINFO["FS"] == "FS") + REGULAR FIELD SPLITTING ... + else if (PROCINFO["FS"] == "FIELDWIDTHS") + FIXED-WIDTH FIELD SPLITTING ... + else + CONTENT-BASED FIELD SPLITTING ... (see next minor node) + + This information is useful when writing a function that needs to +temporarily change 'FS' or 'FIELDWIDTHS', read some records, and then +restore the original settings (*note Passwd Functions:: for an example +of such a function). + + +File: gawk.info, Node: Splitting By Content, Next: Multiple Line, Prev: Constant Size, Up: Reading Files + +4.7 Defining Fields by Content +============================== + +This minor node discusses an advanced feature of 'gawk'. If you are a +novice 'awk' user, you might want to skip it on the first reading. + + Normally, when using 'FS', 'gawk' defines the fields as the parts of +the record that occur in between each field separator. In other words, +'FS' defines what a field _is not_, instead of what a field _is_. +However, there are times when you really want to define the fields by +what they are, and not by what they are not. + + The most notorious such case is so-called "comma-separated values" +(CSV) data. Many spreadsheet programs, for example, can export their +data into text files, where each record is terminated with a newline, +and fields are separated by commas. If commas only separated the data, +there wouldn't be an issue. The problem comes when one of the fields +contains an _embedded_ comma. In such cases, most programs embed the +field in double quotes.(1) So, we might have data like this: + + Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA + + The 'FPAT' variable offers a solution for cases like this. The value +of 'FPAT' should be a string that provides a regular expression. This +regular expression describes the contents of each field. + + In the case of CSV data as presented here, each field is either +"anything that is not a comma," or "a double quote, anything that is not +a double quote, and a closing double quote." If written as a regular +expression constant (*note Regexp::), we would have +'/([^,]+)|("[^"]+")/'. Writing this as a string requires us to escape +the double quotes, leading to: + + FPAT = "([^,]+)|(\"[^\"]+\")" + + Putting this to use, here is a simple program to parse the data: + + BEGIN { + FPAT = "([^,]+)|(\"[^\"]+\")" + } + + { + print "NF = ", NF + for (i = 1; i <= NF; i++) { + printf("$%d = <%s>\n", i, $i) + } + } + + When run, we get the following: + + $ gawk -f simple-csv.awk addresses.csv + NF = 7 + $1 = <Robbins> + $2 = <Arnold> + $3 = <"1234 A Pretty Street, NE"> + $4 = <MyTown> + $5 = <MyState> + $6 = <12345-6789> + $7 = <USA> + + Note the embedded comma in the value of '$3'. + + A straightforward improvement when processing CSV data of this sort +would be to remove the quotes when they occur, with something like this: + + if (substr($i, 1, 1) == "\"") { + len = length($i) + $i = substr($i, 2, len - 2) # Get text within the two quotes + } + + As with 'FS', the 'IGNORECASE' variable (*note User-modified::) +affects field splitting with 'FPAT'. + + Assigning a value to 'FPAT' overrides field splitting with 'FS' and +with 'FIELDWIDTHS'. Similar to 'FIELDWIDTHS', the value of +'PROCINFO["FS"]' will be '"FPAT"' if content-based field splitting is +being used. + + NOTE: Some programs export CSV data that contains embedded newlines + between the double quotes. 'gawk' provides no way to deal with + this. Even though a formal specification for CSV data exists, + there isn't much more to be done; the 'FPAT' mechanism provides an + elegant solution for the majority of cases, and the 'gawk' + developers are satisfied with that. + + As written, the regexp used for 'FPAT' requires that each field +contain at least one character. A straightforward modification +(changing the first '+' to '*') allows fields to be empty: + + FPAT = "([^,]*)|(\"[^\"]+\")" + + Finally, the 'patsplit()' function makes the same functionality +available for splitting regular strings (*note String Functions::). + + To recap, 'gawk' provides three independent methods to split input +records into fields. The mechanism used is based on which of the three +variables--'FS', 'FIELDWIDTHS', or 'FPAT'--was last assigned to. + + ---------- Footnotes ---------- + + (1) The CSV format lacked a formal standard definition for many +years. RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt) standardizes the +most common practices. + + +File: gawk.info, Node: Multiple Line, Next: Getline, Prev: Splitting By Content, Up: Reading Files + +4.8 Multiple-Line Records +========================= + +In some databases, a single line cannot conveniently hold all the +information in one entry. In such cases, you can use multiline records. +The first step in doing this is to choose your data format. + + One technique is to use an unusual character or string to separate +records. For example, you could use the formfeed character (written +'\f' in 'awk', as in C) to separate them, making each record a page of +the file. To do this, just set the variable 'RS' to '"\f"' (a string +containing the formfeed character). Any other character could equally +well be used, as long as it won't be part of the data in a record. + + Another technique is to have blank lines separate records. By a +special dispensation, an empty string as the value of 'RS' indicates +that records are separated by one or more blank lines. When 'RS' is set +to the empty string, each record always ends at the first blank line +encountered. The next record doesn't start until the first nonblank +line that follows. No matter how many blank lines appear in a row, they +all act as one record separator. (Blank lines must be completely empty; +lines that contain only whitespace do not count.) + + You can achieve the same effect as 'RS = ""' by assigning the string +'"\n\n+"' to 'RS'. This regexp matches the newline at the end of the +record and one or more blank lines after the record. In addition, a +regular expression always matches the longest possible sequence when +there is a choice (*note Leftmost Longest::). So, the next record +doesn't start until the first nonblank line that follows--no matter how +many blank lines appear in a row, they are considered one record +separator. + + However, there is an important difference between 'RS = ""' and 'RS = +"\n\n+"'. In the first case, leading newlines in the input data file +are ignored, and if a file ends without extra blank lines after the last +record, the final newline is removed from the record. In the second +case, this special processing is not done. (d.c.) + + Now that the input is separated into records, the second step is to +separate the fields in the records. One way to do this is to divide +each of the lines into fields in the normal manner. This happens by +default as the result of a special feature. When 'RS' is set to the +empty string _and_ 'FS' is set to a single character, the newline +character _always_ acts as a field separator. This is in addition to +whatever field separations result from 'FS'.(1) + + The original motivation for this special exception was probably to +provide useful behavior in the default case (i.e., 'FS' is equal to +'" "'). This feature can be a problem if you really don't want the +newline character to separate fields, because there is no way to prevent +it. However, you can work around this by using the 'split()' function +to break up the record manually (*note String Functions::). If you have +a single-character field separator, you can work around the special +feature in a different way, by making 'FS' into a regexp for that single +character. For example, if the field separator is a percent character, +instead of 'FS = "%"', use 'FS = "[%]"'. + + Another way to separate fields is to put each field on a separate +line: to do this, just set the variable 'FS' to the string '"\n"'. +(This single-character separator matches a single newline.) A practical +example of a data file organized this way might be a mailing list, where +blank lines separate the entries. Consider a mailing list in a file +named 'addresses', which looks like this: + + Jane Doe + 123 Main Street + Anywhere, SE 12345-6789 + + John Smith + 456 Tree-lined Avenue + Smallville, MW 98765-4321 + ... + +A simple program to process this file is as follows: + + # addrs.awk --- simple mailing list program + + # Records are separated by blank lines. + # Each line is one field. + BEGIN { RS = "" ; FS = "\n" } + + { + print "Name is:", $1 + print "Address is:", $2 + print "City and State are:", $3 + print "" + } + + Running the program produces the following output: + + $ awk -f addrs.awk addresses + -| Name is: Jane Doe + -| Address is: 123 Main Street + -| City and State are: Anywhere, SE 12345-6789 + -| + -| Name is: John Smith + -| Address is: 456 Tree-lined Avenue + -| City and State are: Smallville, MW 98765-4321 + -| + ... + + *Note Labels Program:: for a more realistic program dealing with +address lists. The following list summarizes how records are split, +based on the value of 'RS'. ('==' means "is equal to.") + +'RS == "\n"' + Records are separated by the newline character ('\n'). In effect, + every line in the data file is a separate record, including blank + lines. This is the default. + +'RS == ANY SINGLE CHARACTER' + Records are separated by each occurrence of the character. + Multiple successive occurrences delimit empty records. + +'RS == ""' + Records are separated by runs of blank lines. When 'FS' is a + single character, then the newline character always serves as a + field separator, in addition to whatever value 'FS' may have. + Leading and trailing newlines in a file are ignored. + +'RS == REGEXP' + Records are separated by occurrences of characters that match + REGEXP. Leading and trailing matches of REGEXP delimit empty + records. (This is a 'gawk' extension; it is not specified by the + POSIX standard.) + + If not in compatibility mode (*note Options::), 'gawk' sets 'RT' to +the input text that matched the value specified by 'RS'. But if the +input file ended without any text that matches 'RS', then 'gawk' sets +'RT' to the null string. + + ---------- Footnotes ---------- + + (1) When 'FS' is the null string ('""') or a regexp, this special +feature of 'RS' does not apply. It does apply to the default field +separator of a single space: 'FS = " "'. + + +File: gawk.info, Node: Getline, Next: Read Timeout, Prev: Multiple Line, Up: Reading Files + +4.9 Explicit Input with 'getline' +================================= + +So far we have been getting our input data from 'awk''s main input +stream--either the standard input (usually your keyboard, sometimes the +output from another program) or the files specified on the command line. +The 'awk' language has a special built-in command called 'getline' that +can be used to read input under your explicit control. + + The 'getline' command is used in several different ways and should +_not_ be used by beginners. The examples that follow the explanation of +the 'getline' command include material that has not been covered yet. +Therefore, come back and study the 'getline' command _after_ you have +reviewed the rest of this Info file and have a good knowledge of how +'awk' works. + + The 'getline' command returns 1 if it finds a record and 0 if it +encounters the end of the file. If there is some error in getting a +record, such as a file that cannot be opened, then 'getline' returns -1. +In this case, 'gawk' sets the variable 'ERRNO' to a string describing +the error that occurred. + + If 'ERRNO' indicates that the I/O operation may be retried, and +'PROCINFO["INPUT", "RETRY"]' is set, then 'getline' returns -2 instead +of -1, and further calls to 'getline' may be attempted. *Note Retrying +Input:: for further information about this feature. + + In the following examples, COMMAND stands for a string value that +represents a shell command. + + NOTE: When '--sandbox' is specified (*note Options::), reading + lines from files, pipes, and coprocesses is disabled. + +* Menu: + +* Plain Getline:: Using 'getline' with no arguments. +* Getline/Variable:: Using 'getline' into a variable. +* Getline/File:: Using 'getline' from a file. +* Getline/Variable/File:: Using 'getline' into a variable from a + file. +* Getline/Pipe:: Using 'getline' from a pipe. +* Getline/Variable/Pipe:: Using 'getline' into a variable from a + pipe. +* Getline/Coprocess:: Using 'getline' from a coprocess. +* Getline/Variable/Coprocess:: Using 'getline' into a variable from a + coprocess. +* Getline Notes:: Important things to know about 'getline'. +* Getline Summary:: Summary of 'getline' Variants. + + +File: gawk.info, Node: Plain Getline, Next: Getline/Variable, Up: Getline + +4.9.1 Using 'getline' with No Arguments +--------------------------------------- + +The 'getline' command can be used without arguments to read input from +the current input file. All it does in this case is read the next input +record and split it up into fields. This is useful if you've finished +processing the current record, but want to do some special processing on +the next record _right now_. For example: + + # Remove text between /* and */, inclusive + { + if ((i = index($0, "/*")) != 0) { + out = substr($0, 1, i - 1) # leading part of the string + rest = substr($0, i + 2) # ... */ ... + j = index(rest, "*/") # is */ in trailing part? + if (j > 0) { + rest = substr(rest, j + 2) # remove comment + } else { + while (j == 0) { + # get more text + if (getline <= 0) { + print("unexpected EOF or error:", ERRNO) > "/dev/stderr" + exit + } + # build up the line using string concatenation + rest = rest $0 + j = index(rest, "*/") # is */ in trailing part? + if (j != 0) { + rest = substr(rest, j + 2) + break + } + } + } + # build up the output line using string concatenation + $0 = out rest + } + print $0 + } + + This 'awk' program deletes C-style comments ('/* ... */') from the +input. It uses a number of features we haven't covered yet, including +string concatenation (*note Concatenation::) and the 'index()' and +'substr()' built-in functions (*note String Functions::). By replacing +the 'print $0' with other statements, you could perform more complicated +processing on the decommented input, such as searching for matches of a +regular expression. (This program has a subtle problem--it does not +work if one comment ends and another begins on the same line.) + + This form of the 'getline' command sets 'NF', 'NR', 'FNR', 'RT', and +the value of '$0'. + + NOTE: The new value of '$0' is used to test the patterns of any + subsequent rules. The original value of '$0' that triggered the + rule that executed 'getline' is lost. By contrast, the 'next' + statement reads a new record but immediately begins processing it + normally, starting with the first rule in the program. *Note Next + Statement::. + + +File: gawk.info, Node: Getline/Variable, Next: Getline/File, Prev: Plain Getline, Up: Getline + +4.9.2 Using 'getline' into a Variable +------------------------------------- + +You can use 'getline VAR' to read the next record from 'awk''s input +into the variable VAR. No other processing is done. For example, +suppose the next line is a comment or a special string, and you want to +read it without triggering any rules. This form of 'getline' allows you +to read that line and store it in a variable so that the main +read-a-line-and-check-each-rule loop of 'awk' never sees it. The +following example swaps every two lines of input: + + { + if ((getline tmp) > 0) { + print tmp + print $0 + } else + print $0 + } + +It takes the following list: + + wan + tew + free + phore + +and produces these results: + + tew + wan + phore + free + + The 'getline' command used in this way sets only the variables 'NR', +'FNR', and 'RT' (and, of course, VAR). The record is not split into +fields, so the values of the fields (including '$0') and the value of +'NF' do not change. + + +File: gawk.info, Node: Getline/File, Next: Getline/Variable/File, Prev: Getline/Variable, Up: Getline + +4.9.3 Using 'getline' from a File +--------------------------------- + +Use 'getline < FILE' to read the next record from FILE. Here, FILE is a +string-valued expression that specifies the file name. '< FILE' is +called a "redirection" because it directs input to come from a different +place. For example, the following program reads its input record from +the file 'secondary.input' when it encounters a first field with a value +equal to 10 in the current input file: + + { + if ($1 == 10) { + getline < "secondary.input" + print + } else + print + } + + Because the main input stream is not used, the values of 'NR' and +'FNR' are not changed. However, the record it reads is split into +fields in the normal manner, so the values of '$0' and the other fields +are changed, resulting in a new value of 'NF'. 'RT' is also set. + + According to POSIX, 'getline < EXPRESSION' is ambiguous if EXPRESSION +contains unparenthesized operators other than '$'; for example, 'getline +< dir "/" file' is ambiguous because the concatenation operator (not +discussed yet; *note Concatenation::) is not parenthesized. You should +write it as 'getline < (dir "/" file)' if you want your program to be +portable to all 'awk' implementations. + + +File: gawk.info, Node: Getline/Variable/File, Next: Getline/Pipe, Prev: Getline/File, Up: Getline + +4.9.4 Using 'getline' into a Variable from a File +------------------------------------------------- + +Use 'getline VAR < FILE' to read input from the file FILE, and put it in +the variable VAR. As earlier, FILE is a string-valued expression that +specifies the file from which to read. + + In this version of 'getline', none of the predefined variables are +changed and the record is not split into fields. The only variable +changed is VAR.(1) For example, the following program copies all the +input files to the output, except for records that say +'@include FILENAME'. Such a record is replaced by the contents of the +file FILENAME: + + { + if (NF == 2 && $1 == "@include") { + while ((getline line < $2) > 0) + print line + close($2) + } else + print + } + + Note here how the name of the extra input file is not built into the +program; it is taken directly from the data, specifically from the +second field on the '@include' line. + + The 'close()' function is called to ensure that if two identical +'@include' lines appear in the input, the entire specified file is +included twice. *Note Close Files And Pipes::. + + One deficiency of this program is that it does not process nested +'@include' statements (i.e., '@include' statements in included files) +the way a true macro preprocessor would. *Note Igawk Program:: for a +program that does handle nested '@include' statements. + + ---------- Footnotes ---------- + + (1) This is not quite true. 'RT' could be changed if 'RS' is a +regular expression. + + +File: gawk.info, Node: Getline/Pipe, Next: Getline/Variable/Pipe, Prev: Getline/Variable/File, Up: Getline + +4.9.5 Using 'getline' from a Pipe +--------------------------------- + + Omniscience has much to recommend it. Failing that, attention to + details would be useful. + -- _Brian Kernighan_ + + The output of a command can also be piped into 'getline', using +'COMMAND | getline'. In this case, the string COMMAND is run as a shell +command and its output is piped into 'awk' to be used as input. This +form of 'getline' reads one record at a time from the pipe. For +example, the following program copies its input to its output, except +for lines that begin with '@execute', which are replaced by the output +produced by running the rest of the line as a shell command: + + { + if ($1 == "@execute") { + tmp = substr($0, 10) # Remove "@execute" + while ((tmp | getline) > 0) + print + close(tmp) + } else + print + } + +The 'close()' function is called to ensure that if two identical +'@execute' lines appear in the input, the command is run for each one. +*Note Close Files And Pipes::. Given the input: + + foo + bar + baz + @execute who + bletch + +the program might produce: + + foo + bar + baz + arnold ttyv0 Jul 13 14:22 + miriam ttyp0 Jul 13 14:23 (murphy:0) + bill ttyp1 Jul 13 14:23 (murphy:0) + bletch + +Notice that this program ran the command 'who' and printed the result. +(If you try this program yourself, you will of course get different +results, depending upon who is logged in on your system.) + + This variation of 'getline' splits the record into fields, sets the +value of 'NF', and recomputes the value of '$0'. The values of 'NR' and +'FNR' are not changed. 'RT' is set. + + According to POSIX, 'EXPRESSION | getline' is ambiguous if EXPRESSION +contains unparenthesized operators other than '$'--for example, '"echo " +"date" | getline' is ambiguous because the concatenation operator is not +parenthesized. You should write it as '("echo " "date") | getline' if +you want your program to be portable to all 'awk' implementations. + + NOTE: Unfortunately, 'gawk' has not been consistent in its + treatment of a construct like '"echo " "date" | getline'. Most + versions, including the current version, treat it at as '("echo " + "date") | getline'. (This is also how BWK 'awk' behaves.) Some + versions instead treat it as '"echo " ("date" | getline)'. (This + is how 'mawk' behaves.) In short, _always_ use explicit + parentheses, and then you won't have to worry. + + +File: gawk.info, Node: Getline/Variable/Pipe, Next: Getline/Coprocess, Prev: Getline/Pipe, Up: Getline + +4.9.6 Using 'getline' into a Variable from a Pipe +------------------------------------------------- + +When you use 'COMMAND | getline VAR', the output of COMMAND is sent +through a pipe to 'getline' and into the variable VAR. For example, the +following program reads the current date and time into the variable +'current_time', using the 'date' utility, and then prints it: + + BEGIN { + "date" | getline current_time + close("date") + print "Report printed on " current_time + } + + In this version of 'getline', none of the predefined variables are +changed and the record is not split into fields. However, 'RT' is set. + + According to POSIX, 'EXPRESSION | getline VAR' is ambiguous if +EXPRESSION contains unparenthesized operators other than '$'; for +example, '"echo " "date" | getline VAR' is ambiguous because the +concatenation operator is not parenthesized. You should write it as +'("echo " "date") | getline VAR' if you want your program to be portable +to other 'awk' implementations. + + +File: gawk.info, Node: Getline/Coprocess, Next: Getline/Variable/Coprocess, Prev: Getline/Variable/Pipe, Up: Getline + +4.9.7 Using 'getline' from a Coprocess +-------------------------------------- + +Reading input into 'getline' from a pipe is a one-way operation. The +command that is started with 'COMMAND | getline' only sends data _to_ +your 'awk' program. + + On occasion, you might want to send data to another program for +processing and then read the results back. 'gawk' allows you to start a +"coprocess", with which two-way communications are possible. This is +done with the '|&' operator. Typically, you write data to the coprocess +first and then read the results back, as shown in the following: + + print "SOME QUERY" |& "db_server" + "db_server" |& getline + +which sends a query to 'db_server' and then reads the results. + + The values of 'NR' and 'FNR' are not changed, because the main input +stream is not used. However, the record is split into fields in the +normal manner, thus changing the values of '$0', of the other fields, +and of 'NF' and 'RT'. + + Coprocesses are an advanced feature. They are discussed here only +because this is the minor node on 'getline'. *Note Two-way I/O::, where +coprocesses are discussed in more detail. + + +File: gawk.info, Node: Getline/Variable/Coprocess, Next: Getline Notes, Prev: Getline/Coprocess, Up: Getline + +4.9.8 Using 'getline' into a Variable from a Coprocess +------------------------------------------------------ + +When you use 'COMMAND |& getline VAR', the output from the coprocess +COMMAND is sent through a two-way pipe to 'getline' and into the +variable VAR. + + In this version of 'getline', none of the predefined variables are +changed and the record is not split into fields. The only variable +changed is VAR. However, 'RT' is set. + + Coprocesses are an advanced feature. They are discussed here only +because this is the minor node on 'getline'. *Note Two-way I/O::, where +coprocesses are discussed in more detail. + + +File: gawk.info, Node: Getline Notes, Next: Getline Summary, Prev: Getline/Variable/Coprocess, Up: Getline + +4.9.9 Points to Remember About 'getline' +---------------------------------------- + +Here are some miscellaneous points about 'getline' that you should bear +in mind: + + * When 'getline' changes the value of '$0' and 'NF', 'awk' does _not_ + automatically jump to the start of the program and start testing + the new record against every pattern. However, the new record is + tested against any subsequent rules. + + * Some very old 'awk' implementations limit the number of pipelines + that an 'awk' program may have open to just one. In 'gawk', there + is no such limit. You can open as many pipelines (and coprocesses) + as the underlying operating system permits. + + * An interesting side effect occurs if you use 'getline' without a + redirection inside a 'BEGIN' rule. Because an unredirected + 'getline' reads from the command-line data files, the first + 'getline' command causes 'awk' to set the value of 'FILENAME'. + Normally, 'FILENAME' does not have a value inside 'BEGIN' rules, + because you have not yet started to process the command-line data + files. (d.c.) (See *note BEGIN/END::; also *note Auto-set::.) + + * Using 'FILENAME' with 'getline' ('getline < FILENAME') is likely to + be a source of confusion. 'awk' opens a separate input stream from + the current input file. However, by not using a variable, '$0' and + 'NF' are still updated. If you're doing this, it's probably by + accident, and you should reconsider what it is you're trying to + accomplish. + + * *note Getline Summary::, presents a table summarizing the 'getline' + variants and which variables they can affect. It is worth noting + that those variants that do not use redirection can cause + 'FILENAME' to be updated if they cause 'awk' to start reading a new + input file. + + * If the variable being assigned is an expression with side effects, + different versions of 'awk' behave differently upon encountering + end-of-file. Some versions don't evaluate the expression; many + versions (including 'gawk') do. Here is an example, courtesy of + Duncan Moore: + + BEGIN { + system("echo 1 > f") + while ((getline a[++c] < "f") > 0) { } + print c + } + + Here, the side effect is the '++c'. Is 'c' incremented if + end-of-file is encountered before the element in 'a' is assigned? + + 'gawk' treats 'getline' like a function call, and evaluates the + expression 'a[++c]' before attempting to read from 'f'. However, + some versions of 'awk' only evaluate the expression once they know + that there is a string value to be assigned. + + +File: gawk.info, Node: Getline Summary, Prev: Getline Notes, Up: Getline + +4.9.10 Summary of 'getline' Variants +------------------------------------ + +*note Table 4.1: table-getline-variants. summarizes the eight variants +of 'getline', listing which predefined variables are set by each one, +and whether the variant is standard or a 'gawk' extension. Note: for +each variant, 'gawk' sets the 'RT' predefined variable. + +Variant Effect 'awk' / 'gawk' +------------------------------------------------------------------------- +'getline' Sets '$0', 'NF', 'FNR', 'awk' + 'NR', and 'RT' +'getline' VAR Sets VAR, 'FNR', 'NR', 'awk' + and 'RT' +'getline <' FILE Sets '$0', 'NF', and 'RT' 'awk' +'getline VAR < FILE' Sets VAR and 'RT' 'awk' +COMMAND '| getline' Sets '$0', 'NF', and 'RT' 'awk' +COMMAND '| getline' Sets VAR and 'RT' 'awk' +VAR +COMMAND '|& getline' Sets '$0', 'NF', and 'RT' 'gawk' +COMMAND '|& getline' Sets VAR and 'RT' 'gawk' +VAR + +Table 4.1: 'getline' variants and what they set + + +File: gawk.info, Node: Read Timeout, Next: Retrying Input, Prev: Getline, Up: Reading Files + +4.10 Reading Input with a Timeout +================================= + +This minor node describes a feature that is specific to 'gawk'. + + You may specify a timeout in milliseconds for reading input from the +keyboard, a pipe, or two-way communication, including TCP/IP sockets. +This can be done on a per-input, per-command, or per-connection basis, +by setting a special element in the 'PROCINFO' array (*note Auto-set::): + + PROCINFO["input_name", "READ_TIMEOUT"] = TIMEOUT IN MILLISECONDS + + When set, this causes 'gawk' to time out and return failure if no +data is available to read within the specified timeout period. For +example, a TCP client can decide to give up on receiving any response +from the server after a certain amount of time: + + Service = "/inet/tcp/0/localhost/daytime" + PROCINFO[Service, "READ_TIMEOUT"] = 100 + if ((Service |& getline) > 0) + print $0 + else if (ERRNO != "") + print ERRNO + + Here is how to read interactively from the user(1) without waiting +for more than five seconds: + + PROCINFO["/dev/stdin", "READ_TIMEOUT"] = 5000 + while ((getline < "/dev/stdin") > 0) + print $0 + + 'gawk' terminates the read operation if input does not arrive after +waiting for the timeout period, returns failure, and sets 'ERRNO' to an +appropriate string value. A negative or zero value for the timeout is +the same as specifying no timeout at all. + + A timeout can also be set for reading from the keyboard in the +implicit loop that reads input records and matches them against +patterns, like so: + + $ gawk 'BEGIN { PROCINFO["-", "READ_TIMEOUT"] = 5000 } + > { print "You entered: " $0 }' + gawk + -| You entered: gawk + + In this case, failure to respond within five seconds results in the +following error message: + + error-> gawk: cmd. line:2: (FILENAME=- FNR=1) fatal: error reading input file `-': Connection timed out + + The timeout can be set or changed at any time, and will take effect +on the next attempt to read from the input device. In the following +example, we start with a timeout value of one second, and progressively +reduce it by one-tenth of a second until we wait indefinitely for the +input to arrive: + + PROCINFO[Service, "READ_TIMEOUT"] = 1000 + while ((Service |& getline) > 0) { + print $0 + PROCINFO[Service, "READ_TIMEOUT"] -= 100 + } + + NOTE: You should not assume that the read operation will block + exactly after the tenth record has been printed. It is possible + that 'gawk' will read and buffer more than one record's worth of + data the first time. Because of this, changing the value of + timeout like in the preceding example is not very useful. + + If the 'PROCINFO' element is not present and the 'GAWK_READ_TIMEOUT' +environment variable exists, 'gawk' uses its value to initialize the +timeout value. The exclusive use of the environment variable to specify +timeout has the disadvantage of not being able to control it on a +per-command or per-connection basis. + + 'gawk' considers a timeout event to be an error even though the +attempt to read from the underlying device may succeed in a later +attempt. This is a limitation, and it also means that you cannot use +this to multiplex input from two or more sources. *Note Retrying +Input:: for a way to enable later I/O attempts to succeed. + + Assigning a timeout value prevents read operations from blocking +indefinitely. But bear in mind that there are other ways 'gawk' can +stall waiting for an input device to be ready. A network client can +sometimes take a long time to establish a connection before it can start +reading any data, or the attempt to open a FIFO special file for reading +can block indefinitely until some other process opens it for writing. + + ---------- Footnotes ---------- + + (1) This assumes that standard input is the keyboard. + + +File: gawk.info, Node: Retrying Input, Next: Command-line directories, Prev: Read Timeout, Up: Reading Files + +4.11 Retrying Reads After Certain Input Errors +============================================== + +This minor node describes a feature that is specific to 'gawk'. + + When 'gawk' encounters an error while reading input, by default +'getline' returns -1, and subsequent attempts to read from that file +result in an end-of-file indication. However, you may optionally +instruct 'gawk' to allow I/O to be retried when certain errors are +encountered by setting a special element in the 'PROCINFO' array (*note +Auto-set::): + + PROCINFO["INPUT_NAME", "RETRY"] = 1 + + When this element exists, 'gawk' checks the value of the system (C +language) 'errno' variable when an I/O error occurs. If 'errno' +indicates a subsequent I/O attempt may succeed, 'getline' instead +returns -2 and further calls to 'getline' may succeed. This applies to +the 'errno' values 'EAGAIN', 'EWOULDBLOCK', 'EINTR', or 'ETIMEDOUT'. + + This feature is useful in conjunction with 'PROCINFO["INPUT_NAME", +"READ_TIMEOUT"]' or situations where a file descriptor has been +configured to behave in a non-blocking fashion. + + +File: gawk.info, Node: Command-line directories, Next: Input Summary, Prev: Retrying Input, Up: Reading Files + +4.12 Directories on the Command Line +==================================== + +According to the POSIX standard, files named on the 'awk' command line +must be text files; it is a fatal error if they are not. Most versions +of 'awk' treat a directory on the command line as a fatal error. + + By default, 'gawk' produces a warning for a directory on the command +line, but otherwise ignores it. This makes it easier to use shell +wildcards with your 'awk' program: + + $ gawk -f whizprog.awk * Directories could kill this program + + If either of the '--posix' or '--traditional' options is given, then +'gawk' reverts to treating a directory on the command line as a fatal +error. + + *Note Extension Sample Readdir:: for a way to treat directories as +usable data from an 'awk' program. + + +File: gawk.info, Node: Input Summary, Next: Input Exercises, Prev: Command-line directories, Up: Reading Files + +4.13 Summary +============ + + * Input is split into records based on the value of 'RS'. The + possibilities are as follows: + + Value of 'RS' Records are split on 'awk' / 'gawk' + ... + --------------------------------------------------------------------------- + Any single That character 'awk' + character + The empty string Runs of two or more 'awk' + ('""') newlines + A regexp Text that matches the 'gawk' + regexp + + * 'FNR' indicates how many records have been read from the current + input file; 'NR' indicates how many records have been read in + total. + + * 'gawk' sets 'RT' to the text matched by 'RS'. + + * After splitting the input into records, 'awk' further splits the + records into individual fields, named '$1', '$2', and so on. '$0' + is the whole record, and 'NF' indicates how many fields there are. + The default way to split fields is between whitespace characters. + + * Fields may be referenced using a variable, as in '$NF'. Fields may + also be assigned values, which causes the value of '$0' to be + recomputed when it is later referenced. Assigning to a field with + a number greater than 'NF' creates the field and rebuilds the + record, using 'OFS' to separate the fields. Incrementing 'NF' does + the same thing. Decrementing 'NF' throws away fields and rebuilds + the record. + + * Field splitting is more complicated than record splitting: + + Field separator value Fields are split ... 'awk' / + 'gawk' + --------------------------------------------------------------------------- + 'FS == " "' On runs of whitespace 'awk' + 'FS == ANY SINGLE On that character 'awk' + CHARACTER' + 'FS == REGEXP' On text matching the regexp 'awk' + 'FS == ""' Such that each individual 'gawk' + character is a separate + field + 'FIELDWIDTHS == LIST OF Based on character position 'gawk' + COLUMNS' + 'FPAT == REGEXP' On the text surrounding 'gawk' + text matching the regexp + + * Using 'FS = "\n"' causes the entire record to be a single field + (assuming that newlines separate records). + + * 'FS' may be set from the command line using the '-F' option. This + can also be done using command-line variable assignment. + + * Use 'PROCINFO["FS"]' to see how fields are being split. + + * Use 'getline' in its various forms to read additional records from + the default input stream, from a file, or from a pipe or coprocess. + + * Use 'PROCINFO[FILE, "READ_TIMEOUT"]' to cause reads to time out for + FILE. + + * Directories on the command line are fatal for standard 'awk'; + 'gawk' ignores them if not in POSIX mode. + + +File: gawk.info, Node: Input Exercises, Prev: Input Summary, Up: Reading Files + +4.14 Exercises +============== + + 1. Using the 'FIELDWIDTHS' variable (*note Constant Size::), write a + program to read election data, where each record represents one + voter's votes. Come up with a way to define which columns are + associated with each ballot item, and print the total votes, + including abstentions, for each item. + + 2. *note Plain Getline::, presented a program to remove C-style + comments ('/* ... */') from the input. That program does not work + if one comment ends on one line and another one starts later on the + same line. That can be fixed by making one simple change. What is + it? + + +File: gawk.info, Node: Printing, Next: Expressions, Prev: Reading Files, Up: Top + +5 Printing Output +***************** + +One of the most common programming actions is to "print", or output, +some or all of the input. Use the 'print' statement for simple output, +and the 'printf' statement for fancier formatting. The 'print' +statement is not limited when computing _which_ values to print. +However, with two exceptions, you cannot specify _how_ to print +them--how many columns, whether to use exponential notation or not, and +so on. (For the exceptions, *note Output Separators:: and *note +OFMT::.) For printing with specifications, you need the 'printf' +statement (*note Printf::). + + Besides basic and formatted printing, this major node also covers I/O +redirections to files and pipes, introduces the special file names that +'gawk' processes internally, and discusses the 'close()' built-in +function. + +* Menu: + +* Print:: The 'print' statement. +* Print Examples:: Simple examples of 'print' statements. +* Output Separators:: The output separators and how to change them. +* OFMT:: Controlling Numeric Output With 'print'. +* Printf:: The 'printf' statement. +* Redirection:: How to redirect output to multiple files and + pipes. +* Special FD:: Special files for I/O. +* Special Files:: File name interpretation in 'gawk'. + 'gawk' allows access to inherited file + descriptors. +* Close Files And Pipes:: Closing Input and Output Files and Pipes. +* Nonfatal:: Enabling Nonfatal Output. +* Output Summary:: Output summary. +* Output Exercises:: Exercises. + + +File: gawk.info, Node: Print, Next: Print Examples, Up: Printing + +5.1 The 'print' Statement +========================= + +Use the 'print' statement to produce output with simple, standardized +formatting. You specify only the strings or numbers to print, in a list +separated by commas. They are output, separated by single spaces, +followed by a newline. The statement looks like this: + + print ITEM1, ITEM2, ... + +The entire list of items may be optionally enclosed in parentheses. The +parentheses are necessary if any of the item expressions uses the '>' +relational operator; otherwise it could be confused with an output +redirection (*note Redirection::). + + The items to print can be constant strings or numbers, fields of the +current record (such as '$1'), variables, or any 'awk' expression. +Numeric values are converted to strings and then printed. + + The simple statement 'print' with no items is equivalent to 'print +$0': it prints the entire current record. To print a blank line, use +'print ""'. To print a fixed piece of text, use a string constant, such +as '"Don't Panic"', as one item. If you forget to use the double-quote +characters, your text is taken as an 'awk' expression, and you will +probably get an error. Keep in mind that a space is printed between any +two items. + + Note that the 'print' statement is a statement and not an +expression--you can't use it in the pattern part of a pattern-action +statement, for example. + + +File: gawk.info, Node: Print Examples, Next: Output Separators, Prev: Print, Up: Printing + +5.2 'print' Statement Examples +============================== + +Each 'print' statement makes at least one line of output. However, it +isn't limited to only one line. If an item value is a string containing +a newline, the newline is output along with the rest of the string. A +single 'print' statement can make any number of lines this way. + + The following is an example of printing a string that contains +embedded newlines (the '\n' is an escape sequence, used to represent the +newline character; *note Escape Sequences::): + + $ awk 'BEGIN { print "line one\nline two\nline three" }' + -| line one + -| line two + -| line three + + The next example, which is run on the 'inventory-shipped' file, +prints the first two fields of each input record, with a space between +them: + + $ awk '{ print $1, $2 }' inventory-shipped + -| Jan 13 + -| Feb 15 + -| Mar 15 + ... + + A common mistake in using the 'print' statement is to omit the comma +between two items. This often has the effect of making the items run +together in the output, with no space. The reason for this is that +juxtaposing two string expressions in 'awk' means to concatenate them. +Here is the same program, without the comma: + + $ awk '{ print $1 $2 }' inventory-shipped + -| Jan13 + -| Feb15 + -| Mar15 + ... + + To someone unfamiliar with the 'inventory-shipped' file, neither +example's output makes much sense. A heading line at the beginning +would make it clearer. Let's add some headings to our table of months +('$1') and green crates shipped ('$2'). We do this using a 'BEGIN' rule +(*note BEGIN/END::) so that the headings are only printed once: + + awk 'BEGIN { print "Month Crates" + print "----- ------" } + { print $1, $2 }' inventory-shipped + +When run, the program prints the following: + + Month Crates + ----- ------ + Jan 13 + Feb 15 + Mar 15 + ... + +The only problem, however, is that the headings and the table data don't +line up! We can fix this by printing some spaces between the two +fields: + + awk 'BEGIN { print "Month Crates" + print "----- ------" } + { print $1, " ", $2 }' inventory-shipped + + Lining up columns this way can get pretty complicated when there are +many columns to fix. Counting spaces for two or three columns is +simple, but any more than this can take up a lot of time. This is why +the 'printf' statement was created (*note Printf::); one of its +specialties is lining up columns of data. + + NOTE: You can continue either a 'print' or 'printf' statement + simply by putting a newline after any comma (*note + Statements/Lines::). + + +File: gawk.info, Node: Output Separators, Next: OFMT, Prev: Print Examples, Up: Printing + +5.3 Output Separators +===================== + +As mentioned previously, a 'print' statement contains a list of items +separated by commas. In the output, the items are normally separated by +single spaces. However, this doesn't need to be the case; a single +space is simply the default. Any string of characters may be used as +the "output field separator" by setting the predefined variable 'OFS'. +The initial value of this variable is the string '" "' (i.e., a single +space). + + The output from an entire 'print' statement is called an "output +record". Each 'print' statement outputs one output record, and then +outputs a string called the "output record separator" (or 'ORS'). The +initial value of 'ORS' is the string '"\n"' (i.e., a newline character). +Thus, each 'print' statement normally makes a separate line. + + In order to change how output fields and records are separated, +assign new values to the variables 'OFS' and 'ORS'. The usual place to +do this is in the 'BEGIN' rule (*note BEGIN/END::), so that it happens +before any input is processed. It can also be done with assignments on +the command line, before the names of the input files, or using the '-v' +command-line option (*note Options::). The following example prints the +first and second fields of each input record, separated by a semicolon, +with a blank line added after each newline: + + $ awk 'BEGIN { OFS = ";"; ORS = "\n\n" } + > { print $1, $2 }' mail-list + -| Amelia;555-5553 + -| + -| Anthony;555-3412 + -| + -| Becky;555-7685 + -| + -| Bill;555-1675 + -| + -| Broderick;555-0542 + -| + -| Camilla;555-2912 + -| + -| Fabius;555-1234 + -| + -| Julie;555-6699 + -| + -| Martin;555-6480 + -| + -| Samuel;555-3430 + -| + -| Jean-Paul;555-2127 + -| + + If the value of 'ORS' does not contain a newline, the program's +output runs together on a single line. + + +File: gawk.info, Node: OFMT, Next: Printf, Prev: Output Separators, Up: Printing + +5.4 Controlling Numeric Output with 'print' +=========================================== + +When printing numeric values with the 'print' statement, 'awk' +internally converts each number to a string of characters and prints +that string. 'awk' uses the 'sprintf()' function to do this conversion +(*note String Functions::). For now, it suffices to say that the +'sprintf()' function accepts a "format specification" that tells it how +to format numbers (or strings), and that there are a number of different +ways in which numbers can be formatted. The different format +specifications are discussed more fully in *note Control Letters::. + + The predefined variable 'OFMT' contains the format specification that +'print' uses with 'sprintf()' when it wants to convert a number to a +string for printing. The default value of 'OFMT' is '"%.6g"'. The way +'print' prints numbers can be changed by supplying a different format +specification for the value of 'OFMT', as shown in the following +example: + + $ awk 'BEGIN { + > OFMT = "%.0f" # print numbers as integers (rounds) + > print 17.23, 17.54 }' + -| 17 18 + +According to the POSIX standard, 'awk''s behavior is undefined if 'OFMT' +contains anything but a floating-point conversion specification. (d.c.) + + +File: gawk.info, Node: Printf, Next: Redirection, Prev: OFMT, Up: Printing + +5.5 Using 'printf' Statements for Fancier Printing +================================================== + +For more precise control over the output format than what is provided by +'print', use 'printf'. With 'printf' you can specify the width to use +for each item, as well as various formatting choices for numbers (such +as what output base to use, whether to print an exponent, whether to +print a sign, and how many digits to print after the decimal point). + +* Menu: + +* Basic Printf:: Syntax of the 'printf' statement. +* Control Letters:: Format-control letters. +* Format Modifiers:: Format-specification modifiers. +* Printf Examples:: Several examples. + + +File: gawk.info, Node: Basic Printf, Next: Control Letters, Up: Printf + +5.5.1 Introduction to the 'printf' Statement +-------------------------------------------- + +A simple 'printf' statement looks like this: + + printf FORMAT, ITEM1, ITEM2, ... + +As for 'print', the entire list of arguments may optionally be enclosed +in parentheses. Here too, the parentheses are necessary if any of the +item expressions uses the '>' relational operator; otherwise, it can be +confused with an output redirection (*note Redirection::). + + The difference between 'printf' and 'print' is the FORMAT argument. +This is an expression whose value is taken as a string; it specifies how +to output each of the other arguments. It is called the "format +string". + + The format string is very similar to that in the ISO C library +function 'printf()'. Most of FORMAT is text to output verbatim. +Scattered among this text are "format specifiers"--one per item. Each +format specifier says to output the next item in the argument list at +that place in the format. + + The 'printf' statement does not automatically append a newline to its +output. It outputs only what the format string specifies. So if a +newline is needed, you must include one in the format string. The +output separator variables 'OFS' and 'ORS' have no effect on 'printf' +statements. For example: + + $ awk 'BEGIN { + > ORS = "\nOUCH!\n"; OFS = "+" + > msg = "Don\47t Panic!" + > printf "%s\n", msg + > }' + -| Don't Panic! + +Here, neither the '+' nor the 'OUCH!' appears in the output message. + + +File: gawk.info, Node: Control Letters, Next: Format Modifiers, Prev: Basic Printf, Up: Printf + +5.5.2 Format-Control Letters +---------------------------- + +A format specifier starts with the character '%' and ends with a +"format-control letter"--it tells the 'printf' statement how to output +one item. The format-control letter specifies what _kind_ of value to +print. The rest of the format specifier is made up of optional +"modifiers" that control _how_ to print the value, such as the field +width. Here is a list of the format-control letters: + +'%c' + Print a number as a character; thus, 'printf "%c", 65' outputs the + letter 'A'. The output for a string value is the first character + of the string. + + NOTE: The POSIX standard says the first character of a string + is printed. In locales with multibyte characters, 'gawk' + attempts to convert the leading bytes of the string into a + valid wide character and then to print the multibyte encoding + of that character. Similarly, when printing a numeric value, + 'gawk' allows the value to be within the numeric range of + values that can be held in a wide character. If the + conversion to multibyte encoding fails, 'gawk' uses the low + eight bits of the value as the character to print. + + Other 'awk' versions generally restrict themselves to printing + the first byte of a string or to numeric values within the + range of a single byte (0-255). + +'%d', '%i' + Print a decimal integer. The two control letters are equivalent. + (The '%i' specification is for compatibility with ISO C.) + +'%e', '%E' + Print a number in scientific (exponential) notation. For example: + + printf "%4.3e\n", 1950 + + prints '1.950e+03', with a total of four significant figures, three + of which follow the decimal point. (The '4.3' represents two + modifiers, discussed in the next node.) '%E' uses 'E' instead of + 'e' in the output. + +'%f' + Print a number in floating-point notation. For example: + + printf "%4.3f", 1950 + + prints '1950.000', with a total of four significant figures, three + of which follow the decimal point. (The '4.3' represents two + modifiers, discussed in the next node.) + + On systems supporting IEEE 754 floating-point format, values + representing negative infinity are formatted as '-inf' or + '-infinity', and positive infinity as 'inf' or 'infinity'. The + special "not a number" value formats as '-nan' or 'nan' (*note Math + Definitions::). + +'%F' + Like '%f', but the infinity and "not a number" values are spelled + using uppercase letters. + + The '%F' format is a POSIX extension to ISO C; not all systems + support it. On those that don't, 'gawk' uses '%f' instead. + +'%g', '%G' + Print a number in either scientific notation or in floating-point + notation, whichever uses fewer characters; if the result is printed + in scientific notation, '%G' uses 'E' instead of 'e'. + +'%o' + Print an unsigned octal integer (*note Nondecimal-numbers::). + +'%s' + Print a string. + +'%u' + Print an unsigned decimal integer. (This format is of marginal + use, because all numbers in 'awk' are floating point; it is + provided primarily for compatibility with C.) + +'%x', '%X' + Print an unsigned hexadecimal integer; '%X' uses the letters 'A' + through 'F' instead of 'a' through 'f' (*note + Nondecimal-numbers::). + +'%%' + Print a single '%'. This does not consume an argument and it + ignores any modifiers. + + NOTE: When using the integer format-control letters for values that + are outside the range of the widest C integer type, 'gawk' switches + to the '%g' format specifier. If '--lint' is provided on the + command line (*note Options::), 'gawk' warns about this. Other + versions of 'awk' may print invalid values or do something else + entirely. (d.c.) + + +File: gawk.info, Node: Format Modifiers, Next: Printf Examples, Prev: Control Letters, Up: Printf + +5.5.3 Modifiers for 'printf' Formats +------------------------------------ + +A format specification can also include "modifiers" that can control how +much of the item's value is printed, as well as how much space it gets. +The modifiers come between the '%' and the format-control letter. We +use the bullet symbol "*" in the following examples to represent spaces +in the output. Here are the possible modifiers, in the order in which +they may appear: + +'N$' + An integer constant followed by a '$' is a "positional specifier". + Normally, format specifications are applied to arguments in the + order given in the format string. With a positional specifier, the + format specification is applied to a specific argument, instead of + what would be the next argument in the list. Positional specifiers + begin counting with one. Thus: + + printf "%s %s\n", "don't", "panic" + printf "%2$s %1$s\n", "panic", "don't" + + prints the famous friendly message twice. + + At first glance, this feature doesn't seem to be of much use. It + is in fact a 'gawk' extension, intended for use in translating + messages at runtime. *Note Printf Ordering::, which describes how + and why to use positional specifiers. For now, we ignore them. + +'-' (Minus) + The minus sign, used before the width modifier (see later on in + this list), says to left-justify the argument within its specified + width. Normally, the argument is printed right-justified in the + specified width. Thus: + + printf "%-4s", "foo" + + prints 'foo*'. + +SPACE + For numeric conversions, prefix positive values with a space and + negative values with a minus sign. + +'+' + The plus sign, used before the width modifier (see later on in this + list), says to always supply a sign for numeric conversions, even + if the data to format is positive. The '+' overrides the space + modifier. + +'#' + Use an "alternative form" for certain control letters. For '%o', + supply a leading zero. For '%x' and '%X', supply a leading '0x' or + '0X' for a nonzero result. For '%e', '%E', '%f', and '%F', the + result always contains a decimal point. For '%g' and '%G', + trailing zeros are not removed from the result. + +'0' + A leading '0' (zero) acts as a flag indicating that output should + be padded with zeros instead of spaces. This applies only to the + numeric output formats. This flag only has an effect when the + field width is wider than the value to print. + +''' + A single quote or apostrophe character is a POSIX extension to ISO + C. It indicates that the integer part of a floating-point value, or + the entire part of an integer decimal value, should have a + thousands-separator character in it. This only works in locales + that support such characters. For example: + + $ cat thousands.awk Show source program + -| BEGIN { printf "%'d\n", 1234567 } + $ LC_ALL=C gawk -f thousands.awk + -| 1234567 Results in "C" locale + $ LC_ALL=en_US.UTF-8 gawk -f thousands.awk + -| 1,234,567 Results in US English UTF locale + + For more information about locales and internationalization issues, + see *note Locales::. + + NOTE: The ''' flag is a nice feature, but its use complicates + things: it becomes difficult to use it in command-line + programs. For information on appropriate quoting tricks, see + *note Quoting::. + +WIDTH + This is a number specifying the desired minimum width of a field. + Inserting any number between the '%' sign and the format-control + character forces the field to expand to this width. The default + way to do this is to pad with spaces on the left. For example: + + printf "%4s", "foo" + + prints '*foo'. + + The value of WIDTH is a minimum width, not a maximum. If the item + value requires more than WIDTH characters, it can be as wide as + necessary. Thus, the following: + + printf "%4s", "foobar" + + prints 'foobar'. + + Preceding the WIDTH with a minus sign causes the output to be + padded with spaces on the right, instead of on the left. + +'.PREC' + A period followed by an integer constant specifies the precision to + use when printing. The meaning of the precision varies by control + letter: + + '%d', '%i', '%o', '%u', '%x', '%X' + Minimum number of digits to print. + + '%e', '%E', '%f', '%F' + Number of digits to the right of the decimal point. + + '%g', '%G' + Maximum number of significant digits. + + '%s' + Maximum number of characters from the string that should + print. + + Thus, the following: + + printf "%.4s", "foobar" + + prints 'foob'. + + The C library 'printf''s dynamic WIDTH and PREC capability (e.g., +'"%*.*s"') is supported. Instead of supplying explicit WIDTH and/or +PREC values in the format string, they are passed in the argument list. +For example: + + w = 5 + p = 3 + s = "abcdefg" + printf "%*.*s\n", w, p, s + +is exactly equivalent to: + + s = "abcdefg" + printf "%5.3s\n", s + +Both programs output '**abc'. Earlier versions of 'awk' did not support +this capability. If you must use such a version, you may simulate this +feature by using concatenation to build up the format string, like so: + + w = 5 + p = 3 + s = "abcdefg" + printf "%" w "." p "s\n", s + +This is not particularly easy to read, but it does work. + + C programmers may be used to supplying additional modifiers ('h', +'j', 'l', 'L', 't', and 'z') in 'printf' format strings. These are not +valid in 'awk'. Most 'awk' implementations silently ignore them. If +'--lint' is provided on the command line (*note Options::), 'gawk' warns +about their use. If '--posix' is supplied, their use is a fatal error. + + +File: gawk.info, Node: Printf Examples, Prev: Format Modifiers, Up: Printf + +5.5.4 Examples Using 'printf' +----------------------------- + +The following simple example shows how to use 'printf' to make an +aligned table: + + awk '{ printf "%-10s %s\n", $1, $2 }' mail-list + +This command prints the names of the people ('$1') in the file +'mail-list' as a string of 10 characters that are left-justified. It +also prints the phone numbers ('$2') next on the line. This produces an +aligned two-column table of names and phone numbers, as shown here: + + $ awk '{ printf "%-10s %s\n", $1, $2 }' mail-list + -| Amelia 555-5553 + -| Anthony 555-3412 + -| Becky 555-7685 + -| Bill 555-1675 + -| Broderick 555-0542 + -| Camilla 555-2912 + -| Fabius 555-1234 + -| Julie 555-6699 + -| Martin 555-6480 + -| Samuel 555-3430 + -| Jean-Paul 555-2127 + + In this case, the phone numbers had to be printed as strings because +the numbers are separated by dashes. Printing the phone numbers as +numbers would have produced just the first three digits: '555'. This +would have been pretty confusing. + + It wasn't necessary to specify a width for the phone numbers because +they are last on their lines. They don't need to have spaces after +them. + + The table could be made to look even nicer by adding headings to the +tops of the columns. This is done using a 'BEGIN' rule (*note +BEGIN/END::) so that the headers are only printed once, at the beginning +of the 'awk' program: + + awk 'BEGIN { print "Name Number" + print "---- ------" } + { printf "%-10s %s\n", $1, $2 }' mail-list + + The preceding example mixes 'print' and 'printf' statements in the +same program. Using just 'printf' statements can produce the same +results: + + awk 'BEGIN { printf "%-10s %s\n", "Name", "Number" + printf "%-10s %s\n", "----", "------" } + { printf "%-10s %s\n", $1, $2 }' mail-list + +Printing each column heading with the same format specification used for +the column elements ensures that the headings are aligned just like the +columns. + + The fact that the same format specification is used three times can +be emphasized by storing it in a variable, like this: + + awk 'BEGIN { format = "%-10s %s\n" + printf format, "Name", "Number" + printf format, "----", "------" } + { printf format, $1, $2 }' mail-list + + +File: gawk.info, Node: Redirection, Next: Special FD, Prev: Printf, Up: Printing + +5.6 Redirecting Output of 'print' and 'printf' +============================================== + +So far, the output from 'print' and 'printf' has gone to the standard +output, usually the screen. Both 'print' and 'printf' can also send +their output to other places. This is called "redirection". + + NOTE: When '--sandbox' is specified (*note Options::), redirecting + output to files, pipes, and coprocesses is disabled. + + A redirection appears after the 'print' or 'printf' statement. +Redirections in 'awk' are written just like redirections in shell +commands, except that they are written inside the 'awk' program. + + There are four forms of output redirection: output to a file, output +appended to a file, output through a pipe to another command, and output +to a coprocess. We show them all for the 'print' statement, but they +work identically for 'printf': + +'print ITEMS > OUTPUT-FILE' + This redirection prints the items into the output file named + OUTPUT-FILE. The file name OUTPUT-FILE can be any expression. Its + value is changed to a string and then used as a file name (*note + Expressions::). + + When this type of redirection is used, the OUTPUT-FILE is erased + before the first output is written to it. Subsequent writes to the + same OUTPUT-FILE do not erase OUTPUT-FILE, but append to it. (This + is different from how you use redirections in shell scripts.) If + OUTPUT-FILE does not exist, it is created. For example, here is + how an 'awk' program can write a list of peoples' names to one file + named 'name-list', and a list of phone numbers to another file + named 'phone-list': + + $ awk '{ print $2 > "phone-list" + > print $1 > "name-list" }' mail-list + $ cat phone-list + -| 555-5553 + -| 555-3412 + ... + $ cat name-list + -| Amelia + -| Anthony + ... + + Each output file contains one name or number per line. + +'print ITEMS >> OUTPUT-FILE' + This redirection prints the items into the preexisting output file + named OUTPUT-FILE. The difference between this and the single-'>' + redirection is that the old contents (if any) of OUTPUT-FILE are + not erased. Instead, the 'awk' output is appended to the file. If + OUTPUT-FILE does not exist, then it is created. + +'print ITEMS | COMMAND' + It is possible to send output to another program through a pipe + instead of into a file. This redirection opens a pipe to COMMAND, + and writes the values of ITEMS through this pipe to another process + created to execute COMMAND. + + The redirection argument COMMAND is actually an 'awk' expression. + Its value is converted to a string whose contents give the shell + command to be run. For example, the following produces two files, + one unsorted list of peoples' names, and one list sorted in reverse + alphabetical order: + + awk '{ print $1 > "names.unsorted" + command = "sort -r > names.sorted" + print $1 | command }' mail-list + + The unsorted list is written with an ordinary redirection, while + the sorted list is written by piping through the 'sort' utility. + + The next example uses redirection to mail a message to the mailing + list 'bug-system'. This might be useful when trouble is + encountered in an 'awk' script run periodically for system + maintenance: + + report = "mail bug-system" + print("Awk script failed:", $0) | report + print("at record number", FNR, "of", FILENAME) | report + close(report) + + The 'close()' function is called here because it's a good idea to + close the pipe as soon as all the intended output has been sent to + it. *Note Close Files And Pipes:: for more information. + + This example also illustrates the use of a variable to represent a + FILE or COMMAND--it is not necessary to always use a string + constant. Using a variable is generally a good idea, because (if + you mean to refer to that same file or command) 'awk' requires that + the string value be written identically every time. + +'print ITEMS |& COMMAND' + This redirection prints the items to the input of COMMAND. The + difference between this and the single-'|' redirection is that the + output from COMMAND can be read with 'getline'. Thus, COMMAND is a + "coprocess", which works together with but is subsidiary to the + 'awk' program. + + This feature is a 'gawk' extension, and is not available in POSIX + 'awk'. *Note Getline/Coprocess::, for a brief discussion. *Note + Two-way I/O::, for a more complete discussion. + + Redirecting output using '>', '>>', '|', or '|&' asks the system to +open a file, pipe, or coprocess only if the particular FILE or COMMAND +you specify has not already been written to by your program or if it has +been closed since it was last written to. + + It is a common error to use '>' redirection for the first 'print' to +a file, and then to use '>>' for subsequent output: + + # clear the file + print "Don't panic" > "guide.txt" + ... + # append + print "Avoid improbability generators" >> "guide.txt" + +This is indeed how redirections must be used from the shell. But in +'awk', it isn't necessary. In this kind of case, a program should use +'>' for all the 'print' statements, because the output file is only +opened once. (It happens that if you mix '>' and '>>' output is +produced in the expected order. However, mixing the operators for the +same file is definitely poor style, and is confusing to readers of your +program.) + + Many older 'awk' implementations limit the number of pipelines that +an 'awk' program may have open to just one! In 'gawk', there is no such +limit. 'gawk' allows a program to open as many pipelines as the +underlying operating system permits. + + Piping into 'sh' + + A particularly powerful way to use redirection is to build command +lines and pipe them into the shell, 'sh'. For example, suppose you have +a list of files brought over from a system where all the file names are +stored in uppercase, and you wish to rename them to have names in all +lowercase. The following program is both simple and efficient: + + { printf("mv %s %s\n", $0, tolower($0)) | "sh" } + + END { close("sh") } + + The 'tolower()' function returns its argument string with all +uppercase characters converted to lowercase (*note String Functions::). +The program builds up a list of command lines, using the 'mv' utility to +rename the files. It then sends the list to the shell for execution. + + *Note Shell Quoting:: for a function that can help in generating +command lines to be fed to the shell. + + +File: gawk.info, Node: Special FD, Next: Special Files, Prev: Redirection, Up: Printing + +5.7 Special Files for Standard Preopened Data Streams +===================================================== + +Running programs conventionally have three input and output streams +already available to them for reading and writing. These are known as +the "standard input", "standard output", and "standard error output". +These open streams (and any other open files or pipes) are often +referred to by the technical term "file descriptors". + + These streams are, by default, connected to your keyboard and screen, +but they are often redirected with the shell, via the '<', '<<', '>', +'>>', '>&', and '|' operators. Standard error is typically used for +writing error messages; the reason there are two separate streams, +standard output and standard error, is so that they can be redirected +separately. + + In traditional implementations of 'awk', the only way to write an +error message to standard error in an 'awk' program is as follows: + + print "Serious error detected!" | "cat 1>&2" + +This works by opening a pipeline to a shell command that can access the +standard error stream that it inherits from the 'awk' process. This is +far from elegant, and it also requires a separate process. So people +writing 'awk' programs often don't do this. Instead, they send the +error messages to the screen, like this: + + print "Serious error detected!" > "/dev/tty" + +('/dev/tty' is a special file supplied by the operating system that is +connected to your keyboard and screen. It represents the "terminal,"(1) +which on modern systems is a keyboard and screen, not a serial console.) +This generally has the same effect, but not always: although the +standard error stream is usually the screen, it can be redirected; when +that happens, writing to the screen is not correct. In fact, if 'awk' +is run from a background job, it may not have a terminal at all. Then +opening '/dev/tty' fails. + + 'gawk', BWK 'awk', and 'mawk' provide special file names for +accessing the three standard streams. If the file name matches one of +these special names when 'gawk' (or one of the others) redirects input +or output, then it directly uses the descriptor that the file name +stands for. These special file names work for all operating systems +that 'gawk' has been ported to, not just those that are POSIX-compliant: + +'/dev/stdin' + The standard input (file descriptor 0). + +'/dev/stdout' + The standard output (file descriptor 1). + +'/dev/stderr' + The standard error output (file descriptor 2). + + With these facilities, the proper way to write an error message then +becomes: + + print "Serious error detected!" > "/dev/stderr" + + Note the use of quotes around the file name. Like with any other +redirection, the value must be a string. It is a common error to omit +the quotes, which leads to confusing results. + + 'gawk' does not treat these file names as special when in +POSIX-compatibility mode. However, because BWK 'awk' supports them, +'gawk' does support them even when invoked with the '--traditional' +option (*note Options::). + + ---------- Footnotes ---------- + + (1) The "tty" in '/dev/tty' stands for "Teletype," a serial terminal. + + +File: gawk.info, Node: Special Files, Next: Close Files And Pipes, Prev: Special FD, Up: Printing + +5.8 Special File names in 'gawk' +================================ + +Besides access to standard input, standard output, and standard error, +'gawk' provides access to any open file descriptor. Additionally, there +are special file names reserved for TCP/IP networking. + +* Menu: + +* Other Inherited Files:: Accessing other open files with + 'gawk'. +* Special Network:: Special files for network communications. +* Special Caveats:: Things to watch out for. + + +File: gawk.info, Node: Other Inherited Files, Next: Special Network, Up: Special Files + +5.8.1 Accessing Other Open Files with 'gawk' +-------------------------------------------- + +Besides the '/dev/stdin', '/dev/stdout', and '/dev/stderr' special file +names mentioned earlier, 'gawk' provides syntax for accessing any other +inherited open file: + +'/dev/fd/N' + The file associated with file descriptor N. Such a file must be + opened by the program initiating the 'awk' execution (typically the + shell). Unless special pains are taken in the shell from which + 'gawk' is invoked, only descriptors 0, 1, and 2 are available. + + The file names '/dev/stdin', '/dev/stdout', and '/dev/stderr' are +essentially aliases for '/dev/fd/0', '/dev/fd/1', and '/dev/fd/2', +respectively. However, those names are more self-explanatory. + + Note that using 'close()' on a file name of the form '"/dev/fd/N"', +for file descriptor numbers above two, does actually close the given +file descriptor. + + +File: gawk.info, Node: Special Network, Next: Special Caveats, Prev: Other Inherited Files, Up: Special Files + +5.8.2 Special Files for Network Communications +---------------------------------------------- + +'gawk' programs can open a two-way TCP/IP connection, acting as either a +client or a server. This is done using a special file name of the form: + + /NET-TYPE/PROTOCOL/LOCAL-PORT/REMOTE-HOST/REMOTE-PORT + + The NET-TYPE is one of 'inet', 'inet4', or 'inet6'. The PROTOCOL is +one of 'tcp' or 'udp', and the other fields represent the other +essential pieces of information for making a networking connection. +These file names are used with the '|&' operator for communicating with +a coprocess (*note Two-way I/O::). This is an advanced feature, +mentioned here only for completeness. Full discussion is delayed until +*note TCP/IP Networking::. + + +File: gawk.info, Node: Special Caveats, Prev: Special Network, Up: Special Files + +5.8.3 Special File name Caveats +------------------------------- + +Here are some things to bear in mind when using the special file names +that 'gawk' provides: + + * Recognition of the file names for the three standard preopened + files is disabled only in POSIX mode. + + * Recognition of the other special file names is disabled if 'gawk' + is in compatibility mode (either '--traditional' or '--posix'; + *note Options::). + + * 'gawk' _always_ interprets these special file names. For example, + using '/dev/fd/4' for output actually writes on file descriptor 4, + and not on a new file descriptor that is 'dup()'ed from file + descriptor 4. Most of the time this does not matter; however, it + is important to _not_ close any of the files related to file + descriptors 0, 1, and 2. Doing so results in unpredictable + behavior. + + +File: gawk.info, Node: Close Files And Pipes, Next: Nonfatal, Prev: Special Files, Up: Printing + +5.9 Closing Input and Output Redirections +========================================= + +If the same file name or the same shell command is used with 'getline' +more than once during the execution of an 'awk' program (*note +Getline::), the file is opened (or the command is executed) the first +time only. At that time, the first record of input is read from that +file or command. The next time the same file or command is used with +'getline', another record is read from it, and so on. + + Similarly, when a file or pipe is opened for output, 'awk' remembers +the file name or command associated with it, and subsequent writes to +the same file or command are appended to the previous writes. The file +or pipe stays open until 'awk' exits. + + This implies that special steps are necessary in order to read the +same file again from the beginning, or to rerun a shell command (rather +than reading more output from the same command). The 'close()' function +makes these things possible: + + close(FILENAME) + +or: + + close(COMMAND) + + The argument FILENAME or COMMAND can be any expression. Its value +must _exactly_ match the string that was used to open the file or start +the command (spaces and other "irrelevant" characters included). For +example, if you open a pipe with this: + + "sort -r names" | getline foo + +then you must close it with this: + + close("sort -r names") + + Once this function call is executed, the next 'getline' from that +file or command, or the next 'print' or 'printf' to that file or +command, reopens the file or reruns the command. Because the expression +that you use to close a file or pipeline must exactly match the +expression used to open the file or run the command, it is good practice +to use a variable to store the file name or command. The previous +example becomes the following: + + sortcom = "sort -r names" + sortcom | getline foo + ... + close(sortcom) + +This helps avoid hard-to-find typographical errors in your 'awk' +programs. Here are some of the reasons for closing an output file: + + * To write a file and read it back later on in the same 'awk' + program. Close the file after writing it, then begin reading it + with 'getline'. + + * To write numerous files, successively, in the same 'awk' program. + If the files aren't closed, eventually 'awk' may exceed a system + limit on the number of open files in one process. It is best to + close each one when the program has finished writing it. + + * To make a command finish. When output is redirected through a + pipe, the command reading the pipe normally continues to try to + read input as long as the pipe is open. Often this means the + command cannot really do its work until the pipe is closed. For + example, if output is redirected to the 'mail' program, the message + is not actually sent until the pipe is closed. + + * To run the same program a second time, with the same arguments. + This is not the same thing as giving more input to the first run! + + For example, suppose a program pipes output to the 'mail' program. + If it outputs several lines redirected to this pipe without closing + it, they make a single message of several lines. By contrast, if + the program closes the pipe after each line of output, then each + line makes a separate message. + + If you use more files than the system allows you to have open, 'gawk' +attempts to multiplex the available open files among your data files. +'gawk''s ability to do this depends upon the facilities of your +operating system, so it may not always work. It is therefore both good +practice and good portability advice to always use 'close()' on your +files when you are done with them. In fact, if you are using a lot of +pipes, it is essential that you close commands when done. For example, +consider something like this: + + { + ... + command = ("grep " $1 " /some/file | my_prog -q " $3) + while ((command | getline) > 0) { + PROCESS OUTPUT OF command + } + # need close(command) here + } + + This example creates a new pipeline based on data in _each_ record. +Without the call to 'close()' indicated in the comment, 'awk' creates +child processes to run the commands, until it eventually runs out of +file descriptors for more pipelines. + + Even though each command has finished (as indicated by the +end-of-file return status from 'getline'), the child process is not +terminated;(1) more importantly, the file descriptor for the pipe is not +closed and released until 'close()' is called or 'awk' exits. + + 'close()' silently does nothing if given an argument that does not +represent a file, pipe, or coprocess that was opened with a redirection. +In such a case, it returns a negative value, indicating an error. In +addition, 'gawk' sets 'ERRNO' to a string indicating the error. + + Note also that 'close(FILENAME)' has no "magic" effects on the +implicit loop that reads through the files named on the command line. +It is, more likely, a close of a file that was never opened with a +redirection, so 'awk' silently does nothing, except return a negative +value. + + When using the '|&' operator to communicate with a coprocess, it is +occasionally useful to be able to close one end of the two-way pipe +without closing the other. This is done by supplying a second argument +to 'close()'. As in any other call to 'close()', the first argument is +the name of the command or special file used to start the coprocess. +The second argument should be a string, with either of the values '"to"' +or '"from"'. Case does not matter. As this is an advanced feature, +discussion is delayed until *note Two-way I/O::, which describes it in +more detail and gives an example. + + Using 'close()''s Return Value + + In many older versions of Unix 'awk', the 'close()' function is +actually a statement. (d.c.) It is a syntax error to try and use the +return value from 'close()': + + command = "..." + command | getline info + retval = close(command) # syntax error in many Unix awks + + 'gawk' treats 'close()' as a function. The return value is -1 if the +argument names something that was never opened with a redirection, or if +there is a system problem closing the file or process. In these cases, +'gawk' sets the predefined variable 'ERRNO' to a string describing the +problem. + + In 'gawk', starting with version 4.2, when closing a pipe or +coprocess (input or output), the return value is the exit status of the +command, as described in *note Table 5.1: +table-close-pipe-return-values.(2) Otherwise, it is the return value +from the system's 'close()' or 'fclose()' C functions when closing input +or output files, respectively. This value is zero if the close +succeeds, or -1 if it fails. + +Situation Return value from 'close()' +-------------------------------------------------------------------------- +Normal exit of command Command's exit status +Death by signal of command 256 + number of murderous signal +Death by signal of command 512 + number of murderous signal +with core dump +Some kind of error -1 + +Table 5.1: Return values from 'close()' of a pipe + + The POSIX standard is very vague; it says that 'close()' returns zero +on success and a nonzero value otherwise. In general, different +implementations vary in what they report when closing pipes; thus, the +return value cannot be used portably. (d.c.) In POSIX mode (*note +Options::), 'gawk' just returns zero when closing a pipe. + + ---------- Footnotes ---------- + + (1) The technical terminology is rather morbid. The finished child +is called a "zombie," and cleaning up after it is referred to as +"reaping." + + (2) Prior to version 4.2, the return value from closing a pipe or +co-process was the full 16-bit exit value as defined by the 'wait()' +system call. + + +File: gawk.info, Node: Nonfatal, Next: Output Summary, Prev: Close Files And Pipes, Up: Printing + +5.10 Enabling Nonfatal Output +============================= + +This minor node describes a 'gawk'-specific feature. + + In standard 'awk', output with 'print' or 'printf' to a nonexistent +file, or some other I/O error (such as filling up the disk) is a fatal +error. + + $ gawk 'BEGIN { print "hi" > "/no/such/file" }' + error-> gawk: cmd. line:1: fatal: can't redirect to `/no/such/file' (No such file or directory) + + 'gawk' makes it possible to detect that an error has occurred, +allowing you to possibly recover from the error, or at least print an +error message of your choosing before exiting. You can do this in one +of two ways: + + * For all output files, by assigning any value to + 'PROCINFO["NONFATAL"]'. + + * On a per-file basis, by assigning any value to 'PROCINFO[FILENAME, + "NONFATAL"]'. Here, FILENAME is the name of the file to which you + wish output to be nonfatal. + + Once you have enabled nonfatal output, you must check 'ERRNO' after +every relevant 'print' or 'printf' statement to see if something went +wrong. It is also a good idea to initialize 'ERRNO' to zero before +attempting the output. For example: + + $ gawk ' + > BEGIN { + > PROCINFO["NONFATAL"] = 1 + > ERRNO = 0 + > print "hi" > "/no/such/file" + > if (ERRNO) { + > print("Output failed:", ERRNO) > "/dev/stderr" + > exit 1 + > } + > }' + error-> Output failed: No such file or directory + + Here, 'gawk' did not produce a fatal error; instead it let the 'awk' +program code detect the problem and handle it. + + This mechanism works also for standard output and standard error. +For standard output, you may use 'PROCINFO["-", "NONFATAL"]' or +'PROCINFO["/dev/stdout", "NONFATAL"]'. For standard error, use +'PROCINFO["/dev/stderr", "NONFATAL"]'. + + When attempting to open a TCP/IP socket (*note TCP/IP Networking::), +'gawk' tries multiple times. The 'GAWK_SOCK_RETRIES' environment +variable (*note Other Environment Variables::) allows you to override +'gawk''s builtin default number of attempts. However, once nonfatal I/O +is enabled for a given socket, 'gawk' only retries once, relying on +'awk'-level code to notice that there was a problem. + + +File: gawk.info, Node: Output Summary, Next: Output Exercises, Prev: Nonfatal, Up: Printing + +5.11 Summary +============ + + * The 'print' statement prints comma-separated expressions. Each + expression is separated by the value of 'OFS' and terminated by the + value of 'ORS'. 'OFMT' provides the conversion format for numeric + values for the 'print' statement. + + * The 'printf' statement provides finer-grained control over output, + with format-control letters for different data types and various + flags that modify the behavior of the format-control letters. + + * Output from both 'print' and 'printf' may be redirected to files, + pipes, and coprocesses. + + * 'gawk' provides special file names for access to standard input, + output, and error, and for network communications. + + * Use 'close()' to close open file, pipe, and coprocess redirections. + For coprocesses, it is possible to close only one direction of the + communications. + + * Normally errors with 'print' or 'printf' are fatal. 'gawk' lets + you make output errors be nonfatal either for all files or on a + per-file basis. You must then check for errors after every + relevant output statement. + + +File: gawk.info, Node: Output Exercises, Prev: Output Summary, Up: Printing + +5.12 Exercises +============== + + 1. Rewrite the program: + + awk 'BEGIN { print "Month Crates" + print "----- ------" } + { print $1, " ", $2 }' inventory-shipped + + from *note Output Separators::, by using a new value of 'OFS'. + + 2. Use the 'printf' statement to line up the headings and table data + for the 'inventory-shipped' example that was covered in *note + Print::. + + 3. What happens if you forget the double quotes when redirecting + output, as follows: + + BEGIN { print "Serious error detected!" > /dev/stderr } + + +File: gawk.info, Node: Expressions, Next: Patterns and Actions, Prev: Printing, Up: Top + +6 Expressions +************* + +Expressions are the basic building blocks of 'awk' patterns and actions. +An expression evaluates to a value that you can print, test, or pass to +a function. Additionally, an expression can assign a new value to a +variable or a field by using an assignment operator. + + An expression can serve as a pattern or action statement on its own. +Most other kinds of statements contain one or more expressions that +specify the data on which to operate. As in other languages, +expressions in 'awk' can include variables, array references, constants, +and function calls, as well as combinations of these with various +operators. + +* Menu: + +* Values:: Constants, Variables, and Regular Expressions. +* All Operators:: 'gawk''s operators. +* Truth Values and Conditions:: Testing for true and false. +* Function Calls:: A function call is an expression. +* Precedence:: How various operators nest. +* Locales:: How the locale affects things. +* Expressions Summary:: Expressions summary. + + +File: gawk.info, Node: Values, Next: All Operators, Up: Expressions + +6.1 Constants, Variables, and Conversions +========================================= + +Expressions are built up from values and the operations performed upon +them. This minor node describes the elementary objects that provide the +values used in expressions. + +* Menu: + +* Constants:: String, numeric and regexp constants. +* Using Constant Regexps:: When and how to use a regexp constant. +* Variables:: Variables give names to values for later use. +* Conversion:: The conversion of strings to numbers and vice + versa. + + +File: gawk.info, Node: Constants, Next: Using Constant Regexps, Up: Values + +6.1.1 Constant Expressions +-------------------------- + +The simplest type of expression is the "constant", which always has the +same value. There are three types of constants: numeric, string, and +regular expression. + + Each is used in the appropriate context when you need a data value +that isn't going to change. Numeric constants can have different forms, +but are internally stored in an identical manner. + +* Menu: + +* Scalar Constants:: Numeric and string constants. +* Nondecimal-numbers:: What are octal and hex numbers. +* Regexp Constants:: Regular Expression constants. + + +File: gawk.info, Node: Scalar Constants, Next: Nondecimal-numbers, Up: Constants + +6.1.1.1 Numeric and String Constants +.................................... + +A "numeric constant" stands for a number. This number can be an +integer, a decimal fraction, or a number in scientific (exponential) +notation.(1) Here are some examples of numeric constants that all have +the same value: + + 105 + 1.05e+2 + 1050e-1 + + A "string constant" consists of a sequence of characters enclosed in +double quotation marks. For example: + + "parrot" + +represents the string whose contents are 'parrot'. Strings in 'gawk' +can be of any length, and they can contain any of the possible eight-bit +ASCII characters, including ASCII NUL (character code zero). Other +'awk' implementations may have difficulty with some character codes. + + ---------- Footnotes ---------- + + (1) The internal representation of all numbers, including integers, +uses double-precision floating-point numbers. On most modern systems, +these are in IEEE 754 standard format. *Note Arbitrary Precision +Arithmetic::, for much more information. + + +File: gawk.info, Node: Nondecimal-numbers, Next: Regexp Constants, Prev: Scalar Constants, Up: Constants + +6.1.1.2 Octal and Hexadecimal Numbers +..................................... + +In 'awk', all numbers are in decimal (i.e., base 10). Many other +programming languages allow you to specify numbers in other bases, often +octal (base 8) and hexadecimal (base 16). In octal, the numbers go 0, +1, 2, 3, 4, 5, 6, 7, 10, 11, 12, and so on. Just as '11' in decimal is +1 times 10 plus 1, so '11' in octal is 1 times 8 plus 1. This equals 9 +in decimal. In hexadecimal, there are 16 digits. Because the everyday +decimal number system only has ten digits ('0'-'9'), the letters 'a' +through 'f' are used to represent the rest. (Case in the letters is +usually irrelevant; hexadecimal 'a' and 'A' have the same value.) Thus, +'11' in hexadecimal is 1 times 16 plus 1, which equals 17 in decimal. + + Just by looking at plain '11', you can't tell what base it's in. So, +in C, C++, and other languages derived from C, there is a special +notation to signify the base. Octal numbers start with a leading '0', +and hexadecimal numbers start with a leading '0x' or '0X': + +'11' + Decimal value 11 + +'011' + Octal 11, decimal value 9 + +'0x11' + Hexadecimal 11, decimal value 17 + + This example shows the difference: + + $ gawk 'BEGIN { printf "%d, %d, %d\n", 011, 11, 0x11 }' + -| 9, 11, 17 + + Being able to use octal and hexadecimal constants in your programs is +most useful when working with data that cannot be represented +conveniently as characters or as regular numbers, such as binary data of +various sorts. + + 'gawk' allows the use of octal and hexadecimal constants in your +program text. However, such numbers in the input data are not treated +differently; doing so by default would break old programs. (If you +really need to do this, use the '--non-decimal-data' command-line +option; *note Nondecimal Data::.) If you have octal or hexadecimal +data, you can use the 'strtonum()' function (*note String Functions::) +to convert the data into a number. Most of the time, you will want to +use octal or hexadecimal constants when working with the built-in +bit-manipulation functions; see *note Bitwise Functions:: for more +information. + + Unlike in some early C implementations, '8' and '9' are not valid in +octal constants. For example, 'gawk' treats '018' as decimal 18: + + $ gawk 'BEGIN { print "021 is", 021 ; print 018 }' + -| 021 is 17 + -| 18 + + Octal and hexadecimal source code constants are a 'gawk' extension. +If 'gawk' is in compatibility mode (*note Options::), they are not +available. + + A Constant's Base Does Not Affect Its Value + + Once a numeric constant has been converted internally into a number, +'gawk' no longer remembers what the original form of the constant was; +the internal value is always used. This has particular consequences for +conversion of numbers to strings: + + $ gawk 'BEGIN { printf "0x11 is <%s>\n", 0x11 }' + -| 0x11 is <17> + + +File: gawk.info, Node: Regexp Constants, Prev: Nondecimal-numbers, Up: Constants + +6.1.1.3 Regular Expression Constants +.................................... + +A "regexp constant" is a regular expression description enclosed in +slashes, such as '/^beginning and end$/'. Most regexps used in 'awk' +programs are constant, but the '~' and '!~' matching operators can also +match computed or dynamic regexps (which are typically just ordinary +strings or variables that contain a regexp, but could be more complex +expressions). + + +File: gawk.info, Node: Using Constant Regexps, Next: Variables, Prev: Constants, Up: Values + +6.1.2 Using Regular Expression Constants +---------------------------------------- + +When used on the righthand side of the '~' or '!~' operators, a regexp +constant merely stands for the regexp that is to be matched. However, +regexp constants (such as '/foo/') may be used like simple expressions. +When a regexp constant appears by itself, it has the same meaning as if +it appeared in a pattern (i.e., '($0 ~ /foo/)'). (d.c.) *Note +Expression Patterns::. This means that the following two code segments: + + if ($0 ~ /barfly/ || $0 ~ /camelot/) + print "found" + +and: + + if (/barfly/ || /camelot/) + print "found" + +are exactly equivalent. One rather bizarre consequence of this rule is +that the following Boolean expression is valid, but does not do what its +author probably intended: + + # Note that /foo/ is on the left of the ~ + if (/foo/ ~ $1) print "found foo" + +This code is "obviously" testing '$1' for a match against the regexp +'/foo/'. But in fact, the expression '/foo/ ~ $1' really means '($0 ~ +/foo/) ~ $1'. In other words, first match the input record against the +regexp '/foo/'. The result is either zero or one, depending upon the +success or failure of the match. That result is then matched against +the first field in the record. Because it is unlikely that you would +ever really want to make this kind of test, 'gawk' issues a warning when +it sees this construct in a program. Another consequence of this rule +is that the assignment statement: + + matches = /foo/ + +assigns either zero or one to the variable 'matches', depending upon the +contents of the current input record. + + Constant regular expressions are also used as the first argument for +the 'gensub()', 'sub()', and 'gsub()' functions, as the second argument +of the 'match()' function, and as the third argument of the 'split()' +and 'patsplit()' functions (*note String Functions::). Modern +implementations of 'awk', including 'gawk', allow the third argument of +'split()' to be a regexp constant, but some older implementations do +not. (d.c.) Because some built-in functions accept regexp constants as +arguments, confusion can arise when attempting to use regexp constants +as arguments to user-defined functions (*note User-defined::). For +example: + + function mysub(pat, repl, str, global) + { + if (global) + gsub(pat, repl, str) + else + sub(pat, repl, str) + return str + } + + { + ... + text = "hi! hi yourself!" + mysub(/hi/, "howdy", text, 1) + ... + } + + In this example, the programmer wants to pass a regexp constant to +the user-defined function 'mysub()', which in turn passes it on to +either 'sub()' or 'gsub()'. However, what really happens is that the +'pat' parameter is assigned a value of either one or zero, depending +upon whether or not '$0' matches '/hi/'. 'gawk' issues a warning when +it sees a regexp constant used as a parameter to a user-defined +function, because passing a truth value in this way is probably not what +was intended. + + +File: gawk.info, Node: Variables, Next: Conversion, Prev: Using Constant Regexps, Up: Values + +6.1.3 Variables +--------------- + +"Variables" are ways of storing values at one point in your program for +use later in another part of your program. They can be manipulated +entirely within the program text, and they can also be assigned values +on the 'awk' command line. + +* Menu: + +* Using Variables:: Using variables in your programs. +* Assignment Options:: Setting variables on the command line and a + summary of command-line syntax. This is an + advanced method of input. + + +File: gawk.info, Node: Using Variables, Next: Assignment Options, Up: Variables + +6.1.3.1 Using Variables in a Program +.................................... + +Variables let you give names to values and refer to them later. +Variables have already been used in many of the examples. The name of a +variable must be a sequence of letters, digits, or underscores, and it +may not begin with a digit. Here, a "letter" is any one of the 52 +upper- and lowercase English letters. Other characters that may be +defined as letters in non-English locales are not valid in variable +names. Case is significant in variable names; 'a' and 'A' are distinct +variables. + + A variable name is a valid expression by itself; it represents the +variable's current value. Variables are given new values with +"assignment operators", "increment operators", and "decrement operators" +(*note Assignment Ops::). In addition, the 'sub()' and 'gsub()' +functions can change a variable's value, and the 'match()', 'split()', +and 'patsplit()' functions can change the contents of their array +parameters (*note String Functions::). + + A few variables have special built-in meanings, such as 'FS' (the +field separator) and 'NF' (the number of fields in the current input +record). *Note Built-in Variables:: for a list of the predefined +variables. These predefined variables can be used and assigned just +like all other variables, but their values are also used or changed +automatically by 'awk'. All predefined variables' names are entirely +uppercase. + + Variables in 'awk' can be assigned either numeric or string values. +The kind of value a variable holds can change over the life of a +program. By default, variables are initialized to the empty string, +which is zero if converted to a number. There is no need to explicitly +initialize a variable in 'awk', which is what you would do in C and in +most other traditional languages. + + +File: gawk.info, Node: Assignment Options, Prev: Using Variables, Up: Variables + +6.1.3.2 Assigning Variables on the Command Line +............................................... + +Any 'awk' variable can be set by including a "variable assignment" among +the arguments on the command line when 'awk' is invoked (*note Other +Arguments::). Such an assignment has the following form: + + VARIABLE=TEXT + +With it, a variable is set either at the beginning of the 'awk' run or +in between input files. When the assignment is preceded with the '-v' +option, as in the following: + + -v VARIABLE=TEXT + +the variable is set at the very beginning, even before the 'BEGIN' rules +execute. The '-v' option and its assignment must precede all the file +name arguments, as well as the program text. (*Note Options:: for more +information about the '-v' option.) Otherwise, the variable assignment +is performed at a time determined by its position among the input file +arguments--after the processing of the preceding input file argument. +For example: + + awk '{ print $n }' n=4 inventory-shipped n=2 mail-list + +prints the value of field number 'n' for all input records. Before the +first file is read, the command line sets the variable 'n' equal to +four. This causes the fourth field to be printed in lines from +'inventory-shipped'. After the first file has finished, but before the +second file is started, 'n' is set to two, so that the second field is +printed in lines from 'mail-list': + + $ awk '{ print $n }' n=4 inventory-shipped n=2 mail-list + -| 15 + -| 24 + ... + -| 555-5553 + -| 555-3412 + ... + + Command-line arguments are made available for explicit examination by +the 'awk' program in the 'ARGV' array (*note ARGC and ARGV::). 'awk' +processes the values of command-line assignments for escape sequences +(*note Escape Sequences::). (d.c.) + + +File: gawk.info, Node: Conversion, Prev: Variables, Up: Values + +6.1.4 Conversion of Strings and Numbers +--------------------------------------- + +Number-to-string and string-to-number conversion are generally +straightforward. There can be subtleties to be aware of; this minor +node discusses this important facet of 'awk'. + +* Menu: + +* Strings And Numbers:: How 'awk' Converts Between Strings And + Numbers. +* Locale influences conversions:: How the locale may affect conversions. + + +File: gawk.info, Node: Strings And Numbers, Next: Locale influences conversions, Up: Conversion + +6.1.4.1 How 'awk' Converts Between Strings and Numbers +...................................................... + +Strings are converted to numbers and numbers are converted to strings, +if the context of the 'awk' program demands it. For example, if the +value of either 'foo' or 'bar' in the expression 'foo + bar' happens to +be a string, it is converted to a number before the addition is +performed. If numeric values appear in string concatenation, they are +converted to strings. Consider the following: + + two = 2; three = 3 + print (two three) + 4 + +This prints the (numeric) value 27. The numeric values of the variables +'two' and 'three' are converted to strings and concatenated together. +The resulting string is converted back to the number 23, to which 4 is +then added. + + If, for some reason, you need to force a number to be converted to a +string, concatenate that number with the empty string, '""'. To force a +string to be converted to a number, add zero to that string. A string +is converted to a number by interpreting any numeric prefix of the +string as numerals: '"2.5"' converts to 2.5, '"1e3"' converts to 1,000, +and '"25fix"' has a numeric value of 25. Strings that can't be +interpreted as valid numbers convert to zero. + + The exact manner in which numbers are converted into strings is +controlled by the 'awk' predefined variable 'CONVFMT' (*note Built-in +Variables::). Numbers are converted using the 'sprintf()' function with +'CONVFMT' as the format specifier (*note String Functions::). + + 'CONVFMT''s default value is '"%.6g"', which creates a value with at +most six significant digits. For some applications, you might want to +change it to specify more precision. On most modern machines, 17 digits +is usually enough to capture a floating-point number's value exactly.(1) + + Strange results can occur if you set 'CONVFMT' to a string that +doesn't tell 'sprintf()' how to format floating-point numbers in a +useful way. For example, if you forget the '%' in the format, 'awk' +converts all numbers to the same constant string. + + As a special case, if a number is an integer, then the result of +converting it to a string is _always_ an integer, no matter what the +value of 'CONVFMT' may be. Given the following code fragment: + + CONVFMT = "%2.2f" + a = 12 + b = a "" + +'b' has the value '"12"', not '"12.00"'. (d.c.) + + Pre-POSIX 'awk' Used 'OFMT' for String Conversion + + Prior to the POSIX standard, 'awk' used the value of 'OFMT' for +converting numbers to strings. 'OFMT' specifies the output format to +use when printing numbers with 'print'. 'CONVFMT' was introduced in +order to separate the semantics of conversion from the semantics of +printing. Both 'CONVFMT' and 'OFMT' have the same default value: +'"%.6g"'. In the vast majority of cases, old 'awk' programs do not +change their behavior. *Note Print:: for more information on the +'print' statement. + + ---------- Footnotes ---------- + + (1) Pathological cases can require up to 752 digits (!), but we doubt +that you need to worry about this. + + +File: gawk.info, Node: Locale influences conversions, Prev: Strings And Numbers, Up: Conversion + +6.1.4.2 Locales Can Influence Conversion +........................................ + +Where you are can matter when it comes to converting between numbers and +strings. The local character set and language--the "locale"--can affect +numeric formats. In particular, for 'awk' programs, it affects the +decimal point character and the thousands-separator character. The +'"C"' locale, and most English-language locales, use the period +character ('.') as the decimal point and don't have a thousands +separator. However, many (if not most) European and non-English locales +use the comma (',') as the decimal point character. European locales +often use either a space or a period as the thousands separator, if they +have one. + + The POSIX standard says that 'awk' always uses the period as the +decimal point when reading the 'awk' program source code, and for +command-line variable assignments (*note Other Arguments::). However, +when interpreting input data, for 'print' and 'printf' output, and for +number-to-string conversion, the local decimal point character is used. +(d.c.) In all cases, numbers in source code and in input data cannot +have a thousands separator. Here are some examples indicating the +difference in behavior, on a GNU/Linux system: + + $ export POSIXLY_CORRECT=1 Force POSIX behavior + $ gawk 'BEGIN { printf "%g\n", 3.1415927 }' + -| 3.14159 + $ LC_ALL=en_DK.utf-8 gawk 'BEGIN { printf "%g\n", 3.1415927 }' + -| 3,14159 + $ echo 4,321 | gawk '{ print $1 + 1 }' + -| 5 + $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }' + -| 5,321 + +The 'en_DK.utf-8' locale is for English in Denmark, where the comma acts +as the decimal point separator. In the normal '"C"' locale, 'gawk' +treats '4,321' as 4, while in the Danish locale, it's treated as the +full number including the fractional part, 4.321. + + Some earlier versions of 'gawk' fully complied with this aspect of +the standard. However, many users in non-English locales complained +about this behavior, because their data used a period as the decimal +point, so the default behavior was restored to use a period as the +decimal point character. You can use the '--use-lc-numeric' option +(*note Options::) to force 'gawk' to use the locale's decimal point +character. ('gawk' also uses the locale's decimal point character when +in POSIX mode, either via '--posix' or the 'POSIXLY_CORRECT' environment +variable, as shown previously.) + + *note Table 6.1: table-locale-affects. describes the cases in which +the locale's decimal point character is used and when a period is used. +Some of these features have not been described yet. + +Feature Default '--posix' or + '--use-lc-numeric' +------------------------------------------------------------ +'%'g' Use locale Use locale +'%g' Use period Use locale +Input Use period Use locale +'strtonum()'Use period Use locale + +Table 6.1: Locale decimal point versus a period + + Finally, modern-day formal standards and the IEEE standard +floating-point representation can have an unusual but important effect +on the way 'gawk' converts some special string values to numbers. The +details are presented in *note POSIX Floating Point Problems::. + + +File: gawk.info, Node: All Operators, Next: Truth Values and Conditions, Prev: Values, Up: Expressions + +6.2 Operators: Doing Something with Values +========================================== + +This minor node introduces the "operators" that make use of the values +provided by constants and variables. + +* Menu: + +* Arithmetic Ops:: Arithmetic operations ('+', '-', + etc.) +* Concatenation:: Concatenating strings. +* Assignment Ops:: Changing the value of a variable or a field. +* Increment Ops:: Incrementing the numeric value of a variable. + + +File: gawk.info, Node: Arithmetic Ops, Next: Concatenation, Up: All Operators + +6.2.1 Arithmetic Operators +-------------------------- + +The 'awk' language uses the common arithmetic operators when evaluating +expressions. All of these arithmetic operators follow normal precedence +rules and work as you would expect them to. + + The following example uses a file named 'grades', which contains a +list of student names as well as three test scores per student (it's a +small class): + + Pat 100 97 58 + Sandy 84 72 93 + Chris 72 92 89 + +This program takes the file 'grades' and prints the average of the +scores: + + $ awk '{ sum = $2 + $3 + $4 ; avg = sum / 3 + > print $1, avg }' grades + -| Pat 85 + -| Sandy 83 + -| Chris 84.3333 + + The following list provides the arithmetic operators in 'awk', in +order from the highest precedence to the lowest: + +'X ^ Y' +'X ** Y' + Exponentiation; X raised to the Y power. '2 ^ 3' has the value + eight; the character sequence '**' is equivalent to '^'. (c.e.) + +'- X' + Negation. + +'+ X' + Unary plus; the expression is converted to a number. + +'X * Y' + Multiplication. + +'X / Y' + Division; because all numbers in 'awk' are floating-point numbers, + the result is _not_ rounded to an integer--'3 / 4' has the value + 0.75. (It is a common mistake, especially for C programmers, to + forget that _all_ numbers in 'awk' are floating point, and that + division of integer-looking constants produces a real number, not + an integer.) + +'X % Y' + Remainder; further discussion is provided in the text, just after + this list. + +'X + Y' + Addition. + +'X - Y' + Subtraction. + + Unary plus and minus have the same precedence, the multiplication +operators all have the same precedence, and addition and subtraction +have the same precedence. + + When computing the remainder of 'X % Y', the quotient is rounded +toward zero to an integer and multiplied by Y. This result is +subtracted from X; this operation is sometimes known as "trunc-mod." +The following relation always holds: + + b * int(a / b) + (a % b) == a + + One possibly undesirable effect of this definition of remainder is +that 'X % Y' is negative if X is negative. Thus: + + -17 % 8 = -1 + + In other 'awk' implementations, the signedness of the remainder may +be machine-dependent. + + NOTE: The POSIX standard only specifies the use of '^' for + exponentiation. For maximum portability, do not use the '**' + operator. + + +File: gawk.info, Node: Concatenation, Next: Assignment Ops, Prev: Arithmetic Ops, Up: All Operators + +6.2.2 String Concatenation +-------------------------- + + It seemed like a good idea at the time. + -- _Brian Kernighan_ + + There is only one string operation: concatenation. It does not have +a specific operator to represent it. Instead, concatenation is +performed by writing expressions next to one another, with no operator. +For example: + + $ awk '{ print "Field number one: " $1 }' mail-list + -| Field number one: Amelia + -| Field number one: Anthony + ... + + Without the space in the string constant after the ':', the line runs +together. For example: + + $ awk '{ print "Field number one:" $1 }' mail-list + -| Field number one:Amelia + -| Field number one:Anthony + ... + + Because string concatenation does not have an explicit operator, it +is often necessary to ensure that it happens at the right time by using +parentheses to enclose the items to concatenate. For example, you might +expect that the following code fragment concatenates 'file' and 'name': + + file = "file" + name = "name" + print "something meaningful" > file name + +This produces a syntax error with some versions of Unix 'awk'.(1) It is +necessary to use the following: + + print "something meaningful" > (file name) + + Parentheses should be used around concatenation in all but the most +common contexts, such as on the righthand side of '='. Be careful about +the kinds of expressions used in string concatenation. In particular, +the order of evaluation of expressions used for concatenation is +undefined in the 'awk' language. Consider this example: + + BEGIN { + a = "don't" + print (a " " (a = "panic")) + } + +It is not defined whether the second assignment to 'a' happens before or +after the value of 'a' is retrieved for producing the concatenated +value. The result could be either 'don't panic', or 'panic panic'. + + The precedence of concatenation, when mixed with other operators, is +often counter-intuitive. Consider this example: + + $ awk 'BEGIN { print -12 " " -24 }' + -| -12-24 + + This "obviously" is concatenating -12, a space, and -24. But where +did the space disappear to? The answer lies in the combination of +operator precedences and 'awk''s automatic conversion rules. To get the +desired result, write the program this way: + + $ awk 'BEGIN { print -12 " " (-24) }' + -| -12 -24 + + This forces 'awk' to treat the '-' on the '-24' as unary. Otherwise, +it's parsed as follows: + + -12 ('" "' - 24) + => -12 (0 - 24) + => -12 (-24) + => -12-24 + + As mentioned earlier, when mixing concatenation with other operators, +_parenthesize_. Otherwise, you're never quite sure what you'll get. + + ---------- Footnotes ---------- + + (1) It happens that BWK 'awk', 'gawk', and 'mawk' all "get it right," +but you should not rely on this. + + +File: gawk.info, Node: Assignment Ops, Next: Increment Ops, Prev: Concatenation, Up: All Operators + +6.2.3 Assignment Expressions +---------------------------- + +An "assignment" is an expression that stores a (usually different) value +into a variable. For example, let's assign the value one to the +variable 'z': + + z = 1 + + After this expression is executed, the variable 'z' has the value +one. Whatever old value 'z' had before the assignment is forgotten. + + Assignments can also store string values. For example, the following +stores the value '"this food is good"' in the variable 'message': + + thing = "food" + predicate = "good" + message = "this " thing " is " predicate + +This also illustrates string concatenation. The '=' sign is called an +"assignment operator". It is the simplest assignment operator because +the value of the righthand operand is stored unchanged. Most operators +(addition, concatenation, and so on) have no effect except to compute a +value. If the value isn't used, there's no reason to use the operator. +An assignment operator is different; it does produce a value, but even +if you ignore it, the assignment still makes itself felt through the +alteration of the variable. We call this a "side effect". + + The lefthand operand of an assignment need not be a variable (*note +Variables::); it can also be a field (*note Changing Fields::) or an +array element (*note Arrays::). These are all called "lvalues", which +means they can appear on the lefthand side of an assignment operator. +The righthand operand may be any expression; it produces the new value +that the assignment stores in the specified variable, field, or array +element. (Such values are called "rvalues".) + + It is important to note that variables do _not_ have permanent types. +A variable's type is simply the type of whatever value was last assigned +to it. In the following program fragment, the variable 'foo' has a +numeric value at first, and a string value later on: + + foo = 1 + print foo + foo = "bar" + print foo + +When the second assignment gives 'foo' a string value, the fact that it +previously had a numeric value is forgotten. + + String values that do not begin with a digit have a numeric value of +zero. After executing the following code, the value of 'foo' is five: + + foo = "a string" + foo = foo + 5 + + NOTE: Using a variable as a number and then later as a string can + be confusing and is poor programming style. The previous two + examples illustrate how 'awk' works, _not_ how you should write + your programs! + + An assignment is an expression, so it has a value--the same value +that is assigned. Thus, 'z = 1' is an expression with the value one. +One consequence of this is that you can write multiple assignments +together, such as: + + x = y = z = 5 + +This example stores the value five in all three variables ('x', 'y', and +'z'). It does so because the value of 'z = 5', which is five, is stored +into 'y' and then the value of 'y = z = 5', which is five, is stored +into 'x'. + + Assignments may be used anywhere an expression is called for. For +example, it is valid to write 'x != (y = 1)' to set 'y' to one, and then +test whether 'x' equals one. But this style tends to make programs hard +to read; such nesting of assignments should be avoided, except perhaps +in a one-shot program. + + Aside from '=', there are several other assignment operators that do +arithmetic with the old value of the variable. For example, the +operator '+=' computes a new value by adding the righthand value to the +old value of the variable. Thus, the following assignment adds five to +the value of 'foo': + + foo += 5 + +This is equivalent to the following: + + foo = foo + 5 + +Use whichever makes the meaning of your program clearer. + + There are situations where using '+=' (or any assignment operator) is +_not_ the same as simply repeating the lefthand operand in the righthand +expression. For example: + + # Thanks to Pat Rankin for this example + BEGIN { + foo[rand()] += 5 + for (x in foo) + print x, foo[x] + + bar[rand()] = bar[rand()] + 5 + for (x in bar) + print x, bar[x] + } + +The indices of 'bar' are practically guaranteed to be different, because +'rand()' returns different values each time it is called. (Arrays and +the 'rand()' function haven't been covered yet. *Note Arrays::, and +*note Numeric Functions:: for more information.) This example +illustrates an important fact about assignment operators: the lefthand +expression is only evaluated _once_. + + It is up to the implementation as to which expression is evaluated +first, the lefthand or the righthand. Consider this example: + + i = 1 + a[i += 2] = i + 1 + +The value of 'a[3]' could be either two or four. + + *note Table 6.2: table-assign-ops. lists the arithmetic assignment +operators. In each case, the righthand operand is an expression whose +value is converted to a number. + +Operator Effect +-------------------------------------------------------------------------- +LVALUE '+=' Add INCREMENT to the value of LVALUE. +INCREMENT +LVALUE '-=' Subtract DECREMENT from the value of LVALUE. +DECREMENT +LVALUE '*=' Multiply the value of LVALUE by COEFFICIENT. +COEFFICIENT +LVALUE '/=' DIVISOR Divide the value of LVALUE by DIVISOR. +LVALUE '%=' MODULUS Set LVALUE to its remainder by MODULUS. +LVALUE '^=' POWER Raise LVALUE to the power POWER. +LVALUE '**=' POWER Raise LVALUE to the power POWER. (c.e.) + +Table 6.2: Arithmetic assignment operators + + NOTE: Only the '^=' operator is specified by POSIX. For maximum + portability, do not use the '**=' operator. + + Syntactic Ambiguities Between '/=' and Regular Expressions + + There is a syntactic ambiguity between the '/=' assignment operator +and regexp constants whose first character is an '='. (d.c.) This is +most notable in some commercial 'awk' versions. For example: + + $ awk /==/ /dev/null + error-> awk: syntax error at source line 1 + error-> context is + error-> >>> /= <<< + error-> awk: bailing out at source line 1 + +A workaround is: + + awk '/[=]=/' /dev/null + + 'gawk' does not have this problem; BWK 'awk' and 'mawk' also do not. + + +File: gawk.info, Node: Increment Ops, Prev: Assignment Ops, Up: All Operators + +6.2.4 Increment and Decrement Operators +--------------------------------------- + +"Increment" and "decrement operators" increase or decrease the value of +a variable by one. An assignment operator can do the same thing, so the +increment operators add no power to the 'awk' language; however, they +are convenient abbreviations for very common operations. + + The operator used for adding one is written '++'. It can be used to +increment a variable either before or after taking its value. To +"pre-increment" a variable 'v', write '++v'. This adds one to the value +of 'v'--that new value is also the value of the expression. (The +assignment expression 'v += 1' is completely equivalent.) Writing the +'++' after the variable specifies "post-increment". This increments the +variable value just the same; the difference is that the value of the +increment expression itself is the variable's _old_ value. Thus, if +'foo' has the value four, then the expression 'foo++' has the value +four, but it changes the value of 'foo' to five. In other words, the +operator returns the old value of the variable, but with the side effect +of incrementing it. + + The post-increment 'foo++' is nearly the same as writing '(foo += 1) +- 1'. It is not perfectly equivalent because all numbers in 'awk' are +floating point--in floating point, 'foo + 1 - 1' does not necessarily +equal 'foo'. But the difference is minute as long as you stick to +numbers that are fairly small (less than 10e12). + + Fields and array elements are incremented just like variables. (Use +'$(i++)' when you want to do a field reference and a variable increment +at the same time. The parentheses are necessary because of the +precedence of the field reference operator '$'.) + + The decrement operator '--' works just like '++', except that it +subtracts one instead of adding it. As with '++', it can be used before +the lvalue to pre-decrement or after it to post-decrement. Following is +a summary of increment and decrement expressions: + +'++LVALUE' + Increment LVALUE, returning the new value as the value of the + expression. + +'LVALUE++' + Increment LVALUE, returning the _old_ value of LVALUE as the value + of the expression. + +'--LVALUE' + Decrement LVALUE, returning the new value as the value of the + expression. (This expression is like '++LVALUE', but instead of + adding, it subtracts.) + +'LVALUE--' + Decrement LVALUE, returning the _old_ value of LVALUE as the value + of the expression. (This expression is like 'LVALUE++', but + instead of adding, it subtracts.) + + Operator Evaluation Order + + Doctor, it hurts when I do this! + Then don't do that! + -- _Groucho Marx_ + +What happens for something like the following? + + b = 6 + print b += b++ + +Or something even stranger? + + b = 6 + b += ++b + b++ + print b + + In other words, when do the various side effects prescribed by the +postfix operators ('b++') take effect? When side effects happen is +"implementation-defined". In other words, it is up to the particular +version of 'awk'. The result for the first example may be 12 or 13, and +for the second, it may be 22 or 23. + + In short, doing things like this is not recommended and definitely +not anything that you can rely upon for portability. You should avoid +such things in your own programs. + + +File: gawk.info, Node: Truth Values and Conditions, Next: Function Calls, Prev: All Operators, Up: Expressions + +6.3 Truth Values and Conditions +=============================== + +In certain contexts, expression values also serve as "truth values"; +i.e., they determine what should happen next as the program runs. This +minor node describes how 'awk' defines "true" and "false" and how values +are compared. + +* Menu: + +* Truth Values:: What is "true" and what is "false". +* Typing and Comparison:: How variables acquire types and how this + affects comparison of numbers and strings with + '<', etc. +* Boolean Ops:: Combining comparison expressions using boolean + operators '||' ("or"), '&&' + ("and") and '!' ("not"). +* Conditional Exp:: Conditional expressions select between two + subexpressions under control of a third + subexpression. + + +File: gawk.info, Node: Truth Values, Next: Typing and Comparison, Up: Truth Values and Conditions + +6.3.1 True and False in 'awk' +----------------------------- + +Many programming languages have a special representation for the +concepts of "true" and "false." Such languages usually use the special +constants 'true' and 'false', or perhaps their uppercase equivalents. +However, 'awk' is different. It borrows a very simple concept of true +and false from C. In 'awk', any nonzero numeric value _or_ any nonempty +string value is true. Any other value (zero or the null string, '""') +is false. The following program prints 'A strange truth value' three +times: + + BEGIN { + if (3.1415927) + print "A strange truth value" + if ("Four Score And Seven Years Ago") + print "A strange truth value" + if (j = 57) + print "A strange truth value" + } + + There is a surprising consequence of the "nonzero or non-null" rule: +the string constant '"0"' is actually true, because it is non-null. +(d.c.) + + +File: gawk.info, Node: Typing and Comparison, Next: Boolean Ops, Prev: Truth Values, Up: Truth Values and Conditions + +6.3.2 Variable Typing and Comparison Expressions +------------------------------------------------ + + The Guide is definitive. Reality is frequently inaccurate. + -- _Douglas Adams, 'The Hitchhiker's Guide to the Galaxy'_ + + Unlike in other programming languages, in 'awk' variables do not have +a fixed type. Instead, they can be either a number or a string, +depending upon the value that is assigned to them. We look now at how +variables are typed, and how 'awk' compares variables. + +* Menu: + +* Variable Typing:: String type versus numeric type. +* Comparison Operators:: The comparison operators. +* POSIX String Comparison:: String comparison with POSIX rules. + + +File: gawk.info, Node: Variable Typing, Next: Comparison Operators, Up: Typing and Comparison + +6.3.2.1 String Type versus Numeric Type +....................................... + +The POSIX standard introduced the concept of a "numeric string", which +is simply a string that looks like a number--for example, '" +2"'. This +concept is used for determining the type of a variable. The type of the +variable is important because the types of two variables determine how +they are compared. Variable typing follows these rules: + + * A numeric constant or the result of a numeric operation has the + "numeric" attribute. + + * A string constant or the result of a string operation has the + "string" attribute. + + * Fields, 'getline' input, 'FILENAME', 'ARGV' elements, 'ENVIRON' + elements, and the elements of an array created by 'match()', + 'split()', and 'patsplit()' that are numeric strings have the + "strnum" attribute. Otherwise, they have the "string" attribute. + Uninitialized variables also have the "strnum" attribute. + + * Attributes propagate across assignments but are not changed by any + use. + + The last rule is particularly important. In the following program, +'a' has numeric type, even though it is later used in a string +operation: + + BEGIN { + a = 12.345 + b = a " is a cute number" + print b + } + + When two operands are compared, either string comparison or numeric +comparison may be used. This depends upon the attributes of the +operands, according to the following symmetric matrix: + + +------------------------------- + | STRING NUMERIC STRNUM + -----+------------------------------- + | + STRING | string string string + | + NUMERIC | string numeric numeric + | + STRNUM | string numeric numeric + -----+------------------------------- + + The basic idea is that user input that looks numeric--and _only_ user +input--should be treated as numeric, even though it is actually made of +characters and is therefore also a string. Thus, for example, the +string constant '" +3.14"', when it appears in program source code, is a +string--even though it looks numeric--and is _never_ treated as a number +for comparison purposes. + + In short, when one operand is a "pure" string, such as a string +constant, then a string comparison is performed. Otherwise, a numeric +comparison is performed. + + This point bears additional emphasis: All user input is made of +characters, and so is first and foremost of string type; input strings +that look numeric are additionally given the strnum attribute. Thus, +the six-character input string ' +3.14' receives the strnum attribute. +In contrast, the eight characters '" +3.14"' appearing in program text +comprise a string constant. The following examples print '1' when the +comparison between the two different constants is true, and '0' +otherwise: + + $ echo ' +3.14' | awk '{ print($0 == " +3.14") }' True + -| 1 + $ echo ' +3.14' | awk '{ print($0 == "+3.14") }' False + -| 0 + $ echo ' +3.14' | awk '{ print($0 == "3.14") }' False + -| 0 + $ echo ' +3.14' | awk '{ print($0 == 3.14) }' True + -| 1 + $ echo ' +3.14' | awk '{ print($1 == " +3.14") }' False + -| 0 + $ echo ' +3.14' | awk '{ print($1 == "+3.14") }' True + -| 1 + $ echo ' +3.14' | awk '{ print($1 == "3.14") }' False + -| 0 + $ echo ' +3.14' | awk '{ print($1 == 3.14) }' True + -| 1 + + +File: gawk.info, Node: Comparison Operators, Next: POSIX String Comparison, Prev: Variable Typing, Up: Typing and Comparison + +6.3.2.2 Comparison Operators +............................ + +"Comparison expressions" compare strings or numbers for relationships +such as equality. They are written using "relational operators", which +are a superset of those in C. *note Table 6.3: table-relational-ops. +describes them. + +Expression Result +-------------------------------------------------------------------------- +X '<' Y True if X is less than Y +X '<=' Y True if X is less than or equal to Y +X '>' Y True if X is greater than Y +X '>=' Y True if X is greater than or equal to Y +X '==' Y True if X is equal to Y +X '!=' Y True if X is not equal to Y +X '~' Y True if the string X matches the regexp denoted by Y +X '!~' Y True if the string X does not match the regexp + denoted by Y +SUBSCRIPT 'in' True if the array ARRAY has an element with the +ARRAY subscript SUBSCRIPT + +Table 6.3: Relational operators + + Comparison expressions have the value one if true and zero if false. +When comparing operands of mixed types, numeric operands are converted +to strings using the value of 'CONVFMT' (*note Conversion::). + + Strings are compared by comparing the first character of each, then +the second character of each, and so on. Thus, '"10"' is less than +'"9"'. If there are two strings where one is a prefix of the other, the +shorter string is less than the longer one. Thus, '"abc"' is less than +'"abcd"'. + + It is very easy to accidentally mistype the '==' operator and leave +off one of the '=' characters. The result is still valid 'awk' code, +but the program does not do what is intended: + + if (a = b) # oops! should be a == b + ... + else + ... + +Unless 'b' happens to be zero or the null string, the 'if' part of the +test always succeeds. Because the operators are so similar, this kind +of error is very difficult to spot when scanning the source code. + + The following list of expressions illustrates the kinds of +comparisons 'awk' performs, as well as what the result of each +comparison is: + +'1.5 <= 2.0' + Numeric comparison (true) + +'"abc" >= "xyz"' + String comparison (false) + +'1.5 != " +2"' + String comparison (true) + +'"1e2" < "3"' + String comparison (true) + +'a = 2; b = "2"' +'a == b' + String comparison (true) + +'a = 2; b = " +2"' +'a == b' + String comparison (false) + + In this example: + + $ echo 1e2 3 | awk '{ print ($1 < $2) ? "true" : "false" }' + -| false + +the result is 'false' because both '$1' and '$2' are user input. They +are numeric strings--therefore both have the strnum attribute, dictating +a numeric comparison. The purpose of the comparison rules and the use +of numeric strings is to attempt to produce the behavior that is "least +surprising," while still "doing the right thing." + + String comparisons and regular expression comparisons are very +different. For example: + + x == "foo" + +has the value one, or is true if the variable 'x' is precisely 'foo'. +By contrast: + + x ~ /foo/ + +has the value one if 'x' contains 'foo', such as '"Oh, what a fool am +I!"'. + + The righthand operand of the '~' and '!~' operators may be either a +regexp constant ('/'...'/') or an ordinary expression. In the latter +case, the value of the expression as a string is used as a dynamic +regexp (*note Regexp Usage::; also *note Computed Regexps::). + + A constant regular expression in slashes by itself is also an +expression. '/REGEXP/' is an abbreviation for the following comparison +expression: + + $0 ~ /REGEXP/ + + One special place where '/foo/' is _not_ an abbreviation for '$0 ~ +/foo/' is when it is the righthand operand of '~' or '!~'. *Note Using +Constant Regexps::, where this is discussed in more detail. + + +File: gawk.info, Node: POSIX String Comparison, Prev: Comparison Operators, Up: Typing and Comparison + +6.3.2.3 String Comparison Based on Locale Collating Order +......................................................... + +The POSIX standard used to say that all string comparisons are performed +based on the locale's "collating order". This is the order in which +characters sort, as defined by the locale (for more discussion, *note +Locales::). This order is usually very different from the results +obtained when doing straight byte-by-byte comparison.(1) + + Because this behavior differs considerably from existing practice, +'gawk' only implemented it when in POSIX mode (*note Options::). Here +is an example to illustrate the difference, in an 'en_US.UTF-8' locale: + + $ gawk 'BEGIN { printf("ABC < abc = %s\n", + > ("ABC" < "abc" ? "TRUE" : "FALSE")) }' + -| ABC < abc = TRUE + $ gawk --posix 'BEGIN { printf("ABC < abc = %s\n", + > ("ABC" < "abc" ? "TRUE" : "FALSE")) }' + -| ABC < abc = FALSE + + Fortunately, as of August 2016, comparison based on locale collating +order is no longer required for the '==' and '!=' operators.(2) +However, comparison based on locales is still required for '<', '<=', +'>', and '>='. POSIX thus recommends as follows: + + Since the '==' operator checks whether strings are identical, not + whether they collate equally, applications needing to check whether + strings collate equally can use: + + a <= b && a >= b + + As of version 4.2, 'gawk' continues to use locale collating order for +'<', '<=', '>', and '>=' only in POSIX mode. + + ---------- Footnotes ---------- + + (1) Technically, string comparison is supposed to behave the same way +as if the strings were compared with the C 'strcoll()' function. + + (2) See the Austin Group website +(http://austingroupbugs.net/view.php?id=1070). + + +File: gawk.info, Node: Boolean Ops, Next: Conditional Exp, Prev: Typing and Comparison, Up: Truth Values and Conditions + +6.3.3 Boolean Expressions +------------------------- + +A "Boolean expression" is a combination of comparison expressions or +matching expressions, using the Boolean operators "or" ('||'), "and" +('&&'), and "not" ('!'), along with parentheses to control nesting. The +truth value of the Boolean expression is computed by combining the truth +values of the component expressions. Boolean expressions are also +referred to as "logical expressions". The terms are equivalent. + + Boolean expressions can be used wherever comparison and matching +expressions can be used. They can be used in 'if', 'while', 'do', and +'for' statements (*note Statements::). They have numeric values (one if +true, zero if false) that come into play if the result of the Boolean +expression is stored in a variable or used in arithmetic. + + In addition, every Boolean expression is also a valid pattern, so you +can use one as a pattern to control the execution of rules. The Boolean +operators are: + +'BOOLEAN1 && BOOLEAN2' + True if both BOOLEAN1 and BOOLEAN2 are true. For example, the + following statement prints the current input record if it contains + both 'edu' and 'li': + + if ($0 ~ /edu/ && $0 ~ /li/) print + + The subexpression BOOLEAN2 is evaluated only if BOOLEAN1 is true. + This can make a difference when BOOLEAN2 contains expressions that + have side effects. In the case of '$0 ~ /foo/ && ($2 == bar++)', + the variable 'bar' is not incremented if there is no substring + 'foo' in the record. + +'BOOLEAN1 || BOOLEAN2' + True if at least one of BOOLEAN1 or BOOLEAN2 is true. For example, + the following statement prints all records in the input that + contain _either_ 'edu' or 'li': + + if ($0 ~ /edu/ || $0 ~ /li/) print + + The subexpression BOOLEAN2 is evaluated only if BOOLEAN1 is false. + This can make a difference when BOOLEAN2 contains expressions that + have side effects. (Thus, this test never really distinguishes + records that contain both 'edu' and 'li'--as soon as 'edu' is + matched, the full test succeeds.) + +'! BOOLEAN' + True if BOOLEAN is false. For example, the following program + prints 'no home!' in the unusual event that the 'HOME' environment + variable is not defined: + + BEGIN { if (! ("HOME" in ENVIRON)) + print "no home!" } + + (The 'in' operator is described in *note Reference to Elements::.) + + The '&&' and '||' operators are called "short-circuit" operators +because of the way they work. Evaluation of the full expression is +"short-circuited" if the result can be determined partway through its +evaluation. + + Statements that end with '&&' or '||' can be continued simply by +putting a newline after them. But you cannot put a newline in front of +either of these operators without using backslash continuation (*note +Statements/Lines::). + + The actual value of an expression using the '!' operator is either +one or zero, depending upon the truth value of the expression it is +applied to. The '!' operator is often useful for changing the sense of +a flag variable from false to true and back again. For example, the +following program is one way to print lines in between special +bracketing lines: + + $1 == "START" { interested = ! interested; next } + interested { print } + $1 == "END" { interested = ! interested; next } + +The variable 'interested', as with all 'awk' variables, starts out +initialized to zero, which is also false. When a line is seen whose +first field is 'START', the value of 'interested' is toggled to true, +using '!'. The next rule prints lines as long as 'interested' is true. +When a line is seen whose first field is 'END', 'interested' is toggled +back to false.(1) + + Most commonly, the '!' operator is used in the conditions of 'if' and +'while' statements, where it often makes more sense to phrase the logic +in the negative: + + if (! SOME CONDITION || SOME OTHER CONDITION) { + ... DO WHATEVER PROCESSING ... + } + + NOTE: The 'next' statement is discussed in *note Next Statement::. + 'next' tells 'awk' to skip the rest of the rules, get the next + record, and start processing the rules over again at the top. The + reason it's there is to avoid printing the bracketing 'START' and + 'END' lines. + + ---------- Footnotes ---------- + + (1) This program has a bug; it prints lines starting with 'END'. How +would you fix it? + + +File: gawk.info, Node: Conditional Exp, Prev: Boolean Ops, Up: Truth Values and Conditions + +6.3.4 Conditional Expressions +----------------------------- + +A "conditional expression" is a special kind of expression that has +three operands. It allows you to use one expression's value to select +one of two other expressions. The conditional expression in 'awk' is +the same as in the C language, as shown here: + + SELECTOR ? IF-TRUE-EXP : IF-FALSE-EXP + +There are three subexpressions. The first, SELECTOR, is always computed +first. If it is "true" (not zero or not null), then IF-TRUE-EXP is +computed next, and its value becomes the value of the whole expression. +Otherwise, IF-FALSE-EXP is computed next, and its value becomes the +value of the whole expression. For example, the following expression +produces the absolute value of 'x': + + x >= 0 ? x : -x + + Each time the conditional expression is computed, only one of +IF-TRUE-EXP and IF-FALSE-EXP is used; the other is ignored. This is +important when the expressions have side effects. For example, this +conditional expression examines element 'i' of either array 'a' or array +'b', and increments 'i': + + x == y ? a[i++] : b[i++] + +This is guaranteed to increment 'i' exactly once, because each time only +one of the two increment expressions is executed and the other is not. +*Note Arrays::, for more information about arrays. + + As a minor 'gawk' extension, a statement that uses '?:' can be +continued simply by putting a newline after either character. However, +putting a newline in front of either character does not work without +using backslash continuation (*note Statements/Lines::). If '--posix' +is specified (*note Options::), this extension is disabled. + + +File: gawk.info, Node: Function Calls, Next: Precedence, Prev: Truth Values and Conditions, Up: Expressions + +6.4 Function Calls +================== + +A "function" is a name for a particular calculation. This enables you +to ask for it by name at any point in the program. For example, the +function 'sqrt()' computes the square root of a number. + + A fixed set of functions are "built in", which means they are +available in every 'awk' program. The 'sqrt()' function is one of +these. *Note Built-in:: for a list of built-in functions and their +descriptions. In addition, you can define functions for use in your +program. *Note User-defined:: for instructions on how to do this. +Finally, 'gawk' lets you write functions in C or C++ that may be called +from your program (*note Dynamic Extensions::). + + The way to use a function is with a "function call" expression, which +consists of the function name followed immediately by a list of +"arguments" in parentheses. The arguments are expressions that provide +the raw materials for the function's calculations. When there is more +than one argument, they are separated by commas. If there are no +arguments, just write '()' after the function name. The following +examples show function calls with and without arguments: + + sqrt(x^2 + y^2) one argument + atan2(y, x) two arguments + rand() no arguments + + CAUTION: Do not put any space between the function name and the + opening parenthesis! A user-defined function name looks just like + the name of a variable--a space would make the expression look like + concatenation of a variable with an expression inside parentheses. + With built-in functions, space before the parenthesis is harmless, + but it is best not to get into the habit of using space to avoid + mistakes with user-defined functions. + + Each function expects a particular number of arguments. For example, +the 'sqrt()' function must be called with a single argument, the number +of which to take the square root: + + sqrt(ARGUMENT) + + Some of the built-in functions have one or more optional arguments. +If those arguments are not supplied, the functions use a reasonable +default value. *Note Built-in:: for full details. If arguments are +omitted in calls to user-defined functions, then those arguments are +treated as local variables. Such local variables act like the empty +string if referenced where a string value is required, and like zero if +referenced where a numeric value is required (*note User-defined::). + + As an advanced feature, 'gawk' provides indirect function calls, +which is a way to choose the function to call at runtime, instead of +when you write the source code to your program. We defer discussion of +this feature until later; see *note Indirect Calls::. + + Like every other expression, the function call has a value, often +called the "return value", which is computed by the function based on +the arguments you give it. In this example, the return value of +'sqrt(ARGUMENT)' is the square root of ARGUMENT. The following program +reads numbers, one number per line, and prints the square root of each +one: + + $ awk '{ print "The square root of", $1, "is", sqrt($1) }' + 1 + -| The square root of 1 is 1 + 3 + -| The square root of 3 is 1.73205 + 5 + -| The square root of 5 is 2.23607 + Ctrl-d + + A function can also have side effects, such as assigning values to +certain variables or doing I/O. This program shows how the 'match()' +function (*note String Functions::) changes the variables 'RSTART' and +'RLENGTH': + + { + if (match($1, $2)) + print RSTART, RLENGTH + else + print "no match" + } + +Here is a sample run: + + $ awk -f matchit.awk + aaccdd c+ + -| 3 2 + foo bar + -| no match + abcdefg e + -| 5 1 + + +File: gawk.info, Node: Precedence, Next: Locales, Prev: Function Calls, Up: Expressions + +6.5 Operator Precedence (How Operators Nest) +============================================ + +"Operator precedence" determines how operators are grouped when +different operators appear close by in one expression. For example, '*' +has higher precedence than '+'; thus, 'a + b * c' means to multiply 'b' +and 'c', and then add 'a' to the product (i.e., 'a + (b * c)'). + + The normal precedence of the operators can be overruled by using +parentheses. Think of the precedence rules as saying where the +parentheses are assumed to be. In fact, it is wise to always use +parentheses whenever there is an unusual combination of operators, +because other people who read the program may not remember what the +precedence is in this case. Even experienced programmers occasionally +forget the exact rules, which leads to mistakes. Explicit parentheses +help prevent any such mistakes. + + When operators of equal precedence are used together, the leftmost +operator groups first, except for the assignment, conditional, and +exponentiation operators, which group in the opposite order. Thus, 'a - +b + c' groups as '(a - b) + c' and 'a = b = c' groups as 'a = (b = c)'. + + Normally the precedence of prefix unary operators does not matter, +because there is only one way to interpret them: innermost first. Thus, +'$++i' means '$(++i)' and '++$x' means '++($x)'. However, when another +operator follows the operand, then the precedence of the unary operators +can matter. '$x^2' means '($x)^2', but '-x^2' means '-(x^2)', because +'-' has lower precedence than '^', whereas '$' has higher precedence. +Also, operators cannot be combined in a way that violates the precedence +rules; for example, '$$0++--' is not a valid expression because the +first '$' has higher precedence than the '++'; to avoid the problem the +expression can be rewritten as '$($0++)--'. + + This list presents 'awk''s operators, in order of highest to lowest +precedence: + +'('...')' + Grouping. + +'$' + Field reference. + +'++ --' + Increment, decrement. + +'^ **' + Exponentiation. These operators group right to left. + +'+ - !' + Unary plus, minus, logical "not." + +'* / %' + Multiplication, division, remainder. + +'+ -' + Addition, subtraction. + +String concatenation + There is no special symbol for concatenation. The operands are + simply written side by side (*note Concatenation::). + +'< <= == != > >= >> | |&' + Relational and redirection. The relational operators and the + redirections have the same precedence level. Characters such as + '>' serve both as relationals and as redirections; the context + distinguishes between the two meanings. + + Note that the I/O redirection operators in 'print' and 'printf' + statements belong to the statement level, not to expressions. The + redirection does not produce an expression that could be the + operand of another operator. As a result, it does not make sense + to use a redirection operator near another operator of lower + precedence without parentheses. Such combinations (e.g., 'print + foo > a ? b : c') result in syntax errors. The correct way to + write this statement is 'print foo > (a ? b : c)'. + +'~ !~' + Matching, nonmatching. + +'in' + Array membership. + +'&&' + Logical "and." + +'||' + Logical "or." + +'?:' + Conditional. This operator groups right to left. + +'= += -= *= /= %= ^= **=' + Assignment. These operators group right to left. + + NOTE: The '|&', '**', and '**=' operators are not specified by + POSIX. For maximum portability, do not use them. + + +File: gawk.info, Node: Locales, Next: Expressions Summary, Prev: Precedence, Up: Expressions + +6.6 Where You Are Makes a Difference +==================================== + +Modern systems support the notion of "locales": a way to tell the system +about the local character set and language. The ISO C standard defines +a default '"C"' locale, which is an environment that is typical of what +many C programmers are used to. + + Once upon a time, the locale setting used to affect regexp matching, +but this is no longer true (*note Ranges and Locales::). + + Locales can affect record splitting. For the normal case of 'RS = +"\n"', the locale is largely irrelevant. For other single-character +record separators, setting 'LC_ALL=C' in the environment will give you +much better performance when reading records. Otherwise, 'gawk' has to +make several function calls, _per input character_, to find the record +terminator. + + Locales can affect how dates and times are formatted (*note Time +Functions::). For example, a common way to abbreviate the date +September 4, 2015, in the United States is "9/4/15." In many countries +in Europe, however, it is abbreviated "4.9.15." Thus, the '%x' +specification in a '"US"' locale might produce '9/4/15', while in a +'"EUROPE"' locale, it might produce '4.9.15'. + + According to POSIX, string comparison is also affected by locales +(similar to regular expressions). The details are presented in *note +POSIX String Comparison::. + + Finally, the locale affects the value of the decimal point character +used when 'gawk' parses input data. This is discussed in detail in +*note Conversion::. + + +File: gawk.info, Node: Expressions Summary, Prev: Locales, Up: Expressions + +6.7 Summary +=========== + + * Expressions are the basic elements of computation in programs. + They are built from constants, variables, function calls, and + combinations of the various kinds of values with operators. + + * 'awk' supplies three kinds of constants: numeric, string, and + regexp. 'gawk' lets you specify numeric constants in octal and + hexadecimal (bases 8 and 16) as well as decimal (base 10). In + certain contexts, a standalone regexp constant such as '/foo/' has + the same meaning as '$0 ~ /foo/'. + + * Variables hold values between uses in computations. A number of + built-in variables provide information to your 'awk' program, and a + number of others let you control how 'awk' behaves. + + * Numbers are automatically converted to strings, and strings to + numbers, as needed by 'awk'. Numeric values are converted as if + they were formatted with 'sprintf()' using the format in 'CONVFMT'. + Locales can influence the conversions. + + * 'awk' provides the usual arithmetic operators (addition, + subtraction, multiplication, division, modulus), and unary plus and + minus. It also provides comparison operators, Boolean operators, + an array membership testing operator, and regexp matching + operators. String concatenation is accomplished by placing two + expressions next to each other; there is no explicit operator. The + three-operand '?:' operator provides an "if-else" test within + expressions. + + * Assignment operators provide convenient shorthands for common + arithmetic operations. + + * In 'awk', a value is considered to be true if it is nonzero _or_ + non-null. Otherwise, the value is false. + + * A variable's type is set upon each assignment and may change over + its lifetime. The type determines how it behaves in comparisons + (string or numeric). + + * Function calls return a value that may be used as part of a larger + expression. Expressions used to pass parameter values are fully + evaluated before the function is called. 'awk' provides built-in + and user-defined functions; this is described in *note Functions::. + + * Operator precedence specifies the order in which operations are + performed, unless explicitly overridden by parentheses. 'awk''s + operator precedence is compatible with that of C. + + * Locales can affect the format of data as output by an 'awk' + program, and occasionally the format for data read as input. + + +File: gawk.info, Node: Patterns and Actions, Next: Arrays, Prev: Expressions, Up: Top + +7 Patterns, Actions, and Variables +********************************** + +As you have already seen, each 'awk' statement consists of a pattern +with an associated action. This major node describes how you build +patterns and actions, what kinds of things you can do within actions, +and 'awk''s predefined variables. + + The pattern-action rules and the statements available for use within +actions form the core of 'awk' programming. In a sense, everything +covered up to here has been the foundation that programs are built on +top of. Now it's time to start building something useful. + +* Menu: + +* Pattern Overview:: What goes into a pattern. +* Using Shell Variables:: How to use shell variables with 'awk'. +* Action Overview:: What goes into an action. +* Statements:: Describes the various control statements in + detail. +* Built-in Variables:: Summarizes the predefined variables. +* Pattern Action Summary:: Patterns and Actions summary. + + +File: gawk.info, Node: Pattern Overview, Next: Using Shell Variables, Up: Patterns and Actions + +7.1 Pattern Elements +==================== + +* Menu: + +* Regexp Patterns:: Using regexps as patterns. +* Expression Patterns:: Any expression can be used as a pattern. +* Ranges:: Pairs of patterns specify record ranges. +* BEGIN/END:: Specifying initialization and cleanup rules. +* BEGINFILE/ENDFILE:: Two special patterns for advanced control. +* Empty:: The empty pattern, which matches every record. + +Patterns in 'awk' control the execution of rules--a rule is executed +when its pattern matches the current input record. The following is a +summary of the types of 'awk' patterns: + +'/REGULAR EXPRESSION/' + A regular expression. It matches when the text of the input record + fits the regular expression. (*Note Regexp::.) + +'EXPRESSION' + A single expression. It matches when its value is nonzero (if a + number) or non-null (if a string). (*Note Expression Patterns::.) + +'BEGPAT, ENDPAT' + A pair of patterns separated by a comma, specifying a "range" of + records. The range includes both the initial record that matches + BEGPAT and the final record that matches ENDPAT. (*Note Ranges::.) + +'BEGIN' +'END' + Special patterns for you to supply startup or cleanup actions for + your 'awk' program. (*Note BEGIN/END::.) + +'BEGINFILE' +'ENDFILE' + Special patterns for you to supply startup or cleanup actions to be + done on a per-file basis. (*Note BEGINFILE/ENDFILE::.) + +'EMPTY' + The empty pattern matches every input record. (*Note Empty::.) + + +File: gawk.info, Node: Regexp Patterns, Next: Expression Patterns, Up: Pattern Overview + +7.1.1 Regular Expressions as Patterns +------------------------------------- + +Regular expressions are one of the first kinds of patterns presented in +this book. This kind of pattern is simply a regexp constant in the +pattern part of a rule. Its meaning is '$0 ~ /PATTERN/'. The pattern +matches when the input record matches the regexp. For example: + + /foo|bar|baz/ { buzzwords++ } + END { print buzzwords, "buzzwords seen" } + + +File: gawk.info, Node: Expression Patterns, Next: Ranges, Prev: Regexp Patterns, Up: Pattern Overview + +7.1.2 Expressions as Patterns +----------------------------- + +Any 'awk' expression is valid as an 'awk' pattern. The pattern matches +if the expression's value is nonzero (if a number) or non-null (if a +string). The expression is reevaluated each time the rule is tested +against a new input record. If the expression uses fields such as '$1', +the value depends directly on the new input record's text; otherwise, it +depends on only what has happened so far in the execution of the 'awk' +program. + + Comparison expressions, using the comparison operators described in +*note Typing and Comparison::, are a very common kind of pattern. +Regexp matching and nonmatching are also very common expressions. The +left operand of the '~' and '!~' operators is a string. The right +operand is either a constant regular expression enclosed in slashes +('/REGEXP/'), or any expression whose string value is used as a dynamic +regular expression (*note Computed Regexps::). The following example +prints the second field of each input record whose first field is +precisely 'li': + + $ awk '$1 == "li" { print $2 }' mail-list + +(There is no output, because there is no person with the exact name +'li'.) Contrast this with the following regular expression match, which +accepts any record with a first field that contains 'li': + + $ awk '$1 ~ /li/ { print $2 }' mail-list + -| 555-5553 + -| 555-6699 + + A regexp constant as a pattern is also a special case of an +expression pattern. The expression '/li/' has the value one if 'li' +appears in the current input record. Thus, as a pattern, '/li/' matches +any record containing 'li'. + + Boolean expressions are also commonly used as patterns. Whether the +pattern matches an input record depends on whether its subexpressions +match. For example, the following command prints all the records in +'mail-list' that contain both 'edu' and 'li': + + $ awk '/edu/ && /li/' mail-list + -| Samuel 555-3430 samuel.lanceolis@shu.edu A + + The following command prints all records in 'mail-list' that contain +_either_ 'edu' or 'li' (or both, of course): + + $ awk '/edu/ || /li/' mail-list + -| Amelia 555-5553 amelia.zodiacusque@gmail.com F + -| Broderick 555-0542 broderick.aliquotiens@yahoo.com R + -| Fabius 555-1234 fabius.undevicesimus@ucb.edu F + -| Julie 555-6699 julie.perscrutabor@skeeve.com F + -| Samuel 555-3430 samuel.lanceolis@shu.edu A + -| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R + + The following command prints all records in 'mail-list' that do _not_ +contain the string 'li': + + $ awk '! /li/' mail-list + -| Anthony 555-3412 anthony.asserturo@hotmail.com A + -| Becky 555-7685 becky.algebrarum@gmail.com A + -| Bill 555-1675 bill.drowning@hotmail.com A + -| Camilla 555-2912 camilla.infusarum@skynet.be R + -| Fabius 555-1234 fabius.undevicesimus@ucb.edu F + -| Martin 555-6480 martin.codicibus@hotmail.com A + -| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R + + The subexpressions of a Boolean operator in a pattern can be constant +regular expressions, comparisons, or any other 'awk' expressions. Range +patterns are not expressions, so they cannot appear inside Boolean +patterns. Likewise, the special patterns 'BEGIN', 'END', 'BEGINFILE', +and 'ENDFILE', which never match any input record, are not expressions +and cannot appear inside Boolean patterns. + + The precedence of the different operators that can appear in patterns +is described in *note Precedence::. + + +File: gawk.info, Node: Ranges, Next: BEGIN/END, Prev: Expression Patterns, Up: Pattern Overview + +7.1.3 Specifying Record Ranges with Patterns +-------------------------------------------- + +A "range pattern" is made of two patterns separated by a comma, in the +form 'BEGPAT, ENDPAT'. It is used to match ranges of consecutive input +records. The first pattern, BEGPAT, controls where the range begins, +while ENDPAT controls where the pattern ends. For example, the +following: + + awk '$1 == "on", $1 == "off"' myfile + +prints every record in 'myfile' between 'on'/'off' pairs, inclusive. + + A range pattern starts out by matching BEGPAT against every input +record. When a record matches BEGPAT, the range pattern is "turned on", +and the range pattern matches this record as well. As long as the range +pattern stays turned on, it automatically matches every input record +read. The range pattern also matches ENDPAT against every input record; +when this succeeds, the range pattern is "turned off" again for the +following record. Then the range pattern goes back to checking BEGPAT +against each record. + + The record that turns on the range pattern and the one that turns it +off both match the range pattern. If you don't want to operate on these +records, you can write 'if' statements in the rule's action to +distinguish them from the records you are interested in. + + It is possible for a pattern to be turned on and off by the same +record. If the record satisfies both conditions, then the action is +executed for just that record. For example, suppose there is text +between two identical markers (e.g., the '%' symbol), each on its own +line, that should be ignored. A first attempt would be to combine a +range pattern that describes the delimited text with the 'next' +statement (not discussed yet, *note Next Statement::). This causes +'awk' to skip any further processing of the current record and start +over again with the next input record. Such a program looks like this: + + /^%$/,/^%$/ { next } + { print } + +This program fails because the range pattern is both turned on and +turned off by the first line, which just has a '%' on it. To accomplish +this task, write the program in the following manner, using a flag: + + /^%$/ { skip = ! skip; next } + skip == 1 { next } # skip lines with `skip' set + + In a range pattern, the comma (',') has the lowest precedence of all +the operators (i.e., it is evaluated last). Thus, the following program +attempts to combine a range pattern with another, simpler test: + + echo Yes | awk '/1/,/2/ || /Yes/' + + The intent of this program is '(/1/,/2/) || /Yes/'. However, 'awk' +interprets this as '/1/, (/2/ || /Yes/)'. This cannot be changed or +worked around; range patterns do not combine with other patterns: + + $ echo Yes | gawk '(/1/,/2/) || /Yes/' + error-> gawk: cmd. line:1: (/1/,/2/) || /Yes/ + error-> gawk: cmd. line:1: ^ syntax error + + As a minor point of interest, although it is poor style, POSIX allows +you to put a newline after the comma in a range pattern. (d.c.) + + +File: gawk.info, Node: BEGIN/END, Next: BEGINFILE/ENDFILE, Prev: Ranges, Up: Pattern Overview + +7.1.4 The 'BEGIN' and 'END' Special Patterns +-------------------------------------------- + +All the patterns described so far are for matching input records. The +'BEGIN' and 'END' special patterns are different. They supply startup +and cleanup actions for 'awk' programs. 'BEGIN' and 'END' rules must +have actions; there is no default action for these rules because there +is no current record when they run. 'BEGIN' and 'END' rules are often +referred to as "'BEGIN' and 'END' blocks" by longtime 'awk' programmers. + +* Menu: + +* Using BEGIN/END:: How and why to use BEGIN/END rules. +* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. + + +File: gawk.info, Node: Using BEGIN/END, Next: I/O And BEGIN/END, Up: BEGIN/END + +7.1.4.1 Startup and Cleanup Actions +................................... + +A 'BEGIN' rule is executed once only, before the first input record is +read. Likewise, an 'END' rule is executed once only, after all the +input is read. For example: + + $ awk ' + > BEGIN { print "Analysis of \"li\"" } + > /li/ { ++n } + > END { print "\"li\" appears in", n, "records." }' mail-list + -| Analysis of "li" + -| "li" appears in 4 records. + + This program finds the number of records in the input file +'mail-list' that contain the string 'li'. The 'BEGIN' rule prints a +title for the report. There is no need to use the 'BEGIN' rule to +initialize the counter 'n' to zero, as 'awk' does this automatically +(*note Variables::). The second rule increments the variable 'n' every +time a record containing the pattern 'li' is read. The 'END' rule +prints the value of 'n' at the end of the run. + + The special patterns 'BEGIN' and 'END' cannot be used in ranges or +with Boolean operators (indeed, they cannot be used with any operators). +An 'awk' program may have multiple 'BEGIN' and/or 'END' rules. They are +executed in the order in which they appear: all the 'BEGIN' rules at +startup and all the 'END' rules at termination. 'BEGIN' and 'END' rules +may be intermixed with other rules. This feature was added in the 1987 +version of 'awk' and is included in the POSIX standard. The original +(1978) version of 'awk' required the 'BEGIN' rule to be placed at the +beginning of the program, the 'END' rule to be placed at the end, and +only allowed one of each. This is no longer required, but it is a good +idea to follow this template in terms of program organization and +readability. + + Multiple 'BEGIN' and 'END' rules are useful for writing library +functions, because each library file can have its own 'BEGIN' and/or +'END' rule to do its own initialization and/or cleanup. The order in +which library functions are named on the command line controls the order +in which their 'BEGIN' and 'END' rules are executed. Therefore, you +have to be careful when writing such rules in library files so that the +order in which they are executed doesn't matter. *Note Options:: for +more information on using library functions. *Note Library Functions::, +for a number of useful library functions. + + If an 'awk' program has only 'BEGIN' rules and no other rules, then +the program exits after the 'BEGIN' rules are run.(1) However, if an +'END' rule exists, then the input is read, even if there are no other +rules in the program. This is necessary in case the 'END' rule checks +the 'FNR' and 'NR' variables. + + ---------- Footnotes ---------- + + (1) The original version of 'awk' kept reading and ignoring input +until the end of the file was seen. + + +File: gawk.info, Node: I/O And BEGIN/END, Prev: Using BEGIN/END, Up: BEGIN/END + +7.1.4.2 Input/Output from 'BEGIN' and 'END' Rules +................................................. + +There are several (sometimes subtle) points to be aware of when doing +I/O from a 'BEGIN' or 'END' rule. The first has to do with the value of +'$0' in a 'BEGIN' rule. Because 'BEGIN' rules are executed before any +input is read, there simply is no input record, and therefore no fields, +when executing 'BEGIN' rules. References to '$0' and the fields yield a +null string or zero, depending upon the context. One way to give '$0' a +real value is to execute a 'getline' command without a variable (*note +Getline::). Another way is simply to assign a value to '$0'. + + The second point is similar to the first, but from the other +direction. Traditionally, due largely to implementation issues, '$0' +and 'NF' were _undefined_ inside an 'END' rule. The POSIX standard +specifies that 'NF' is available in an 'END' rule. It contains the +number of fields from the last input record. Most probably due to an +oversight, the standard does not say that '$0' is also preserved, +although logically one would think that it should be. In fact, all of +BWK 'awk', 'mawk', and 'gawk' preserve the value of '$0' for use in +'END' rules. Be aware, however, that some other implementations and +many older versions of Unix 'awk' do not. + + The third point follows from the first two. The meaning of 'print' +inside a 'BEGIN' or 'END' rule is the same as always: 'print $0'. If +'$0' is the null string, then this prints an empty record. Many +longtime 'awk' programmers use an unadorned 'print' in 'BEGIN' and 'END' +rules, to mean 'print ""', relying on '$0' being null. Although one +might generally get away with this in 'BEGIN' rules, it is a very bad +idea in 'END' rules, at least in 'gawk'. It is also poor style, because +if an empty line is needed in the output, the program should print one +explicitly. + + Finally, the 'next' and 'nextfile' statements are not allowed in a +'BEGIN' rule, because the implicit +read-a-record-and-match-against-the-rules loop has not started yet. +Similarly, those statements are not valid in an 'END' rule, because all +the input has been read. (*Note Next Statement:: and *note Nextfile +Statement::.) + + +File: gawk.info, Node: BEGINFILE/ENDFILE, Next: Empty, Prev: BEGIN/END, Up: Pattern Overview + +7.1.5 The 'BEGINFILE' and 'ENDFILE' Special Patterns +---------------------------------------------------- + +This minor node describes a 'gawk'-specific feature. + + Two special kinds of rule, 'BEGINFILE' and 'ENDFILE', give you +"hooks" into 'gawk''s command-line file processing loop. As with the +'BEGIN' and 'END' rules (*note BEGIN/END::), all 'BEGINFILE' rules in a +program are merged, in the order they are read by 'gawk', and all +'ENDFILE' rules are merged as well. + + The body of the 'BEGINFILE' rules is executed just before 'gawk' +reads the first record from a file. 'FILENAME' is set to the name of +the current file, and 'FNR' is set to zero. + + The 'BEGINFILE' rule provides you the opportunity to accomplish two +tasks that would otherwise be difficult or impossible to perform: + + * You can test if the file is readable. Normally, it is a fatal + error if a file named on the command line cannot be opened for + reading. However, you can bypass the fatal error and move on to + the next file on the command line. + + You do this by checking if the 'ERRNO' variable is not the empty + string; if so, then 'gawk' was not able to open the file. In this + case, your program can execute the 'nextfile' statement (*note + Nextfile Statement::). This causes 'gawk' to skip the file + entirely. Otherwise, 'gawk' exits with the usual fatal error. + + * If you have written extensions that modify the record handling (by + inserting an "input parser"; *note Input Parsers::), you can invoke + them at this point, before 'gawk' has started processing the file. + (This is a _very_ advanced feature, currently used only by the + 'gawkextlib' project (http://sourceforge.net/projects/gawkextlib).) + + The 'ENDFILE' rule is called when 'gawk' has finished processing the +last record in an input file. For the last input file, it will be +called before any 'END' rules. The 'ENDFILE' rule is executed even for +empty input files. + + Normally, when an error occurs when reading input in the normal +input-processing loop, the error is fatal. However, if an 'ENDFILE' +rule is present, the error becomes non-fatal, and instead 'ERRNO' is +set. This makes it possible to catch and process I/O errors at the +level of the 'awk' program. + + The 'next' statement (*note Next Statement::) is not allowed inside +either a 'BEGINFILE' or an 'ENDFILE' rule. The 'nextfile' statement is +allowed only inside a 'BEGINFILE' rule, not inside an 'ENDFILE' rule. + + The 'getline' statement (*note Getline::) is restricted inside both +'BEGINFILE' and 'ENDFILE': only redirected forms of 'getline' are +allowed. + + 'BEGINFILE' and 'ENDFILE' are 'gawk' extensions. In most other 'awk' +implementations, or if 'gawk' is in compatibility mode (*note +Options::), they are not special. + + +File: gawk.info, Node: Empty, Prev: BEGINFILE/ENDFILE, Up: Pattern Overview + +7.1.6 The Empty Pattern +----------------------- + +An empty (i.e., nonexistent) pattern is considered to match _every_ +input record. For example, the program: + + awk '{ print $1 }' mail-list + +prints the first field of every record. + + +File: gawk.info, Node: Using Shell Variables, Next: Action Overview, Prev: Pattern Overview, Up: Patterns and Actions + +7.2 Using Shell Variables in Programs +===================================== + +'awk' programs are often used as components in larger programs written +in shell. For example, it is very common to use a shell variable to +hold a pattern that the 'awk' program searches for. There are two ways +to get the value of the shell variable into the body of the 'awk' +program. + + A common method is to use shell quoting to substitute the variable's +value into the program inside the script. For example, consider the +following program: + + printf "Enter search pattern: " + read pattern + awk "/$pattern/ "'{ nmatches++ } + END { print nmatches, "found" }' /path/to/data + +The 'awk' program consists of two pieces of quoted text that are +concatenated together to form the program. The first part is +double-quoted, which allows substitution of the 'pattern' shell variable +inside the quotes. The second part is single-quoted. + + Variable substitution via quoting works, but can potentially be +messy. It requires a good understanding of the shell's quoting rules +(*note Quoting::), and it's often difficult to correctly match up the +quotes when reading the program. + + A better method is to use 'awk''s variable assignment feature (*note +Assignment Options::) to assign the shell variable's value to an 'awk' +variable. Then use dynamic regexps to match the pattern (*note Computed +Regexps::). The following shows how to redo the previous example using +this technique: + + printf "Enter search pattern: " + read pattern + awk -v pat="$pattern" '$0 ~ pat { nmatches++ } + END { print nmatches, "found" }' /path/to/data + +Now, the 'awk' program is just one single-quoted string. The assignment +'-v pat="$pattern"' still requires double quotes, in case there is +whitespace in the value of '$pattern'. The 'awk' variable 'pat' could +be named 'pattern' too, but that would be more confusing. Using a +variable also provides more flexibility, as the variable can be used +anywhere inside the program--for printing, as an array subscript, or for +any other use--without requiring the quoting tricks at every point in +the program. + + +File: gawk.info, Node: Action Overview, Next: Statements, Prev: Using Shell Variables, Up: Patterns and Actions + +7.3 Actions +=========== + +An 'awk' program or script consists of a series of rules and function +definitions interspersed. (Functions are described later. *Note +User-defined::.) A rule contains a pattern and an action, either of +which (but not both) may be omitted. The purpose of the "action" is to +tell 'awk' what to do once a match for the pattern is found. Thus, in +outline, an 'awk' program generally looks like this: + + [PATTERN] '{ ACTION }' + PATTERN ['{ ACTION }'] + ... + 'function NAME(ARGS) { ... }' + ... + + An action consists of one or more 'awk' "statements", enclosed in +braces ('{...}'). Each statement specifies one thing to do. The +statements are separated by newlines or semicolons. The braces around +an action must be used even if the action contains only one statement, +or if it contains no statements at all. However, if you omit the action +entirely, omit the braces as well. An omitted action is equivalent to +'{ print $0 }': + + /foo/ { } match 'foo', do nothing -- empty action + /foo/ match 'foo', print the record -- omitted action + + The following types of statements are supported in 'awk': + +Expressions + Call functions or assign values to variables (*note Expressions::). + Executing this kind of statement simply computes the value of the + expression. This is useful when the expression has side effects + (*note Assignment Ops::). + +Control statements + Specify the control flow of 'awk' programs. The 'awk' language + gives you C-like constructs ('if', 'for', 'while', and 'do') as + well as a few special ones (*note Statements::). + +Compound statements + Enclose one or more statements in braces. A compound statement is + used in order to put several statements together in the body of an + 'if', 'while', 'do', or 'for' statement. + +Input statements + Use the 'getline' command (*note Getline::). Also supplied in + 'awk' are the 'next' statement (*note Next Statement::) and the + 'nextfile' statement (*note Nextfile Statement::). + +Output statements + Such as 'print' and 'printf'. *Note Printing::. + +Deletion statements + For deleting array elements. *Note Delete::. + + +File: gawk.info, Node: Statements, Next: Built-in Variables, Prev: Action Overview, Up: Patterns and Actions + +7.4 Control Statements in Actions +================================= + +"Control statements", such as 'if', 'while', and so on, control the flow +of execution in 'awk' programs. Most of 'awk''s control statements are +patterned after similar statements in C. + + All the control statements start with special keywords, such as 'if' +and 'while', to distinguish them from simple expressions. Many control +statements contain other statements. For example, the 'if' statement +contains another statement that may or may not be executed. The +contained statement is called the "body". To include more than one +statement in the body, group them into a single "compound statement" +with braces, separating them with newlines or semicolons. + +* Menu: + +* If Statement:: Conditionally execute some 'awk' + statements. +* While Statement:: Loop until some condition is satisfied. +* Do Statement:: Do specified action while looping until some + condition is satisfied. +* For Statement:: Another looping statement, that provides + initialization and increment clauses. +* Switch Statement:: Switch/case evaluation for conditional + execution of statements based on a value. +* Break Statement:: Immediately exit the innermost enclosing loop. +* Continue Statement:: Skip to the end of the innermost enclosing + loop. +* Next Statement:: Stop processing the current input record. +* Nextfile Statement:: Stop processing the current file. +* Exit Statement:: Stop execution of 'awk'. + + +File: gawk.info, Node: If Statement, Next: While Statement, Up: Statements + +7.4.1 The 'if'-'else' Statement +------------------------------- + +The 'if'-'else' statement is 'awk''s decision-making statement. It +looks like this: + + 'if (CONDITION) THEN-BODY' ['else ELSE-BODY'] + +The CONDITION is an expression that controls what the rest of the +statement does. If the CONDITION is true, THEN-BODY is executed; +otherwise, ELSE-BODY is executed. The 'else' part of the statement is +optional. The condition is considered false if its value is zero or the +null string; otherwise, the condition is true. Refer to the following: + + if (x % 2 == 0) + print "x is even" + else + print "x is odd" + + In this example, if the expression 'x % 2 == 0' is true (i.e., if the +value of 'x' is evenly divisible by two), then the first 'print' +statement is executed; otherwise, the second 'print' statement is +executed. If the 'else' keyword appears on the same line as THEN-BODY +and THEN-BODY is not a compound statement (i.e., not surrounded by +braces), then a semicolon must separate THEN-BODY from the 'else'. To +illustrate this, the previous example can be rewritten as: + + if (x % 2 == 0) print "x is even"; else + print "x is odd" + +If the ';' is left out, 'awk' can't interpret the statement and it +produces a syntax error. Don't actually write programs this way, +because a human reader might fail to see the 'else' if it is not the +first thing on its line. + + +File: gawk.info, Node: While Statement, Next: Do Statement, Prev: If Statement, Up: Statements + +7.4.2 The 'while' Statement +--------------------------- + +In programming, a "loop" is a part of a program that can be executed two +or more times in succession. The 'while' statement is the simplest +looping statement in 'awk'. It repeatedly executes a statement as long +as a condition is true. For example: + + while (CONDITION) + BODY + +BODY is a statement called the "body" of the loop, and CONDITION is an +expression that controls how long the loop keeps running. The first +thing the 'while' statement does is test the CONDITION. If the +CONDITION is true, it executes the statement BODY. (The CONDITION is +true when the value is not zero and not a null string.) After BODY has +been executed, CONDITION is tested again, and if it is still true, BODY +executes again. This process repeats until the CONDITION is no longer +true. If the CONDITION is initially false, the body of the loop never +executes and 'awk' continues with the statement following the loop. +This example prints the first three fields of each record, one per line: + + awk ' + { + i = 1 + while (i <= 3) { + print $i + i++ + } + }' inventory-shipped + +The body of this loop is a compound statement enclosed in braces, +containing two statements. The loop works in the following manner: +first, the value of 'i' is set to one. Then, the 'while' statement +tests whether 'i' is less than or equal to three. This is true when 'i' +equals one, so the 'i'th field is printed. Then the 'i++' increments +the value of 'i' and the loop repeats. The loop terminates when 'i' +reaches four. + + A newline is not required between the condition and the body; +however, using one makes the program clearer unless the body is a +compound statement or else is very simple. The newline after the open +brace that begins the compound statement is not required either, but the +program is harder to read without it. + + +File: gawk.info, Node: Do Statement, Next: For Statement, Prev: While Statement, Up: Statements + +7.4.3 The 'do'-'while' Statement +-------------------------------- + +The 'do' loop is a variation of the 'while' looping statement. The 'do' +loop executes the BODY once and then repeats the BODY as long as the +CONDITION is true. It looks like this: + + do + BODY + while (CONDITION) + + Even if the CONDITION is false at the start, the BODY executes at +least once (and only once, unless executing BODY makes CONDITION true). +Contrast this with the corresponding 'while' statement: + + while (CONDITION) + BODY + +This statement does not execute the BODY even once if the CONDITION is +false to begin with. The following is an example of a 'do' statement: + + { + i = 1 + do { + print $0 + i++ + } while (i <= 10) + } + +This program prints each input record 10 times. However, it isn't a +very realistic example, because in this case an ordinary 'while' would +do just as well. This situation reflects actual experience; only +occasionally is there a real use for a 'do' statement. + + +File: gawk.info, Node: For Statement, Next: Switch Statement, Prev: Do Statement, Up: Statements + +7.4.4 The 'for' Statement +------------------------- + +The 'for' statement makes it more convenient to count iterations of a +loop. The general form of the 'for' statement looks like this: + + for (INITIALIZATION; CONDITION; INCREMENT) + BODY + +The INITIALIZATION, CONDITION, and INCREMENT parts are arbitrary 'awk' +expressions, and BODY stands for any 'awk' statement. + + The 'for' statement starts by executing INITIALIZATION. Then, as +long as the CONDITION is true, it repeatedly executes BODY and then +INCREMENT. Typically, INITIALIZATION sets a variable to either zero or +one, INCREMENT adds one to it, and CONDITION compares it against the +desired number of iterations. For example: + + awk ' + { + for (i = 1; i <= 3; i++) + print $i + }' inventory-shipped + +This prints the first three fields of each input record, with one field +per line. + + It isn't possible to set more than one variable in the INITIALIZATION +part without using a multiple assignment statement such as 'x = y = 0'. +This makes sense only if all the initial values are equal. (But it is +possible to initialize additional variables by writing their assignments +as separate statements preceding the 'for' loop.) + + The same is true of the INCREMENT part. Incrementing additional +variables requires separate statements at the end of the loop. The C +compound expression, using C's comma operator, is useful in this +context, but it is not supported in 'awk'. + + Most often, INCREMENT is an increment expression, as in the previous +example. But this is not required; it can be any expression whatsoever. +For example, the following statement prints all the powers of two +between 1 and 100: + + for (i = 1; i <= 100; i *= 2) + print i + + If there is nothing to be done, any of the three expressions in the +parentheses following the 'for' keyword may be omitted. Thus, +'for (; x > 0;)' is equivalent to 'while (x > 0)'. If the CONDITION is +omitted, it is treated as true, effectively yielding an "infinite loop" +(i.e., a loop that never terminates). + + In most cases, a 'for' loop is an abbreviation for a 'while' loop, as +shown here: + + INITIALIZATION + while (CONDITION) { + BODY + INCREMENT + } + +The only exception is when the 'continue' statement (*note Continue +Statement::) is used inside the loop. Changing a 'for' statement to a +'while' statement in this way can change the effect of the 'continue' +statement inside the loop. + + The 'awk' language has a 'for' statement in addition to a 'while' +statement because a 'for' loop is often both less work to type and more +natural to think of. Counting the number of iterations is very common +in loops. It can be easier to think of this counting as part of looping +rather than as something to do inside the loop. + + There is an alternative version of the 'for' loop, for iterating over +all the indices of an array: + + for (i in array) + DO SOMETHING WITH array[i] + +*Note Scanning an Array:: for more information on this version of the +'for' loop. + + +File: gawk.info, Node: Switch Statement, Next: Break Statement, Prev: For Statement, Up: Statements + +7.4.5 The 'switch' Statement +---------------------------- + +This minor node describes a 'gawk'-specific feature. If 'gawk' is in +compatibility mode (*note Options::), it is not available. + + The 'switch' statement allows the evaluation of an expression and the +execution of statements based on a 'case' match. Case statements are +checked for a match in the order they are defined. If no suitable +'case' is found, the 'default' section is executed, if supplied. + + Each 'case' contains a single constant, be it numeric, string, or +regexp. The 'switch' expression is evaluated, and then each 'case''s +constant is compared against the result in turn. The type of constant +determines the comparison: numeric or string do the usual comparisons. +A regexp constant does a regular expression match against the string +value of the original expression. The general form of the 'switch' +statement looks like this: + + switch (EXPRESSION) { + case VALUE OR REGULAR EXPRESSION: + CASE-BODY + default: + DEFAULT-BODY + } + + Control flow in the 'switch' statement works as it does in C. Once a +match to a given case is made, the case statement bodies execute until a +'break', 'continue', 'next', 'nextfile', or 'exit' is encountered, or +the end of the 'switch' statement itself. For example: + + while ((c = getopt(ARGC, ARGV, "aksx")) != -1) { + switch (c) { + case "a": + # report size of all files + all_files = TRUE; + break + case "k": + BLOCK_SIZE = 1024 # 1K block size + break + case "s": + # do sums only + sum_only = TRUE + break + case "x": + # don't cross filesystems + fts_flags = or(fts_flags, FTS_XDEV) + break + case "?": + default: + usage() + break + } + } + + Note that if none of the statements specified here halt execution of +a matched 'case' statement, execution falls through to the next 'case' +until execution halts. In this example, the 'case' for '"?"' falls +through to the 'default' case, which is to call a function named +'usage()'. (The 'getopt()' function being called here is described in +*note Getopt Function::.) + + +File: gawk.info, Node: Break Statement, Next: Continue Statement, Prev: Switch Statement, Up: Statements + +7.4.6 The 'break' Statement +--------------------------- + +The 'break' statement jumps out of the innermost 'for', 'while', or 'do' +loop that encloses it. The following example finds the smallest divisor +of any integer, and also identifies prime numbers: + + # find smallest divisor of num + { + num = $1 + for (divisor = 2; divisor * divisor <= num; divisor++) { + if (num % divisor == 0) + break + } + if (num % divisor == 0) + printf "Smallest divisor of %d is %d\n", num, divisor + else + printf "%d is prime\n", num + } + + When the remainder is zero in the first 'if' statement, 'awk' +immediately "breaks out" of the containing 'for' loop. This means that +'awk' proceeds immediately to the statement following the loop and +continues processing. (This is very different from the 'exit' +statement, which stops the entire 'awk' program. *Note Exit +Statement::.) + + The following program illustrates how the CONDITION of a 'for' or +'while' statement could be replaced with a 'break' inside an 'if': + + # find smallest divisor of num + { + num = $1 + for (divisor = 2; ; divisor++) { + if (num % divisor == 0) { + printf "Smallest divisor of %d is %d\n", num, divisor + break + } + if (divisor * divisor > num) { + printf "%d is prime\n", num + break + } + } + } + + The 'break' statement is also used to break out of the 'switch' +statement. This is discussed in *note Switch Statement::. + + The 'break' statement has no meaning when used outside the body of a +loop or 'switch'. However, although it was never documented, historical +implementations of 'awk' treated the 'break' statement outside of a loop +as if it were a 'next' statement (*note Next Statement::). (d.c.) +Recent versions of BWK 'awk' no longer allow this usage, nor does +'gawk'. + + +File: gawk.info, Node: Continue Statement, Next: Next Statement, Prev: Break Statement, Up: Statements + +7.4.7 The 'continue' Statement +------------------------------ + +Similar to 'break', the 'continue' statement is used only inside 'for', +'while', and 'do' loops. It skips over the rest of the loop body, +causing the next cycle around the loop to begin immediately. Contrast +this with 'break', which jumps out of the loop altogether. + + The 'continue' statement in a 'for' loop directs 'awk' to skip the +rest of the body of the loop and resume execution with the +increment-expression of the 'for' statement. The following program +illustrates this fact: + + BEGIN { + for (x = 0; x <= 20; x++) { + if (x == 5) + continue + printf "%d ", x + } + print "" + } + +This program prints all the numbers from 0 to 20--except for 5, for +which the 'printf' is skipped. Because the increment 'x++' is not +skipped, 'x' does not remain stuck at 5. Contrast the 'for' loop from +the previous example with the following 'while' loop: + + BEGIN { + x = 0 + while (x <= 20) { + if (x == 5) + continue + printf "%d ", x + x++ + } + print "" + } + +This program loops forever once 'x' reaches 5, because the increment +('x++') is never reached. + + The 'continue' statement has no special meaning with respect to the +'switch' statement, nor does it have any meaning when used outside the +body of a loop. Historical versions of 'awk' treated a 'continue' +statement outside a loop the same way they treated a 'break' statement +outside a loop: as if it were a 'next' statement (*note Next +Statement::). (d.c.) Recent versions of BWK 'awk' no longer work this +way, nor does 'gawk'. + + +File: gawk.info, Node: Next Statement, Next: Nextfile Statement, Prev: Continue Statement, Up: Statements + +7.4.8 The 'next' Statement +-------------------------- + +The 'next' statement forces 'awk' to immediately stop processing the +current record and go on to the next record. This means that no further +rules are executed for the current record, and the rest of the current +rule's action isn't executed. + + Contrast this with the effect of the 'getline' function (*note +Getline::). That also causes 'awk' to read the next record immediately, +but it does not alter the flow of control in any way (i.e., the rest of +the current action executes with a new input record). + + At the highest level, 'awk' program execution is a loop that reads an +input record and then tests each rule's pattern against it. If you +think of this loop as a 'for' statement whose body contains the rules, +then the 'next' statement is analogous to a 'continue' statement. It +skips to the end of the body of this implicit loop and executes the +increment (which reads another record). + + For example, suppose an 'awk' program works only on records with four +fields, and it shouldn't fail when given bad input. To avoid +complicating the rest of the program, write a "weed out" rule near the +beginning, in the following manner: + + NF != 4 { + printf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) > "/dev/stderr" + next + } + +Because of the 'next' statement, the program's subsequent rules won't +see the bad record. The error message is redirected to the standard +error output stream, as error messages should be. For more detail, see +*note Special Files::. + + If the 'next' statement causes the end of the input to be reached, +then the code in any 'END' rules is executed. *Note BEGIN/END::. + + The 'next' statement is not allowed inside 'BEGINFILE' and 'ENDFILE' +rules. *Note BEGINFILE/ENDFILE::. + + According to the POSIX standard, the behavior is undefined if the +'next' statement is used in a 'BEGIN' or 'END' rule. 'gawk' treats it +as a syntax error. Although POSIX does not disallow it, most other +'awk' implementations don't allow the 'next' statement inside function +bodies (*note User-defined::). Just as with any other 'next' statement, +a 'next' statement inside a function body reads the next record and +starts processing it with the first rule in the program. + + +File: gawk.info, Node: Nextfile Statement, Next: Exit Statement, Prev: Next Statement, Up: Statements + +7.4.9 The 'nextfile' Statement +------------------------------ + +The 'nextfile' statement is similar to the 'next' statement. However, +instead of abandoning processing of the current record, the 'nextfile' +statement instructs 'awk' to stop processing the current data file. + + Upon execution of the 'nextfile' statement, 'FILENAME' is updated to +the name of the next data file listed on the command line, 'FNR' is +reset to one, and processing starts over with the first rule in the +program. If the 'nextfile' statement causes the end of the input to be +reached, then the code in any 'END' rules is executed. An exception to +this is when 'nextfile' is invoked during execution of any statement in +an 'END' rule; in this case, it causes the program to stop immediately. +*Note BEGIN/END::. + + The 'nextfile' statement is useful when there are many data files to +process but it isn't necessary to process every record in every file. +Without 'nextfile', in order to move on to the next data file, a program +would have to continue scanning the unwanted records. The 'nextfile' +statement accomplishes this much more efficiently. + + In 'gawk', execution of 'nextfile' causes additional things to +happen: any 'ENDFILE' rules are executed if 'gawk' is not currently in +an 'END' or 'BEGINFILE' rule, 'ARGIND' is incremented, and any +'BEGINFILE' rules are executed. ('ARGIND' hasn't been introduced yet. +*Note Built-in Variables::.) + + With 'gawk', 'nextfile' is useful inside a 'BEGINFILE' rule to skip +over a file that would otherwise cause 'gawk' to exit with a fatal +error. In this case, 'ENDFILE' rules are not executed. *Note +BEGINFILE/ENDFILE::. + + Although it might seem that 'close(FILENAME)' would accomplish the +same as 'nextfile', this isn't true. 'close()' is reserved for closing +files, pipes, and coprocesses that are opened with redirections. It is +not related to the main processing that 'awk' does with the files listed +in 'ARGV'. + + NOTE: For many years, 'nextfile' was a common extension. In + September 2012, it was accepted for inclusion into the POSIX + standard. See the Austin Group website + (http://austingroupbugs.net/view.php?id=607). + + The current version of BWK 'awk' and 'mawk' also support 'nextfile'. +However, they don't allow the 'nextfile' statement inside function +bodies (*note User-defined::). 'gawk' does; a 'nextfile' inside a +function body reads the first record from the next file and starts +processing it with the first rule in the program, just as any other +'nextfile' statement. + + +File: gawk.info, Node: Exit Statement, Prev: Nextfile Statement, Up: Statements + +7.4.10 The 'exit' Statement +--------------------------- + +The 'exit' statement causes 'awk' to immediately stop executing the +current rule and to stop processing input; any remaining input is +ignored. The 'exit' statement is written as follows: + + 'exit' [RETURN CODE] + + When an 'exit' statement is executed from a 'BEGIN' rule, the program +stops processing everything immediately. No input records are read. +However, if an 'END' rule is present, as part of executing the 'exit' +statement, the 'END' rule is executed (*note BEGIN/END::). If 'exit' is +used in the body of an 'END' rule, it causes the program to stop +immediately. + + An 'exit' statement that is not part of a 'BEGIN' or 'END' rule stops +the execution of any further automatic rules for the current record, +skips reading any remaining input records, and executes the 'END' rule +if there is one. 'gawk' also skips any 'ENDFILE' rules; they do not +execute. + + In such a case, if you don't want the 'END' rule to do its job, set a +variable to a nonzero value before the 'exit' statement and check that +variable in the 'END' rule. *Note Assert Function:: for an example that +does this. + + If an argument is supplied to 'exit', its value is used as the exit +status code for the 'awk' process. If no argument is supplied, 'exit' +causes 'awk' to return a "success" status. In the case where an +argument is supplied to a first 'exit' statement, and then 'exit' is +called a second time from an 'END' rule with no argument, 'awk' uses the +previously supplied exit value. (d.c.) *Note Exit Status:: for more +information. + + For example, suppose an error condition occurs that is difficult or +impossible to handle. Conventionally, programs report this by exiting +with a nonzero status. An 'awk' program can do this using an 'exit' +statement with a nonzero argument, as shown in the following example: + + BEGIN { + if (("date" | getline date_now) <= 0) { + print "Can't get system date" > "/dev/stderr" + exit 1 + } + print "current date is", date_now + close("date") + } + + NOTE: For full portability, exit values should be between zero and + 126, inclusive. Negative values, and values of 127 or greater, may + not produce consistent results across different operating systems. + + +File: gawk.info, Node: Built-in Variables, Next: Pattern Action Summary, Prev: Statements, Up: Patterns and Actions + +7.5 Predefined Variables +======================== + +Most 'awk' variables are available to use for your own purposes; they +never change unless your program assigns values to them, and they never +affect anything unless your program examines them. However, a few +variables in 'awk' have special built-in meanings. 'awk' examines some +of these automatically, so that they enable you to tell 'awk' how to do +certain things. Others are set automatically by 'awk', so that they +carry information from the internal workings of 'awk' to your program. + + This minor node documents all of 'gawk''s predefined variables, most +of which are also documented in the major nodes describing their areas +of activity. + +* Menu: + +* User-modified:: Built-in variables that you change to control + 'awk'. +* Auto-set:: Built-in variables where 'awk' gives + you information. +* ARGC and ARGV:: Ways to use 'ARGC' and 'ARGV'. + + +File: gawk.info, Node: User-modified, Next: Auto-set, Up: Built-in Variables + +7.5.1 Built-in Variables That Control 'awk' +------------------------------------------- + +The following is an alphabetical list of variables that you can change +to control how 'awk' does certain things. + + The variables that are specific to 'gawk' are marked with a pound +sign ('#'). These variables are 'gawk' extensions. In other 'awk' +implementations or if 'gawk' is in compatibility mode (*note Options::), +they are not special. (Any exceptions are noted in the description of +each variable.) + +'BINMODE #' + On non-POSIX systems, this variable specifies use of binary mode + for all I/O. Numeric values of one, two, or three specify that + input files, output files, or all files, respectively, should use + binary I/O. A numeric value less than zero is treated as zero, and + a numeric value greater than three is treated as three. + Alternatively, string values of '"r"' or '"w"' specify that input + files and output files, respectively, should use binary I/O. A + string value of '"rw"' or '"wr"' indicates that all files should + use binary I/O. Any other string value is treated the same as + '"rw"', but causes 'gawk' to generate a warning message. 'BINMODE' + is described in more detail in *note PC Using::. 'mawk' (*note + Other Versions::) also supports this variable, but only using + numeric values. + +'CONVFMT' + A string that controls the conversion of numbers to strings (*note + Conversion::). It works by being passed, in effect, as the first + argument to the 'sprintf()' function (*note String Functions::). + Its default value is '"%.6g"'. 'CONVFMT' was introduced by the + POSIX standard. + +'FIELDWIDTHS #' + A space-separated list of columns that tells 'gawk' how to split + input with fixed columnar boundaries. Assigning a value to + 'FIELDWIDTHS' overrides the use of 'FS' and 'FPAT' for field + splitting. *Note Constant Size:: for more information. + +'FPAT #' + A regular expression (as a string) that tells 'gawk' to create the + fields based on text that matches the regular expression. + Assigning a value to 'FPAT' overrides the use of 'FS' and + 'FIELDWIDTHS' for field splitting. *Note Splitting By Content:: + for more information. + +'FS' + The input field separator (*note Field Separators::). The value is + a single-character string or a multicharacter regular expression + that matches the separations between fields in an input record. If + the value is the null string ('""'), then each character in the + record becomes a separate field. (This behavior is a 'gawk' + extension. POSIX 'awk' does not specify the behavior when 'FS' is + the null string. Nonetheless, some other versions of 'awk' also + treat '""' specially.) + + The default value is '" "', a string consisting of a single space. + As a special exception, this value means that any sequence of + spaces, TABs, and/or newlines is a single separator. It also + causes spaces, TABs, and newlines at the beginning and end of a + record to be ignored. + + You can set the value of 'FS' on the command line using the '-F' + option: + + awk -F, 'PROGRAM' INPUT-FILES + + If 'gawk' is using 'FIELDWIDTHS' or 'FPAT' for field splitting, + assigning a value to 'FS' causes 'gawk' to return to the normal, + 'FS'-based field splitting. An easy way to do this is to simply + say 'FS = FS', perhaps with an explanatory comment. + +'IGNORECASE #' + If 'IGNORECASE' is nonzero or non-null, then all string comparisons + and all regular expression matching are case-independent. This + applies to regexp matching with '~' and '!~', the 'gensub()', + 'gsub()', 'index()', 'match()', 'patsplit()', 'split()', and + 'sub()' functions, record termination with 'RS', and field + splitting with 'FS' and 'FPAT'. However, the value of 'IGNORECASE' + does _not_ affect array subscripting and it does not affect field + splitting when using a single-character field separator. *Note + Case-sensitivity::. + +'LINT #' + When this variable is true (nonzero or non-null), 'gawk' behaves as + if the '--lint' command-line option is in effect (*note Options::). + With a value of '"fatal"', lint warnings become fatal errors. With + a value of '"invalid"', only warnings about things that are + actually invalid are issued. (This is not fully implemented yet.) + Any other true value prints nonfatal warnings. Assigning a false + value to 'LINT' turns off the lint warnings. + + This variable is a 'gawk' extension. It is not special in other + 'awk' implementations. Unlike with the other special variables, + changing 'LINT' does affect the production of lint warnings, even + if 'gawk' is in compatibility mode. Much as the '--lint' and + '--traditional' options independently control different aspects of + 'gawk''s behavior, the control of lint warnings during program + execution is independent of the flavor of 'awk' being executed. + +'OFMT' + A string that controls conversion of numbers to strings (*note + Conversion::) for printing with the 'print' statement. It works by + being passed as the first argument to the 'sprintf()' function + (*note String Functions::). Its default value is '"%.6g"'. + Earlier versions of 'awk' used 'OFMT' to specify the format for + converting numbers to strings in general expressions; this is now + done by 'CONVFMT'. + +'OFS' + The output field separator (*note Output Separators::). It is + output between the fields printed by a 'print' statement. Its + default value is '" "', a string consisting of a single space. + +'ORS' + The output record separator. It is output at the end of every + 'print' statement. Its default value is '"\n"', the newline + character. (*Note Output Separators::.) + +'PREC #' + The working precision of arbitrary-precision floating-point + numbers, 53 bits by default (*note Setting precision::). + +'ROUNDMODE #' + The rounding mode to use for arbitrary-precision arithmetic on + numbers, by default '"N"' ('roundTiesToEven' in the IEEE 754 + standard; *note Setting the rounding mode::). + +'RS' + The input record separator. Its default value is a string + containing a single newline character, which means that an input + record consists of a single line of text. It can also be the null + string, in which case records are separated by runs of blank lines. + If it is a regexp, records are separated by matches of the regexp + in the input text. (*Note Records::.) + + The ability for 'RS' to be a regular expression is a 'gawk' + extension. In most other 'awk' implementations, or if 'gawk' is in + compatibility mode (*note Options::), just the first character of + 'RS''s value is used. + +'SUBSEP' + The subscript separator. It has the default value of '"\034"' and + is used to separate the parts of the indices of a multidimensional + array. Thus, the expression 'foo["A", "B"]' really accesses + 'foo["A\034B"]' (*note Multidimensional::). + +'TEXTDOMAIN #' + Used for internationalization of programs at the 'awk' level. It + sets the default text domain for specially marked string constants + in the source text, as well as for the 'dcgettext()', + 'dcngettext()', and 'bindtextdomain()' functions (*note + Internationalization::). The default value of 'TEXTDOMAIN' is + '"messages"'. + + +File: gawk.info, Node: Auto-set, Next: ARGC and ARGV, Prev: User-modified, Up: Built-in Variables + +7.5.2 Built-in Variables That Convey Information +------------------------------------------------ + +The following is an alphabetical list of variables that 'awk' sets +automatically on certain occasions in order to provide information to +your program. + + The variables that are specific to 'gawk' are marked with a pound +sign ('#'). These variables are 'gawk' extensions. In other 'awk' +implementations or if 'gawk' is in compatibility mode (*note Options::), +they are not special: + +'ARGC', 'ARGV' + The command-line arguments available to 'awk' programs are stored + in an array called 'ARGV'. 'ARGC' is the number of command-line + arguments present. *Note Other Arguments::. Unlike most 'awk' + arrays, 'ARGV' is indexed from 0 to 'ARGC' - 1. In the following + example: + + $ awk 'BEGIN { + > for (i = 0; i < ARGC; i++) + > print ARGV[i] + > }' inventory-shipped mail-list + -| awk + -| inventory-shipped + -| mail-list + + 'ARGV[0]' contains 'awk', 'ARGV[1]' contains 'inventory-shipped', + and 'ARGV[2]' contains 'mail-list'. The value of 'ARGC' is three, + one more than the index of the last element in 'ARGV', because the + elements are numbered from zero. + + The names 'ARGC' and 'ARGV', as well as the convention of indexing + the array from 0 to 'ARGC' - 1, are derived from the C language's + method of accessing command-line arguments. + + The value of 'ARGV[0]' can vary from system to system. Also, you + should note that the program text is _not_ included in 'ARGV', nor + are any of 'awk''s command-line options. *Note ARGC and ARGV:: for + information about how 'awk' uses these variables. (d.c.) + +'ARGIND #' + The index in 'ARGV' of the current file being processed. Every + time 'gawk' opens a new data file for processing, it sets 'ARGIND' + to the index in 'ARGV' of the file name. When 'gawk' is processing + the input files, 'FILENAME == ARGV[ARGIND]' is always true. + + This variable is useful in file processing; it allows you to tell + how far along you are in the list of data files as well as to + distinguish between successive instances of the same file name on + the command line. + + While you can change the value of 'ARGIND' within your 'awk' + program, 'gawk' automatically sets it to a new value when it opens + the next file. + +'ENVIRON' + An associative array containing the values of the environment. The + array indices are the environment variable names; the elements are + the values of the particular environment variables. For example, + 'ENVIRON["HOME"]' might be '/home/arnold'. + + For POSIX 'awk', changing this array does not affect the + environment passed on to any programs that 'awk' may spawn via + redirection or the 'system()' function. + + However, beginning with version 4.2, if not in POSIX compatibility + mode, 'gawk' does update its own environment when 'ENVIRON' is + changed, thus changing the environment seen by programs that it + creates. You should therefore be especially careful if you modify + 'ENVIRON["PATH"]', which is the search path for finding executable + programs. + + This can also affect the running 'gawk' program, since some of the + built-in functions may pay attention to certain environment + variables. The most notable instance of this is 'mktime()' (*note + Time Functions::), which pays attention the value of the 'TZ' + environment variable on many systems. + + Some operating systems may not have environment variables. On such + systems, the 'ENVIRON' array is empty (except for + 'ENVIRON["AWKPATH"]' and 'ENVIRON["AWKLIBPATH"]'; *note AWKPATH + Variable:: and *note AWKLIBPATH Variable::). + +'ERRNO #' + If a system error occurs during a redirection for 'getline', during + a read for 'getline', or during a 'close()' operation, then 'ERRNO' + contains a string describing the error. + + In addition, 'gawk' clears 'ERRNO' before opening each command-line + input file. This enables checking if the file is readable inside a + 'BEGINFILE' pattern (*note BEGINFILE/ENDFILE::). + + Otherwise, 'ERRNO' works similarly to the C variable 'errno'. + Except for the case just mentioned, 'gawk' _never_ clears it (sets + it to zero or '""'). Thus, you should only expect its value to be + meaningful when an I/O operation returns a failure value, such as + 'getline' returning -1. You are, of course, free to clear it + yourself before doing an I/O operation. + + If the value of 'ERRNO' corresponds to a system error in the C + 'errno' variable, then 'PROCINFO["errno"]' will be set to the value + of 'errno'. For non-system errors, 'PROCINFO["errno"]' will be + zero. + +'FILENAME' + The name of the current input file. When no data files are listed + on the command line, 'awk' reads from the standard input and + 'FILENAME' is set to '"-"'. 'FILENAME' changes each time a new + file is read (*note Reading Files::). Inside a 'BEGIN' rule, the + value of 'FILENAME' is '""', because there are no input files being + processed yet.(1) (d.c.) Note, though, that using 'getline' + (*note Getline::) inside a 'BEGIN' rule can give 'FILENAME' a + value. + +'FNR' + The current record number in the current file. 'awk' increments + 'FNR' each time it reads a new record (*note Records::). 'awk' + resets 'FNR' to zero each time it starts a new input file. + +'NF' + The number of fields in the current input record. 'NF' is set each + time a new record is read, when a new field is created, or when + '$0' changes (*note Fields::). + + Unlike most of the variables described in this node, assigning a + value to 'NF' has the potential to affect 'awk''s internal + workings. In particular, assignments to 'NF' can be used to create + fields in or remove fields from the current record. *Note Changing + Fields::. + +'FUNCTAB #' + An array whose indices and corresponding values are the names of + all the built-in, user-defined, and extension functions in the + program. + + NOTE: Attempting to use the 'delete' statement with the + 'FUNCTAB' array causes a fatal error. Any attempt to assign + to an element of 'FUNCTAB' also causes a fatal error. + +'NR' + The number of input records 'awk' has processed since the beginning + of the program's execution (*note Records::). 'awk' increments + 'NR' each time it reads a new record. + +'PROCINFO #' + The elements of this array provide access to information about the + running 'awk' program. The following elements (listed + alphabetically) are guaranteed to be available: + + 'PROCINFO["egid"]' + The value of the 'getegid()' system call. + + 'PROCINFO["errno"]' + The value of the C 'errno' variable when 'ERRNO' is set to the + associated error message. + + 'PROCINFO["euid"]' + The value of the 'geteuid()' system call. + + 'PROCINFO["FS"]' + This is '"FS"' if field splitting with 'FS' is in effect, + '"FIELDWIDTHS"' if field splitting with 'FIELDWIDTHS' is in + effect, or '"FPAT"' if field matching with 'FPAT' is in + effect. + + 'PROCINFO["gid"]' + The value of the 'getgid()' system call. + + 'PROCINFO["identifiers"]' + A subarray, indexed by the names of all identifiers used in + the text of the 'awk' program. An "identifier" is simply the + name of a variable (be it scalar or array), built-in function, + user-defined function, or extension function. For each + identifier, the value of the element is one of the following: + + '"array"' + The identifier is an array. + + '"builtin"' + The identifier is a built-in function. + + '"extension"' + The identifier is an extension function loaded via + '@load' or '-l'. + + '"scalar"' + The identifier is a scalar. + + '"untyped"' + The identifier is untyped (could be used as a scalar or + an array; 'gawk' doesn't know yet). + + '"user"' + The identifier is a user-defined function. + + The values indicate what 'gawk' knows about the identifiers + after it has finished parsing the program; they are _not_ + updated while the program runs. + + 'PROCINFO["pgrpid"]' + The process group ID of the current process. + + 'PROCINFO["pid"]' + The process ID of the current process. + + 'PROCINFO["ppid"]' + The parent process ID of the current process. + + 'PROCINFO["strftime"]' + The default time format string for 'strftime()'. Assigning a + new value to this element changes the default. *Note Time + Functions::. + + 'PROCINFO["uid"]' + The value of the 'getuid()' system call. + + 'PROCINFO["version"]' + The version of 'gawk'. + + The following additional elements in the array are available to + provide information about the MPFR and GMP libraries if your + version of 'gawk' supports arbitrary-precision arithmetic (*note + Arbitrary Precision Arithmetic::): + + 'PROCINFO["gmp_version"]' + The version of the GNU MP library. + + 'PROCINFO["mpfr_version"]' + The version of the GNU MPFR library. + + 'PROCINFO["prec_max"]' + The maximum precision supported by MPFR. + + 'PROCINFO["prec_min"]' + The minimum precision required by MPFR. + + The following additional elements in the array are available to + provide information about the version of the extension API, if your + version of 'gawk' supports dynamic loading of extension functions + (*note Dynamic Extensions::): + + 'PROCINFO["api_major"]' + The major version of the extension API. + + 'PROCINFO["api_minor"]' + The minor version of the extension API. + + On some systems, there may be elements in the array, '"group1"' + through '"groupN"' for some N. N is the number of supplementary + groups that the process has. Use the 'in' operator to test for + these elements (*note Reference to Elements::). + + The following elements allow you to change 'gawk''s behavior: + + 'PROCINFO["NONFATAL"]' + If this element exists, then I/O errors for all output + redirections become nonfatal. *Note Nonfatal::. + + 'PROCINFO["OUTPUT_NAME", "NONFATAL"]' + Make output errors for OUTPUT_NAME be nonfatal. *Note + Nonfatal::. + + 'PROCINFO["COMMAND", "pty"]' + For two-way communication to COMMAND, use a pseudo-tty instead + of setting up a two-way pipe. *Note Two-way I/O:: for more + information. + + 'PROCINFO["INPUT_NAME", "READ_TIMEOUT"]' + Set a timeout for reading from input redirection INPUT_NAME. + *Note Read Timeout:: for more information. + + 'PROCINFO["sorted_in"]' + If this element exists in 'PROCINFO', its value controls the + order in which array indices will be processed by 'for (INDX + in ARRAY)' loops. This is an advanced feature, so we defer + the full description until later; see *note Scanning an + Array::. + +'RLENGTH' + The length of the substring matched by the 'match()' function + (*note String Functions::). 'RLENGTH' is set by invoking the + 'match()' function. Its value is the length of the matched string, + or -1 if no match is found. + +'RSTART' + The start index in characters of the substring that is matched by + the 'match()' function (*note String Functions::). 'RSTART' is set + by invoking the 'match()' function. Its value is the position of + the string where the matched substring starts, or zero if no match + was found. + +'RT #' + The input text that matched the text denoted by 'RS', the record + separator. It is set every time a record is read. + +'SYMTAB #' + An array whose indices are the names of all defined global + variables and arrays in the program. 'SYMTAB' makes 'gawk''s + symbol table visible to the 'awk' programmer. It is built as + 'gawk' parses the program and is complete before the program starts + to run. + + The array may be used for indirect access to read or write the + value of a variable: + + foo = 5 + SYMTAB["foo"] = 4 + print foo # prints 4 + + The 'isarray()' function (*note Type Functions::) may be used to + test if an element in 'SYMTAB' is an array. Also, you may not use + the 'delete' statement with the 'SYMTAB' array. + + You may use an index for 'SYMTAB' that is not a predefined + identifier: + + SYMTAB["xxx"] = 5 + print SYMTAB["xxx"] + + This works as expected: in this case 'SYMTAB' acts just like a + regular array. The only difference is that you can't then delete + 'SYMTAB["xxx"]'. + + The 'SYMTAB' array is more interesting than it looks. Andrew + Schorr points out that it effectively gives 'awk' data pointers. + Consider his example: + + # Indirect multiply of any variable by amount, return result + + function multiply(variable, amount) + { + return SYMTAB[variable] *= amount + } + + You would use it like this: + + BEGIN { + answer = 10.5 + multiply("answer", 4) + print "The answer is", answer + } + + When run, this produces: + + $ gawk -f answer.awk + -| The answer is 42 + + NOTE: In order to avoid severe time-travel paradoxes,(2) + neither 'FUNCTAB' nor 'SYMTAB' is available as an element + within the 'SYMTAB' array. + + Changing 'NR' and 'FNR' + + 'awk' increments 'NR' and 'FNR' each time it reads a record, instead +of setting them to the absolute value of the number of records read. +This means that a program can change these variables and their new +values are incremented for each record. (d.c.) The following example +shows this: + + $ echo '1 + > 2 + > 3 + > 4' | awk 'NR == 2 { NR = 17 } + > { print NR }' + -| 1 + -| 17 + -| 18 + -| 19 + +Before 'FNR' was added to the 'awk' language (*note V7/SVR3.1::), many +'awk' programs used this feature to track the number of records in a +file by resetting 'NR' to zero when 'FILENAME' changed. + + ---------- Footnotes ---------- + + (1) Some early implementations of Unix 'awk' initialized 'FILENAME' +to '"-"', even if there were data files to be processed. This behavior +was incorrect and should not be relied upon in your programs. + + (2) Not to mention difficult implementation issues. + + +File: gawk.info, Node: ARGC and ARGV, Prev: Auto-set, Up: Built-in Variables + +7.5.3 Using 'ARGC' and 'ARGV' +----------------------------- + +*note Auto-set:: presented the following program describing the +information contained in 'ARGC' and 'ARGV': + + $ awk 'BEGIN { + > for (i = 0; i < ARGC; i++) + > print ARGV[i] + > }' inventory-shipped mail-list + -| awk + -| inventory-shipped + -| mail-list + +In this example, 'ARGV[0]' contains 'awk', 'ARGV[1]' contains +'inventory-shipped', and 'ARGV[2]' contains 'mail-list'. Notice that +the 'awk' program is not entered in 'ARGV'. The other command-line +options, with their arguments, are also not entered. This includes +variable assignments done with the '-v' option (*note Options::). +Normal variable assignments on the command line _are_ treated as +arguments and do show up in the 'ARGV' array. Given the following +program in a file named 'showargs.awk': + + BEGIN { + printf "A=%d, B=%d\n", A, B + for (i = 0; i < ARGC; i++) + printf "\tARGV[%d] = %s\n", i, ARGV[i] + } + END { printf "A=%d, B=%d\n", A, B } + +Running it produces the following: + + $ awk -v A=1 -f showargs.awk B=2 /dev/null + -| A=1, B=0 + -| ARGV[0] = awk + -| ARGV[1] = B=2 + -| ARGV[2] = /dev/null + -| A=1, B=2 + + A program can alter 'ARGC' and the elements of 'ARGV'. Each time +'awk' reaches the end of an input file, it uses the next element of +'ARGV' as the name of the next input file. By storing a different +string there, a program can change which files are read. Use '"-"' to +represent the standard input. Storing additional elements and +incrementing 'ARGC' causes additional files to be read. + + If the value of 'ARGC' is decreased, that eliminates input files from +the end of the list. By recording the old value of 'ARGC' elsewhere, a +program can treat the eliminated arguments as something other than file +names. + + To eliminate a file from the middle of the list, store the null +string ('""') into 'ARGV' in place of the file's name. As a special +feature, 'awk' ignores file names that have been replaced with the null +string. Another option is to use the 'delete' statement to remove +elements from 'ARGV' (*note Delete::). + + All of these actions are typically done in the 'BEGIN' rule, before +actual processing of the input begins. *Note Split Program:: and *note +Tee Program:: for examples of each way of removing elements from 'ARGV'. + + To actually get options into an 'awk' program, end the 'awk' options +with '--' and then supply the 'awk' program's options, in the following +manner: + + awk -f myprog.awk -- -v -q file1 file2 ... + + The following fragment processes 'ARGV' in order to examine, and then +remove, the previously mentioned command-line options: + + BEGIN { + for (i = 1; i < ARGC; i++) { + if (ARGV[i] == "-v") + verbose = 1 + else if (ARGV[i] == "-q") + debug = 1 + else if (ARGV[i] ~ /^-./) { + e = sprintf("%s: unrecognized option -- %c", + ARGV[0], substr(ARGV[i], 2, 1)) + print e > "/dev/stderr" + } else + break + delete ARGV[i] + } + } + + Ending the 'awk' options with '--' isn't necessary in 'gawk'. Unless +'--posix' has been specified, 'gawk' silently puts any unrecognized +options into 'ARGV' for the 'awk' program to deal with. As soon as it +sees an unknown option, 'gawk' stops looking for other options that it +might otherwise recognize. The previous command line with 'gawk' would +be: + + gawk -f myprog.awk -q -v file1 file2 ... + +Because '-q' is not a valid 'gawk' option, it and the following '-v' are +passed on to the 'awk' program. (*Note Getopt Function:: for an 'awk' +library function that parses command-line options.) + + When designing your program, you should choose options that don't +conflict with 'gawk''s, because it will process any options that it +accepts before passing the rest of the command line on to your program. +Using '#!' with the '-E' option may help (*note Executable Scripts:: and +*note Options::). + + +File: gawk.info, Node: Pattern Action Summary, Prev: Built-in Variables, Up: Patterns and Actions + +7.6 Summary +=========== + + * Pattern-action pairs make up the basic elements of an 'awk' + program. Patterns are either normal expressions, range + expressions, or regexp constants; one of the special keywords + 'BEGIN', 'END', 'BEGINFILE', or 'ENDFILE'; or empty. The action + executes if the current record matches the pattern. Empty + (missing) patterns match all records. + + * I/O from 'BEGIN' and 'END' rules has certain constraints. This is + also true, only more so, for 'BEGINFILE' and 'ENDFILE' rules. The + latter two give you "hooks" into 'gawk''s file processing, allowing + you to recover from a file that otherwise would cause a fatal error + (such as a file that cannot be opened). + + * Shell variables can be used in 'awk' programs by careful use of + shell quoting. It is easier to pass a shell variable into 'awk' by + using the '-v' option and an 'awk' variable. + + * Actions consist of statements enclosed in curly braces. Statements + are built up from expressions, control statements, compound + statements, input and output statements, and deletion statements. + + * The control statements in 'awk' are 'if'-'else', 'while', 'for', + and 'do'-'while'. 'gawk' adds the 'switch' statement. There are + two flavors of 'for' statement: one for performing general looping, + and the other for iterating through an array. + + * 'break' and 'continue' let you exit early or start the next + iteration of a loop (or get out of a 'switch'). + + * 'next' and 'nextfile' let you read the next record and start over + at the top of your program or skip to the next input file and start + over, respectively. + + * The 'exit' statement terminates your program. When executed from + an action (or function body), it transfers control to the 'END' + statements. From an 'END' statement body, it exits immediately. + You may pass an optional numeric value to be used as 'awk''s exit + status. + + * Some predefined variables provide control over 'awk', mainly for + I/O. Other variables convey information from 'awk' to your program. + + * 'ARGC' and 'ARGV' make the command-line arguments available to your + program. Manipulating them from a 'BEGIN' rule lets you control + how 'awk' will process the provided data files. + + +File: gawk.info, Node: Arrays, Next: Functions, Prev: Patterns and Actions, Up: Top + +8 Arrays in 'awk' +***************** + +An "array" is a table of values called "elements". The elements of an +array are distinguished by their "indices". Indices may be either +numbers or strings. + + This major node describes how arrays work in 'awk', how to use array +elements, how to scan through every element in an array, and how to +remove array elements. It also describes how 'awk' simulates +multidimensional arrays, as well as some of the less obvious points +about array usage. The major node moves on to discuss 'gawk''s facility +for sorting arrays, and ends with a brief description of 'gawk''s +ability to support true arrays of arrays. + +* Menu: + +* Array Basics:: The basics of arrays. +* Numeric Array Subscripts:: How to use numbers as subscripts in + 'awk'. +* Uninitialized Subscripts:: Using Uninitialized variables as subscripts. +* Delete:: The 'delete' statement removes an element + from an array. +* Multidimensional:: Emulating multidimensional arrays in + 'awk'. +* Arrays of Arrays:: True multidimensional arrays. +* Arrays Summary:: Summary of arrays. + + +File: gawk.info, Node: Array Basics, Next: Numeric Array Subscripts, Up: Arrays + +8.1 The Basics of Arrays +======================== + +This minor node presents the basics: working with elements in arrays one +at a time, and traversing all of the elements in an array. + +* Menu: + +* Array Intro:: Introduction to Arrays +* Reference to Elements:: How to examine one element of an array. +* Assigning Elements:: How to change an element of an array. +* Array Example:: Basic Example of an Array +* Scanning an Array:: A variation of the 'for' statement. It + loops through the indices of an array's + existing elements. +* Controlling Scanning:: Controlling the order in which arrays are + scanned. + + +File: gawk.info, Node: Array Intro, Next: Reference to Elements, Up: Array Basics + +8.1.1 Introduction to Arrays +---------------------------- + + Doing linear scans over an associative array is like trying to club + someone to death with a loaded Uzi. + -- _Larry Wall_ + + The 'awk' language provides one-dimensional arrays for storing groups +of related strings or numbers. Every 'awk' array must have a name. +Array names have the same syntax as variable names; any valid variable +name would also be a valid array name. But one name cannot be used in +both ways (as an array and as a variable) in the same 'awk' program. + + Arrays in 'awk' superficially resemble arrays in other programming +languages, but there are fundamental differences. In 'awk', it isn't +necessary to specify the size of an array before starting to use it. +Additionally, any number or string, not just consecutive integers, may +be used as an array index. + + In most other languages, arrays must be "declared" before use, +including a specification of how many elements or components they +contain. In such languages, the declaration causes a contiguous block +of memory to be allocated for that many elements. Usually, an index in +the array must be a nonnegative integer. For example, the index zero +specifies the first element in the array, which is actually stored at +the beginning of the block of memory. Index one specifies the second +element, which is stored in memory right after the first element, and so +on. It is impossible to add more elements to the array, because it has +room only for as many elements as given in the declaration. (Some +languages allow arbitrary starting and ending indices--e.g., '15 .. +27'--but the size of the array is still fixed when the array is +declared.) + + A contiguous array of four elements might look like *note Figure 8.1: +figure-array-elements, conceptually, if the element values are eight, +'"foo"', '""', and 30. + + +| 8 | \"foo\" | \"\" | 30 | Value ++---------+---------+--------+---------+ + 0 1 2 3 Index" + +Figure 8.1: A contiguous array + +Only the values are stored; the indices are implicit from the order of +the values. Here, eight is the value at index zero, because eight +appears in the position with zero elements before it. + + Arrays in 'awk' are different--they are "associative". This means +that each array is a collection of pairs--an index and its corresponding +array element value: + + Index Value +------------------------ + '3' '30' + '1' '"foo"' + '0' '8' + '2' '""' + +The pairs are shown in jumbled order because their order is +irrelevant.(1) + + One advantage of associative arrays is that new pairs can be added at +any time. For example, suppose a tenth element is added to the array +whose value is '"number ten"'. The result is: + + Index Value +------------------------------- + '10' '"number + ten"' + '3' '30' + '1' '"foo"' + '0' '8' + '2' '""' + +Now the array is "sparse", which just means some indices are missing. +It has elements 0-3 and 10, but doesn't have elements 4, 5, 6, 7, 8, or +9. + + Another consequence of associative arrays is that the indices don't +have to be nonnegative integers. Any number, or even a string, can be +an index. For example, the following is an array that translates words +from English to French: + + Index Value +------------------------ + '"dog"' '"chien"' + '"cat"' '"chat"' + '"one"' '"un"' + '1' '"un"' + +Here we decided to translate the number one in both spelled-out and +numeric form--thus illustrating that a single array can have both +numbers and strings as indices. (In fact, array subscripts are always +strings. There are some subtleties to how numbers work when used as +array subscripts; this is discussed in more detail in *note Numeric +Array Subscripts::.) Here, the number '1' isn't double-quoted, because +'awk' automatically converts it to a string. + + The value of 'IGNORECASE' has no effect upon array subscripting. The +identical string value used to store an array element must be used to +retrieve it. When 'awk' creates an array (e.g., with the 'split()' +built-in function), that array's indices are consecutive integers +starting at one. (*Note String Functions::.) + + 'awk''s arrays are efficient--the time to access an element is +independent of the number of elements in the array. + + ---------- Footnotes ---------- + + (1) The ordering will vary among 'awk' implementations, which +typically use hash tables to store array elements and values. + + +File: gawk.info, Node: Reference to Elements, Next: Assigning Elements, Prev: Array Intro, Up: Array Basics + +8.1.2 Referring to an Array Element +----------------------------------- + +The principal way to use an array is to refer to one of its elements. +An "array reference" is an expression as follows: + + ARRAY[INDEX-EXPRESSION] + +Here, ARRAY is the name of an array. The expression INDEX-EXPRESSION is +the index of the desired element of the array. + + The value of the array reference is the current value of that array +element. For example, 'foo[4.3]' is an expression referencing the +element of array 'foo' at index '4.3'. + + A reference to an array element that has no recorded value yields a +value of '""', the null string. This includes elements that have not +been assigned any value as well as elements that have been deleted +(*note Delete::). + + NOTE: A reference to an element that does not exist _automatically_ + creates that array element, with the null string as its value. (In + some cases, this is unfortunate, because it might waste memory + inside 'awk'.) + + Novice 'awk' programmers often make the mistake of checking if an + element exists by checking if the value is empty: + + # Check if "foo" exists in a: Incorrect! + if (a["foo"] != "") ... + + This is incorrect for two reasons. First, it _creates_ 'a["foo"]' + if it didn't exist before! Second, it is valid (if a bit unusual) + to set an array element equal to the empty string. + + To determine whether an element exists in an array at a certain +index, use the following expression: + + INDX in ARRAY + +This expression tests whether the particular index INDX exists, without +the side effect of creating that element if it is not present. The +expression has the value one (true) if 'ARRAY[INDX]' exists and zero +(false) if it does not exist. (We use INDX here, because 'index' is the +name of a built-in function.) For example, this statement tests whether +the array 'frequencies' contains the index '2': + + if (2 in frequencies) + print "Subscript 2 is present." + + Note that this is _not_ a test of whether the array 'frequencies' +contains an element whose _value_ is two. There is no way to do that +except to scan all the elements. Also, this _does not_ create +'frequencies[2]', while the following (incorrect) alternative does: + + if (frequencies[2] != "") + print "Subscript 2 is present." + + +File: gawk.info, Node: Assigning Elements, Next: Array Example, Prev: Reference to Elements, Up: Array Basics + +8.1.3 Assigning Array Elements +------------------------------ + +Array elements can be assigned values just like 'awk' variables: + + ARRAY[INDEX-EXPRESSION] = VALUE + +ARRAY is the name of an array. The expression INDEX-EXPRESSION is the +index of the element of the array that is assigned a value. The +expression VALUE is the value to assign to that element of the array. + + +File: gawk.info, Node: Array Example, Next: Scanning an Array, Prev: Assigning Elements, Up: Array Basics + +8.1.4 Basic Array Example +------------------------- + +The following program takes a list of lines, each beginning with a line +number, and prints them out in order of line number. The line numbers +are not in order when they are first read--instead, they are scrambled. +This program sorts the lines by making an array using the line numbers +as subscripts. The program then prints out the lines in sorted order of +their numbers. It is a very simple program and gets confused upon +encountering repeated numbers, gaps, or lines that don't begin with a +number: + + { + if ($1 > max) + max = $1 + arr[$1] = $0 + } + + END { + for (x = 1; x <= max; x++) + print arr[x] + } + + The first rule keeps track of the largest line number seen so far; it +also stores each line into the array 'arr', at an index that is the +line's number. The second rule runs after all the input has been read, +to print out all the lines. When this program is run with the following +input: + + 5 I am the Five man + 2 Who are you? The new number two! + 4 . . . And four on the floor + 1 Who is number one? + 3 I three you. + +Its output is: + + 1 Who is number one? + 2 Who are you? The new number two! + 3 I three you. + 4 . . . And four on the floor + 5 I am the Five man + + If a line number is repeated, the last line with a given number +overrides the others. Gaps in the line numbers can be handled with an +easy improvement to the program's 'END' rule, as follows: + + END { + for (x = 1; x <= max; x++) + if (x in arr) + print arr[x] + } + + +File: gawk.info, Node: Scanning an Array, Next: Controlling Scanning, Prev: Array Example, Up: Array Basics + +8.1.5 Scanning All Elements of an Array +--------------------------------------- + +In programs that use arrays, it is often necessary to use a loop that +executes once for each element of an array. In other languages, where +arrays are contiguous and indices are limited to nonnegative integers, +this is easy: all the valid indices can be found by counting from the +lowest index up to the highest. This technique won't do the job in +'awk', because any number or string can be an array index. So 'awk' has +a special kind of 'for' statement for scanning an array: + + for (VAR in ARRAY) + BODY + +This loop executes BODY once for each index in ARRAY that the program +has previously used, with the variable VAR set to that index. + + The following program uses this form of the 'for' statement. The +first rule scans the input records and notes which words appear (at +least once) in the input, by storing a one into the array 'used' with +the word as the index. The second rule scans the elements of 'used' to +find all the distinct words that appear in the input. It prints each +word that is more than 10 characters long and also prints the number of +such words. *Note String Functions:: for more information on the +built-in function 'length()'. + + # Record a 1 for each word that is used at least once + { + for (i = 1; i <= NF; i++) + used[$i] = 1 + } + + # Find number of distinct words more than 10 characters long + END { + for (x in used) { + if (length(x) > 10) { + ++num_long_words + print x + } + } + print num_long_words, "words longer than 10 characters" + } + +*Note Word Sorting:: for a more detailed example of this type. + + The order in which elements of the array are accessed by this +statement is determined by the internal arrangement of the array +elements within 'awk' and in standard 'awk' cannot be controlled or +changed. This can lead to problems if new elements are added to ARRAY +by statements in the loop body; it is not predictable whether the 'for' +loop will reach them. Similarly, changing VAR inside the loop may +produce strange results. It is best to avoid such things. + + As a point of information, 'gawk' sets up the list of elements to be +iterated over before the loop starts, and does not change it. But not +all 'awk' versions do so. Consider this program, named 'loopcheck.awk': + + BEGIN { + a["here"] = "here" + a["is"] = "is" + a["a"] = "a" + a["loop"] = "loop" + for (i in a) { + j++ + a[j] = j + print i + } + } + + Here is what happens when run with 'gawk' (and 'mawk'): + + $ gawk -f loopcheck.awk + -| here + -| loop + -| a + -| is + + Contrast this to BWK 'awk': + + $ nawk -f loopcheck.awk + -| loop + -| here + -| is + -| a + -| 1 + + +File: gawk.info, Node: Controlling Scanning, Prev: Scanning an Array, Up: Array Basics + +8.1.6 Using Predefined Array Scanning Orders with 'gawk' +-------------------------------------------------------- + +This node describes a feature that is specific to 'gawk'. + + By default, when a 'for' loop traverses an array, the order is +undefined, meaning that the 'awk' implementation determines the order in +which the array is traversed. This order is usually based on the +internal implementation of arrays and will vary from one version of +'awk' to the next. + + Often, though, you may wish to do something simple, such as "traverse +the array by comparing the indices in ascending order," or "traverse the +array by comparing the values in descending order." 'gawk' provides two +mechanisms that give you this control: + + * Set 'PROCINFO["sorted_in"]' to one of a set of predefined values. + We describe this now. + + * Set 'PROCINFO["sorted_in"]' to the name of a user-defined function + to use for comparison of array elements. This advanced feature is + described later in *note Array Sorting::. + + The following special values for 'PROCINFO["sorted_in"]' are +available: + +'"@unsorted"' + Array elements are processed in arbitrary order, which is the + default 'awk' behavior. + +'"@ind_str_asc"' + Order by indices in ascending order compared as strings; this is + the most basic sort. (Internally, array indices are always + strings, so with 'a[2*5] = 1' the index is '"10"' rather than + numeric 10.) + +'"@ind_num_asc"' + Order by indices in ascending order but force them to be treated as + numbers in the process. Any index with a non-numeric value will + end up positioned as if it were zero. + +'"@val_type_asc"' + Order by element values in ascending order (rather than by + indices). Ordering is by the type assigned to the element (*note + Typing and Comparison::). All numeric values come before all + string values, which in turn come before all subarrays. (Subarrays + have not been described yet; *note Arrays of Arrays::.) + +'"@val_str_asc"' + Order by element values in ascending order (rather than by + indices). Scalar values are compared as strings. Subarrays, if + present, come out last. + +'"@val_num_asc"' + Order by element values in ascending order (rather than by + indices). Scalar values are compared as numbers. Subarrays, if + present, come out last. When numeric values are equal, the string + values are used to provide an ordering: this guarantees consistent + results across different versions of the C 'qsort()' function,(1) + which 'gawk' uses internally to perform the sorting. + +'"@ind_str_desc"' + Like '"@ind_str_asc"', but the string indices are ordered from high + to low. + +'"@ind_num_desc"' + Like '"@ind_num_asc"', but the numeric indices are ordered from + high to low. + +'"@val_type_desc"' + Like '"@val_type_asc"', but the element values, based on type, are + ordered from high to low. Subarrays, if present, come out first. + +'"@val_str_desc"' + Like '"@val_str_asc"', but the element values, treated as strings, + are ordered from high to low. Subarrays, if present, come out + first. + +'"@val_num_desc"' + Like '"@val_num_asc"', but the element values, treated as numbers, + are ordered from high to low. Subarrays, if present, come out + first. + + The array traversal order is determined before the 'for' loop starts +to run. Changing 'PROCINFO["sorted_in"]' in the loop body does not +affect the loop. For example: + + $ gawk ' + > BEGIN { + > a[4] = 4 + > a[3] = 3 + > for (i in a) + > print i, a[i] + > }' + -| 4 4 + -| 3 3 + $ gawk ' + > BEGIN { + > PROCINFO["sorted_in"] = "@ind_str_asc" + > a[4] = 4 + > a[3] = 3 + > for (i in a) + > print i, a[i] + > }' + -| 3 3 + -| 4 4 + + When sorting an array by element values, if a value happens to be a +subarray then it is considered to be greater than any string or numeric +value, regardless of what the subarray itself contains, and all +subarrays are treated as being equal to each other. Their order +relative to each other is determined by their index strings. + + Here are some additional things to bear in mind about sorted array +traversal: + + * The value of 'PROCINFO["sorted_in"]' is global. That is, it + affects all array traversal 'for' loops. If you need to change it + within your own code, you should see if it's defined and save and + restore the value: + + ... + if ("sorted_in" in PROCINFO) { + save_sorted = PROCINFO["sorted_in"] + PROCINFO["sorted_in"] = "@val_str_desc" # or whatever + } + ... + if (save_sorted) + PROCINFO["sorted_in"] = save_sorted + + * As already mentioned, the default array traversal order is + represented by '"@unsorted"'. You can also get the default + behavior by assigning the null string to 'PROCINFO["sorted_in"]' or + by just deleting the '"sorted_in"' element from the 'PROCINFO' + array with the 'delete' statement. (The 'delete' statement hasn't + been described yet; *note Delete::.) + + In addition, 'gawk' provides built-in functions for sorting arrays; +see *note Array Sorting Functions::. + + ---------- Footnotes ---------- + + (1) When two elements compare as equal, the C 'qsort()' function does +not guarantee that they will maintain their original relative order +after sorting. Using the string value to provide a unique ordering when +the numeric values are equal ensures that 'gawk' behaves consistently +across different environments. + + +File: gawk.info, Node: Numeric Array Subscripts, Next: Uninitialized Subscripts, Prev: Array Basics, Up: Arrays + +8.2 Using Numbers to Subscript Arrays +===================================== + +An important aspect to remember about arrays is that _array subscripts +are always strings_. When a numeric value is used as a subscript, it is +converted to a string value before being used for subscripting (*note +Conversion::). This means that the value of the predefined variable +'CONVFMT' can affect how your program accesses elements of an array. +For example: + + xyz = 12.153 + data[xyz] = 1 + CONVFMT = "%2.2f" + if (xyz in data) + printf "%s is in data\n", xyz + else + printf "%s is not in data\n", xyz + +This prints '12.15 is not in data'. The first statement gives 'xyz' a +numeric value. Assigning to 'data[xyz]' subscripts 'data' with the +string value '"12.153"' (using the default conversion value of +'CONVFMT', '"%.6g"'). Thus, the array element 'data["12.153"]' is +assigned the value one. The program then changes the value of +'CONVFMT'. The test '(xyz in data)' generates a new string value from +'xyz'--this time '"12.15"'--because the value of 'CONVFMT' only allows +two significant digits. This test fails, because '"12.15"' is different +from '"12.153"'. + + According to the rules for conversions (*note Conversion::), integer +values always convert to strings as integers, no matter what the value +of 'CONVFMT' may happen to be. So the usual case of the following +works: + + for (i = 1; i <= maxsub; i++) + do something with array[i] + + The "integer values always convert to strings as integers" rule has +an additional consequence for array indexing. Octal and hexadecimal +constants (*note Nondecimal-numbers::) are converted internally into +numbers, and their original form is forgotten. This means, for example, +that 'array[17]', 'array[021]', and 'array[0x11]' all refer to the same +element! + + As with many things in 'awk', the majority of the time things work as +you would expect them to. But it is useful to have a precise knowledge +of the actual rules, as they can sometimes have a subtle effect on your +programs. + + +File: gawk.info, Node: Uninitialized Subscripts, Next: Delete, Prev: Numeric Array Subscripts, Up: Arrays + +8.3 Using Uninitialized Variables as Subscripts +=============================================== + +Suppose it's necessary to write a program to print the input data in +reverse order. A reasonable attempt to do so (with some test data) +might look like this: + + $ echo 'line 1 + > line 2 + > line 3' | awk '{ l[lines] = $0; ++lines } + > END { + > for (i = lines - 1; i >= 0; i--) + > print l[i] + > }' + -| line 3 + -| line 2 + + Unfortunately, the very first line of input data did not appear in +the output! + + Upon first glance, we would think that this program should have +worked. The variable 'lines' is uninitialized, and uninitialized +variables have the numeric value zero. So, 'awk' should have printed +the value of 'l[0]'. + + The issue here is that subscripts for 'awk' arrays are _always_ +strings. Uninitialized variables, when used as strings, have the value +'""', not zero. Thus, 'line 1' ends up stored in 'l[""]'. The +following version of the program works correctly: + + { l[lines++] = $0 } + END { + for (i = lines - 1; i >= 0; i--) + print l[i] + } + + Here, the '++' forces 'lines' to be numeric, thus making the "old +value" numeric zero. This is then converted to '"0"' as the array +subscript. + + Even though it is somewhat unusual, the null string ('""') is a valid +array subscript. (d.c.) 'gawk' warns about the use of the null string +as a subscript if '--lint' is provided on the command line (*note +Options::). + + +File: gawk.info, Node: Delete, Next: Multidimensional, Prev: Uninitialized Subscripts, Up: Arrays + +8.4 The 'delete' Statement +========================== + +To remove an individual element of an array, use the 'delete' statement: + + delete ARRAY[INDEX-EXPRESSION] + + Once an array element has been deleted, any value the element once +had is no longer available. It is as if the element had never been +referred to or been given a value. The following is an example of +deleting elements in an array: + + for (i in frequencies) + delete frequencies[i] + +This example removes all the elements from the array 'frequencies'. +Once an element is deleted, a subsequent 'for' statement to scan the +array does not report that element and using the 'in' operator to check +for the presence of that element returns zero (i.e., false): + + delete foo[4] + if (4 in foo) + print "This will never be printed" + + It is important to note that deleting an element is _not_ the same as +assigning it a null value (the empty string, '""'). For example: + + foo[4] = "" + if (4 in foo) + print "This is printed, even though foo[4] is empty" + + It is not an error to delete an element that does not exist. +However, if '--lint' is provided on the command line (*note Options::), +'gawk' issues a warning message when an element that is not in the array +is deleted. + + All the elements of an array may be deleted with a single statement +by leaving off the subscript in the 'delete' statement, as follows: + + delete ARRAY + + Using this version of the 'delete' statement is about three times +more efficient than the equivalent loop that deletes each element one at +a time. + + This form of the 'delete' statement is also supported by BWK 'awk' +and 'mawk', as well as by a number of other implementations. + + NOTE: For many years, using 'delete' without a subscript was a + common extension. In September 2012, it was accepted for inclusion + into the POSIX standard. See the Austin Group website + (http://austingroupbugs.net/view.php?id=544). + + The following statement provides a portable but nonobvious way to +clear out an array:(1) + + split("", array) + + The 'split()' function (*note String Functions::) clears out the +target array first. This call asks it to split apart the null string. +Because there is no data to split out, the function simply clears the +array and then returns. + + CAUTION: Deleting all the elements from an array does not change + its type; you cannot clear an array and then use the array's name + as a scalar (i.e., a regular variable). For example, the following + does not work: + + a[1] = 3 + delete a + a = 3 + + ---------- Footnotes ---------- + + (1) Thanks to Michael Brennan for pointing this out. + + +File: gawk.info, Node: Multidimensional, Next: Arrays of Arrays, Prev: Delete, Up: Arrays + +8.5 Multidimensional Arrays +=========================== + +* Menu: + +* Multiscanning:: Scanning multidimensional arrays. + +A "multidimensional array" is an array in which an element is identified +by a sequence of indices instead of a single index. For example, a +two-dimensional array requires two indices. The usual way (in many +languages, including 'awk') to refer to an element of a two-dimensional +array named 'grid' is with 'grid[X,Y]'. + + Multidimensional arrays are supported in 'awk' through concatenation +of indices into one string. 'awk' converts the indices into strings +(*note Conversion::) and concatenates them together, with a separator +between them. This creates a single string that describes the values of +the separate indices. The combined string is used as a single index +into an ordinary, one-dimensional array. The separator used is the +value of the built-in variable 'SUBSEP'. + + For example, suppose we evaluate the expression 'foo[5,12] = "value"' +when the value of 'SUBSEP' is '"@"'. The numbers 5 and 12 are converted +to strings and concatenated with an '@' between them, yielding '"5@12"'; +thus, the array element 'foo["5@12"]' is set to '"value"'. + + Once the element's value is stored, 'awk' has no record of whether it +was stored with a single index or a sequence of indices. The two +expressions 'foo[5,12]' and 'foo[5 SUBSEP 12]' are always equivalent. + + The default value of 'SUBSEP' is the string '"\034"', which contains +a nonprinting character that is unlikely to appear in an 'awk' program +or in most input data. The usefulness of choosing an unlikely character +comes from the fact that index values that contain a string matching +'SUBSEP' can lead to combined strings that are ambiguous. Suppose that +'SUBSEP' is '"@"'; then 'foo["a@b", "c"]' and 'foo["a", "b@c"]' are +indistinguishable because both are actually stored as 'foo["a@b@c"]'. + + To test whether a particular index sequence exists in a +multidimensional array, use the same operator ('in') that is used for +single-dimensional arrays. Write the whole sequence of indices in +parentheses, separated by commas, as the left operand: + + if ((SUBSCRIPT1, SUBSCRIPT2, ...) in ARRAY) + ... + + Here is an example that treats its input as a two-dimensional array +of fields; it rotates this array 90 degrees clockwise and prints the +result. It assumes that all lines have the same number of elements: + + { + if (max_nf < NF) + max_nf = NF + max_nr = NR + for (x = 1; x <= NF; x++) + vector[x, NR] = $x + } + + END { + for (x = 1; x <= max_nf; x++) { + for (y = max_nr; y >= 1; --y) + printf("%s ", vector[x, y]) + printf("\n") + } + } + +When given the input: + + 1 2 3 4 5 6 + 2 3 4 5 6 1 + 3 4 5 6 1 2 + 4 5 6 1 2 3 + +the program produces the following output: + + 4 3 2 1 + 5 4 3 2 + 6 5 4 3 + 1 6 5 4 + 2 1 6 5 + 3 2 1 6 + + +File: gawk.info, Node: Multiscanning, Up: Multidimensional + +8.5.1 Scanning Multidimensional Arrays +-------------------------------------- + +There is no special 'for' statement for scanning a "multidimensional" +array. There cannot be one, because, in truth, 'awk' does not have +multidimensional arrays or elements--there is only a multidimensional +_way of accessing_ an array. + + However, if your program has an array that is always accessed as +multidimensional, you can get the effect of scanning it by combining the +scanning 'for' statement (*note Scanning an Array::) with the built-in +'split()' function (*note String Functions::). It works in the +following manner: + + for (combined in array) { + split(combined, separate, SUBSEP) + ... + } + +This sets the variable 'combined' to each concatenated combined index in +the array, and splits it into the individual indices by breaking it +apart where the value of 'SUBSEP' appears. The individual indices then +become the elements of the array 'separate'. + + Thus, if a value is previously stored in 'array[1, "foo"]', then an +element with index '"1\034foo"' exists in 'array'. (Recall that the +default value of 'SUBSEP' is the character with code 034.) Sooner or +later, the 'for' statement finds that index and does an iteration with +the variable 'combined' set to '"1\034foo"'. Then the 'split()' +function is called as follows: + + split("1\034foo", separate, "\034") + +The result is to set 'separate[1]' to '"1"' and 'separate[2]' to +'"foo"'. Presto! The original sequence of separate indices is +recovered. + + +File: gawk.info, Node: Arrays of Arrays, Next: Arrays Summary, Prev: Multidimensional, Up: Arrays + +8.6 Arrays of Arrays +==================== + +'gawk' goes beyond standard 'awk''s multidimensional array access and +provides true arrays of arrays. Elements of a subarray are referred to +by their own indices enclosed in square brackets, just like the elements +of the main array. For example, the following creates a two-element +subarray at index '1' of the main array 'a': + + a[1][1] = 1 + a[1][2] = 2 + + This simulates a true two-dimensional array. Each subarray element +can contain another subarray as a value, which in turn can hold other +arrays as well. In this way, you can create arrays of three or more +dimensions. The indices can be any 'awk' expressions, including scalars +separated by commas (i.e., a regular 'awk' simulated multidimensional +subscript). So the following is valid in 'gawk': + + a[1][3][1, "name"] = "barney" + + Each subarray and the main array can be of different length. In +fact, the elements of an array or its subarray do not all have to have +the same type. This means that the main array and any of its subarrays +can be nonrectangular, or jagged in structure. You can assign a scalar +value to the index '4' of the main array 'a', even though 'a[1]' is +itself an array and not a scalar: + + a[4] = "An element in a jagged array" + + The terms "dimension", "row", and "column" are meaningless when +applied to such an array, but we will use "dimension" henceforth to +imply the maximum number of indices needed to refer to an existing +element. The type of any element that has already been assigned cannot +be changed by assigning a value of a different type. You have to first +delete the current element, which effectively makes 'gawk' forget about +the element at that index: + + delete a[4] + a[4][5][6][7] = "An element in a four-dimensional array" + +This removes the scalar value from index '4' and then inserts a +three-level nested subarray containing a scalar. You can also delete an +entire subarray or subarray of subarrays: + + delete a[4][5] + a[4][5] = "An element in subarray a[4]" + + But recall that you can not delete the main array 'a' and then use it +as a scalar. + + The built-in functions that take array arguments can also be used +with subarrays. For example, the following code fragment uses +'length()' (*note String Functions::) to determine the number of +elements in the main array 'a' and its subarrays: + + print length(a), length(a[1]), length(a[1][3]) + +This results in the following output for our main array 'a': + + 2, 3, 1 + +The 'SUBSCRIPT in ARRAY' expression (*note Reference to Elements::) +works similarly for both regular 'awk'-style arrays and arrays of +arrays. For example, the tests '1 in a', '3 in a[1]', and '(1, "name") +in a[1][3]' all evaluate to one (true) for our array 'a'. + + The 'for (item in array)' statement (*note Scanning an Array::) can +be nested to scan all the elements of an array of arrays if it is +rectangular in structure. In order to print the contents (scalar +values) of a two-dimensional array of arrays (i.e., in which each +first-level element is itself an array, not necessarily of the same +length), you could use the following code: + + for (i in array) + for (j in array[i]) + print array[i][j] + + The 'isarray()' function (*note Type Functions::) lets you test if an +array element is itself an array: + + for (i in array) { + if (isarray(array[i]) { + for (j in array[i]) { + print array[i][j] + } + } + else + print array[i] + } + + If the structure of a jagged array of arrays is known in advance, you +can often devise workarounds using control statements. For example, the +following code prints the elements of our main array 'a': + + for (i in a) { + for (j in a[i]) { + if (j == 3) { + for (k in a[i][j]) + print a[i][j][k] + } else + print a[i][j] + } + } + +*Note Walking Arrays:: for a user-defined function that "walks" an +arbitrarily dimensioned array of arrays. + + Recall that a reference to an uninitialized array element yields a +value of '""', the null string. This has one important implication when +you intend to use a subarray as an argument to a function, as +illustrated by the following example: + + $ gawk 'BEGIN { split("a b c d", b[1]); print b[1][1] }' + error-> gawk: cmd. line:1: fatal: split: second argument is not an array + + The way to work around this is to first force 'b[1]' to be an array +by creating an arbitrary index: + + $ gawk 'BEGIN { b[1][1] = ""; split("a b c d", b[1]); print b[1][1] }' + -| a + + +File: gawk.info, Node: Arrays Summary, Prev: Arrays of Arrays, Up: Arrays + +8.7 Summary +=========== + + * Standard 'awk' provides one-dimensional associative arrays (arrays + indexed by string values). All arrays are associative; numeric + indices are converted automatically to strings. + + * Array elements are referenced as 'ARRAY[INDX]'. Referencing an + element creates it if it did not exist previously. + + * The proper way to see if an array has an element with a given index + is to use the 'in' operator: 'INDX in ARRAY'. + + * Use 'for (INDX in ARRAY) ...' to scan through all the individual + elements of an array. In the body of the loop, INDX takes on the + value of each element's index in turn. + + * The order in which a 'for (INDX in ARRAY)' loop traverses an array + is undefined in POSIX 'awk' and varies among implementations. + 'gawk' lets you control the order by assigning special predefined + values to 'PROCINFO["sorted_in"]'. + + * Use 'delete ARRAY[INDX]' to delete an individual element. To + delete all of the elements in an array, use 'delete ARRAY'. This + latter feature has been a common extension for many years and is + now standard, but may not be supported by all commercial versions + of 'awk'. + + * Standard 'awk' simulates multidimensional arrays by separating + subscript values with commas. The values are concatenated into a + single string, separated by the value of 'SUBSEP'. The fact that + such a subscript was created in this way is not retained; thus, + changing 'SUBSEP' may have unexpected consequences. You can use + '(SUB1, SUB2, ...) in ARRAY' to see if such a multidimensional + subscript exists in ARRAY. + + * 'gawk' provides true arrays of arrays. You use a separate set of + square brackets for each dimension in such an array: + 'data[row][col]', for example. Array elements may thus be either + scalar values (number or string) or other arrays. + + * Use the 'isarray()' built-in function to determine if an array + element is itself a subarray. + + +File: gawk.info, Node: Functions, Next: Library Functions, Prev: Arrays, Up: Top + +9 Functions +*********** + +This major node describes 'awk''s built-in functions, which fall into +three categories: numeric, string, and I/O. 'gawk' provides additional +groups of functions to work with values that represent time, do bit +manipulation, sort arrays, provide type information, and +internationalize and localize programs. + + Besides the built-in functions, 'awk' has provisions for writing new +functions that the rest of a program can use. The second half of this +major node describes these "user-defined" functions. Finally, we +explore indirect function calls, a 'gawk'-specific extension that lets +you determine at runtime what function is to be called. + +* Menu: + +* Built-in:: Summarizes the built-in functions. +* User-defined:: Describes User-defined functions in detail. +* Indirect Calls:: Choosing the function to call at runtime. +* Functions Summary:: Summary of functions. + + +File: gawk.info, Node: Built-in, Next: User-defined, Up: Functions + +9.1 Built-in Functions +====================== + +"Built-in" functions are always available for your 'awk' program to +call. This minor node defines all the built-in functions in 'awk'; some +of these are mentioned in other minor nodes but are summarized here for +your convenience. + +* Menu: + +* Calling Built-in:: How to call built-in functions. +* Numeric Functions:: Functions that work with numbers, including + 'int()', 'sin()' and 'rand()'. +* String Functions:: Functions for string manipulation, such as + 'split()', 'match()' and + 'sprintf()'. +* I/O Functions:: Functions for files and shell commands. +* Time Functions:: Functions for dealing with timestamps. +* Bitwise Functions:: Functions for bitwise operations. +* Type Functions:: Functions for type information. +* I18N Functions:: Functions for string translation. + + +File: gawk.info, Node: Calling Built-in, Next: Numeric Functions, Up: Built-in + +9.1.1 Calling Built-in Functions +-------------------------------- + +To call one of 'awk''s built-in functions, write the name of the +function followed by arguments in parentheses. For example, 'atan2(y + +z, 1)' is a call to the function 'atan2()' and has two arguments. + + Whitespace is ignored between the built-in function name and the +opening parenthesis, but nonetheless it is good practice to avoid using +whitespace there. User-defined functions do not permit whitespace in +this way, and it is easier to avoid mistakes by following a simple +convention that always works--no whitespace after a function name. + + Each built-in function accepts a certain number of arguments. In +some cases, arguments can be omitted. The defaults for omitted +arguments vary from function to function and are described under the +individual functions. In some 'awk' implementations, extra arguments +given to built-in functions are ignored. However, in 'gawk', it is a +fatal error to give extra arguments to a built-in function. + + When a function is called, expressions that create the function's +actual parameters are evaluated completely before the call is performed. +For example, in the following code fragment: + + i = 4 + j = sqrt(i++) + +the variable 'i' is incremented to the value five before 'sqrt()' is +called with a value of four for its actual parameter. The order of +evaluation of the expressions used for the function's parameters is +undefined. Thus, avoid writing programs that assume that parameters are +evaluated from left to right or from right to left. For example: + + i = 5 + j = atan2(++i, i *= 2) + + If the order of evaluation is left to right, then 'i' first becomes +six, and then 12, and 'atan2()' is called with the two arguments six and +12. But if the order of evaluation is right to left, 'i' first becomes +10, then 11, and 'atan2()' is called with the two arguments 11 and 10. + + +File: gawk.info, Node: Numeric Functions, Next: String Functions, Prev: Calling Built-in, Up: Built-in + +9.1.2 Numeric Functions +----------------------- + +The following list describes all of the built-in functions that work +with numbers. Optional parameters are enclosed in square +brackets ([ ]): + +'atan2(Y, X)' + Return the arctangent of 'Y / X' in radians. You can use 'pi = + atan2(0, -1)' to retrieve the value of pi. + +'cos(X)' + Return the cosine of X, with X in radians. + +'exp(X)' + Return the exponential of X ('e ^ X') or report an error if X is + out of range. The range of values X can have depends on your + machine's floating-point representation. + +'int(X)' + Return the nearest integer to X, located between X and zero and + truncated toward zero. For example, 'int(3)' is 3, 'int(3.9)' is + 3, 'int(-3.9)' is -3, and 'int(-3)' is -3 as well. + +'intdiv(NUMERATOR, DENOMINATOR, RESULT)' + Perform integer division, similar to the standard C function of the + same name. First, truncate 'numerator' and 'denominator' towards + zero, creating integer values. Clear the 'result' array, and then + set 'result["quotient"]' to the result of 'numerator / + denominator', truncated towards zero to an integer, and set + 'result["remainder"]' to the result of 'numerator % denominator', + truncated towards zero to an integer. This function is primarily + intended for use with arbitrary length integers; it avoids creating + MPFR arbitrary precision floating-point values (*note Arbitrary + Precision Integers::). + + This function is a 'gawk' extension. It is not available in + compatibility mode (*note Options::). + +'log(X)' + Return the natural logarithm of X, if X is positive; otherwise, + return 'NaN' ("not a number") on IEEE 754 systems. Additionally, + 'gawk' prints a warning message when 'x' is negative. + +'rand()' + Return a random number. The values of 'rand()' are uniformly + distributed between zero and one. The value could be zero but is + never one.(1) + + Often random integers are needed instead. Following is a + user-defined function that can be used to obtain a random + nonnegative integer less than N: + + function randint(n) + { + return int(n * rand()) + } + + The multiplication produces a random number greater than or equal + to zero and less than 'n'. Using 'int()', this result is made into + an integer between zero and 'n' - 1, inclusive. + + The following example uses a similar function to produce random + integers between one and N. This program prints a new random + number for each input record: + + # Function to roll a simulated die. + function roll(n) { return 1 + int(rand() * n) } + + # Roll 3 six-sided dice and + # print total number of points. + { + printf("%d points\n", roll(6) + roll(6) + roll(6)) + } + + CAUTION: In most 'awk' implementations, including 'gawk', + 'rand()' starts generating numbers from the same starting + number, or "seed", each time you run 'awk'.(2) Thus, a + program generates the same results each time you run it. The + numbers are random within one 'awk' run but predictable from + run to run. This is convenient for debugging, but if you want + a program to do different things each time it is used, you + must change the seed to a value that is different in each run. + To do this, use 'srand()'. + +'sin(X)' + Return the sine of X, with X in radians. + +'sqrt(X)' + Return the positive square root of X. 'gawk' prints a warning + message if X is negative. Thus, 'sqrt(4)' is 2. + +'srand('[X]')' + Set the starting point, or seed, for generating random numbers to + the value X. + + Each seed value leads to a particular sequence of random + numbers.(3) Thus, if the seed is set to the same value a second + time, the same sequence of random numbers is produced again. + + CAUTION: Different 'awk' implementations use different + random-number generators internally. Don't expect the same + 'awk' program to produce the same series of random numbers + when executed by different versions of 'awk'. + + If the argument X is omitted, as in 'srand()', then the current + date and time of day are used for a seed. This is the way to get + random numbers that are truly unpredictable. + + The return value of 'srand()' is the previous seed. This makes it + easy to keep track of the seeds in case you need to consistently + reproduce sequences of random numbers. + + POSIX does not specify the initial seed; it differs among 'awk' + implementations. + + ---------- Footnotes ---------- + + (1) The C version of 'rand()' on many Unix systems is known to +produce fairly poor sequences of random numbers. However, nothing +requires that an 'awk' implementation use the C 'rand()' to implement +the 'awk' version of 'rand()'. In fact, 'gawk' uses the BSD 'random()' +function, which is considerably better than 'rand()', to produce random +numbers. + + (2) 'mawk' uses a different seed each time. + + (3) Computer-generated random numbers really are not truly random. +They are technically known as "pseudorandom". This means that although +the numbers in a sequence appear to be random, you can in fact generate +the same sequence of random numbers over and over again. + + +File: gawk.info, Node: String Functions, Next: I/O Functions, Prev: Numeric Functions, Up: Built-in + +9.1.3 String-Manipulation Functions +----------------------------------- + +The functions in this minor node look at or change the text of one or +more strings. + + 'gawk' understands locales (*note Locales::) and does all string +processing in terms of _characters_, not _bytes_. This distinction is +particularly important to understand for locales where one character may +be represented by multiple bytes. Thus, for example, 'length()' returns +the number of characters in a string, and not the number of bytes used +to represent those characters. Similarly, 'index()' works with +character indices, and not byte indices. + + CAUTION: A number of functions deal with indices into strings. For + these functions, the first character of a string is at position + (index) one. This is different from C and the languages descended + from it, where the first character is at position zero. You need + to remember this when doing index calculations, particularly if you + are used to C. + + In the following list, optional parameters are enclosed in square +brackets ([ ]). Several functions perform string substitution; the full +discussion is provided in the description of the 'sub()' function, which +comes toward the end, because the list is presented alphabetically. + + Those functions that are specific to 'gawk' are marked with a pound +sign ('#'). They are not available in compatibility mode (*note +Options::): + +* Menu: + +* Gory Details:: More than you want to know about '\' and + '&' with 'sub()', 'gsub()', and + 'gensub()'. + +'asort('SOURCE [',' DEST [',' HOW ] ]') #' +'asorti('SOURCE [',' DEST [',' HOW ] ]') #' + These two functions are similar in behavior, so they are described + together. + + NOTE: The following description ignores the third argument, + HOW, as it requires understanding features that we have not + discussed yet. Thus, the discussion here is a deliberate + simplification. (We do provide all the details later on; see + *note Array Sorting Functions:: for the full story.) + + Both functions return the number of elements in the array SOURCE. + For 'asort()', 'gawk' sorts the values of SOURCE and replaces the + indices of the sorted values of SOURCE with sequential integers + starting with one. If the optional array DEST is specified, then + SOURCE is duplicated into DEST. DEST is then sorted, leaving the + indices of SOURCE unchanged. + + When comparing strings, 'IGNORECASE' affects the sorting (*note + Array Sorting Functions::). If the SOURCE array contains subarrays + as values (*note Arrays of Arrays::), they will come last, after + all scalar values. Subarrays are _not_ recursively sorted. + + For example, if the contents of 'a' are as follows: + + a["last"] = "de" + a["first"] = "sac" + a["middle"] = "cul" + + A call to 'asort()': + + asort(a) + + results in the following contents of 'a': + + a[1] = "cul" + a[2] = "de" + a[3] = "sac" + + The 'asorti()' function works similarly to 'asort()'; however, the + _indices_ are sorted, instead of the values. Thus, in the previous + example, starting with the same initial set of indices and values + in 'a', calling 'asorti(a)' would yield: + + a[1] = "first" + a[2] = "last" + a[3] = "middle" + +'gensub(REGEXP, REPLACEMENT, HOW' [', TARGET']') #' + Search the target string TARGET for matches of the regular + expression REGEXP. If HOW is a string beginning with 'g' or 'G' + (short for "global"), then replace all matches of REGEXP with + REPLACEMENT. Otherwise, HOW is treated as a number indicating + which match of REGEXP to replace. If no TARGET is supplied, use + '$0'. It returns the modified string as the result of the function + and the original target string is _not_ changed. + + 'gensub()' is a general substitution function. Its purpose is to + provide more features than the standard 'sub()' and 'gsub()' + functions. + + 'gensub()' provides an additional feature that is not available in + 'sub()' or 'gsub()': the ability to specify components of a regexp + in the replacement text. This is done by using parentheses in the + regexp to mark the components and then specifying '\N' in the + replacement text, where N is a digit from 1 to 9. For example: + + $ gawk ' + > BEGIN { + > a = "abc def" + > b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) + > print b + > }' + -| def abc + + As with 'sub()', you must type two backslashes in order to get one + into the string. In the replacement text, the sequence '\0' + represents the entire matched text, as does the character '&'. + + The following example shows how you can use the third argument to + control which match of the regexp should be changed: + + $ echo a b c a b c | + > gawk '{ print gensub(/a/, "AA", 2) }' + -| a b c AA b c + + In this case, '$0' is the default target string. 'gensub()' + returns the new string as its result, which is passed directly to + 'print' for printing. + + If the HOW argument is a string that does not begin with 'g' or + 'G', or if it is a number that is less than or equal to zero, only + one substitution is performed. If HOW is zero, 'gawk' issues a + warning message. + + If REGEXP does not match TARGET, 'gensub()''s return value is the + original unchanged value of TARGET. + +'gsub(REGEXP, REPLACEMENT' [', TARGET']')' + Search TARGET for _all_ of the longest, leftmost, _nonoverlapping_ + matching substrings it can find and replace them with REPLACEMENT. + The 'g' in 'gsub()' stands for "global," which means replace + everywhere. For example: + + { gsub(/Britain/, "United Kingdom"); print } + + replaces all occurrences of the string 'Britain' with 'United + Kingdom' for all input records. + + The 'gsub()' function returns the number of substitutions made. If + the variable to search and alter (TARGET) is omitted, then the + entire input record ('$0') is used. As in 'sub()', the characters + '&' and '\' are special, and the third argument must be assignable. + +'index(IN, FIND)' + Search the string IN for the first occurrence of the string FIND, + and return the position in characters where that occurrence begins + in the string IN. Consider the following example: + + $ awk 'BEGIN { print index("peanut", "an") }' + -| 3 + + If FIND is not found, 'index()' returns zero. + + With BWK 'awk' and 'gawk', it is a fatal error to use a regexp + constant for FIND. Other implementations allow it, simply treating + the regexp constant as an expression meaning '$0 ~ /regexp/'. + (d.c.) + +'length('[STRING]')' + Return the number of characters in STRING. If STRING is a number, + the length of the digit string representing that number is + returned. For example, 'length("abcde")' is five. By contrast, + 'length(15 * 35)' works out to three. In this example, 15 * 35 = + 525, and 525 is then converted to the string '"525"', which has + three characters. + + If no argument is supplied, 'length()' returns the length of '$0'. + + NOTE: In older versions of 'awk', the 'length()' function + could be called without any parentheses. Doing so is + considered poor practice, although the 2008 POSIX standard + explicitly allows it, to support historical practice. For + programs to be maximally portable, always supply the + parentheses. + + If 'length()' is called with a variable that has not been used, + 'gawk' forces the variable to be a scalar. Other implementations + of 'awk' leave the variable without a type. (d.c.) Consider: + + $ gawk 'BEGIN { print length(x) ; x[1] = 1 }' + -| 0 + error-> gawk: fatal: attempt to use scalar `x' as array + + $ nawk 'BEGIN { print length(x) ; x[1] = 1 }' + -| 0 + + If '--lint' has been specified on the command line, 'gawk' issues a + warning about this. + + With 'gawk' and several other 'awk' implementations, when given an + array argument, the 'length()' function returns the number of + elements in the array. (c.e.) This is less useful than it might + seem at first, as the array is not guaranteed to be indexed from + one to the number of elements in it. If '--lint' is provided on + the command line (*note Options::), 'gawk' warns that passing an + array argument is not portable. If '--posix' is supplied, using an + array argument is a fatal error (*note Arrays::). + +'match(STRING, REGEXP' [', ARRAY']')' + Search STRING for the longest, leftmost substring matched by the + regular expression REGEXP and return the character position (index) + at which that substring begins (one, if it starts at the beginning + of STRING). If no match is found, return zero. + + The REGEXP argument may be either a regexp constant ('/'...'/') or + a string constant ('"'...'"'). In the latter case, the string is + treated as a regexp to be matched. *Note Computed Regexps:: for a + discussion of the difference between the two forms, and the + implications for writing your program correctly. + + The order of the first two arguments is the opposite of most other + string functions that work with regular expressions, such as + 'sub()' and 'gsub()'. It might help to remember that for + 'match()', the order is the same as for the '~' operator: 'STRING ~ + REGEXP'. + + The 'match()' function sets the predefined variable 'RSTART' to the + index. It also sets the predefined variable 'RLENGTH' to the + length in characters of the matched substring. If no match is + found, 'RSTART' is set to zero, and 'RLENGTH' to -1. + + For example: + + { + if ($1 == "FIND") + regex = $2 + else { + where = match($0, regex) + if (where != 0) + print "Match of", regex, "found at", where, "in", $0 + } + } + + This program looks for lines that match the regular expression + stored in the variable 'regex'. This regular expression can be + changed. If the first word on a line is 'FIND', 'regex' is changed + to be the second word on that line. Therefore, if given: + + FIND ru+n + My program runs + but not very quickly + FIND Melvin + JF+KM + This line is property of Reality Engineering Co. + Melvin was here. + + 'awk' prints: + + Match of ru+n found at 12 in My program runs + Match of Melvin found at 1 in Melvin was here. + + If ARRAY is present, it is cleared, and then the zeroth element of + ARRAY is set to the entire portion of STRING matched by REGEXP. If + REGEXP contains parentheses, the integer-indexed elements of ARRAY + are set to contain the portion of STRING matching the corresponding + parenthesized subexpression. For example: + + $ echo foooobazbarrrrr | + > gawk '{ match($0, /(fo+).+(bar*)/, arr) + > print arr[1], arr[2] }' + -| foooo barrrrr + + In addition, multidimensional subscripts are available providing + the start index and length of each matched subexpression: + + $ echo foooobazbarrrrr | + > gawk '{ match($0, /(fo+).+(bar*)/, arr) + > print arr[1], arr[2] + > print arr[1, "start"], arr[1, "length"] + > print arr[2, "start"], arr[2, "length"] + > }' + -| foooo barrrrr + -| 1 5 + -| 9 7 + + There may not be subscripts for the start and index for every + parenthesized subexpression, because they may not all have matched + text; thus, they should be tested for with the 'in' operator (*note + Reference to Elements::). + + The ARRAY argument to 'match()' is a 'gawk' extension. In + compatibility mode (*note Options::), using a third argument is a + fatal error. + +'patsplit(STRING, ARRAY' [', FIELDPAT' [', SEPS' ] ]') #' + Divide STRING into pieces defined by FIELDPAT and store the pieces + in ARRAY and the separator strings in the SEPS array. The first + piece is stored in 'ARRAY[1]', the second piece in 'ARRAY[2]', and + so forth. The third argument, FIELDPAT, is a regexp describing the + fields in STRING (just as 'FPAT' is a regexp describing the fields + in input records). It may be either a regexp constant or a string. + If FIELDPAT is omitted, the value of 'FPAT' is used. 'patsplit()' + returns the number of elements created. 'SEPS[I]' is the separator + string between 'ARRAY[I]' and 'ARRAY[I+1]'. Any leading separator + will be in 'SEPS[0]'. + + The 'patsplit()' function splits strings into pieces in a manner + similar to the way input lines are split into fields using 'FPAT' + (*note Splitting By Content::). + + Before splitting the string, 'patsplit()' deletes any previously + existing elements in the arrays ARRAY and SEPS. + +'split(STRING, ARRAY' [', FIELDSEP' [', SEPS' ] ]')' + Divide STRING into pieces separated by FIELDSEP and store the + pieces in ARRAY and the separator strings in the SEPS array. The + first piece is stored in 'ARRAY[1]', the second piece in + 'ARRAY[2]', and so forth. The string value of the third argument, + FIELDSEP, is a regexp describing where to split STRING (much as + 'FS' can be a regexp describing where to split input records). If + FIELDSEP is omitted, the value of 'FS' is used. 'split()' returns + the number of elements created. SEPS is a 'gawk' extension, with + 'SEPS[I]' being the separator string between 'ARRAY[I]' and + 'ARRAY[I+1]'. If FIELDSEP is a single space, then any leading + whitespace goes into 'SEPS[0]' and any trailing whitespace goes + into 'SEPS[N]', where N is the return value of 'split()' (i.e., the + number of elements in ARRAY). + + The 'split()' function splits strings into pieces in a manner + similar to the way input lines are split into fields. For example: + + split("cul-de-sac", a, "-", seps) + + splits the string '"cul-de-sac"' into three fields using '-' as the + separator. It sets the contents of the array 'a' as follows: + + a[1] = "cul" + a[2] = "de" + a[3] = "sac" + + and sets the contents of the array 'seps' as follows: + + seps[1] = "-" + seps[2] = "-" + + The value returned by this call to 'split()' is three. + + As with input field-splitting, when the value of FIELDSEP is '" "', + leading and trailing whitespace is ignored in values assigned to + the elements of ARRAY but not in SEPS, and the elements are + separated by runs of whitespace. Also, as with input field + splitting, if FIELDSEP is the null string, each individual + character in the string is split into its own array element. + (c.e.) + + Note, however, that 'RS' has no effect on the way 'split()' works. + Even though 'RS = ""' causes the newline character to also be an + input field separator, this does not affect how 'split()' splits + strings. + + Modern implementations of 'awk', including 'gawk', allow the third + argument to be a regexp constant ('/'...'/') as well as a string. + (d.c.) The POSIX standard allows this as well. *Note Computed + Regexps:: for a discussion of the difference between using a string + constant or a regexp constant, and the implications for writing + your program correctly. + + Before splitting the string, 'split()' deletes any previously + existing elements in the arrays ARRAY and SEPS. + + If STRING is null, the array has no elements. (So this is a + portable way to delete an entire array with one statement. *Note + Delete::.) + + If STRING does not match FIELDSEP at all (but is not null), ARRAY + has one element only. The value of that element is the original + STRING. + + In POSIX mode (*note Options::), the fourth argument is not + allowed. + +'sprintf(FORMAT, EXPRESSION1, ...)' + Return (without printing) the string that 'printf' would have + printed out with the same arguments (*note Printf::). For example: + + pival = sprintf("pi = %.2f (approx.)", 22/7) + + assigns the string 'pi = 3.14 (approx.)' to the variable 'pival'. + +'strtonum(STR) #' + Examine STR and return its numeric value. If STR begins with a + leading '0', 'strtonum()' assumes that STR is an octal number. If + STR begins with a leading '0x' or '0X', 'strtonum()' assumes that + STR is a hexadecimal number. For example: + + $ echo 0x11 | + > gawk '{ printf "%d\n", strtonum($1) }' + -| 17 + + Using the 'strtonum()' function is _not_ the same as adding zero to + a string value; the automatic coercion of strings to numbers works + only for decimal data, not for octal or hexadecimal.(1) + + Note also that 'strtonum()' uses the current locale's decimal point + for recognizing numbers (*note Locales::). + +'sub(REGEXP, REPLACEMENT' [', TARGET']')' + Search TARGET, which is treated as a string, for the leftmost, + longest substring matched by the regular expression REGEXP. Modify + the entire string by replacing the matched text with REPLACEMENT. + The modified string becomes the new value of TARGET. Return the + number of substitutions made (zero or one). + + The REGEXP argument may be either a regexp constant ('/'...'/') or + a string constant ('"'...'"'). In the latter case, the string is + treated as a regexp to be matched. *Note Computed Regexps:: for a + discussion of the difference between the two forms, and the + implications for writing your program correctly. + + This function is peculiar because TARGET is not simply used to + compute a value, and not just any expression will do--it must be a + variable, field, or array element so that 'sub()' can store a + modified value there. If this argument is omitted, then the + default is to use and alter '$0'.(2) For example: + + str = "water, water, everywhere" + sub(/at/, "ith", str) + + sets 'str' to 'wither, water, everywhere', by replacing the + leftmost longest occurrence of 'at' with 'ith'. + + If the special character '&' appears in REPLACEMENT, it stands for + the precise substring that was matched by REGEXP. (If the regexp + can match more than one string, then this precise substring may + vary.) For example: + + { sub(/candidate/, "& and his wife"); print } + + changes the first occurrence of 'candidate' to 'candidate and his + wife' on each input line. Here is another example: + + $ awk 'BEGIN { + > str = "daabaaa" + > sub(/a+/, "C&C", str) + > print str + > }' + -| dCaaCbaaa + + This shows how '&' can represent a nonconstant string and also + illustrates the "leftmost, longest" rule in regexp matching (*note + Leftmost Longest::). + + The effect of this special character ('&') can be turned off by + putting a backslash before it in the string. As usual, to insert + one backslash in the string, you must write two backslashes. + Therefore, write '\\&' in a string constant to include a literal + '&' in the replacement. For example, the following shows how to + replace the first '|' on each line with an '&': + + { sub(/\|/, "\\&"); print } + + As mentioned, the third argument to 'sub()' must be a variable, + field, or array element. Some versions of 'awk' allow the third + argument to be an expression that is not an lvalue. In such a + case, 'sub()' still searches for the pattern and returns zero or + one, but the result of the substitution (if any) is thrown away + because there is no place to put it. Such versions of 'awk' accept + expressions like the following: + + sub(/USA/, "United States", "the USA and Canada") + + For historical compatibility, 'gawk' accepts such erroneous code. + However, using any other nonchangeable object as the third + parameter causes a fatal error and your program will not run. + + Finally, if the REGEXP is not a regexp constant, it is converted + into a string, and then the value of that string is treated as the + regexp to match. + +'substr(STRING, START' [', LENGTH' ]')' + Return a LENGTH-character-long substring of STRING, starting at + character number START. The first character of a string is + character number one.(3) For example, 'substr("washington", 5, 3)' + returns '"ing"'. + + If LENGTH is not present, 'substr()' returns the whole suffix of + STRING that begins at character number START. For example, + 'substr("washington", 5)' returns '"ington"'. The whole suffix is + also returned if LENGTH is greater than the number of characters + remaining in the string, counting from character START. + + If START is less than one, 'substr()' treats it as if it was one. + (POSIX doesn't specify what to do in this case: BWK 'awk' acts this + way, and therefore 'gawk' does too.) If START is greater than the + number of characters in the string, 'substr()' returns the null + string. Similarly, if LENGTH is present but less than or equal to + zero, the null string is returned. + + The string returned by 'substr()' _cannot_ be assigned. Thus, it + is a mistake to attempt to change a portion of a string, as shown + in the following example: + + string = "abcdef" + # try to get "abCDEf", won't work + substr(string, 3, 3) = "CDE" + + It is also a mistake to use 'substr()' as the third argument of + 'sub()' or 'gsub()': + + gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG + + (Some commercial versions of 'awk' treat 'substr()' as assignable, + but doing so is not portable.) + + If you need to replace bits and pieces of a string, combine + 'substr()' with string concatenation, in the following manner: + + string = "abcdef" + ... + string = substr(string, 1, 2) "CDE" substr(string, 6) + +'tolower(STRING)' + Return a copy of STRING, with each uppercase character in the + string replaced with its corresponding lowercase character. + Nonalphabetic characters are left unchanged. For example, + 'tolower("MiXeD cAsE 123")' returns '"mixed case 123"'. + +'toupper(STRING)' + Return a copy of STRING, with each lowercase character in the + string replaced with its corresponding uppercase character. + Nonalphabetic characters are left unchanged. For example, + 'toupper("MiXeD cAsE 123")' returns '"MIXED CASE 123"'. + + Matching the Null String + + In 'awk', the '*' operator can match the null string. This is +particularly important for the 'sub()', 'gsub()', and 'gensub()' +functions. For example: + + $ echo abc | awk '{ gsub(/m*/, "X"); print }' + -| XaXbXcX + +Although this makes a certain amount of sense, it can be surprising. + + ---------- Footnotes ---------- + + (1) Unless you use the '--non-decimal-data' option, which isn't +recommended. *Note Nondecimal Data:: for more information. + + (2) Note that this means that the record will first be regenerated +using the value of 'OFS' if any fields have been changed, and that the +fields will be updated after the substitution, even if the operation is +a "no-op" such as 'sub(/^/, "")'. + + (3) This is different from C and C++, in which the first character is +number zero. + + +File: gawk.info, Node: Gory Details, Up: String Functions + +9.1.3.1 More about '\' and '&' with 'sub()', 'gsub()', and 'gensub()' +..................................................................... + + CAUTION: This subsubsection has been reported to cause headaches. + You might want to skip it upon first reading. + + When using 'sub()', 'gsub()', or 'gensub()', and trying to get +literal backslashes and ampersands into the replacement text, you need +to remember that there are several levels of "escape processing" going +on. + + First, there is the "lexical" level, which is when 'awk' reads your +program and builds an internal copy of it to execute. Then there is the +runtime level, which is when 'awk' actually scans the replacement string +to determine what to generate. + + At both levels, 'awk' looks for a defined set of characters that can +come after a backslash. At the lexical level, it looks for the escape +sequences listed in *note Escape Sequences::. Thus, for every '\' that +'awk' processes at the runtime level, you must type two backslashes at +the lexical level. When a character that is not valid for an escape +sequence follows the '\', BWK 'awk' and 'gawk' both simply remove the +initial '\' and put the next character into the string. Thus, for +example, '"a\qb"' is treated as '"aqb"'. + + At the runtime level, the various functions handle sequences of '\' +and '&' differently. The situation is (sadly) somewhat complex. +Historically, the 'sub()' and 'gsub()' functions treated the +two-character sequence '\&' specially; this sequence was replaced in the +generated text with a single '&'. Any other '\' within the REPLACEMENT +string that did not precede an '&' was passed through unchanged. This +is illustrated in *note Table 9.1: table-sub-escapes. + + You type 'sub()' sees 'sub()' generates + ----- ------- ---------- + '\&' '&' The matched text + '\\&' '\&' A literal '&' + '\\\&' '\&' A literal '&' + '\\\\&' '\\&' A literal '\&' + '\\\\\&' '\\&' A literal '\&' + '\\\\\\&' '\\\&' A literal '\\&' + '\\q' '\q' A literal '\q' + +Table 9.1: Historical escape sequence processing for 'sub()' and +'gsub()' + +This table shows the lexical-level processing, where an odd number of +backslashes becomes an even number at the runtime level, as well as the +runtime processing done by 'sub()'. (For the sake of simplicity, the +rest of the following tables only show the case of even numbers of +backslashes entered at the lexical level.) + + The problem with the historical approach is that there is no way to +get a literal '\' followed by the matched text. + + Several editions of the POSIX standard attempted to fix this problem +but weren't successful. The details are irrelevant at this point in +time. + + At one point, the 'gawk' maintainer submitted proposed text for a +revised standard that reverts to rules that correspond more closely to +the original existing practice. The proposed rules have special cases +that make it possible to produce a '\' preceding the matched text. This +is shown in *note Table 9.2: table-sub-proposed. + + You type 'sub()' sees 'sub()' generates + ----- ------- ---------- + '\\\\\\&' '\\\&' A literal '\&' + '\\\\&' '\\&' A literal '\', followed by the matched text + '\\&' '\&' A literal '&' + '\\q' '\q' A literal '\q' + '\\\\' '\\' '\\' + +Table 9.2: 'gawk' rules for 'sub()' and backslash + + In a nutshell, at the runtime level, there are now three special +sequences of characters ('\\\&', '\\&', and '\&') whereas historically +there was only one. However, as in the historical case, any '\' that is +not part of one of these three sequences is not special and appears in +the output literally. + + 'gawk' 3.0 and 3.1 follow these rules for 'sub()' and 'gsub()'. The +POSIX standard took much longer to be revised than was expected. In +addition, the 'gawk' maintainer's proposal was lost during the +standardization process. The final rules are somewhat simpler. The +results are similar except for one case. + + The POSIX rules state that '\&' in the replacement string produces a +literal '&', '\\' produces a literal '\', and '\' followed by anything +else is not special; the '\' is placed straight into the output. These +rules are presented in *note Table 9.3: table-posix-sub. + + You type 'sub()' sees 'sub()' generates + ----- ------- ---------- + '\\\\\\&' '\\\&' A literal '\&' + '\\\\&' '\\&' A literal '\', followed by the matched text + '\\&' '\&' A literal '&' + '\\q' '\q' A literal '\q' + '\\\\' '\\' '\' + +Table 9.3: POSIX rules for 'sub()' and 'gsub()' + + The only case where the difference is noticeable is the last one: +'\\\\' is seen as '\\' and produces '\' instead of '\\'. + + Starting with version 3.1.4, 'gawk' followed the POSIX rules when +'--posix' was specified (*note Options::). Otherwise, it continued to +follow the proposed rules, as that had been its behavior for many years. + + When version 4.0.0 was released, the 'gawk' maintainer made the POSIX +rules the default, breaking well over a decade's worth of backward +compatibility.(1) Needless to say, this was a bad idea, and as of +version 4.0.1, 'gawk' resumed its historical behavior, and only follows +the POSIX rules when '--posix' is given. + + The rules for 'gensub()' are considerably simpler. At the runtime +level, whenever 'gawk' sees a '\', if the following character is a +digit, then the text that matched the corresponding parenthesized +subexpression is placed in the generated output. Otherwise, no matter +what character follows the '\', it appears in the generated text and the +'\' does not, as shown in *note Table 9.4: table-gensub-escapes. + + You type 'gensub()' sees 'gensub()' generates + ----- --------- ------------ + '&' '&' The matched text + '\\&' '\&' A literal '&' + '\\\\' '\\' A literal '\' + '\\\\&' '\\&' A literal '\', then the matched text + '\\\\\\&' '\\\&' A literal '\&' + '\\q' '\q' A literal 'q' + +Table 9.4: Escape sequence processing for 'gensub()' + + Because of the complexity of the lexical- and runtime-level +processing and the special cases for 'sub()' and 'gsub()', we recommend +the use of 'gawk' and 'gensub()' when you have to do substitutions. + + ---------- Footnotes ---------- + + (1) This was rather naive of him, despite there being a note in this +minor node indicating that the next major version would move to the +POSIX rules. + + +File: gawk.info, Node: I/O Functions, Next: Time Functions, Prev: String Functions, Up: Built-in + +9.1.4 Input/Output Functions +---------------------------- + +The following functions relate to input/output (I/O). Optional +parameters are enclosed in square brackets ([ ]): + +'close('FILENAME [',' HOW]')' + Close the file FILENAME for input or output. Alternatively, the + argument may be a shell command that was used for creating a + coprocess, or for redirecting to or from a pipe; then the coprocess + or pipe is closed. *Note Close Files And Pipes:: for more + information. + + When closing a coprocess, it is occasionally useful to first close + one end of the two-way pipe and then to close the other. This is + done by providing a second argument to 'close()'. This second + argument (HOW) should be one of the two string values '"to"' or + '"from"', indicating which end of the pipe to close. Case in the + string does not matter. *Note Two-way I/O::, which discusses this + feature in more detail and gives an example. + + Note that the second argument to 'close()' is a 'gawk' extension; + it is not available in compatibility mode (*note Options::). + +'fflush('[FILENAME]')' + Flush any buffered output associated with FILENAME, which is either + a file opened for writing or a shell command for redirecting output + to a pipe or coprocess. + + Many utility programs "buffer" their output (i.e., they save + information to write to a disk file or the screen in memory until + there is enough for it to be worthwhile to send the data to the + output device). This is often more efficient than writing every + little bit of information as soon as it is ready. However, + sometimes it is necessary to force a program to "flush" its buffers + (i.e., write the information to its destination, even if a buffer + is not full). This is the purpose of the 'fflush()' + function--'gawk' also buffers its output, and the 'fflush()' + function forces 'gawk' to flush its buffers. + + Brian Kernighan added 'fflush()' to his 'awk' in April 1992. For + two decades, it was a common extension. In December 2012, it was + accepted for inclusion into the POSIX standard. See the Austin + Group website (http://austingroupbugs.net/view.php?id=634). + + POSIX standardizes 'fflush()' as follows: if there is no argument, + or if the argument is the null string ('""'), then 'awk' flushes + the buffers for _all_ open output files and pipes. + + NOTE: Prior to version 4.0.2, 'gawk' would flush only the + standard output if there was no argument, and flush all output + files and pipes if the argument was the null string. This was + changed in order to be compatible with Brian Kernighan's + 'awk', in the hope that standardizing this feature in POSIX + would then be easier (which indeed proved to be the case). + + With 'gawk', you can use 'fflush("/dev/stdout")' if you wish + to flush only the standard output. + + 'fflush()' returns zero if the buffer is successfully flushed; + otherwise, it returns a nonzero value. ('gawk' returns -1.) In + the case where all buffers are flushed, the return value is zero + only if all buffers were flushed successfully. Otherwise, it is + -1, and 'gawk' warns about the problem FILENAME. + + 'gawk' also issues a warning message if you attempt to flush a file + or pipe that was opened for reading (such as with 'getline'), or if + FILENAME is not an open file, pipe, or coprocess. In such a case, + 'fflush()' returns -1, as well. + + Interactive Versus Noninteractive Buffering + + As a side point, buffering issues can be even more confusing if + your program is "interactive" (i.e., communicating with a user + sitting at a keyboard).(1) + + Interactive programs generally "line buffer" their output (i.e., + they write out every line). Noninteractive programs wait until + they have a full buffer, which may be many lines of output. Here + is an example of the difference: + + $ awk '{ print $1 + $2 }' + 1 1 + -| 2 + 2 3 + -| 5 + Ctrl-d + + Each line of output is printed immediately. Compare that behavior + with this example: + + $ awk '{ print $1 + $2 }' | cat + 1 1 + 2 3 + Ctrl-d + -| 2 + -| 5 + + Here, no output is printed until after the 'Ctrl-d' is typed, + because it is all buffered and sent down the pipe to 'cat' in one + shot. + +'system(COMMAND)' + Execute the operating system command COMMAND and then return to the + 'awk' program. Return COMMAND's exit status (see further on). + + For example, if the following fragment of code is put in your 'awk' + program: + + END { + system("date | mail -s 'awk run done' root") + } + + the system administrator is sent mail when the 'awk' program + finishes processing input and begins its end-of-input processing. + + Note that redirecting 'print' or 'printf' into a pipe is often + enough to accomplish your task. If you need to run many commands, + it is more efficient to simply print them down a pipeline to the + shell: + + while (MORE STUFF TO DO) + print COMMAND | "/bin/sh" + close("/bin/sh") + + However, if your 'awk' program is interactive, 'system()' is useful + for running large self-contained programs, such as a shell or an + editor. Some operating systems cannot implement the 'system()' + function. 'system()' causes a fatal error if it is not supported. + + NOTE: When '--sandbox' is specified, the 'system()' function + is disabled (*note Options::). + + On POSIX systems, a command's exit status is a 16-bit number. The + exit value passed to the C 'exit()' function is held in the + high-order eight bits. The low-order bits indicate if the process + was killed by a signal (bit 7) and if so, the guilty signal number + (bits 0-6). + + Traditionally, 'awk''s 'system()' function has simply returned the + exit status value divided by 256. In the normal case this gives + the exit status but in the case of death-by-signal it yields a + fractional floating-point value.(2) POSIX states that 'awk''s + 'system()' should return the full 16-bit value. + + 'gawk' steers a middle ground. The return values are summarized in + *note Table 9.5: table-system-return-values. + + Situation Return value from 'system()' + -------------------------------------------------------------------------- + '--traditional' C 'system()''s value divided by 256 + '--posix' C 'system()''s value + Normal exit of command Command's exit status + Death by signal of command 256 + number of murderous signal + Death by signal of command 512 + number of murderous signal + with core dump + Some kind of error -1 + + Table 9.5: Return values from 'system()' + + Controlling Output Buffering with 'system()' + + The 'fflush()' function provides explicit control over output +buffering for individual files and pipes. However, its use is not +portable to many older 'awk' implementations. An alternative method to +flush output buffers is to call 'system()' with a null string as its +argument: + + system("") # flush output + +'gawk' treats this use of the 'system()' function as a special case and +is smart enough not to run a shell (or other command interpreter) with +the empty command. Therefore, with 'gawk', this idiom is not only +useful, it is also efficient. Although this method should work with +other 'awk' implementations, it does not necessarily avoid starting an +unnecessary shell. (Other implementations may only flush the buffer +associated with the standard output and not necessarily all buffered +output.) + + If you think about what a programmer expects, it makes sense that +'system()' should flush any pending output. The following program: + + BEGIN { + print "first print" + system("echo system echo") + print "second print" + } + +must print: + + first print + system echo + second print + +and not: + + system echo + first print + second print + + If 'awk' did not flush its buffers before calling 'system()', you +would see the latter (undesirable) output. + + ---------- Footnotes ---------- + + (1) A program is interactive if the standard output is connected to a +terminal device. On modern systems, this means your keyboard and +screen. + + (2) In private correspondence, Dr. Kernighan has indicated to me that +the way this was done was probably a mistake. + + +File: gawk.info, Node: Time Functions, Next: Bitwise Functions, Prev: I/O Functions, Up: Built-in + +9.1.5 Time Functions +-------------------- + +'awk' programs are commonly used to process log files containing +timestamp information, indicating when a particular log record was +written. Many programs log their timestamps in the form returned by the +'time()' system call, which is the number of seconds since a particular +epoch. On POSIX-compliant systems, it is the number of seconds since +1970-01-01 00:00:00 UTC, not counting leap seconds.(1) All known +POSIX-compliant systems support timestamps from 0 through 2^31 - 1, +which is sufficient to represent times through 2038-01-19 03:14:07 UTC. +Many systems support a wider range of timestamps, including negative +timestamps that represent times before the epoch. + + In order to make it easier to process such log files and to produce +useful reports, 'gawk' provides the following functions for working with +timestamps. They are 'gawk' extensions; they are not specified in the +POSIX standard.(2) However, recent versions of 'mawk' (*note Other +Versions::) also support these functions. Optional parameters are +enclosed in square brackets ([ ]): + +'mktime(DATESPEC)' + Turn DATESPEC into a timestamp in the same form as is returned by + 'systime()'. It is similar to the function of the same name in ISO + C. The argument, DATESPEC, is a string of the form + '"YYYY MM DD HH MM SS [DST]"'. The string consists of six or seven + numbers representing, respectively, the full year including + century, the month from 1 to 12, the day of the month from 1 to 31, + the hour of the day from 0 to 23, the minute from 0 to 59, the + second from 0 to 60,(3) and an optional daylight-savings flag. + + The values of these numbers need not be within the ranges + specified; for example, an hour of -1 means 1 hour before midnight. + The origin-zero Gregorian calendar is assumed, with year 0 + preceding year 1 and year -1 preceding year 0. The time is assumed + to be in the local time zone. If the daylight-savings flag is + positive, the time is assumed to be daylight savings time; if zero, + the time is assumed to be standard time; and if negative (the + default), 'mktime()' attempts to determine whether daylight savings + time is in effect for the specified time. + + If DATESPEC does not contain enough elements or if the resulting + time is out of range, 'mktime()' returns -1. + +'strftime('[FORMAT [',' TIMESTAMP [',' UTC-FLAG] ] ]')' + Format the time specified by TIMESTAMP based on the contents of the + FORMAT string and return the result. It is similar to the function + of the same name in ISO C. If UTC-FLAG is present and is either + nonzero or non-null, the value is formatted as UTC (Coordinated + Universal Time, formerly GMT or Greenwich Mean Time). Otherwise, + the value is formatted for the local time zone. The TIMESTAMP is + in the same format as the value returned by the 'systime()' + function. If no TIMESTAMP argument is supplied, 'gawk' uses the + current time of day as the timestamp. Without a FORMAT argument, + 'strftime()' uses the value of 'PROCINFO["strftime"]' as the format + string (*note Built-in Variables::). The default string value is + '"%a %b %e %H:%M:%S %Z %Y"'. This format string produces output + that is equivalent to that of the 'date' utility. You can assign a + new value to 'PROCINFO["strftime"]' to change the default format; + see the following list for the various format directives. + +'systime()' + Return the current time as the number of seconds since the system + epoch. On POSIX systems, this is the number of seconds since + 1970-01-01 00:00:00 UTC, not counting leap seconds. It may be a + different number on other systems. + + The 'systime()' function allows you to compare a timestamp from a log +file with the current time of day. In particular, it is easy to +determine how long ago a particular record was logged. It also allows +you to produce log records using the "seconds since the epoch" format. + + The 'mktime()' function allows you to convert a textual +representation of a date and time into a timestamp. This makes it easy +to do before/after comparisons of dates and times, particularly when +dealing with date and time data coming from an external source, such as +a log file. + + The 'strftime()' function allows you to easily turn a timestamp into +human-readable information. It is similar in nature to the 'sprintf()' +function (*note String Functions::), in that it copies nonformat +specification characters verbatim to the returned string, while +substituting date and time values for format specifications in the +FORMAT string. + + 'strftime()' is guaranteed by the 1999 ISO C standard(4) to support +the following date format specifications: + +'%a' + The locale's abbreviated weekday name. + +'%A' + The locale's full weekday name. + +'%b' + The locale's abbreviated month name. + +'%B' + The locale's full month name. + +'%c' + The locale's "appropriate" date and time representation. (This is + '%A %B %d %T %Y' in the '"C"' locale.) + +'%C' + The century part of the current year. This is the year divided by + 100 and truncated to the next lower integer. + +'%d' + The day of the month as a decimal number (01-31). + +'%D' + Equivalent to specifying '%m/%d/%y'. + +'%e' + The day of the month, padded with a space if it is only one digit. + +'%F' + Equivalent to specifying '%Y-%m-%d'. This is the ISO 8601 date + format. + +'%g' + The year modulo 100 of the ISO 8601 week number, as a decimal + number (00-99). For example, January 1, 2012, is in week 53 of + 2011. Thus, the year of its ISO 8601 week number is 2011, even + though its year is 2012. Similarly, December 31, 2012, is in week + 1 of 2013. Thus, the year of its ISO week number is 2013, even + though its year is 2012. + +'%G' + The full year of the ISO week number, as a decimal number. + +'%h' + Equivalent to '%b'. + +'%H' + The hour (24-hour clock) as a decimal number (00-23). + +'%I' + The hour (12-hour clock) as a decimal number (01-12). + +'%j' + The day of the year as a decimal number (001-366). + +'%m' + The month as a decimal number (01-12). + +'%M' + The minute as a decimal number (00-59). + +'%n' + A newline character (ASCII LF). + +'%p' + The locale's equivalent of the AM/PM designations associated with a + 12-hour clock. + +'%r' + The locale's 12-hour clock time. (This is '%I:%M:%S %p' in the + '"C"' locale.) + +'%R' + Equivalent to specifying '%H:%M'. + +'%S' + The second as a decimal number (00-60). + +'%t' + A TAB character. + +'%T' + Equivalent to specifying '%H:%M:%S'. + +'%u' + The weekday as a decimal number (1-7). Monday is day one. + +'%U' + The week number of the year (with the first Sunday as the first day + of week one) as a decimal number (00-53). + +'%V' + The week number of the year (with the first Monday as the first day + of week one) as a decimal number (01-53). The method for + determining the week number is as specified by ISO 8601. (To wit: + if the week containing January 1 has four or more days in the new + year, then it is week one; otherwise it is the last week [52 or 53] + of the previous year and the next week is week one.) + +'%w' + The weekday as a decimal number (0-6). Sunday is day zero. + +'%W' + The week number of the year (with the first Monday as the first day + of week one) as a decimal number (00-53). + +'%x' + The locale's "appropriate" date representation. (This is '%A %B %d + %Y' in the '"C"' locale.) + +'%X' + The locale's "appropriate" time representation. (This is '%T' in + the '"C"' locale.) + +'%y' + The year modulo 100 as a decimal number (00-99). + +'%Y' + The full year as a decimal number (e.g., 2015). + +'%z' + The time zone offset in a '+HHMM' format (e.g., the format + necessary to produce RFC 822/RFC 1036 date headers). + +'%Z' + The time zone name or abbreviation; no characters if no time zone + is determinable. + +'%Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH' +'%OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy' + "Alternative representations" for the specifications that use only + the second letter ('%c', '%C', and so on).(5) (These facilitate + compliance with the POSIX 'date' utility.) + +'%%' + A literal '%'. + + If a conversion specifier is not one of those just listed, the +behavior is undefined.(6) + + For systems that are not yet fully standards-compliant, 'gawk' +supplies a copy of 'strftime()' from the GNU C Library. It supports all +of the just-listed format specifications. If that version is used to +compile 'gawk' (*note Installation::), then the following additional +format specifications are available: + +'%k' + The hour (24-hour clock) as a decimal number (0-23). Single-digit + numbers are padded with a space. + +'%l' + The hour (12-hour clock) as a decimal number (1-12). Single-digit + numbers are padded with a space. + +'%s' + The time as a decimal timestamp in seconds since the epoch. + + Additionally, the alternative representations are recognized but +their normal representations are used. + + The following example is an 'awk' implementation of the POSIX 'date' +utility. Normally, the 'date' utility prints the current date and time +of day in a well-known format. However, if you provide an argument to +it that begins with a '+', 'date' copies nonformat specifier characters +to the standard output and interprets the current time according to the +format specifiers in the string. For example: + + $ date '+Today is %A, %B %d, %Y.' + -| Today is Monday, September 22, 2014. + + Here is the 'gawk' version of the 'date' utility. It has a shell +"wrapper" to handle the '-u' option, which requires that 'date' run as +if the time zone is set to UTC: + + #! /bin/sh + # + # date --- approximate the POSIX 'date' command + + case $1 in + -u) TZ=UTC0 # use UTC + export TZ + shift ;; + esac + + gawk 'BEGIN { + format = PROCINFO["strftime"] + exitval = 0 + + if (ARGC > 2) + exitval = 1 + else if (ARGC == 2) { + format = ARGV[1] + if (format ~ /^\+/) + format = substr(format, 2) # remove leading + + } + print strftime(format) + exit exitval + }' "$@" + + ---------- Footnotes ---------- + + (1) *Note Glossary::, especially the entries "Epoch" and "UTC." + + (2) The GNU 'date' utility can also do many of the things described +here. Its use may be preferable for simple time-related operations in +shell scripts. + + (3) Occasionally there are minutes in a year with a leap second, +which is why the seconds can go up to 60. + + (4) Unfortunately, not every system's 'strftime()' necessarily +supports all of the conversions listed here. + + (5) If you don't understand any of this, don't worry about it; these +facilities are meant to make it easier to "internationalize" programs. +Other internationalization features are described in *note +Internationalization::. + + (6) This is because ISO C leaves the behavior of the C version of +'strftime()' undefined and 'gawk' uses the system's version of +'strftime()' if it's there. Typically, the conversion specifier either +does not appear in the returned string or appears literally. + + +File: gawk.info, Node: Bitwise Functions, Next: Type Functions, Prev: Time Functions, Up: Built-in + +9.1.6 Bit-Manipulation Functions +-------------------------------- + + I can explain it for you, but I can't understand it for you. + -- _Anonymous_ + + Many languages provide the ability to perform "bitwise" operations on +two integer numbers. In other words, the operation is performed on each +successive pair of bits in the operands. Three common operations are +bitwise AND, OR, and XOR. The operations are described in *note Table +9.6: table-bitwise-ops. + + Bit operator + | AND | OR | XOR + |--+--+--+--+--+-- + Operands | 0 | 1 | 0 | 1 | 0 | 1 + -------+--+--+--+--+--+-- + 0 | 0 0 | 0 1 | 0 1 + 1 | 0 1 | 1 1 | 1 0 + +Table 9.6: Bitwise operations + + As you can see, the result of an AND operation is 1 only when _both_ +bits are 1. The result of an OR operation is 1 if _either_ bit is 1. +The result of an XOR operation is 1 if either bit is 1, but not both. +The next operation is the "complement"; the complement of 1 is 0 and the +complement of 0 is 1. Thus, this operation "flips" all the bits of a +given value. + + Finally, two other common operations are to shift the bits left or +right. For example, if you have a bit string '10111001' and you shift +it right by three bits, you end up with '00010111'.(1) If you start +over again with '10111001' and shift it left by three bits, you end up +with '11001000'. The following list describes 'gawk''s built-in +functions that implement the bitwise operations. Optional parameters +are enclosed in square brackets ([ ]): + +'and(V1, V2 [, ...])' + Return the bitwise AND of the arguments. There must be at least + two. + +'compl(VAL)' + Return the bitwise complement of VAL. + +'lshift(VAL, COUNT)' + Return the value of VAL, shifted left by COUNT bits. + +'or(V1, V2 [, ...])' + Return the bitwise OR of the arguments. There must be at least + two. + +'rshift(VAL, COUNT)' + Return the value of VAL, shifted right by COUNT bits. + +'xor(V1, V2 [, ...])' + Return the bitwise XOR of the arguments. There must be at least + two. + + CAUTION: Beginning with 'gawk' 4.1 4.2, negative operands are not + allowed for any of these functions. A negative operand produces a + fatal error. See the sidebar "Beware The Smoke and Mirrors!" for + more information as to why. + + Here is a user-defined function (*note User-defined::) that +illustrates the use of these functions: + + # bits2str --- turn a byte into readable ones and zeros + + function bits2str(bits, data, mask) + { + if (bits == 0) + return "0" + + mask = 1 + for (; bits != 0; bits = rshift(bits, 1)) + data = (and(bits, mask) ? "1" : "0") data + + while ((length(data) % 8) != 0) + data = "0" data + + return data + } + + BEGIN { + printf "123 = %s\n", bits2str(123) + printf "0123 = %s\n", bits2str(0123) + printf "0x99 = %s\n", bits2str(0x99) + comp = compl(0x99) + printf "compl(0x99) = %#x = %s\n", comp, bits2str(comp) + shift = lshift(0x99, 2) + printf "lshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift) + shift = rshift(0x99, 2) + printf "rshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift) + } + +This program produces the following output when run: + + $ gawk -f testbits.awk + -| 123 = 01111011 + -| 0123 = 01010011 + -| 0x99 = 10011001 + -| compl(0x99) = 0x3fffffffffff66 = 00111111111111111111111111111111111111111111111101100110 + -| lshift(0x99, 2) = 0x264 = 0000001001100100 + -| rshift(0x99, 2) = 0x26 = 00100110 + + The 'bits2str()' function turns a binary number into a string. +Initializing 'mask' to one creates a binary value where the rightmost +bit is set to one. Using this mask, the function repeatedly checks the +rightmost bit. ANDing the mask with the value indicates whether the +rightmost bit is one or not. If so, a '"1"' is concatenated onto the +front of the string. Otherwise, a '"0"' is added. The value is then +shifted right by one bit and the loop continues until there are no more +one bits. + + If the initial value is zero, it returns a simple '"0"'. Otherwise, +at the end, it pads the value with zeros to represent multiples of 8-bit +quantities. This is typical in modern computers. + + The main code in the 'BEGIN' rule shows the difference between the +decimal and octal values for the same numbers (*note +Nondecimal-numbers::), and then demonstrates the results of the +'compl()', 'lshift()', and 'rshift()' functions. + + Beware The Smoke and Mirrors! + + It other languages, bitwise operations are performed on integer +values, not floating-point values. As a general statement, such +operations work best when performed on unsigned integers. + + 'gawk' attempts to treat the arguments to the bitwise functions as +unsigned integers. For this reason, negative arguments produce a fatal +error. + + In normal operation, for all of these functions, first the +double-precision floating-point value is converted to the widest C +unsigned integer type, then the bitwise operation is performed. If the +result cannot be represented exactly as a C 'double', leading nonzero +bits are removed one by one until it can be represented exactly. The +result is then converted back into a C 'double'.(2) + + However, when using arbitrary precision arithmetic with the '-M' +option (*note Arbitrary Precision Arithmetic::), the results may differ. +This is particularly noticable with the 'compl()' function: + + $ gawk 'BEGIN { print compl(42) }' + -| 9007199254740949 + $ gawk -M 'BEGIN { print compl(42) }' + -| -43 + + What's going on becomes clear when printing the results in +hexadecimal: + + $ gawk 'BEGIN { printf "%#x\n", compl(42) }' + -| 0x1fffffffffffd5 + $ gawk -M 'BEGIN { printf "%#x\n", compl(42) }' + -| 0xffffffffffffffd5 + + When using the '-M' option, under the hood, 'gawk' uses GNU MP +arbitrary precision integers which have at least 64 bits of precision. +When not using '-M', 'gawk' stores integral values in regular +double-precision floating point, which only maintain 53 bits of +precision. Furthermore, the GNU MP library treats (or least seems to +treat) the leading bit as a sign bit; thus the result with '-M' in this +case is a negative number. + + In short, using 'gawk' for any but the simplest kind of bitwise +operations is probably a bad idea; caveat emptor! + + ---------- Footnotes ---------- + + (1) This example shows that zeros come in on the left side. For +'gawk', this is always true, but in some languages, it's possible to +have the left side fill with ones. + + (2) If you don't understand this paragraph, the upshot is that 'gawk' +can only store a particular range of integer values; numbers outside +that range are reduced to fit within the range. + + +File: gawk.info, Node: Type Functions, Next: I18N Functions, Prev: Bitwise Functions, Up: Built-in + +9.1.7 Getting Type Information +------------------------------ + +'gawk' provides two functions that lets you distinguish the type of a +variable. This is necessary for writing code that traverses every +element of an array of arrays (*note Arrays of Arrays::), and in other +contexts. + +'isarray(X)' + Return a true value if X is an array. Otherwise, return false. + +'typeof(X)' + Return one of the following strings, depending upon the type of X: + + '"array"' + X is an array. + + '"number"' + X is a number. + + '"string"' + X is a string. + + '"strnum"' + X is a string that might be a number, such as a field or the + result of calling 'split()'. (I.e., X has the STRNUM + attribute; *note Variable Typing::.) + + '"unassigned"' + X is a scalar variable that has not been assigned a value yet. + For example: + + BEGIN { + a[1] # creates a[1] but it has no assigned value + print typeof(a[1]) # scalar_u + } + + '"untyped"' + X has not yet been used yet at all; it can become a scalar or + an array. For example: + + BEGIN { + print typeof(x) # x never used --> untyped + mk_arr(x) + print typeof(x) # x now an array --> array + } + + function mk_arr(a) { a[1] = 1 } + + 'isarray()' is meant for use in two circumstances. The first is when +traversing a multidimensional array: you can test if an element is +itself an array or not. The second is inside the body of a user-defined +function (not discussed yet; *note User-defined::), to test if a +parameter is an array or not. + + NOTE: Using 'isarray()' at the global level to test variables makes + no sense. Because you are the one writing the program, you are + supposed to know if your variables are arrays or not. And in fact, + due to the way 'gawk' works, if you pass the name of a variable + that has not been previously used to 'isarray()', 'gawk' ends up + turning it into a scalar. + + The 'typeof()' function is general; it allows you to determine if a +variable or function parameter is a scalar, an array. + + 'isarray()' is deprecated; you should use 'typeof()' instead. You +should replace any existing uses of 'isarray(var)' in your code with +'typeof(var) == "array"'. + + +File: gawk.info, Node: I18N Functions, Prev: Type Functions, Up: Built-in + +9.1.8 String-Translation Functions +---------------------------------- + +'gawk' provides facilities for internationalizing 'awk' programs. These +include the functions described in the following list. The descriptions +here are purposely brief. *Note Internationalization::, for the full +story. Optional parameters are enclosed in square brackets ([ ]): + +'bindtextdomain(DIRECTORY' [',' DOMAIN]')' + Set the directory in which 'gawk' will look for message translation + files, in case they will not or cannot be placed in the "standard" + locations (e.g., during testing). It returns the directory in + which DOMAIN is "bound." + + The default DOMAIN is the value of 'TEXTDOMAIN'. If DIRECTORY is + the null string ('""'), then 'bindtextdomain()' returns the current + binding for the given DOMAIN. + +'dcgettext(STRING' [',' DOMAIN [',' CATEGORY] ]')' + Return the translation of STRING in text domain DOMAIN for locale + category CATEGORY. The default value for DOMAIN is the current + value of 'TEXTDOMAIN'. The default value for CATEGORY is + '"LC_MESSAGES"'. + +'dcngettext(STRING1, STRING2, NUMBER' [',' DOMAIN [',' CATEGORY] ]')' + Return the plural form used for NUMBER of the translation of + STRING1 and STRING2 in text domain DOMAIN for locale category + CATEGORY. STRING1 is the English singular variant of a message, + and STRING2 is the English plural variant of the same message. The + default value for DOMAIN is the current value of 'TEXTDOMAIN'. The + default value for CATEGORY is '"LC_MESSAGES"'. + + +File: gawk.info, Node: User-defined, Next: Indirect Calls, Prev: Built-in, Up: Functions + +9.2 User-Defined Functions +========================== + +Complicated 'awk' programs can often be simplified by defining your own +functions. User-defined functions can be called just like built-in ones +(*note Function Calls::), but it is up to you to define them (i.e., to +tell 'awk' what they should do). + +* Menu: + +* Definition Syntax:: How to write definitions and what they mean. +* Function Example:: An example function definition and what it + does. +* Function Caveats:: Things to watch out for. +* Return Statement:: Specifying the value a function returns. +* Dynamic Typing:: How variable types can change at runtime. + + +File: gawk.info, Node: Definition Syntax, Next: Function Example, Up: User-defined + +9.2.1 Function Definition Syntax +-------------------------------- + + It's entirely fair to say that the awk syntax for local variable + definitions is appallingly awful. + -- _Brian Kernighan_ + + Definitions of functions can appear anywhere between the rules of an +'awk' program. Thus, the general form of an 'awk' program is extended +to include sequences of rules _and_ user-defined function definitions. +There is no need to put the definition of a function before all uses of +the function. This is because 'awk' reads the entire program before +starting to execute any of it. + + The definition of a function named NAME looks like this: + + 'function' NAME'('[PARAMETER-LIST]')' + '{' + BODY-OF-FUNCTION + '}' + +Here, NAME is the name of the function to define. A valid function name +is like a valid variable name: a sequence of letters, digits, and +underscores that doesn't start with a digit. Here too, only the 52 +upper- and lowercase English letters may be used in a function name. +Within a single 'awk' program, any particular name can only be used as a +variable, array, or function. + + PARAMETER-LIST is an optional list of the function's arguments and +local variable names, separated by commas. When the function is called, +the argument names are used to hold the argument values given in the +call. + + A function cannot have two parameters with the same name, nor may it +have a parameter with the same name as the function itself. + + CAUTION: According to the POSIX standard, function parameters + cannot have the same name as one of the special predefined + variables (*note Built-in Variables::), nor may a function + parameter have the same name as another function. + + Not all versions of 'awk' enforce these restrictions. 'gawk' + always enforces the first restriction. With '--posix' (*note + Options::), it also enforces the second restriction. + + Local variables act like the empty string if referenced where a +string value is required, and like zero if referenced where a numeric +value is required. This is the same as the behavior of regular +variables that have never been assigned a value. (There is more to +understand about local variables; *note Dynamic Typing::.) + + The BODY-OF-FUNCTION consists of 'awk' statements. It is the most +important part of the definition, because it says what the function +should actually _do_. The argument names exist to give the body a way +to talk about the arguments; local variables exist to give the body +places to keep temporary values. + + Argument names are not distinguished syntactically from local +variable names. Instead, the number of arguments supplied when the +function is called determines how many argument variables there are. +Thus, if three argument values are given, the first three names in +PARAMETER-LIST are arguments and the rest are local variables. + + It follows that if the number of arguments is not the same in all +calls to the function, some of the names in PARAMETER-LIST may be +arguments on some occasions and local variables on others. Another way +to think of this is that omitted arguments default to the null string. + + Usually when you write a function, you know how many names you intend +to use for arguments and how many you intend to use as local variables. +It is conventional to place some extra space between the arguments and +the local variables, in order to document how your function is supposed +to be used. + + During execution of the function body, the arguments and local +variable values hide, or "shadow", any variables of the same names used +in the rest of the program. The shadowed variables are not accessible +in the function definition, because there is no way to name them while +their names have been taken away for the arguments and local variables. +All other variables used in the 'awk' program can be referenced or set +normally in the function's body. + + The arguments and local variables last only as long as the function +body is executing. Once the body finishes, you can once again access +the variables that were shadowed while the function was running. + + The function body can contain expressions that call functions. They +can even call this function, either directly or by way of another +function. When this happens, we say the function is "recursive". The +act of a function calling itself is called "recursion". + + All the built-in functions return a value to their caller. +User-defined functions can do so also, using the 'return' statement, +which is described in detail in *note Return Statement::. Many of the +subsequent examples in this minor node use the 'return' statement. + + In many 'awk' implementations, including 'gawk', the keyword +'function' may be abbreviated 'func'. (c.e.) However, POSIX only +specifies the use of the keyword 'function'. This actually has some +practical implications. If 'gawk' is in POSIX-compatibility mode (*note +Options::), then the following statement does _not_ define a function: + + func foo() { a = sqrt($1) ; print a } + +Instead, it defines a rule that, for each record, concatenates the value +of the variable 'func' with the return value of the function 'foo'. If +the resulting string is non-null, the action is executed. This is +probably not what is desired. ('awk' accepts this input as +syntactically valid, because functions may be used before they are +defined in 'awk' programs.(1)) + + To ensure that your 'awk' programs are portable, always use the +keyword 'function' when defining a function. + + ---------- Footnotes ---------- + + (1) This program won't actually run, because 'foo()' is undefined. + + +File: gawk.info, Node: Function Example, Next: Function Caveats, Prev: Definition Syntax, Up: User-defined + +9.2.2 Function Definition Examples +---------------------------------- + +Here is an example of a user-defined function, called 'myprint()', that +takes a number and prints it in a specific format: + + function myprint(num) + { + printf "%6.3g\n", num + } + +To illustrate, here is an 'awk' rule that uses our 'myprint()' function: + + $3 > 0 { myprint($3) } + +This program prints, in our special format, all the third fields that +contain a positive number in our input. Therefore, when given the +following input: + + 1.2 3.4 5.6 7.8 + 9.10 11.12 -13.14 15.16 + 17.18 19.20 21.22 23.24 + +this program, using our function to format the results, prints: + + 5.6 + 21.2 + + This function deletes all the elements in an array (recall that the +extra whitespace signifies the start of the local variable list): + + function delarray(a, i) + { + for (i in a) + delete a[i] + } + + When working with arrays, it is often necessary to delete all the +elements in an array and start over with a new list of elements (*note +Delete::). Instead of having to repeat this loop everywhere that you +need to clear out an array, your program can just call 'delarray()'. +(This guarantees portability. The use of 'delete ARRAY' to delete the +contents of an entire array is a relatively recent(1) addition to the +POSIX standard.) + + The following is an example of a recursive function. It takes a +string as an input parameter and returns the string in reverse order. +Recursive functions must always have a test that stops the recursion. +In this case, the recursion terminates when the input string is already +empty: + + function rev(str) + { + if (str == "") + return "" + + return (rev(substr(str, 2)) substr(str, 1, 1)) + } + + If this function is in a file named 'rev.awk', it can be tested this +way: + + $ echo "Don't Panic!" | + > gawk -e '{ print rev($0) }' -f rev.awk + -| !cinaP t'noD + + The C 'ctime()' function takes a timestamp and returns it as a +string, formatted in a well-known fashion. The following example uses +the built-in 'strftime()' function (*note Time Functions::) to create an +'awk' version of 'ctime()': + + # ctime.awk + # + # awk version of C ctime(3) function + + function ctime(ts, format) + { + format = "%a %b %e %H:%M:%S %Z %Y" + + if (ts == 0) + ts = systime() # use current time as default + return strftime(format, ts) + } + + You might think that 'ctime()' could use 'PROCINFO["strftime"]' for +its format string. That would be a mistake, because 'ctime()' is +supposed to return the time formatted in a standard fashion, and +user-level code could have changed 'PROCINFO["strftime"]'. + + ---------- Footnotes ---------- + + (1) Late in 2012. + + +File: gawk.info, Node: Function Caveats, Next: Return Statement, Prev: Function Example, Up: User-defined + +9.2.3 Calling User-Defined Functions +------------------------------------ + +"Calling a function" means causing the function to run and do its job. +A function call is an expression and its value is the value returned by +the function. + +* Menu: + +* Calling A Function:: Don't use spaces. +* Variable Scope:: Controlling variable scope. +* Pass By Value/Reference:: Passing parameters. + + +File: gawk.info, Node: Calling A Function, Next: Variable Scope, Up: Function Caveats + +9.2.3.1 Writing a Function Call +............................... + +A function call consists of the function name followed by the arguments +in parentheses. 'awk' expressions are what you write in the call for +the arguments. Each time the call is executed, these expressions are +evaluated, and the values become the actual arguments. For example, +here is a call to 'foo()' with three arguments (the first being a string +concatenation): + + foo(x y, "lose", 4 * z) + + CAUTION: Whitespace characters (spaces and TABs) are not allowed + between the function name and the opening parenthesis of the + argument list. If you write whitespace by mistake, 'awk' might + think that you mean to concatenate a variable with an expression in + parentheses. However, it notices that you used a function name and + not a variable name, and reports an error. + + +File: gawk.info, Node: Variable Scope, Next: Pass By Value/Reference, Prev: Calling A Function, Up: Function Caveats + +9.2.3.2 Controlling Variable Scope +.................................. + +Unlike in many languages, there is no way to make a variable local to a +'{' ... '}' block in 'awk', but you can make a variable local to a +function. It is good practice to do so whenever a variable is needed +only in that function. + + To make a variable local to a function, simply declare the variable +as an argument after the actual function arguments (*note Definition +Syntax::). Look at the following example, where variable 'i' is a +global variable used by both functions 'foo()' and 'bar()': + + function bar() + { + for (i = 0; i < 3; i++) + print "bar's i=" i + } + + function foo(j) + { + i = j + 1 + print "foo's i=" i + bar() + print "foo's i=" i + } + + BEGIN { + i = 10 + print "top's i=" i + foo(0) + print "top's i=" i + } + + Running this script produces the following, because the 'i' in +functions 'foo()' and 'bar()' and at the top level refer to the same +variable instance: + + top's i=10 + foo's i=1 + bar's i=0 + bar's i=1 + bar's i=2 + foo's i=3 + top's i=3 + + If you want 'i' to be local to both 'foo()' and 'bar()', do as +follows (the extra space before 'i' is a coding convention to indicate +that 'i' is a local variable, not an argument): + + function bar( i) + { + for (i = 0; i < 3; i++) + print "bar's i=" i + } + + function foo(j, i) + { + i = j + 1 + print "foo's i=" i + bar() + print "foo's i=" i + } + + BEGIN { + i = 10 + print "top's i=" i + foo(0) + print "top's i=" i + } + + Running the corrected script produces the following: + + top's i=10 + foo's i=1 + bar's i=0 + bar's i=1 + bar's i=2 + foo's i=1 + top's i=10 + + Besides scalar values (strings and numbers), you may also have local +arrays. By using a parameter name as an array, 'awk' treats it as an +array, and it is local to the function. In addition, recursive calls +create new arrays. Consider this example: + + function some_func(p1, a) + { + if (p1++ > 3) + return + + a[p1] = p1 + + some_func(p1) + + printf("At level %d, index %d %s found in a\n", + p1, (p1 - 1), (p1 - 1) in a ? "is" : "is not") + printf("At level %d, index %d %s found in a\n", + p1, p1, p1 in a ? "is" : "is not") + print "" + } + + BEGIN { + some_func(1) + } + + When run, this program produces the following output: + + At level 4, index 3 is not found in a + At level 4, index 4 is found in a + + At level 3, index 2 is not found in a + At level 3, index 3 is found in a + + At level 2, index 1 is not found in a + At level 2, index 2 is found in a + + +File: gawk.info, Node: Pass By Value/Reference, Prev: Variable Scope, Up: Function Caveats + +9.2.3.3 Passing Function Arguments by Value Or by Reference +........................................................... + +In 'awk', when you declare a function, there is no way to declare +explicitly whether the arguments are passed "by value" or "by +reference". + + Instead, the passing convention is determined at runtime when the +function is called, according to the following rule: if the argument is +an array variable, then it is passed by reference. Otherwise, the +argument is passed by value. + + Passing an argument by value means that when a function is called, it +is given a _copy_ of the value of this argument. The caller may use a +variable as the expression for the argument, but the called function +does not know this--it only knows what value the argument had. For +example, if you write the following code: + + foo = "bar" + z = myfunc(foo) + +then you should not think of the argument to 'myfunc()' as being "the +variable 'foo'." Instead, think of the argument as the string value +'"bar"'. If the function 'myfunc()' alters the values of its local +variables, this has no effect on any other variables. Thus, if +'myfunc()' does this: + + function myfunc(str) + { + print str + str = "zzz" + print str + } + +to change its first argument variable 'str', it does _not_ change the +value of 'foo' in the caller. The role of 'foo' in calling 'myfunc()' +ended when its value ('"bar"') was computed. If 'str' also exists +outside of 'myfunc()', the function body cannot alter this outer value, +because it is shadowed during the execution of 'myfunc()' and cannot be +seen or changed from there. + + However, when arrays are the parameters to functions, they are _not_ +copied. Instead, the array itself is made available for direct +manipulation by the function. This is usually termed "call by +reference". Changes made to an array parameter inside the body of a +function _are_ visible outside that function. + + NOTE: Changing an array parameter inside a function can be very + dangerous if you do not watch what you are doing. For example: + + function changeit(array, ind, nvalue) + { + array[ind] = nvalue + } + + BEGIN { + a[1] = 1; a[2] = 2; a[3] = 3 + changeit(a, 2, "two") + printf "a[1] = %s, a[2] = %s, a[3] = %s\n", + a[1], a[2], a[3] + } + + prints 'a[1] = 1, a[2] = two, a[3] = 3', because 'changeit()' + stores '"two"' in the second element of 'a'. + + Some 'awk' implementations allow you to call a function that has not +been defined. They only report a problem at runtime, when the program +actually tries to call the function. For example: + + BEGIN { + if (0) + foo() + else + bar() + } + function bar() { ... } + # note that `foo' is not defined + +Because the 'if' statement will never be true, it is not really a +problem that 'foo()' has not been defined. Usually, though, it is a +problem if a program calls an undefined function. + + If '--lint' is specified (*note Options::), 'gawk' reports calls to +undefined functions. + + Some 'awk' implementations generate a runtime error if you use either +the 'next' statement or the 'nextfile' statement (*note Next +Statement::, and *note Nextfile Statement::) inside a user-defined +function. 'gawk' does not have this limitation. + + +File: gawk.info, Node: Return Statement, Next: Dynamic Typing, Prev: Function Caveats, Up: User-defined + +9.2.4 The 'return' Statement +---------------------------- + +As seen in several earlier examples, the body of a user-defined function +can contain a 'return' statement. This statement returns control to the +calling part of the 'awk' program. It can also be used to return a +value for use in the rest of the 'awk' program. It looks like this: + + 'return' [EXPRESSION] + + The EXPRESSION part is optional. Due most likely to an oversight, +POSIX does not define what the return value is if you omit the +EXPRESSION. Technically speaking, this makes the returned value +undefined, and therefore, unpredictable. In practice, though, all +versions of 'awk' simply return the null string, which acts like zero if +used in a numeric context. + + A 'return' statement without an EXPRESSION is assumed at the end of +every function definition. So, if control reaches the end of the +function body, then technically the function returns an unpredictable +value. In practice, it returns the empty string. 'awk' does _not_ warn +you if you use the return value of such a function. + + Sometimes, you want to write a function for what it does, not for +what it returns. Such a function corresponds to a 'void' function in C, +C++, or Java, or to a 'procedure' in Ada. Thus, it may be appropriate +to not return any value; simply bear in mind that you should not be +using the return value of such a function. + + The following is an example of a user-defined function that returns a +value for the largest number among the elements of an array: + + function maxelt(vec, i, ret) + { + for (i in vec) { + if (ret == "" || vec[i] > ret) + ret = vec[i] + } + return ret + } + +You call 'maxelt()' with one argument, which is an array name. The +local variables 'i' and 'ret' are not intended to be arguments; there is +nothing to stop you from passing more than one argument to 'maxelt()' +but the results would be strange. The extra space before 'i' in the +function parameter list indicates that 'i' and 'ret' are local +variables. You should follow this convention when defining functions. + + The following program uses the 'maxelt()' function. It loads an +array, calls 'maxelt()', and then reports the maximum number in that +array: + + function maxelt(vec, i, ret) + { + for (i in vec) { + if (ret == "" || vec[i] > ret) + ret = vec[i] + } + return ret + } + + # Load all fields of each record into nums. + { + for(i = 1; i <= NF; i++) + nums[NR, i] = $i + } + + END { + print maxelt(nums) + } + + Given the following input: + + 1 5 23 8 16 + 44 3 5 2 8 26 + 256 291 1396 2962 100 + -6 467 998 1101 + 99385 11 0 225 + +the program reports (predictably) that 99,385 is the largest value in +the array. + + +File: gawk.info, Node: Dynamic Typing, Prev: Return Statement, Up: User-defined + +9.2.5 Functions and Their Effects on Variable Typing +---------------------------------------------------- + +'awk' is a very fluid language. It is possible that 'awk' can't tell if +an identifier represents a scalar variable or an array until runtime. +Here is an annotated sample program: + + function foo(a) + { + a[1] = 1 # parameter is an array + } + + BEGIN { + b = 1 + foo(b) # invalid: fatal type mismatch + + foo(x) # x uninitialized, becomes an array dynamically + x = 1 # now not allowed, runtime error + } + + In this example, the first call to 'foo()' generates a fatal error, +so 'awk' will not report the second error. If you comment out that +call, though, then 'awk' does report the second error. + + Usually, such things aren't a big issue, but it's worth being aware +of them. + + +File: gawk.info, Node: Indirect Calls, Next: Functions Summary, Prev: User-defined, Up: Functions + +9.3 Indirect Function Calls +=========================== + +This section describes an advanced, 'gawk'-specific extension. + + Often, you may wish to defer the choice of function to call until +runtime. For example, you may have different kinds of records, each of +which should be processed differently. + + Normally, you would have to use a series of 'if'-'else' statements to +decide which function to call. By using "indirect" function calls, you +can specify the name of the function to call as a string variable, and +then call the function. Let's look at an example. + + Suppose you have a file with your test scores for the classes you are +taking, and you wish to get the sum and the average of your test scores. +The first field is the class name. The following fields are the +functions to call to process the data, up to a "marker" field 'data:'. +Following the marker, to the end of the record, are the various numeric +test scores. + + Here is the initial file: + + Biology_101 sum average data: 87.0 92.4 78.5 94.9 + Chemistry_305 sum average data: 75.2 98.3 94.7 88.2 + English_401 sum average data: 100.0 95.6 87.1 93.4 + + To process the data, you might write initially: + + { + class = $1 + for (i = 2; $i != "data:"; i++) { + if ($i == "sum") + sum() # processes the whole record + else if ($i == "average") + average() + ... # and so on + } + } + +This style of programming works, but can be awkward. With "indirect" +function calls, you tell 'gawk' to use the _value_ of a variable as the +_name_ of the function to call. + + The syntax is similar to that of a regular function call: an +identifier immediately followed by an opening parenthesis, any +arguments, and then a closing parenthesis, with the addition of a +leading '@' character: + + the_func = "sum" + result = @the_func() # calls the sum() function + + Here is a full program that processes the previously shown data, +using indirect function calls: + + # indirectcall.awk --- Demonstrate indirect function calls + + # average --- return the average of the values in fields $first - $last + + function average(first, last, sum, i) + { + sum = 0; + for (i = first; i <= last; i++) + sum += $i + + return sum / (last - first + 1) + } + + # sum --- return the sum of the values in fields $first - $last + + function sum(first, last, ret, i) + { + ret = 0; + for (i = first; i <= last; i++) + ret += $i + + return ret + } + + These two functions expect to work on fields; thus, the parameters +'first' and 'last' indicate where in the fields to start and end. +Otherwise, they perform the expected computations and are not unusual: + + # For each record, print the class name and the requested statistics + { + class_name = $1 + gsub(/_/, " ", class_name) # Replace _ with spaces + + # find start + for (i = 1; i <= NF; i++) { + if ($i == "data:") { + start = i + 1 + break + } + } + + printf("%s:\n", class_name) + for (i = 2; $i != "data:"; i++) { + the_function = $i + printf("\t%s: <%s>\n", $i, @the_function(start, NF) "") + } + print "" + } + + This is the main processing for each record. It prints the class +name (with underscores replaced with spaces). It then finds the start +of the actual data, saving it in 'start'. The last part of the code +loops through each function name (from '$2' up to the marker, 'data:'), +calling the function named by the field. The indirect function call +itself occurs as a parameter in the call to 'printf'. (The 'printf' +format string uses '%s' as the format specifier so that we can use +functions that return strings, as well as numbers. Note that the result +from the indirect call is concatenated with the empty string, in order +to force it to be a string value.) + + Here is the result of running the program: + + $ gawk -f indirectcall.awk class_data1 + -| Biology 101: + -| sum: <352.8> + -| average: <88.2> + -| + -| Chemistry 305: + -| sum: <356.4> + -| average: <89.1> + -| + -| English 401: + -| sum: <376.1> + -| average: <94.025> + + The ability to use indirect function calls is more powerful than you +may think at first. The C and C++ languages provide "function +pointers," which are a mechanism for calling a function chosen at +runtime. One of the most well-known uses of this ability is the C +'qsort()' function, which sorts an array using the famous "quicksort" +algorithm (see the Wikipedia article +(http://en.wikipedia.org/wiki/Quicksort) for more information). To use +this function, you supply a pointer to a comparison function. This +mechanism allows you to sort arbitrary data in an arbitrary fashion. + + We can do something similar using 'gawk', like this: + + # quicksort.awk --- Quicksort algorithm, with user-supplied + # comparison function + + # quicksort --- C.A.R. Hoare's quicksort algorithm. See Wikipedia + # or almost any algorithms or computer science text. + + function quicksort(data, left, right, less_than, i, last) + { + if (left >= right) # do nothing if array contains fewer + return # than two elements + + quicksort_swap(data, left, int((left + right) / 2)) + last = left + for (i = left + 1; i <= right; i++) + if (@less_than(data[i], data[left])) + quicksort_swap(data, ++last, i) + quicksort_swap(data, left, last) + quicksort(data, left, last - 1, less_than) + quicksort(data, last + 1, right, less_than) + } + + # quicksort_swap --- helper function for quicksort, should really be inline + + function quicksort_swap(data, i, j, temp) + { + temp = data[i] + data[i] = data[j] + data[j] = temp + } + + The 'quicksort()' function receives the 'data' array, the starting +and ending indices to sort ('left' and 'right'), and the name of a +function that performs a "less than" comparison. It then implements the +quicksort algorithm. + + To make use of the sorting function, we return to our previous +example. The first thing to do is write some comparison functions: + + # num_lt --- do a numeric less than comparison + + function num_lt(left, right) + { + return ((left + 0) < (right + 0)) + } + + # num_ge --- do a numeric greater than or equal to comparison + + function num_ge(left, right) + { + return ((left + 0) >= (right + 0)) + } + + The 'num_ge()' function is needed to perform a descending sort; when +used to perform a "less than" test, it actually does the opposite +(greater than or equal to), which yields data sorted in descending +order. + + Next comes a sorting function. It is parameterized with the starting +and ending field numbers and the comparison function. It builds an +array with the data and calls 'quicksort()' appropriately, and then +formats the results as a single string: + + # do_sort --- sort the data according to `compare' + # and return it as a string + + function do_sort(first, last, compare, data, i, retval) + { + delete data + for (i = 1; first <= last; first++) { + data[i] = $first + i++ + } + + quicksort(data, 1, i-1, compare) + + retval = data[1] + for (i = 2; i in data; i++) + retval = retval " " data[i] + + return retval + } + + Finally, the two sorting functions call 'do_sort()', passing in the +names of the two comparison functions: + + # sort --- sort the data in ascending order and return it as a string + + function sort(first, last) + { + return do_sort(first, last, "num_lt") + } + + # rsort --- sort the data in descending order and return it as a string + + function rsort(first, last) + { + return do_sort(first, last, "num_ge") + } + + Here is an extended version of the data file: + + Biology_101 sum average sort rsort data: 87.0 92.4 78.5 94.9 + Chemistry_305 sum average sort rsort data: 75.2 98.3 94.7 88.2 + English_401 sum average sort rsort data: 100.0 95.6 87.1 93.4 + + Finally, here are the results when the enhanced program is run: + + $ gawk -f quicksort.awk -f indirectcall.awk class_data2 + -| Biology 101: + -| sum: <352.8> + -| average: <88.2> + -| sort: <78.5 87.0 92.4 94.9> + -| rsort: <94.9 92.4 87.0 78.5> + -| + -| Chemistry 305: + -| sum: <356.4> + -| average: <89.1> + -| sort: <75.2 88.2 94.7 98.3> + -| rsort: <98.3 94.7 88.2 75.2> + -| + -| English 401: + -| sum: <376.1> + -| average: <94.025> + -| sort: <87.1 93.4 95.6 100.0> + -| rsort: <100.0 95.6 93.4 87.1> + + Another example where indirect functions calls are useful can be +found in processing arrays. This is described in *note Walking +Arrays::. + + Remember that you must supply a leading '@' in front of an indirect +function call. + + Starting with version 4.1.2 of 'gawk', indirect function calls may +also be used with built-in functions and with extension functions (*note +Dynamic Extensions::). There are some limitations when calling built-in +functions indirectly, as follows. + + * You cannot pass a regular expression constant to a built-in + function through an indirect function call.(1) This applies to the + 'sub()', 'gsub()', 'gensub()', 'match()', 'split()' and + 'patsplit()' functions. + + * If calling 'sub()' or 'gsub()', you may only pass two arguments, + since those functions are unusual in that they update their third + argument. This means that '$0' will be updated. + + 'gawk' does its best to make indirect function calls efficient. For +example, in the following case: + + for (i = 1; i <= n; i++) + @the_func() + +'gawk' looks up the actual function to call only once. + + ---------- Footnotes ---------- + + (1) This may change in a future version; recheck the documentation +that comes with your version of 'gawk' to see if it has. + + +File: gawk.info, Node: Functions Summary, Prev: Indirect Calls, Up: Functions + +9.4 Summary +=========== + + * 'awk' provides built-in functions and lets you define your own + functions. + + * POSIX 'awk' provides three kinds of built-in functions: numeric, + string, and I/O. 'gawk' provides functions that sort arrays, work + with values representing time, do bit manipulation, determine + variable type (array versus scalar), and internationalize and + localize programs. 'gawk' also provides several extensions to some + of standard functions, typically in the form of additional + arguments. + + * Functions accept zero or more arguments and return a value. The + expressions that provide the argument values are completely + evaluated before the function is called. Order of evaluation is + not defined. The return value can be ignored. + + * The handling of backslash in 'sub()' and 'gsub()' is not simple. + It is more straightforward in 'gawk''s 'gensub()' function, but + that function still requires care in its use. + + * User-defined functions provide important capabilities but come with + some syntactic inelegancies. In a function call, there cannot be + any space between the function name and the opening left + parenthesis of the argument list. Also, there is no provision for + local variables, so the convention is to add extra parameters, and + to separate them visually from the real parameters by extra + whitespace. + + * User-defined functions may call other user-defined (and built-in) + functions and may call themselves recursively. Function parameters + "hide" any global variables of the same names. You cannot use the + name of a reserved variable (such as 'ARGC') as the name of a + parameter in user-defined functions. + + * Scalar values are passed to user-defined functions by value. Array + parameters are passed by reference; any changes made by the + function to array parameters are thus visible after the function + has returned. + + * Use the 'return' statement to return from a user-defined function. + An optional expression becomes the function's return value. Only + scalar values may be returned by a function. + + * If a variable that has never been used is passed to a user-defined + function, how that function treats the variable can set its nature: + either scalar or array. + + * 'gawk' provides indirect function calls using a special syntax. By + setting a variable to the name of a function, you can determine at + runtime what function will be called at that point in the program. + This is equivalent to function pointers in C and C++. + + +File: gawk.info, Node: Library Functions, Next: Sample Programs, Prev: Functions, Up: Top + +10 A Library of 'awk' Functions +******************************* + +*note User-defined:: describes how to write your own 'awk' functions. +Writing functions is important, because it allows you to encapsulate +algorithms and program tasks in a single place. It simplifies +programming, making program development more manageable and making +programs more readable. + + In their seminal 1976 book, 'Software Tools',(1) Brian Kernighan and +P.J. Plauger wrote: + + Good Programming is not learned from generalities, but by seeing + how significant programs can be made clean, easy to read, easy to + maintain and modify, human-engineered, efficient and reliable, by + the application of common sense and good programming practices. + Careful study and imitation of good programs leads to better + writing. + + In fact, they felt this idea was so important that they placed this +statement on the cover of their book. Because we believe strongly that +their statement is correct, this major node and *note Sample Programs::, +provide a good-sized body of code for you to read and, we hope, to learn +from. + + This major node presents a library of useful 'awk' functions. Many +of the sample programs presented later in this Info file use these +functions. The functions are presented here in a progression from +simple to complex. + + *note Extract Program:: presents a program that you can use to +extract the source code for these example library functions and programs +from the Texinfo source for this Info file. (This has already been done +as part of the 'gawk' distribution.) + + If you have written one or more useful, general-purpose 'awk' +functions and would like to contribute them to the 'awk' user community, +see *note How To Contribute::, for more information. + + The programs in this major node and in *note Sample Programs::, +freely use 'gawk'-specific features. Rewriting these programs for +different implementations of 'awk' is pretty straightforward: + + * Diagnostic error messages are sent to '/dev/stderr'. Use '| "cat + 1>&2"' instead of '> "/dev/stderr"' if your system does not have a + '/dev/stderr', or if you cannot use 'gawk'. + + * A number of programs use 'nextfile' (*note Nextfile Statement::) to + skip any remaining input in the input file. + + * Finally, some of the programs choose to ignore upper- and lowercase + distinctions in their input. They do so by assigning one to + 'IGNORECASE'. You can achieve almost the same effect(2) by adding + the following rule to the beginning of the program: + + # ignore case + { $0 = tolower($0) } + + Also, verify that all regexp and string constants used in + comparisons use only lowercase letters. + +* Menu: + +* Library Names:: How to best name private global variables in + library functions. +* General Functions:: Functions that are of general use. +* Data File Management:: Functions for managing command-line data + files. +* Getopt Function:: A function for processing command-line + arguments. +* Passwd Functions:: Functions for getting user information. +* Group Functions:: Functions for getting group information. +* Walking Arrays:: A function to walk arrays of arrays. +* Library Functions Summary:: Summary of library functions. +* Library Exercises:: Exercises. + + ---------- Footnotes ---------- + + (1) Sadly, over 35 years later, many of the lessons taught by this +book have yet to be learned by a vast number of practicing programmers. + + (2) The effects are not identical. Output of the transformed record +will be in all lowercase, while 'IGNORECASE' preserves the original +contents of the input record. + + +File: gawk.info, Node: Library Names, Next: General Functions, Up: Library Functions + +10.1 Naming Library Function Global Variables +============================================= + +Due to the way the 'awk' language evolved, variables are either "global" +(usable by the entire program) or "local" (usable just by a specific +function). There is no intermediate state analogous to 'static' +variables in C. + + Library functions often need to have global variables that they can +use to preserve state information between calls to the function--for +example, 'getopt()''s variable '_opti' (*note Getopt Function::). Such +variables are called "private", as the only functions that need to use +them are the ones in the library. + + When writing a library function, you should try to choose names for +your private variables that will not conflict with any variables used by +either another library function or a user's main program. For example, +a name like 'i' or 'j' is not a good choice, because user programs often +use variable names like these for their own purposes. + + The example programs shown in this major node all start the names of +their private variables with an underscore ('_'). Users generally don't +use leading underscores in their variable names, so this convention +immediately decreases the chances that the variable names will be +accidentally shared with the user's program. + + In addition, several of the library functions use a prefix that helps +indicate what function or set of functions use the variables--for +example, '_pw_byname()' in the user database routines (*note Passwd +Functions::). This convention is recommended, as it even further +decreases the chance of inadvertent conflict among variable names. Note +that this convention is used equally well for variable names and for +private function names.(1) + + As a final note on variable naming, if a function makes global +variables available for use by a main program, it is a good convention +to start those variables' names with a capital letter--for example, +'getopt()''s 'Opterr' and 'Optind' variables (*note Getopt Function::). +The leading capital letter indicates that it is global, while the fact +that the variable name is not all capital letters indicates that the +variable is not one of 'awk''s predefined variables, such as 'FS'. + + It is also important that _all_ variables in library functions that +do not need to save state are, in fact, declared local.(2) If this is +not done, the variables could accidentally be used in the user's +program, leading to bugs that are very difficult to track down: + + function lib_func(x, y, l1, l2) + { + ... + # some_var should be local but by oversight is not + USE VARIABLE some_var + ... + } + + A different convention, common in the Tcl community, is to use a +single associative array to hold the values needed by the library +function(s), or "package." This significantly decreases the number of +actual global names in use. For example, the functions described in +*note Passwd Functions:: might have used array elements +'PW_data["inited"]', 'PW_data["total"]', 'PW_data["count"]', and +'PW_data["awklib"]', instead of '_pw_inited', '_pw_awklib', '_pw_total', +and '_pw_count'. + + The conventions presented in this minor node are exactly that: +conventions. You are not required to write your programs this way--we +merely recommend that you do so. + + ---------- Footnotes ---------- + + (1) Although all the library routines could have been rewritten to +use this convention, this was not done, in order to show how our own +'awk' programming style has evolved and to provide some basis for this +discussion. + + (2) 'gawk''s '--dump-variables' command-line option is useful for +verifying this. + + +File: gawk.info, Node: General Functions, Next: Data File Management, Prev: Library Names, Up: Library Functions + +10.2 General Programming +======================== + +This minor node presents a number of functions that are of general +programming use. + +* Menu: + +* Strtonum Function:: A replacement for the built-in + 'strtonum()' function. +* Assert Function:: A function for assertions in 'awk' + programs. +* Round Function:: A function for rounding if 'sprintf()' + does not do it correctly. +* Cliff Random Function:: The Cliff Random Number Generator. +* Ordinal Functions:: Functions for using characters as numbers and + vice versa. +* Join Function:: A function to join an array into a string. +* Getlocaltime Function:: A function to get formatted times. +* Readfile Function:: A function to read an entire file at once. +* Shell Quoting:: A function to quote strings for the shell. + + +File: gawk.info, Node: Strtonum Function, Next: Assert Function, Up: General Functions + +10.2.1 Converting Strings to Numbers +------------------------------------ + +The 'strtonum()' function (*note String Functions::) is a 'gawk' +extension. The following function provides an implementation for other +versions of 'awk': + + # mystrtonum --- convert string to number + + function mystrtonum(str, ret, n, i, k, c) + { + if (str ~ /^0[0-7]*$/) { + # octal + n = length(str) + ret = 0 + for (i = 1; i <= n; i++) { + c = substr(str, i, 1) + # index() returns 0 if c not in string, + # includes c == "0" + k = index("1234567", c) + + ret = ret * 8 + k + } + } else if (str ~ /^0[xX][[:xdigit:]]+$/) { + # hexadecimal + str = substr(str, 3) # lop off leading 0x + n = length(str) + ret = 0 + for (i = 1; i <= n; i++) { + c = substr(str, i, 1) + c = tolower(c) + # index() returns 0 if c not in string, + # includes c == "0" + k = index("123456789abcdef", c) + + ret = ret * 16 + k + } + } else if (str ~ \ + /^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|([.][0-9]+([Ee][-+]?[0-9]+)?))$/) { + # decimal number, possibly floating point + ret = str + 0 + } else + ret = "NOT-A-NUMBER" + + return ret + } + + # BEGIN { # gawk test harness + # a[1] = "25" + # a[2] = ".31" + # a[3] = "0123" + # a[4] = "0xdeadBEEF" + # a[5] = "123.45" + # a[6] = "1.e3" + # a[7] = "1.32" + # a[8] = "1.32E2" + # + # for (i = 1; i in a; i++) + # print a[i], strtonum(a[i]), mystrtonum(a[i]) + # } + + The function first looks for C-style octal numbers (base 8). If the +input string matches a regular expression describing octal numbers, then +'mystrtonum()' loops through each character in the string. It sets 'k' +to the index in '"1234567"' of the current octal digit. The return +value will either be the same number as the digit, or zero if the +character is not there, which will be true for a '0'. This is safe, +because the regexp test in the 'if' ensures that only octal values are +converted. + + Similar logic applies to the code that checks for and converts a +hexadecimal value, which starts with '0x' or '0X'. The use of +'tolower()' simplifies the computation for finding the correct numeric +value for each hexadecimal digit. + + Finally, if the string matches the (rather complicated) regexp for a +regular decimal integer or floating-point number, the computation 'ret = +str + 0' lets 'awk' convert the value to a number. + + A commented-out test program is included, so that the function can be +tested with 'gawk' and the results compared to the built-in 'strtonum()' +function. + + +File: gawk.info, Node: Assert Function, Next: Round Function, Prev: Strtonum Function, Up: General Functions + +10.2.2 Assertions +----------------- + +When writing large programs, it is often useful to know that a condition +or set of conditions is true. Before proceeding with a particular +computation, you make a statement about what you believe to be the case. +Such a statement is known as an "assertion". The C language provides an +'<assert.h>' header file and corresponding 'assert()' macro that a +programmer can use to make assertions. If an assertion fails, the +'assert()' macro arranges to print a diagnostic message describing the +condition that should have been true but was not, and then it kills the +program. In C, using 'assert()' looks this: + + #include <assert.h> + + int myfunc(int a, double b) + { + assert(a <= 5 && b >= 17.1); + ... + } + + If the assertion fails, the program prints a message similar to this: + + prog.c:5: assertion failed: a <= 5 && b >= 17.1 + + The C language makes it possible to turn the condition into a string +for use in printing the diagnostic message. This is not possible in +'awk', so this 'assert()' function also requires a string version of the +condition that is being tested. Following is the function: + + # assert --- assert that a condition is true. Otherwise, exit. + + function assert(condition, string) + { + if (! condition) { + printf("%s:%d: assertion failed: %s\n", + FILENAME, FNR, string) > "/dev/stderr" + _assert_exit = 1 + exit 1 + } + } + + END { + if (_assert_exit) + exit 1 + } + + The 'assert()' function tests the 'condition' parameter. If it is +false, it prints a message to standard error, using the 'string' +parameter to describe the failed condition. It then sets the variable +'_assert_exit' to one and executes the 'exit' statement. The 'exit' +statement jumps to the 'END' rule. If the 'END' rule finds +'_assert_exit' to be true, it exits immediately. + + The purpose of the test in the 'END' rule is to keep any other 'END' +rules from running. When an assertion fails, the program should exit +immediately. If no assertions fail, then '_assert_exit' is still false +when the 'END' rule is run normally, and the rest of the program's 'END' +rules execute. For all of this to work correctly, 'assert.awk' must be +the first source file read by 'awk'. The function can be used in a +program in the following way: + + function myfunc(a, b) + { + assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1") + ... + } + +If the assertion fails, you see a message similar to the following: + + mydata:1357: assertion failed: a <= 5 && b >= 17.1 + + There is a small problem with this version of 'assert()'. An 'END' +rule is automatically added to the program calling 'assert()'. +Normally, if a program consists of just a 'BEGIN' rule, the input files +and/or standard input are not read. However, now that the program has +an 'END' rule, 'awk' attempts to read the input data files or standard +input (*note Using BEGIN/END::), most likely causing the program to hang +as it waits for input. + + There is a simple workaround to this: make sure that such a 'BEGIN' +rule always ends with an 'exit' statement. + + +File: gawk.info, Node: Round Function, Next: Cliff Random Function, Prev: Assert Function, Up: General Functions + +10.2.3 Rounding Numbers +----------------------- + +The way 'printf' and 'sprintf()' (*note Printf::) perform rounding often +depends upon the system's C 'sprintf()' subroutine. On many machines, +'sprintf()' rounding is "unbiased", which means it doesn't always round +a trailing .5 up, contrary to naive expectations. In unbiased rounding, +.5 rounds to even, rather than always up, so 1.5 rounds to 2 but 4.5 +rounds to 4. This means that if you are using a format that does +rounding (e.g., '"%.0f"'), you should check what your system does. The +following function does traditional rounding; it might be useful if your +'awk''s 'printf' does unbiased rounding: + + # round.awk --- do normal rounding + + function round(x, ival, aval, fraction) + { + ival = int(x) # integer part, int() truncates + + # see if fractional part + if (ival == x) # no fraction + return ival # ensure no decimals + + if (x < 0) { + aval = -x # absolute value + ival = int(aval) + fraction = aval - ival + if (fraction >= .5) + return int(x) - 1 # -2.5 --> -3 + else + return int(x) # -2.3 --> -2 + } else { + fraction = x - ival + if (fraction >= .5) + return ival + 1 + else + return ival + } + } + + # test harness + # { print $0, round($0) } + + +File: gawk.info, Node: Cliff Random Function, Next: Ordinal Functions, Prev: Round Function, Up: General Functions + +10.2.4 The Cliff Random Number Generator +---------------------------------------- + +The Cliff random number generator +(http://mathworld.wolfram.com/CliffRandomNumberGenerator.html) is a very +simple random number generator that "passes the noise sphere test for +randomness by showing no structure." It is easily programmed, in less +than 10 lines of 'awk' code: + + # cliff_rand.awk --- generate Cliff random numbers + + BEGIN { _cliff_seed = 0.1 } + + function cliff_rand() + { + _cliff_seed = (100 * log(_cliff_seed)) % 1 + if (_cliff_seed < 0) + _cliff_seed = - _cliff_seed + return _cliff_seed + } + + This algorithm requires an initial "seed" of 0.1. Each new value +uses the current seed as input for the calculation. If the built-in +'rand()' function (*note Numeric Functions::) isn't random enough, you +might try using this function instead. + + +File: gawk.info, Node: Ordinal Functions, Next: Join Function, Prev: Cliff Random Function, Up: General Functions + +10.2.5 Translating Between Characters and Numbers +------------------------------------------------- + +One commercial implementation of 'awk' supplies a built-in function, +'ord()', which takes a character and returns the numeric value for that +character in the machine's character set. If the string passed to +'ord()' has more than one character, only the first one is used. + + The inverse of this function is 'chr()' (from the function of the +same name in Pascal), which takes a number and returns the corresponding +character. Both functions are written very nicely in 'awk'; there is no +real reason to build them into the 'awk' interpreter: + + # ord.awk --- do ord and chr + + # Global identifiers: + # _ord_: numerical values indexed by characters + # _ord_init: function to initialize _ord_ + + BEGIN { _ord_init() } + + function _ord_init( low, high, i, t) + { + low = sprintf("%c", 7) # BEL is ascii 7 + if (low == "\a") { # regular ascii + low = 0 + high = 127 + } else if (sprintf("%c", 128 + 7) == "\a") { + # ascii, mark parity + low = 128 + high = 255 + } else { # ebcdic(!) + low = 0 + high = 255 + } + + for (i = low; i <= high; i++) { + t = sprintf("%c", i) + _ord_[t] = i + } + } + + Some explanation of the numbers used by '_ord_init()' is worthwhile. +The most prominent character set in use today is ASCII.(1) Although an +8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only +defines characters that use the values from 0 to 127.(2) In the now +distant past, at least one minicomputer manufacturer used ASCII, but +with mark parity, meaning that the leftmost bit in the byte is always 1. +This means that on those systems, characters have numeric values from +128 to 255. Finally, large mainframe systems use the EBCDIC character +set, which uses all 256 values. There are other character sets in use +on some older systems, but they are not really worth worrying about: + + function ord(str, c) + { + # only first character is of interest + c = substr(str, 1, 1) + return _ord_[c] + } + + function chr(c) + { + # force c to be numeric by adding 0 + return sprintf("%c", c + 0) + } + + #### test code #### + # BEGIN { + # for (;;) { + # printf("enter a character: ") + # if (getline var <= 0) + # break + # printf("ord(%s) = %d\n", var, ord(var)) + # } + # } + + An obvious improvement to these functions is to move the code for the +'_ord_init' function into the body of the 'BEGIN' rule. It was written +this way initially for ease of development. There is a "test program" +in a 'BEGIN' rule, to test the function. It is commented out for +production use. + + ---------- Footnotes ---------- + + (1) This is changing; many systems use Unicode, a very large +character set that includes ASCII as a subset. On systems with full +Unicode support, a character can occupy up to 32 bits, making simple +tests such as used here prohibitively expensive. + + (2) ASCII has been extended in many countries to use the values from +128 to 255 for country-specific characters. If your system uses these +extensions, you can simplify '_ord_init()' to loop from 0 to 255. + + +File: gawk.info, Node: Join Function, Next: Getlocaltime Function, Prev: Ordinal Functions, Up: General Functions + +10.2.6 Merging an Array into a String +------------------------------------- + +When doing string processing, it is often useful to be able to join all +the strings in an array into one long string. The following function, +'join()', accomplishes this task. It is used later in several of the +application programs (*note Sample Programs::). + + Good function design is important; this function needs to be general, +but it should also have a reasonable default behavior. It is called +with an array as well as the beginning and ending indices of the +elements in the array to be merged. This assumes that the array indices +are numeric--a reasonable assumption, as the array was likely created +with 'split()' (*note String Functions::): + + # join.awk --- join an array into a string + + function join(array, start, end, sep, result, i) + { + if (sep == "") + sep = " " + else if (sep == SUBSEP) # magic value + sep = "" + result = array[start] + for (i = start + 1; i <= end; i++) + result = result sep array[i] + return result + } + + An optional additional argument is the separator to use when joining +the strings back together. If the caller supplies a nonempty value, +'join()' uses it; if it is not supplied, it has a null value. In this +case, 'join()' uses a single space as a default separator for the +strings. If the value is equal to 'SUBSEP', then 'join()' joins the +strings with no separator between them. 'SUBSEP' serves as a "magic" +value to indicate that there should be no separation between the +component strings.(1) + + ---------- Footnotes ---------- + + (1) It would be nice if 'awk' had an assignment operator for +concatenation. The lack of an explicit operator for concatenation makes +string operations more difficult than they really need to be. + + +File: gawk.info, Node: Getlocaltime Function, Next: Readfile Function, Prev: Join Function, Up: General Functions + +10.2.7 Managing the Time of Day +------------------------------- + +The 'systime()' and 'strftime()' functions described in *note Time +Functions:: provide the minimum functionality necessary for dealing with +the time of day in human-readable form. Although 'strftime()' is +extensive, the control formats are not necessarily easy to remember or +intuitively obvious when reading a program. + + The following function, 'getlocaltime()', populates a user-supplied +array with preformatted time information. It returns a string with the +current time formatted in the same way as the 'date' utility: + + # getlocaltime.awk --- get the time of day in a usable format + + # Returns a string in the format of output of date(1) + # Populates the array argument time with individual values: + # time["second"] -- seconds (0 - 59) + # time["minute"] -- minutes (0 - 59) + # time["hour"] -- hours (0 - 23) + # time["althour"] -- hours (0 - 12) + # time["monthday"] -- day of month (1 - 31) + # time["month"] -- month of year (1 - 12) + # time["monthname"] -- name of the month + # time["shortmonth"] -- short name of the month + # time["year"] -- year modulo 100 (0 - 99) + # time["fullyear"] -- full year + # time["weekday"] -- day of week (Sunday = 0) + # time["altweekday"] -- day of week (Monday = 0) + # time["dayname"] -- name of weekday + # time["shortdayname"] -- short name of weekday + # time["yearday"] -- day of year (0 - 365) + # time["timezone"] -- abbreviation of timezone name + # time["ampm"] -- AM or PM designation + # time["weeknum"] -- week number, Sunday first day + # time["altweeknum"] -- week number, Monday first day + + function getlocaltime(time, ret, now, i) + { + # get time once, avoids unnecessary system calls + now = systime() + + # return date(1)-style output + ret = strftime("%a %b %e %H:%M:%S %Z %Y", now) + + # clear out target array + delete time + + # fill in values, force numeric values to be + # numeric by adding 0 + time["second"] = strftime("%S", now) + 0 + time["minute"] = strftime("%M", now) + 0 + time["hour"] = strftime("%H", now) + 0 + time["althour"] = strftime("%I", now) + 0 + time["monthday"] = strftime("%d", now) + 0 + time["month"] = strftime("%m", now) + 0 + time["monthname"] = strftime("%B", now) + time["shortmonth"] = strftime("%b", now) + time["year"] = strftime("%y", now) + 0 + time["fullyear"] = strftime("%Y", now) + 0 + time["weekday"] = strftime("%w", now) + 0 + time["altweekday"] = strftime("%u", now) + 0 + time["dayname"] = strftime("%A", now) + time["shortdayname"] = strftime("%a", now) + time["yearday"] = strftime("%j", now) + 0 + time["timezone"] = strftime("%Z", now) + time["ampm"] = strftime("%p", now) + time["weeknum"] = strftime("%U", now) + 0 + time["altweeknum"] = strftime("%W", now) + 0 + + return ret + } + + The string indices are easier to use and read than the various +formats required by 'strftime()'. The 'alarm' program presented in +*note Alarm Program:: uses this function. A more general design for the +'getlocaltime()' function would have allowed the user to supply an +optional timestamp value to use instead of the current time. + + +File: gawk.info, Node: Readfile Function, Next: Shell Quoting, Prev: Getlocaltime Function, Up: General Functions + +10.2.8 Reading a Whole File at Once +----------------------------------- + +Often, it is convenient to have the entire contents of a file available +in memory as a single string. A straightforward but naive way to do +that might be as follows: + + function readfile(file, tmp, contents) + { + if ((getline tmp < file) < 0) + return + + contents = tmp + while (getline tmp < file) > 0) + contents = contents RT tmp + + close(file) + return contents + } + + This function reads from 'file' one record at a time, building up the +full contents of the file in the local variable 'contents'. It works, +but is not necessarily efficient. + + The following function, based on a suggestion by Denis Shirokov, +reads the entire contents of the named file in one shot: + + # readfile.awk --- read an entire file at once + + function readfile(file, tmp, save_rs) + { + save_rs = RS + RS = "^$" + getline tmp < file + close(file) + RS = save_rs + + return tmp + } + + It works by setting 'RS' to '^$', a regular expression that will +never match if the file has contents. 'gawk' reads data from the file +into 'tmp', attempting to match 'RS'. The match fails after each read, +but fails quickly, such that 'gawk' fills 'tmp' with the entire contents +of the file. (*Note Records:: for information on 'RT' and 'RS'.) + + In the case that 'file' is empty, the return value is the null +string. Thus, calling code may use something like: + + contents = readfile("/some/path") + if (length(contents) == 0) + # file was empty ... + + This tests the result to see if it is empty or not. An equivalent +test would be 'contents == ""'. + + *Note Extension Sample Readfile:: for an extension function that also +reads an entire file into memory. + + +File: gawk.info, Node: Shell Quoting, Prev: Readfile Function, Up: General Functions + +10.2.9 Quoting Strings to Pass to the Shell +------------------------------------------- + +Michael Brennan offers the following programming pattern, which he uses +frequently: + + #! /bin/sh + + awkp=' + ... + ' + + INPUT_PROGRAM | awk "$awkp" | /bin/sh + + For example, a program of his named 'flac-edit' has this form: + + $ flac-edit -song="Whoope! That's Great" file.flac + + It generates the following output, which is to be piped to the shell +('/bin/sh'): + + chmod +w file.flac + metaflac --remove-tag=TITLE file.flac + LANG=en_US.88591 metaflac --set-tag=TITLE='Whoope! That'"'"'s Great' file.flac + chmod -w file.flac + + Note the need for shell quoting. The function 'shell_quote()' does +it. 'SINGLE' is the one-character string '"'"' and 'QSINGLE' is the +three-character string '"\"'\""': + + # shell_quote --- quote an argument for passing to the shell + + function shell_quote(s, # parameter + SINGLE, QSINGLE, i, X, n, ret) # locals + { + if (s == "") + return "\"\"" + + SINGLE = "\x27" # single quote + QSINGLE = "\"\x27\"" + n = split(s, X, SINGLE) + + ret = SINGLE X[1] SINGLE + for (i = 2; i <= n; i++) + ret = ret QSINGLE SINGLE X[i] SINGLE + + return ret + } + + +File: gawk.info, Node: Data File Management, Next: Getopt Function, Prev: General Functions, Up: Library Functions + +10.3 Data file Management +========================= + +This minor node presents functions that are useful for managing +command-line data files. + +* Menu: + +* Filetrans Function:: A function for handling data file transitions. +* Rewind Function:: A function for rereading the current file. +* File Checking:: Checking that data files are readable. +* Empty Files:: Checking for zero-length files. +* Ignoring Assigns:: Treating assignments as file names. + + +File: gawk.info, Node: Filetrans Function, Next: Rewind Function, Up: Data File Management + +10.3.1 Noting Data file Boundaries +---------------------------------- + +The 'BEGIN' and 'END' rules are each executed exactly once, at the +beginning and end of your 'awk' program, respectively (*note +BEGIN/END::). We (the 'gawk' authors) once had a user who mistakenly +thought that the 'BEGIN' rules were executed at the beginning of each +data file and the 'END' rules were executed at the end of each data +file. + + When informed that this was not the case, the user requested that we +add new special patterns to 'gawk', named 'BEGIN_FILE' and 'END_FILE', +that would have the desired behavior. He even supplied us the code to +do so. + + Adding these special patterns to 'gawk' wasn't necessary; the job can +be done cleanly in 'awk' itself, as illustrated by the following library +program. It arranges to call two user-supplied functions, 'beginfile()' +and 'endfile()', at the beginning and end of each data file. Besides +solving the problem in only nine(!) lines of code, it does so +_portably_; this works with any implementation of 'awk': + + # transfile.awk + # + # Give the user a hook for filename transitions + # + # The user must supply functions beginfile() and endfile() + # that each take the name of the file being started or + # finished, respectively. + + FILENAME != _oldfilename { + if (_oldfilename != "") + endfile(_oldfilename) + _oldfilename = FILENAME + beginfile(FILENAME) + } + + END { endfile(FILENAME) } + + This file must be loaded before the user's "main" program, so that +the rule it supplies is executed first. + + This rule relies on 'awk''s 'FILENAME' variable, which automatically +changes for each new data file. The current file name is saved in a +private variable, '_oldfilename'. If 'FILENAME' does not equal +'_oldfilename', then a new data file is being processed and it is +necessary to call 'endfile()' for the old file. Because 'endfile()' +should only be called if a file has been processed, the program first +checks to make sure that '_oldfilename' is not the null string. The +program then assigns the current file name to '_oldfilename' and calls +'beginfile()' for the file. Because, like all 'awk' variables, +'_oldfilename' is initialized to the null string, this rule executes +correctly even for the first data file. + + The program also supplies an 'END' rule to do the final processing +for the last file. Because this 'END' rule comes before any 'END' rules +supplied in the "main" program, 'endfile()' is called first. Once +again, the value of multiple 'BEGIN' and 'END' rules should be clear. + + If the same data file occurs twice in a row on the command line, then +'endfile()' and 'beginfile()' are not executed at the end of the first +pass and at the beginning of the second pass. The following version +solves the problem: + + # ftrans.awk --- handle datafile transitions + # + # user supplies beginfile() and endfile() functions + + FNR == 1 { + if (_filename_ != "") + endfile(_filename_) + _filename_ = FILENAME + beginfile(FILENAME) + } + + END { endfile(_filename_) } + + *note Wc Program:: shows how this library function can be used and +how it simplifies writing the main program. + + So Why Does 'gawk' Have 'BEGINFILE' and 'ENDFILE'? + + You are probably wondering, if 'beginfile()' and 'endfile()' +functions can do the job, why does 'gawk' have 'BEGINFILE' and 'ENDFILE' +patterns? + + Good question. Normally, if 'awk' cannot open a file, this causes an +immediate fatal error. In this case, there is no way for a user-defined +function to deal with the problem, as the mechanism for calling it +relies on the file being open and at the first record. Thus, the main +reason for 'BEGINFILE' is to give you a "hook" to catch files that +cannot be processed. 'ENDFILE' exists for symmetry, and because it +provides an easy way to do per-file cleanup processing. For more +information, refer to *note BEGINFILE/ENDFILE::. + + +File: gawk.info, Node: Rewind Function, Next: File Checking, Prev: Filetrans Function, Up: Data File Management + +10.3.2 Rereading the Current File +--------------------------------- + +Another request for a new built-in function was for a function that +would make it possible to reread the current file. The requesting user +didn't want to have to use 'getline' (*note Getline::) inside a loop. + + However, as long as you are not in the 'END' rule, it is quite easy +to arrange to immediately close the current input file and then start +over with it from the top. For lack of a better name, we'll call the +function 'rewind()': + + # rewind.awk --- rewind the current file and start over + + function rewind( i) + { + # shift remaining arguments up + for (i = ARGC; i > ARGIND; i--) + ARGV[i] = ARGV[i-1] + + # make sure gawk knows to keep going + ARGC++ + + # make current file next to get done + ARGV[ARGIND+1] = FILENAME + + # do it + nextfile + } + + The 'rewind()' function relies on the 'ARGIND' variable (*note +Auto-set::), which is specific to 'gawk'. It also relies on the +'nextfile' keyword (*note Nextfile Statement::). Because of this, you +should not call it from an 'ENDFILE' rule. (This isn't necessary +anyway, because 'gawk' goes to the next file as soon as an 'ENDFILE' +rule finishes!) + + You need to be careful calling 'rewind()'. You can end up causing +infinite recursion if you don't pay attention. Here is an example use: + + $ cat data + -| a + -| b + -| c + -| d + -| e + + $ cat test.awk + -| FNR == 3 && ! rewound { + -| rewound = 1 + -| rewind() + -| } + -| + -| { print FILENAME, FNR, $0 } + + $ gawk -f rewind.awk -f test.awk data + -| data 1 a + -| data 2 b + -| data 1 a + -| data 2 b + -| data 3 c + -| data 4 d + -| data 5 e + + +File: gawk.info, Node: File Checking, Next: Empty Files, Prev: Rewind Function, Up: Data File Management + +10.3.3 Checking for Readable Data files +--------------------------------------- + +Normally, if you give 'awk' a data file that isn't readable, it stops +with a fatal error. There are times when you might want to just ignore +such files and keep going.(1) You can do this by prepending the +following program to your 'awk' program: + + # readable.awk --- library file to skip over unreadable files + + BEGIN { + for (i = 1; i < ARGC; i++) { + if (ARGV[i] ~ /^[a-zA-Z_][a-zA-Z0-9_]*=.*/ \ + || ARGV[i] == "-" || ARGV[i] == "/dev/stdin") + continue # assignment or standard input + else if ((getline junk < ARGV[i]) < 0) # unreadable + delete ARGV[i] + else + close(ARGV[i]) + } + } + + This works, because the 'getline' won't be fatal. Removing the +element from 'ARGV' with 'delete' skips the file (because it's no longer +in the list). See also *note ARGC and ARGV::. + + Because 'awk' variable names only allow the English letters, the +regular expression check purposely does not use character classes such +as '[:alpha:]' and '[:alnum:]' (*note Bracket Expressions::). + + ---------- Footnotes ---------- + + (1) The 'BEGINFILE' special pattern (*note BEGINFILE/ENDFILE::) +provides an alternative mechanism for dealing with files that can't be +opened. However, the code here provides a portable solution. + + +File: gawk.info, Node: Empty Files, Next: Ignoring Assigns, Prev: File Checking, Up: Data File Management + +10.3.4 Checking for Zero-Length Files +------------------------------------- + +All known 'awk' implementations silently skip over zero-length files. +This is a by-product of 'awk''s implicit +read-a-record-and-match-against-the-rules loop: when 'awk' tries to read +a record from an empty file, it immediately receives an end-of-file +indication, closes the file, and proceeds on to the next command-line +data file, _without_ executing any user-level 'awk' program code. + + Using 'gawk''s 'ARGIND' variable (*note Built-in Variables::), it is +possible to detect when an empty data file has been skipped. Similar to +the library file presented in *note Filetrans Function::, the following +library file calls a function named 'zerofile()' that the user must +provide. The arguments passed are the file name and the position in +'ARGV' where it was found: + + # zerofile.awk --- library file to process empty input files + + BEGIN { Argind = 0 } + + ARGIND > Argind + 1 { + for (Argind++; Argind < ARGIND; Argind++) + zerofile(ARGV[Argind], Argind) + } + + ARGIND != Argind { Argind = ARGIND } + + END { + if (ARGIND > Argind) + for (Argind++; Argind <= ARGIND; Argind++) + zerofile(ARGV[Argind], Argind) + } + + The user-level variable 'Argind' allows the 'awk' program to track +its progress through 'ARGV'. Whenever the program detects that 'ARGIND' +is greater than 'Argind + 1', it means that one or more empty files were +skipped. The action then calls 'zerofile()' for each such file, +incrementing 'Argind' along the way. + + The 'Argind != ARGIND' rule simply keeps 'Argind' up to date in the +normal case. + + Finally, the 'END' rule catches the case of any empty files at the +end of the command-line arguments. Note that the test in the condition +of the 'for' loop uses the '<=' operator, not '<'. + + +File: gawk.info, Node: Ignoring Assigns, Prev: Empty Files, Up: Data File Management + +10.3.5 Treating Assignments as File names +----------------------------------------- + +Occasionally, you might not want 'awk' to process command-line variable +assignments (*note Assignment Options::). In particular, if you have a +file name that contains an '=' character, 'awk' treats the file name as +an assignment and does not process it. + + Some users have suggested an additional command-line option for +'gawk' to disable command-line assignments. However, some simple +programming with a library file does the trick: + + # noassign.awk --- library file to avoid the need for a + # special option that disables command-line assignments + + function disable_assigns(argc, argv, i) + { + for (i = 1; i < argc; i++) + if (argv[i] ~ /^[a-zA-Z_][a-zA-Z0-9_]*=.*/) + argv[i] = ("./" argv[i]) + } + + BEGIN { + if (No_command_assign) + disable_assigns(ARGC, ARGV) + } + + You then run your program this way: + + awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk * + + The function works by looping through the arguments. It prepends +'./' to any argument that matches the form of a variable assignment, +turning that argument into a file name. + + The use of 'No_command_assign' allows you to disable command-line +assignments at invocation time, by giving the variable a true value. +When not set, it is initially zero (i.e., false), so the command-line +arguments are left alone. + + +File: gawk.info, Node: Getopt Function, Next: Passwd Functions, Prev: Data File Management, Up: Library Functions + +10.4 Processing Command-Line Options +==================================== + +Most utilities on POSIX-compatible systems take options on the command +line that can be used to change the way a program behaves. 'awk' is an +example of such a program (*note Options::). Often, options take +"arguments" (i.e., data that the program needs to correctly obey the +command-line option). For example, 'awk''s '-F' option requires a +string to use as the field separator. The first occurrence on the +command line of either '--' or a string that does not begin with '-' +ends the options. + + Modern Unix systems provide a C function named 'getopt()' for +processing command-line arguments. The programmer provides a string +describing the one-letter options. If an option requires an argument, +it is followed in the string with a colon. 'getopt()' is also passed +the count and values of the command-line arguments and is called in a +loop. 'getopt()' processes the command-line arguments for option +letters. Each time around the loop, it returns a single character +representing the next option letter that it finds, or '?' if it finds an +invalid option. When it returns -1, there are no options left on the +command line. + + When using 'getopt()', options that do not take arguments can be +grouped together. Furthermore, options that take arguments require that +the argument be present. The argument can immediately follow the option +letter, or it can be a separate command-line argument. + + Given a hypothetical program that takes three command-line options, +'-a', '-b', and '-c', where '-b' requires an argument, all of the +following are valid ways of invoking the program: + + prog -a -b foo -c data1 data2 data3 + prog -ac -bfoo -- data1 data2 data3 + prog -acbfoo data1 data2 data3 + + Notice that when the argument is grouped with its option, the rest of +the argument is considered to be the option's argument. In this +example, '-acbfoo' indicates that all of the '-a', '-b', and '-c' +options were supplied, and that 'foo' is the argument to the '-b' +option. + + 'getopt()' provides four external variables that the programmer can +use: + +'optind' + The index in the argument value array ('argv') where the first + nonoption command-line argument can be found. + +'optarg' + The string value of the argument to an option. + +'opterr' + Usually 'getopt()' prints an error message when it finds an invalid + option. Setting 'opterr' to zero disables this feature. (An + application might want to print its own error message.) + +'optopt' + The letter representing the command-line option. + + The following C fragment shows how 'getopt()' might process +command-line arguments for 'awk': + + int + main(int argc, char *argv[]) + { + ... + /* print our own message */ + opterr = 0; + while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) { + switch (c) { + case 'f': /* file */ + ... + break; + case 'F': /* field separator */ + ... + break; + case 'v': /* variable assignment */ + ... + break; + case 'W': /* extension */ + ... + break; + case '?': + default: + usage(); + break; + } + } + ... + } + + As a side point, 'gawk' actually uses the GNU 'getopt_long()' +function to process both normal and GNU-style long options (*note +Options::). + + The abstraction provided by 'getopt()' is very useful and is quite +handy in 'awk' programs as well. Following is an 'awk' version of +'getopt()'. This function highlights one of the greatest weaknesses in +'awk', which is that it is very poor at manipulating single characters. +Repeated calls to 'substr()' are necessary for accessing individual +characters (*note String Functions::).(1) + + The discussion that follows walks through the code a bit at a time: + + # getopt.awk --- Do C library getopt(3) function in awk + + # External variables: + # Optind -- index in ARGV of first nonoption argument + # Optarg -- string value of argument to current option + # Opterr -- if nonzero, print our own diagnostic + # Optopt -- current option letter + + # Returns: + # -1 at end of options + # "?" for unrecognized option + # <c> a character representing the current option + + # Private Data: + # _opti -- index in multiflag option, e.g., -abc + + The function starts out with comments presenting a list of the global +variables it uses, what the return values are, what they mean, and any +global variables that are "private" to this library function. Such +documentation is essential for any program, and particularly for library +functions. + + The 'getopt()' function first checks that it was indeed called with a +string of options (the 'options' parameter). If 'options' has a zero +length, 'getopt()' immediately returns -1: + + function getopt(argc, argv, options, thisopt, i) + { + if (length(options) == 0) # no options given + return -1 + + if (argv[Optind] == "--") { # all done + Optind++ + _opti = 0 + return -1 + } else if (argv[Optind] !~ /^-[^:[:space:]]/) { + _opti = 0 + return -1 + } + + The next thing to check for is the end of the options. A '--' ends +the command-line options, as does any command-line argument that does +not begin with a '-'. 'Optind' is used to step through the array of +command-line arguments; it retains its value across calls to 'getopt()', +because it is a global variable. + + The regular expression that is used, '/^-[^:[:space:]/', checks for a +'-' followed by anything that is not whitespace and not a colon. If the +current command-line argument does not match this pattern, it is not an +option, and it ends option processing. Continuing on: + + if (_opti == 0) + _opti = 2 + thisopt = substr(argv[Optind], _opti, 1) + Optopt = thisopt + i = index(options, thisopt) + if (i == 0) { + if (Opterr) + printf("%c -- invalid option\n", thisopt) > "/dev/stderr" + if (_opti >= length(argv[Optind])) { + Optind++ + _opti = 0 + } else + _opti++ + return "?" + } + + The '_opti' variable tracks the position in the current command-line +argument ('argv[Optind]'). If multiple options are grouped together +with one '-' (e.g., '-abx'), it is necessary to return them to the user +one at a time. + + If '_opti' is equal to zero, it is set to two, which is the index in +the string of the next character to look at (we skip the '-', which is +at position one). The variable 'thisopt' holds the character, obtained +with 'substr()'. It is saved in 'Optopt' for the main program to use. + + If 'thisopt' is not in the 'options' string, then it is an invalid +option. If 'Opterr' is nonzero, 'getopt()' prints an error message on +the standard error that is similar to the message from the C version of +'getopt()'. + + Because the option is invalid, it is necessary to skip it and move on +to the next option character. If '_opti' is greater than or equal to +the length of the current command-line argument, it is necessary to move +on to the next argument, so 'Optind' is incremented and '_opti' is reset +to zero. Otherwise, 'Optind' is left alone and '_opti' is merely +incremented. + + In any case, because the option is invalid, 'getopt()' returns '"?"'. +The main program can examine 'Optopt' if it needs to know what the +invalid option letter actually is. Continuing on: + + if (substr(options, i + 1, 1) == ":") { + # get option argument + if (length(substr(argv[Optind], _opti + 1)) > 0) + Optarg = substr(argv[Optind], _opti + 1) + else + Optarg = argv[++Optind] + _opti = 0 + } else + Optarg = "" + + If the option requires an argument, the option letter is followed by +a colon in the 'options' string. If there are remaining characters in +the current command-line argument ('argv[Optind]'), then the rest of +that string is assigned to 'Optarg'. Otherwise, the next command-line +argument is used ('-xFOO' versus '-x FOO'). In either case, '_opti' is +reset to zero, because there are no more characters left to examine in +the current command-line argument. Continuing: + + if (_opti == 0 || _opti >= length(argv[Optind])) { + Optind++ + _opti = 0 + } else + _opti++ + return thisopt + } + + Finally, if '_opti' is either zero or greater than the length of the +current command-line argument, it means this element in 'argv' is +through being processed, so 'Optind' is incremented to point to the next +element in 'argv'. If neither condition is true, then only '_opti' is +incremented, so that the next option letter can be processed on the next +call to 'getopt()'. + + The 'BEGIN' rule initializes both 'Opterr' and 'Optind' to one. +'Opterr' is set to one, because the default behavior is for 'getopt()' +to print a diagnostic message upon seeing an invalid option. 'Optind' +is set to one, because there's no reason to look at the program name, +which is in 'ARGV[0]': + + BEGIN { + Opterr = 1 # default is to diagnose + Optind = 1 # skip ARGV[0] + + # test program + if (_getopt_test) { + while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1) + printf("c = <%c>, Optarg = <%s>\n", + _go_c, Optarg) + printf("non-option arguments:\n") + for (; Optind < ARGC; Optind++) + printf("\tARGV[%d] = <%s>\n", + Optind, ARGV[Optind]) + } + } + + The rest of the 'BEGIN' rule is a simple test program. Here are the +results of two sample runs of the test program: + + $ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x + -| c = <a>, Optarg = <> + -| c = <c>, Optarg = <> + -| c = <b>, Optarg = <ARG> + -| non-option arguments: + -| ARGV[3] = <bax> + -| ARGV[4] = <-x> + + $ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc + -| c = <a>, Optarg = <> + error-> x -- invalid option + -| c = <?>, Optarg = <> + -| non-option arguments: + -| ARGV[4] = <xyz> + -| ARGV[5] = <abc> + + In both runs, the first '--' terminates the arguments to 'awk', so +that it does not try to interpret the '-a', etc., as its own options. + + NOTE: After 'getopt()' is through, user-level code must clear out + all the elements of 'ARGV' from 1 to 'Optind', so that 'awk' does + not try to process the command-line options as file names. + + Using '#!' with the '-E' option may help avoid conflicts between your +program's options and 'gawk''s options, as '-E' causes 'gawk' to abandon +processing of further options (*note Executable Scripts:: and *note +Options::). + + Several of the sample programs presented in *note Sample Programs::, +use 'getopt()' to process their arguments. + + ---------- Footnotes ---------- + + (1) This function was written before 'gawk' acquired the ability to +split strings into single characters using '""' as the separator. We +have left it alone, as using 'substr()' is more portable. + + +File: gawk.info, Node: Passwd Functions, Next: Group Functions, Prev: Getopt Function, Up: Library Functions + +10.5 Reading the User Database +============================== + +The 'PROCINFO' array (*note Built-in Variables::) provides access to the +current user's real and effective user and group ID numbers, and, if +available, the user's supplementary group set. However, because these +are numbers, they do not provide very useful information to the average +user. There needs to be some way to find the user information +associated with the user and group ID numbers. This minor node presents +a suite of functions for retrieving information from the user database. +*Note Group Functions:: for a similar suite that retrieves information +from the group database. + + The POSIX standard does not define the file where user information is +kept. Instead, it provides the '<pwd.h>' header file and several C +language subroutines for obtaining user information. The primary +function is 'getpwent()', for "get password entry." The "password" +comes from the original user database file, '/etc/passwd', which stores +user information along with the encrypted passwords (hence the name). + + Although an 'awk' program could simply read '/etc/passwd' directly, +this file may not contain complete information about the system's set of +users.(1) To be sure you are able to produce a readable and complete +version of the user database, it is necessary to write a small C program +that calls 'getpwent()'. 'getpwent()' is defined as returning a pointer +to a 'struct passwd'. Each time it is called, it returns the next entry +in the database. When there are no more entries, it returns 'NULL', the +null pointer. When this happens, the C program should call 'endpwent()' +to close the database. Following is 'pwcat', a C program that "cats" +the password database: + + /* + * pwcat.c + * + * Generate a printable version of the password database. + */ + #include <stdio.h> + #include <pwd.h> + + int + main(int argc, char **argv) + { + struct passwd *p; + + while ((p = getpwent()) != NULL) + printf("%s:%s:%ld:%ld:%s:%s:%s\n", + p->pw_name, p->pw_passwd, (long) p->pw_uid, + (long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell); + + endpwent(); + return 0; + } + + If you don't understand C, don't worry about it. The output from +'pwcat' is the user database, in the traditional '/etc/passwd' format of +colon-separated fields. The fields are: + +Login name + The user's login name. + +Encrypted password + The user's encrypted password. This may not be available on some + systems. + +User-ID + The user's numeric user ID number. (On some systems, it's a C + 'long', and not an 'int'. Thus, we cast it to 'long' for all + cases.) + +Group-ID + The user's numeric group ID number. (Similar comments about 'long' + versus 'int' apply here.) + +Full name + The user's full name, and perhaps other information associated with + the user. + +Home directory + The user's login (or "home") directory (familiar to shell + programmers as '$HOME'). + +Login shell + The program that is run when the user logs in. This is usually a + shell, such as Bash. + + A few lines representative of 'pwcat''s output are as follows: + + $ pwcat + -| root:x:0:1:Operator:/:/bin/sh + -| nobody:*:65534:65534::/: + -| daemon:*:1:1::/: + -| sys:*:2:2::/:/bin/csh + -| bin:*:3:3::/bin: + -| arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh + -| miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh + -| andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh + ... + + With that introduction, following is a group of functions for getting +user information. There are several functions here, corresponding to +the C functions of the same names: + + # passwd.awk --- access password file information + + BEGIN { + # tailor this to suit your system + _pw_awklib = "/usr/local/libexec/awk/" + } + + function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat) + { + if (_pw_inited) + return + + oldfs = FS + oldrs = RS + olddol0 = $0 + using_fw = (PROCINFO["FS"] == "FIELDWIDTHS") + using_fpat = (PROCINFO["FS"] == "FPAT") + FS = ":" + RS = "\n" + + pwcat = _pw_awklib "pwcat" + while ((pwcat | getline) > 0) { + _pw_byname[$1] = $0 + _pw_byuid[$3] = $0 + _pw_bycount[++_pw_total] = $0 + } + close(pwcat) + _pw_count = 0 + _pw_inited = 1 + FS = oldfs + if (using_fw) + FIELDWIDTHS = FIELDWIDTHS + else if (using_fpat) + FPAT = FPAT + RS = oldrs + $0 = olddol0 + } + + The 'BEGIN' rule sets a private variable to the directory where +'pwcat' is stored. Because it is used to help out an 'awk' library +routine, we have chosen to put it in '/usr/local/libexec/awk'; however, +you might want it to be in a different directory on your system. + + The function '_pw_init()' fills three copies of the user information +into three associative arrays. The arrays are indexed by username +('_pw_byname'), by user ID number ('_pw_byuid'), and by order of +occurrence ('_pw_bycount'). The variable '_pw_inited' is used for +efficiency, as '_pw_init()' needs to be called only once. + + Because this function uses 'getline' to read information from +'pwcat', it first saves the values of 'FS', 'RS', and '$0'. It notes in +the variable 'using_fw' whether field splitting with 'FIELDWIDTHS' is in +effect or not. Doing so is necessary, as these functions could be +called from anywhere within a user's program, and the user may have his +or her own way of splitting records and fields. This makes it possible +to restore the correct field-splitting mechanism later. The test can +only be true for 'gawk'. It is false if using 'FS' or 'FPAT', or on +some other 'awk' implementation. + + The code that checks for using 'FPAT', using 'using_fpat' and +'PROCINFO["FS"]', is similar. + + The main part of the function uses a loop to read database lines, +split the lines into fields, and then store the lines into each array as +necessary. When the loop is done, '_pw_init()' cleans up by closing the +pipeline, setting '_pw_inited' to one, and restoring 'FS' (and +'FIELDWIDTHS' or 'FPAT' if necessary), 'RS', and '$0'. The use of +'_pw_count' is explained shortly. + + The 'getpwnam()' function takes a username as a string argument. If +that user is in the database, it returns the appropriate line. +Otherwise, it relies on the array reference to a nonexistent element to +create the element with the null string as its value: + + function getpwnam(name) + { + _pw_init() + return _pw_byname[name] + } + + Similarly, the 'getpwuid()' function takes a user ID number argument. +If that user number is in the database, it returns the appropriate line. +Otherwise, it returns the null string: + + function getpwuid(uid) + { + _pw_init() + return _pw_byuid[uid] + } + + The 'getpwent()' function simply steps through the database, one +entry at a time. It uses '_pw_count' to track its current position in +the '_pw_bycount' array: + + function getpwent() + { + _pw_init() + if (_pw_count < _pw_total) + return _pw_bycount[++_pw_count] + return "" + } + + The 'endpwent()' function resets '_pw_count' to zero, so that +subsequent calls to 'getpwent()' start over again: + + function endpwent() + { + _pw_count = 0 + } + + A conscious design decision in this suite is that each subroutine +calls '_pw_init()' to initialize the database arrays. The overhead of +running a separate process to generate the user database, and the I/O to +scan it, are only incurred if the user's main program actually calls one +of these functions. If this library file is loaded along with a user's +program, but none of the routines are ever called, then there is no +extra runtime overhead. (The alternative is move the body of +'_pw_init()' into a 'BEGIN' rule, which always runs 'pwcat'. This +simplifies the code but runs an extra process that may never be needed.) + + In turn, calling '_pw_init()' is not too expensive, because the +'_pw_inited' variable keeps the program from reading the data more than +once. If you are worried about squeezing every last cycle out of your +'awk' program, the check of '_pw_inited' could be moved out of +'_pw_init()' and duplicated in all the other functions. In practice, +this is not necessary, as most 'awk' programs are I/O-bound, and such a +change would clutter up the code. + + The 'id' program in *note Id Program:: uses these functions. + + ---------- Footnotes ---------- + + (1) It is often the case that password information is stored in a +network database. + + +File: gawk.info, Node: Group Functions, Next: Walking Arrays, Prev: Passwd Functions, Up: Library Functions + +10.6 Reading the Group Database +=============================== + +Much of the discussion presented in *note Passwd Functions:: applies to +the group database as well. Although there has traditionally been a +well-known file ('/etc/group') in a well-known format, the POSIX +standard only provides a set of C library routines ('<grp.h>' and +'getgrent()') for accessing the information. Even though this file may +exist, it may not have complete information. Therefore, as with the +user database, it is necessary to have a small C program that generates +the group database as its output. 'grcat', a C program that "cats" the +group database, is as follows: + + /* + * grcat.c + * + * Generate a printable version of the group database. + */ + #include <stdio.h> + #include <grp.h> + + int + main(int argc, char **argv) + { + struct group *g; + int i; + + while ((g = getgrent()) != NULL) { + printf("%s:%s:%ld:", g->gr_name, g->gr_passwd, + (long) g->gr_gid); + for (i = 0; g->gr_mem[i] != NULL; i++) { + printf("%s", g->gr_mem[i]); + if (g->gr_mem[i+1] != NULL) + putchar(','); + } + putchar('\n'); + } + endgrent(); + return 0; + } + + Each line in the group database represents one group. The fields are +separated with colons and represent the following information: + +Group Name + The group's name. + +Group Password + The group's encrypted password. In practice, this field is never + used; it is usually empty or set to '*'. + +Group ID Number + The group's numeric group ID number; the association of name to + number must be unique within the file. (On some systems it's a C + 'long', and not an 'int'. Thus, we cast it to 'long' for all + cases.) + +Group Member List + A comma-separated list of usernames. These users are members of + the group. Modern Unix systems allow users to be members of + several groups simultaneously. If your system does, then there are + elements '"group1"' through '"groupN"' in 'PROCINFO' for those + group ID numbers. (Note that 'PROCINFO' is a 'gawk' extension; + *note Built-in Variables::.) + + Here is what running 'grcat' might produce: + + $ grcat + -| wheel:*:0:arnold + -| nogroup:*:65534: + -| daemon:*:1: + -| kmem:*:2: + -| staff:*:10:arnold,miriam,andy + -| other:*:20: + ... + + Here are the functions for obtaining information from the group +database. There are several, modeled after the C library functions of +the same names: + + # group.awk --- functions for dealing with the group file + + BEGIN { + # Change to suit your system + _gr_awklib = "/usr/local/libexec/awk/" + } + + function _gr_init( oldfs, oldrs, olddol0, grcat, + using_fw, using_fpat, n, a, i) + { + if (_gr_inited) + return + + oldfs = FS + oldrs = RS + olddol0 = $0 + using_fw = (PROCINFO["FS"] == "FIELDWIDTHS") + using_fpat = (PROCINFO["FS"] == "FPAT") + FS = ":" + RS = "\n" + + grcat = _gr_awklib "grcat" + while ((grcat | getline) > 0) { + if ($1 in _gr_byname) + _gr_byname[$1] = _gr_byname[$1] "," $4 + else + _gr_byname[$1] = $0 + if ($3 in _gr_bygid) + _gr_bygid[$3] = _gr_bygid[$3] "," $4 + else + _gr_bygid[$3] = $0 + + n = split($4, a, "[ \t]*,[ \t]*") + for (i = 1; i <= n; i++) + if (a[i] in _gr_groupsbyuser) + _gr_groupsbyuser[a[i]] = _gr_groupsbyuser[a[i]] " " $1 + else + _gr_groupsbyuser[a[i]] = $1 + + _gr_bycount[++_gr_count] = $0 + } + close(grcat) + _gr_count = 0 + _gr_inited++ + FS = oldfs + if (using_fw) + FIELDWIDTHS = FIELDWIDTHS + else if (using_fpat) + FPAT = FPAT + RS = oldrs + $0 = olddol0 + } + + The 'BEGIN' rule sets a private variable to the directory where +'grcat' is stored. Because it is used to help out an 'awk' library +routine, we have chosen to put it in '/usr/local/libexec/awk'. You +might want it to be in a different directory on your system. + + These routines follow the same general outline as the user database +routines (*note Passwd Functions::). The '_gr_inited' variable is used +to ensure that the database is scanned no more than once. The +'_gr_init()' function first saves 'FS', 'RS', and '$0', and then sets +'FS' and 'RS' to the correct values for scanning the group information. +It also takes care to note whether 'FIELDWIDTHS' or 'FPAT' is being +used, and to restore the appropriate field-splitting mechanism. + + The group information is stored in several associative arrays. The +arrays are indexed by group name ('_gr_byname'), by group ID number +('_gr_bygid'), and by position in the database ('_gr_bycount'). There +is an additional array indexed by username ('_gr_groupsbyuser'), which +is a space-separated list of groups to which each user belongs. + + Unlike in the user database, it is possible to have multiple records +in the database for the same group. This is common when a group has a +large number of members. A pair of such entries might look like the +following: + + tvpeople:*:101:johnny,jay,arsenio + tvpeople:*:101:david,conan,tom,joan + + For this reason, '_gr_init()' looks to see if a group name or group +ID number is already seen. If so, the usernames are simply concatenated +onto the previous list of users.(1) + + Finally, '_gr_init()' closes the pipeline to 'grcat', restores 'FS' +(and 'FIELDWIDTHS' or 'FPAT', if necessary), 'RS', and '$0', initializes +'_gr_count' to zero (it is used later), and makes '_gr_inited' nonzero. + + The 'getgrnam()' function takes a group name as its argument, and if +that group exists, it is returned. Otherwise, it relies on the array +reference to a nonexistent element to create the element with the null +string as its value: + + function getgrnam(group) + { + _gr_init() + return _gr_byname[group] + } + + The 'getgrgid()' function is similar; it takes a numeric group ID and +looks up the information associated with that group ID: + + function getgrgid(gid) + { + _gr_init() + return _gr_bygid[gid] + } + + The 'getgruser()' function does not have a C counterpart. It takes a +username and returns the list of groups that have the user as a member: + + function getgruser(user) + { + _gr_init() + return _gr_groupsbyuser[user] + } + + The 'getgrent()' function steps through the database one entry at a +time. It uses '_gr_count' to track its position in the list: + + function getgrent() + { + _gr_init() + if (++_gr_count in _gr_bycount) + return _gr_bycount[_gr_count] + return "" + } + + The 'endgrent()' function resets '_gr_count' to zero so that +'getgrent()' can start over again: + + function endgrent() + { + _gr_count = 0 + } + + As with the user database routines, each function calls '_gr_init()' +to initialize the arrays. Doing so only incurs the extra overhead of +running 'grcat' if these functions are used (as opposed to moving the +body of '_gr_init()' into a 'BEGIN' rule). + + Most of the work is in scanning the database and building the various +associative arrays. The functions that the user calls are themselves +very simple, relying on 'awk''s associative arrays to do work. + + The 'id' program in *note Id Program:: uses these functions. + + ---------- Footnotes ---------- + + (1) There is a subtle problem with the code just presented. Suppose +that the first time there were no names. This code adds the names with +a leading comma. It also doesn't check that there is a '$4'. + + +File: gawk.info, Node: Walking Arrays, Next: Library Functions Summary, Prev: Group Functions, Up: Library Functions + +10.7 Traversing Arrays of Arrays +================================ + +*note Arrays of Arrays:: described how 'gawk' provides arrays of arrays. +In particular, any element of an array may be either a scalar or another +array. The 'isarray()' function (*note Type Functions::) lets you +distinguish an array from a scalar. The following function, +'walk_array()', recursively traverses an array, printing the element +indices and values. You call it with the array and a string +representing the name of the array: + + function walk_array(arr, name, i) + { + for (i in arr) { + if (isarray(arr[i])) + walk_array(arr[i], (name "[" i "]")) + else + printf("%s[%s] = %s\n", name, i, arr[i]) + } + } + +It works by looping over each element of the array. If any given +element is itself an array, the function calls itself recursively, +passing the subarray and a new string representing the current index. +Otherwise, the function simply prints the element's name, index, and +value. Here is a main program to demonstrate: + + BEGIN { + a[1] = 1 + a[2][1] = 21 + a[2][2] = 22 + a[3] = 3 + a[4][1][1] = 411 + a[4][2] = 42 + + walk_array(a, "a") + } + + When run, the program produces the following output: + + $ gawk -f walk_array.awk + -| a[1] = 1 + -| a[2][1] = 21 + -| a[2][2] = 22 + -| a[3] = 3 + -| a[4][1][1] = 411 + -| a[4][2] = 42 + + The function just presented simply prints the name and value of each +scalar array element. However, it is easy to generalize it, by passing +in the name of a function to call when walking an array. The modified +function looks like this: + + function process_array(arr, name, process, do_arrays, i, new_name) + { + for (i in arr) { + new_name = (name "[" i "]") + if (isarray(arr[i])) { + if (do_arrays) + @process(new_name, arr[i]) + process_array(arr[i], new_name, process, do_arrays) + } else + @process(new_name, arr[i]) + } + } + + The arguments are as follows: + +'arr' + The array. + +'name' + The name of the array (a string). + +'process' + The name of the function to call. + +'do_arrays' + If this is true, the function can handle elements that are + subarrays. + + If subarrays are to be processed, that is done before walking them +further. + + When run with the following scaffolding, the function produces the +same results as does the earlier version of 'walk_array()': + + BEGIN { + a[1] = 1 + a[2][1] = 21 + a[2][2] = 22 + a[3] = 3 + a[4][1][1] = 411 + a[4][2] = 42 + + process_array(a, "a", "do_print", 0) + } + + function do_print(name, element) + { + printf "%s = %s\n", name, element + } + + +File: gawk.info, Node: Library Functions Summary, Next: Library Exercises, Prev: Walking Arrays, Up: Library Functions + +10.8 Summary +============ + + * Reading programs is an excellent way to learn Good Programming. + The functions and programs provided in this major node and the next + are intended to serve that purpose. + + * When writing general-purpose library functions, put some thought + into how to name any global variables so that they won't conflict + with variables from a user's program. + + * The functions presented here fit into the following categories: + + General problems + Number-to-string conversion, testing assertions, rounding, + random number generation, converting characters to numbers, + joining strings, getting easily usable time-of-day + information, and reading a whole file in one shot + + Managing data files + Noting data file boundaries, rereading the current file, + checking for readable files, checking for zero-length files, + and treating assignments as file names + + Processing command-line options + An 'awk' version of the standard C 'getopt()' function + + Reading the user and group databases + Two sets of routines that parallel the C library versions + + Traversing arrays of arrays + Two functions that traverse an array of arrays to any depth + + +File: gawk.info, Node: Library Exercises, Prev: Library Functions Summary, Up: Library Functions + +10.9 Exercises +============== + + 1. In *note Empty Files::, we presented the 'zerofile.awk' program, + which made use of 'gawk''s 'ARGIND' variable. Can this problem be + solved without relying on 'ARGIND'? If so, how? + + 2. As a related challenge, revise that code to handle the case where + an intervening value in 'ARGV' is a variable assignment. + + +File: gawk.info, Node: Sample Programs, Next: Advanced Features, Prev: Library Functions, Up: Top + +11 Practical 'awk' Programs +*************************** + +*note Library Functions::, presents the idea that reading programs in a +language contributes to learning that language. This major node +continues that theme, presenting a potpourri of 'awk' programs for your +reading enjoyment. + + Many of these programs use library functions presented in *note +Library Functions::. + +* Menu: + +* Running Examples:: How to run these examples. +* Clones:: Clones of common utilities. +* Miscellaneous Programs:: Some interesting 'awk' programs. +* Programs Summary:: Summary of programs. +* Programs Exercises:: Exercises. + + +File: gawk.info, Node: Running Examples, Next: Clones, Up: Sample Programs + +11.1 Running the Example Programs +================================= + +To run a given program, you would typically do something like this: + + awk -f PROGRAM -- OPTIONS FILES + +Here, PROGRAM is the name of the 'awk' program (such as 'cut.awk'), +OPTIONS are any command-line options for the program that start with a +'-', and FILES are the actual data files. + + If your system supports the '#!' executable interpreter mechanism +(*note Executable Scripts::), you can instead run your program directly: + + cut.awk -c1-8 myfiles > results + + If your 'awk' is not 'gawk', you may instead need to use this: + + cut.awk -- -c1-8 myfiles > results + + +File: gawk.info, Node: Clones, Next: Miscellaneous Programs, Prev: Running Examples, Up: Sample Programs + +11.2 Reinventing Wheels for Fun and Profit +========================================== + +This minor node presents a number of POSIX utilities implemented in +'awk'. Reinventing these programs in 'awk' is often enjoyable, because +the algorithms can be very clearly expressed, and the code is usually +very concise and simple. This is true because 'awk' does so much for +you. + + It should be noted that these programs are not necessarily intended +to replace the installed versions on your system. Nor may all of these +programs be fully compliant with the most recent POSIX standard. This +is not a problem; their purpose is to illustrate 'awk' language +programming for "real-world" tasks. + + The programs are presented in alphabetical order. + +* Menu: + +* Cut Program:: The 'cut' utility. +* Egrep Program:: The 'egrep' utility. +* Id Program:: The 'id' utility. +* Split Program:: The 'split' utility. +* Tee Program:: The 'tee' utility. +* Uniq Program:: The 'uniq' utility. +* Wc Program:: The 'wc' utility. + + +File: gawk.info, Node: Cut Program, Next: Egrep Program, Up: Clones + +11.2.1 Cutting Out Fields and Columns +------------------------------------- + +The 'cut' utility selects, or "cuts," characters or fields from its +standard input and sends them to its standard output. Fields are +separated by TABs by default, but you may supply a command-line option +to change the field "delimiter" (i.e., the field-separator character). +'cut''s definition of fields is less general than 'awk''s. + + A common use of 'cut' might be to pull out just the login names of +logged-on users from the output of 'who'. For example, the following +pipeline generates a sorted, unique list of the logged-on users: + + who | cut -c1-8 | sort | uniq + + The options for 'cut' are: + +'-c LIST' + Use LIST as the list of characters to cut out. Items within the + list may be separated by commas, and ranges of characters can be + separated with dashes. The list '1-8,15,22-35' specifies + characters 1 through 8, 15, and 22 through 35. + +'-f LIST' + Use LIST as the list of fields to cut out. + +'-d DELIM' + Use DELIM as the field-separator character instead of the TAB + character. + +'-s' + Suppress printing of lines that do not contain the field delimiter. + + The 'awk' implementation of 'cut' uses the 'getopt()' library +function (*note Getopt Function::) and the 'join()' library function +(*note Join Function::). + + The program begins with a comment describing the options, the library +functions needed, and a 'usage()' function that prints out a usage +message and exits. 'usage()' is called if invalid arguments are +supplied: + + # cut.awk --- implement cut in awk + + # Options: + # -f list Cut fields + # -d c Field delimiter character + # -c list Cut characters + # + # -s Suppress lines without the delimiter + # + # Requires getopt() and join() library functions + + function usage() + { + print("usage: cut [-f list] [-d c] [-s] [files...]") > "/dev/stderr" + print("usage: cut [-c list] [files...]") > "/dev/stderr" + exit 1 + } + + Next comes a 'BEGIN' rule that parses the command-line options. It +sets 'FS' to a single TAB character, because that is 'cut''s default +field separator. The rule then sets the output field separator to be +the same as the input field separator. A loop using 'getopt()' steps +through the command-line options. Exactly one of the variables +'by_fields' or 'by_chars' is set to true, to indicate that processing +should be done by fields or by characters, respectively. When cutting +by characters, the output field separator is set to the null string: + + BEGIN { + FS = "\t" # default + OFS = FS + while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) { + if (c == "f") { + by_fields = 1 + fieldlist = Optarg + } else if (c == "c") { + by_chars = 1 + fieldlist = Optarg + OFS = "" + } else if (c == "d") { + if (length(Optarg) > 1) { + printf("cut: using first character of %s" \ + " for delimiter\n", Optarg) > "/dev/stderr" + Optarg = substr(Optarg, 1, 1) + } + fs = FS = Optarg + OFS = FS + if (FS == " ") # defeat awk semantics + FS = "[ ]" + } else if (c == "s") + suppress = 1 + else + usage() + } + + # Clear out options + for (i = 1; i < Optind; i++) + ARGV[i] = "" + + The code must take special care when the field delimiter is a space. +Using a single space ('" "') for the value of 'FS' is incorrect--'awk' +would separate fields with runs of spaces, TABs, and/or newlines, and we +want them to be separated with individual spaces. To this end, we save +the original space character in the variable 'fs' for later use; after +setting 'FS' to '"[ ]"' we can't use it directly to see if the field +delimiter character is in the string. + + Also remember that after 'getopt()' is through (as described in *note +Getopt Function::), we have to clear out all the elements of 'ARGV' from +1 to 'Optind', so that 'awk' does not try to process the command-line +options as file names. + + After dealing with the command-line options, the program verifies +that the options make sense. Only one or the other of '-c' and '-f' +should be used, and both require a field list. Then the program calls +either 'set_fieldlist()' or 'set_charlist()' to pull apart the list of +fields or characters: + + if (by_fields && by_chars) + usage() + + if (by_fields == 0 && by_chars == 0) + by_fields = 1 # default + + if (fieldlist == "") { + print "cut: needs list for -c or -f" > "/dev/stderr" + exit 1 + } + + if (by_fields) + set_fieldlist() + else + set_charlist() + } + + 'set_fieldlist()' splits the field list apart at the commas into an +array. Then, for each element of the array, it looks to see if the +element is actually a range, and if so, splits it apart. The function +checks the range to make sure that the first number is smaller than the +second. Each number in the list is added to the 'flist' array, which +simply lists the fields that will be printed. Normal field splitting is +used. The program lets 'awk' handle the job of doing the field +splitting: + + function set_fieldlist( n, m, i, j, k, f, g) + { + n = split(fieldlist, f, ",") + j = 1 # index in flist + for (i = 1; i <= n; i++) { + if (index(f[i], "-") != 0) { # a range + m = split(f[i], g, "-") + if (m != 2 || g[1] >= g[2]) { + printf("cut: bad field list: %s\n", + f[i]) > "/dev/stderr" + exit 1 + } + for (k = g[1]; k <= g[2]; k++) + flist[j++] = k + } else + flist[j++] = f[i] + } + nfields = j - 1 + } + + The 'set_charlist()' function is more complicated than +'set_fieldlist()'. The idea here is to use 'gawk''s 'FIELDWIDTHS' +variable (*note Constant Size::), which describes constant-width input. +When using a character list, that is exactly what we have. + + Setting up 'FIELDWIDTHS' is more complicated than simply listing the +fields that need to be printed. We have to keep track of the fields to +print and also the intervening characters that have to be skipped. For +example, suppose you wanted characters 1 through 8, 15, and 22 through +35. You would use '-c 1-8,15,22-35'. The necessary value for +'FIELDWIDTHS' is '"8 6 1 6 14"'. This yields five fields, and the +fields to print are '$1', '$3', and '$5'. The intermediate fields are +"filler", which is stuff in between the desired data. 'flist' lists the +fields to print, and 't' tracks the complete field list, including +filler fields: + + function set_charlist( field, i, j, f, g, n, m, t, + filler, last, len) + { + field = 1 # count total fields + n = split(fieldlist, f, ",") + j = 1 # index in flist + for (i = 1; i <= n; i++) { + if (index(f[i], "-") != 0) { # range + m = split(f[i], g, "-") + if (m != 2 || g[1] >= g[2]) { + printf("cut: bad character list: %s\n", + f[i]) > "/dev/stderr" + exit 1 + } + len = g[2] - g[1] + 1 + if (g[1] > 1) # compute length of filler + filler = g[1] - last - 1 + else + filler = 0 + if (filler) + t[field++] = filler + t[field++] = len # length of field + last = g[2] + flist[j++] = field - 1 + } else { + if (f[i] > 1) + filler = f[i] - last - 1 + else + filler = 0 + if (filler) + t[field++] = filler + t[field++] = 1 + last = f[i] + flist[j++] = field - 1 + } + } + FIELDWIDTHS = join(t, 1, field - 1) + nfields = j - 1 + } + + Next is the rule that processes the data. If the '-s' option is +given, then 'suppress' is true. The first 'if' statement makes sure +that the input record does have the field separator. If 'cut' is +processing fields, 'suppress' is true, and the field separator character +is not in the record, then the record is skipped. + + If the record is valid, then 'gawk' has split the data into fields, +either using the character in 'FS' or using fixed-length fields and +'FIELDWIDTHS'. The loop goes through the list of fields that should be +printed. The corresponding field is printed if it contains data. If +the next field also has data, then the separator character is written +out between the fields: + + { + if (by_fields && suppress && index($0, fs) == 0) + next + + for (i = 1; i <= nfields; i++) { + if ($flist[i] != "") { + printf "%s", $flist[i] + if (i < nfields && $flist[i+1] != "") + printf "%s", OFS + } + } + print "" + } + + This version of 'cut' relies on 'gawk''s 'FIELDWIDTHS' variable to do +the character-based cutting. It is possible in other 'awk' +implementations to use 'substr()' (*note String Functions::), but it is +also extremely painful. The 'FIELDWIDTHS' variable supplies an elegant +solution to the problem of picking the input line apart by characters. + + +File: gawk.info, Node: Egrep Program, Next: Id Program, Prev: Cut Program, Up: Clones + +11.2.2 Searching for Regular Expressions in Files +------------------------------------------------- + +The 'egrep' utility searches files for patterns. It uses regular +expressions that are almost identical to those available in 'awk' (*note +Regexp::). You invoke it as follows: + + 'egrep' [OPTIONS] ''PATTERN'' FILES ... + + The PATTERN is a regular expression. In typical usage, the regular +expression is quoted to prevent the shell from expanding any of the +special characters as file name wildcards. Normally, 'egrep' prints the +lines that matched. If multiple file names are provided on the command +line, each output line is preceded by the name of the file and a colon. + + The options to 'egrep' are as follows: + +'-c' + Print out a count of the lines that matched the pattern, instead of + the lines themselves. + +'-s' + Be silent. No output is produced and the exit value indicates + whether the pattern was matched. + +'-v' + Invert the sense of the test. 'egrep' prints the lines that do + _not_ match the pattern and exits successfully if the pattern is + not matched. + +'-i' + Ignore case distinctions in both the pattern and the input data. + +'-l' + Only print (list) the names of the files that matched, not the + lines that matched. + +'-e PATTERN' + Use PATTERN as the regexp to match. The purpose of the '-e' option + is to allow patterns that start with a '-'. + + This version uses the 'getopt()' library function (*note Getopt +Function::) and the file transition library program (*note Filetrans +Function::). + + The program begins with a descriptive comment and then a 'BEGIN' rule +that processes the command-line arguments with 'getopt()'. The '-i' +(ignore case) option is particularly easy with 'gawk'; we just use the +'IGNORECASE' predefined variable (*note Built-in Variables::): + + # egrep.awk --- simulate egrep in awk + # + # Options: + # -c count of lines + # -s silent - use exit value + # -v invert test, success if no match + # -i ignore case + # -l print filenames only + # -e argument is pattern + # + # Requires getopt and file transition library functions + + BEGIN { + while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) { + if (c == "c") + count_only++ + else if (c == "s") + no_print++ + else if (c == "v") + invert++ + else if (c == "i") + IGNORECASE = 1 + else if (c == "l") + filenames_only++ + else if (c == "e") + pattern = Optarg + else + usage() + } + + Next comes the code that handles the 'egrep'-specific behavior. If +no pattern is supplied with '-e', the first nonoption on the command +line is used. The 'awk' command-line arguments up to 'ARGV[Optind]' are +cleared, so that 'awk' won't try to process them as files. If no files +are specified, the standard input is used, and if multiple files are +specified, we make sure to note this so that the file names can precede +the matched lines in the output: + + if (pattern == "") + pattern = ARGV[Optind++] + + for (i = 1; i < Optind; i++) + ARGV[i] = "" + if (Optind >= ARGC) { + ARGV[1] = "-" + ARGC = 2 + } else if (ARGC - Optind > 1) + do_filenames++ + + # if (IGNORECASE) + # pattern = tolower(pattern) + } + + The last two lines are commented out, as they are not needed in +'gawk'. They should be uncommented if you have to use another version +of 'awk'. + + The next set of lines should be uncommented if you are not using +'gawk'. This rule translates all the characters in the input line into +lowercase if the '-i' option is specified.(1) The rule is commented out +as it is not necessary with 'gawk': + + #{ + # if (IGNORECASE) + # $0 = tolower($0) + #} + + The 'beginfile()' function is called by the rule in 'ftrans.awk' when +each new file is processed. In this case, it is very simple; all it +does is initialize a variable 'fcount' to zero. 'fcount' tracks how +many lines in the current file matched the pattern. Naming the +parameter 'junk' shows we know that 'beginfile()' is called with a +parameter, but that we're not interested in its value: + + function beginfile(junk) + { + fcount = 0 + } + + The 'endfile()' function is called after each file has been +processed. It affects the output only when the user wants a count of +the number of lines that matched. 'no_print' is true only if the exit +status is desired. 'count_only' is true if line counts are desired. +'egrep' therefore only prints line counts if printing and counting are +enabled. The output format must be adjusted depending upon the number +of files to process. Finally, 'fcount' is added to 'total', so that we +know the total number of lines that matched the pattern: + + function endfile(file) + { + if (! no_print && count_only) { + if (do_filenames) + print file ":" fcount + else + print fcount + } + + total += fcount + } + + The 'BEGINFILE' and 'ENDFILE' special patterns (*note +BEGINFILE/ENDFILE::) could be used, but then the program would be +'gawk'-specific. Additionally, this example was written before 'gawk' +acquired 'BEGINFILE' and 'ENDFILE'. + + The following rule does most of the work of matching lines. The +variable 'matches' is true if the line matched the pattern. If the user +wants lines that did not match, the sense of 'matches' is inverted using +the '!' operator. 'fcount' is incremented with the value of 'matches', +which is either one or zero, depending upon a successful or unsuccessful +match. If the line does not match, the 'next' statement just moves on +to the next record. + + A number of additional tests are made, but they are only done if we +are not counting lines. First, if the user only wants the exit status +('no_print' is true), then it is enough to know that _one_ line in this +file matched, and we can skip on to the next file with 'nextfile'. +Similarly, if we are only printing file names, we can print the file +name, and then skip to the next file with 'nextfile'. Finally, each +line is printed, with a leading file name and colon if necessary: + + { + matches = ($0 ~ pattern) + if (invert) + matches = ! matches + + fcount += matches # 1 or 0 + + if (! matches) + next + + if (! count_only) { + if (no_print) + nextfile + + if (filenames_only) { + print FILENAME + nextfile + } + + if (do_filenames) + print FILENAME ":" $0 + else + print + } + } + + The 'END' rule takes care of producing the correct exit status. If +there are no matches, the exit status is one; otherwise, it is zero: + + END { + exit (total == 0) + } + + The 'usage()' function prints a usage message in case of invalid +options, and then exits: + + function usage() + { + print("Usage: egrep [-csvil] [-e pat] [files ...]") > "/dev/stderr" + print("\n\tegrep [-csvil] pat [files ...]") > "/dev/stderr" + exit 1 + } + + ---------- Footnotes ---------- + + (1) It also introduces a subtle bug; if a match happens, we output +the translated line, not the original. + + +File: gawk.info, Node: Id Program, Next: Split Program, Prev: Egrep Program, Up: Clones + +11.2.3 Printing Out User Information +------------------------------------ + +The 'id' utility lists a user's real and effective user ID numbers, real +and effective group ID numbers, and the user's group set, if any. 'id' +only prints the effective user ID and group ID if they are different +from the real ones. If possible, 'id' also supplies the corresponding +user and group names. The output might look like this: + + $ id + -| uid=1000(arnold) gid=1000(arnold) groups=1000(arnold),4(adm),7(lp),27(sudo) + + This information is part of what is provided by 'gawk''s 'PROCINFO' +array (*note Built-in Variables::). However, the 'id' utility provides +a more palatable output than just individual numbers. + + Here is a simple version of 'id' written in 'awk'. It uses the user +database library functions (*note Passwd Functions::) and the group +database library functions (*note Group Functions::) from *note Library +Functions::. + + The program is fairly straightforward. All the work is done in the +'BEGIN' rule. The user and group ID numbers are obtained from +'PROCINFO'. The code is repetitive. The entry in the user database for +the real user ID number is split into parts at the ':'. The name is the +first field. Similar code is used for the effective user ID number and +the group numbers: + + # id.awk --- implement id in awk + # + # Requires user and group library functions + # output is: + # uid=12(foo) euid=34(bar) gid=3(baz) \ + # egid=5(blat) groups=9(nine),2(two),1(one) + + BEGIN { + uid = PROCINFO["uid"] + euid = PROCINFO["euid"] + gid = PROCINFO["gid"] + egid = PROCINFO["egid"] + + printf("uid=%d", uid) + pw = getpwuid(uid) + pr_first_field(pw) + + if (euid != uid) { + printf(" euid=%d", euid) + pw = getpwuid(euid) + pr_first_field(pw) + } + + printf(" gid=%d", gid) + pw = getgrgid(gid) + pr_first_field(pw) + + if (egid != gid) { + printf(" egid=%d", egid) + pw = getgrgid(egid) + pr_first_field(pw) + } + + for (i = 1; ("group" i) in PROCINFO; i++) { + if (i == 1) + printf(" groups=") + group = PROCINFO["group" i] + printf("%d", group) + pw = getgrgid(group) + pr_first_field(pw) + if (("group" (i+1)) in PROCINFO) + printf(",") + } + + print "" + } + + function pr_first_field(str, a) + { + if (str != "") { + split(str, a, ":") + printf("(%s)", a[1]) + } + } + + The test in the 'for' loop is worth noting. Any supplementary groups +in the 'PROCINFO' array have the indices '"group1"' through '"groupN"' +for some N (i.e., the total number of supplementary groups). However, +we don't know in advance how many of these groups there are. + + This loop works by starting at one, concatenating the value with +'"group"', and then using 'in' to see if that value is in the array +(*note Reference to Elements::). Eventually, 'i' is incremented past +the last group in the array and the loop exits. + + The loop is also correct if there are _no_ supplementary groups; then +the condition is false the first time it's tested, and the loop body +never executes. + + The 'pr_first_field()' function simply isolates out some code that is +used repeatedly, making the whole program shorter and cleaner. In +particular, moving the check for the empty string into this function +saves several lines of code. + + +File: gawk.info, Node: Split Program, Next: Tee Program, Prev: Id Program, Up: Clones + +11.2.4 Splitting a Large File into Pieces +----------------------------------------- + +The 'split' program splits large text files into smaller pieces. Usage +is as follows:(1) + + 'split' ['-COUNT'] [FILE] [PREFIX] + + By default, the output files are named 'xaa', 'xab', and so on. Each +file has 1,000 lines in it, with the likely exception of the last file. +To change the number of lines in each file, supply a number on the +command line preceded with a minus sign (e.g., '-500' for files with 500 +lines in them instead of 1,000). To change the names of the output +files to something like 'myfileaa', 'myfileab', and so on, supply an +additional argument that specifies the file name prefix. + + Here is a version of 'split' in 'awk'. It uses the 'ord()' and +'chr()' functions presented in *note Ordinal Functions::. + + The program first sets its defaults, and then tests to make sure +there are not too many arguments. It then looks at each argument in +turn. The first argument could be a minus sign followed by a number. +If it is, this happens to look like a negative number, so it is made +positive, and that is the count of lines. The data file name is skipped +over and the final argument is used as the prefix for the output file +names: + + # split.awk --- do split in awk + # + # Requires ord() and chr() library functions + # usage: split [-count] [file] [outname] + + BEGIN { + outfile = "x" # default + count = 1000 + if (ARGC > 4) + usage() + + i = 1 + if (i in ARGV && ARGV[i] ~ /^-[[:digit:]]+$/) { + count = -ARGV[i] + ARGV[i] = "" + i++ + } + # test argv in case reading from stdin instead of file + if (i in ARGV) + i++ # skip datafile name + if (i in ARGV) { + outfile = ARGV[i] + ARGV[i] = "" + } + + s1 = s2 = "a" + out = (outfile s1 s2) + } + + The next rule does most of the work. 'tcount' (temporary count) +tracks how many lines have been printed to the output file so far. If +it is greater than 'count', it is time to close the current file and +start a new one. 's1' and 's2' track the current suffixes for the file +name. If they are both 'z', the file is just too big. Otherwise, 's1' +moves to the next letter in the alphabet and 's2' starts over again at +'a': + + { + if (++tcount > count) { + close(out) + if (s2 == "z") { + if (s1 == "z") { + printf("split: %s is too large to split\n", + FILENAME) > "/dev/stderr" + exit 1 + } + s1 = chr(ord(s1) + 1) + s2 = "a" + } + else + s2 = chr(ord(s2) + 1) + out = (outfile s1 s2) + tcount = 1 + } + print > out + } + +The 'usage()' function simply prints an error message and exits: + + function usage() + { + print("usage: split [-num] [file] [outname]") > "/dev/stderr" + exit 1 + } + + This program is a bit sloppy; it relies on 'awk' to automatically +close the last file instead of doing it in an 'END' rule. It also +assumes that letters are contiguous in the character set, which isn't +true for EBCDIC systems. + + ---------- Footnotes ---------- + + (1) This is the traditional usage. The POSIX usage is different, but +not relevant for what the program aims to demonstrate. + + +File: gawk.info, Node: Tee Program, Next: Uniq Program, Prev: Split Program, Up: Clones + +11.2.5 Duplicating Output into Multiple Files +--------------------------------------------- + +The 'tee' program is known as a "pipe fitting." 'tee' copies its +standard input to its standard output and also duplicates it to the +files named on the command line. Its usage is as follows: + + 'tee' ['-a'] FILE ... + + The '-a' option tells 'tee' to append to the named files, instead of +truncating them and starting over. + + The 'BEGIN' rule first makes a copy of all the command-line arguments +into an array named 'copy'. 'ARGV[0]' is not needed, so it is not +copied. 'tee' cannot use 'ARGV' directly, because 'awk' attempts to +process each file name in 'ARGV' as input data. + + If the first argument is '-a', then the flag variable 'append' is set +to true, and both 'ARGV[1]' and 'copy[1]' are deleted. If 'ARGC' is +less than two, then no file names were supplied and 'tee' prints a usage +message and exits. Finally, 'awk' is forced to read the standard input +by setting 'ARGV[1]' to '"-"' and 'ARGC' to two: + + # tee.awk --- tee in awk + # + # Copy standard input to all named output files. + # Append content if -a option is supplied. + # + BEGIN { + for (i = 1; i < ARGC; i++) + copy[i] = ARGV[i] + + if (ARGV[1] == "-a") { + append = 1 + delete ARGV[1] + delete copy[1] + ARGC-- + } + if (ARGC < 2) { + print "usage: tee [-a] file ..." > "/dev/stderr" + exit 1 + } + ARGV[1] = "-" + ARGC = 2 + } + + The following single rule does all the work. Because there is no +pattern, it is executed for each line of input. The body of the rule +simply prints the line into each file on the command line, and then to +the standard output: + + { + # moving the if outside the loop makes it run faster + if (append) + for (i in copy) + print >> copy[i] + else + for (i in copy) + print > copy[i] + print + } + +It is also possible to write the loop this way: + + for (i in copy) + if (append) + print >> copy[i] + else + print > copy[i] + +This is more concise, but it is also less efficient. The 'if' is tested +for each record and for each output file. By duplicating the loop body, +the 'if' is only tested once for each input record. If there are N +input records and M output files, the first method only executes N 'if' +statements, while the second executes N'*'M 'if' statements. + + Finally, the 'END' rule cleans up by closing all the output files: + + END { + for (i in copy) + close(copy[i]) + } + + +File: gawk.info, Node: Uniq Program, Next: Wc Program, Prev: Tee Program, Up: Clones + +11.2.6 Printing Nonduplicated Lines of Text +------------------------------------------- + +The 'uniq' utility reads sorted lines of data on its standard input, and +by default removes duplicate lines. In other words, it only prints +unique lines--hence the name. 'uniq' has a number of options. The +usage is as follows: + + 'uniq' ['-udc' ['-N']] ['+N'] [INPUTFILE [OUTPUTFILE]] + + The options for 'uniq' are: + +'-d' + Print only repeated (duplicated) lines. + +'-u' + Print only nonrepeated (unique) lines. + +'-c' + Count lines. This option overrides '-d' and '-u'. Both repeated + and nonrepeated lines are counted. + +'-N' + Skip N fields before comparing lines. The definition of fields is + similar to 'awk''s default: nonwhitespace characters separated by + runs of spaces and/or TABs. + +'+N' + Skip N characters before comparing lines. Any fields specified + with '-N' are skipped first. + +'INPUTFILE' + Data is read from the input file named on the command line, instead + of from the standard input. + +'OUTPUTFILE' + The generated output is sent to the named output file, instead of + to the standard output. + + Normally 'uniq' behaves as if both the '-d' and '-u' options are +provided. + + 'uniq' uses the 'getopt()' library function (*note Getopt Function::) +and the 'join()' library function (*note Join Function::). + + The program begins with a 'usage()' function and then a brief outline +of the options and their meanings in comments. The 'BEGIN' rule deals +with the command-line arguments and options. It uses a trick to get +'getopt()' to handle options of the form '-25', treating such an option +as the option letter '2' with an argument of '5'. If indeed two or more +digits are supplied ('Optarg' looks like a number), 'Optarg' is +concatenated with the option digit and then the result is added to zero +to make it into a number. If there is only one digit in the option, +then 'Optarg' is not needed. In this case, 'Optind' must be decremented +so that 'getopt()' processes it next time. This code is admittedly a +bit tricky. + + If no options are supplied, then the default is taken, to print both +repeated and nonrepeated lines. The output file, if provided, is +assigned to 'outputfile'. Early on, 'outputfile' is initialized to the +standard output, '/dev/stdout': + + # uniq.awk --- do uniq in awk + # + # Requires getopt() and join() library functions + + function usage() + { + print("Usage: uniq [-udc [-n]] [+n] [ in [ out ]]") > "/dev/stderr" + exit 1 + } + + # -c count lines. overrides -d and -u + # -d only repeated lines + # -u only nonrepeated lines + # -n skip n fields + # +n skip n characters, skip fields first + + BEGIN { + count = 1 + outputfile = "/dev/stdout" + opts = "udc0:1:2:3:4:5:6:7:8:9:" + while ((c = getopt(ARGC, ARGV, opts)) != -1) { + if (c == "u") + non_repeated_only++ + else if (c == "d") + repeated_only++ + else if (c == "c") + do_count++ + else if (index("0123456789", c) != 0) { + # getopt() requires args to options + # this messes us up for things like -5 + if (Optarg ~ /^[[:digit:]]+$/) + fcount = (c Optarg) + 0 + else { + fcount = c + 0 + Optind-- + } + } else + usage() + } + + if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) { + charcount = substr(ARGV[Optind], 2) + 0 + Optind++ + } + + for (i = 1; i < Optind; i++) + ARGV[i] = "" + + if (repeated_only == 0 && non_repeated_only == 0) + repeated_only = non_repeated_only = 1 + + if (ARGC - Optind == 2) { + outputfile = ARGV[ARGC - 1] + ARGV[ARGC - 1] = "" + } + } + + The following function, 'are_equal()', compares the current line, +'$0', to the previous line, 'last'. It handles skipping fields and +characters. If no field count and no character count are specified, +'are_equal()' returns one or zero depending upon the result of a simple +string comparison of 'last' and '$0'. + + Otherwise, things get more complicated. If fields have to be +skipped, each line is broken into an array using 'split()' (*note String +Functions::); the desired fields are then joined back into a line using +'join()'. The joined lines are stored in 'clast' and 'cline'. If no +fields are skipped, 'clast' and 'cline' are set to 'last' and '$0', +respectively. Finally, if characters are skipped, 'substr()' is used to +strip off the leading 'charcount' characters in 'clast' and 'cline'. +The two strings are then compared and 'are_equal()' returns the result: + + function are_equal( n, m, clast, cline, alast, aline) + { + if (fcount == 0 && charcount == 0) + return (last == $0) + + if (fcount > 0) { + n = split(last, alast) + m = split($0, aline) + clast = join(alast, fcount+1, n) + cline = join(aline, fcount+1, m) + } else { + clast = last + cline = $0 + } + if (charcount) { + clast = substr(clast, charcount + 1) + cline = substr(cline, charcount + 1) + } + + return (clast == cline) + } + + The following two rules are the body of the program. The first one +is executed only for the very first line of data. It sets 'last' equal +to '$0', so that subsequent lines of text have something to be compared +to. + + The second rule does the work. The variable 'equal' is one or zero, +depending upon the results of 'are_equal()''s comparison. If 'uniq' is +counting repeated lines, and the lines are equal, then it increments the +'count' variable. Otherwise, it prints the line and resets 'count', +because the two lines are not equal. + + If 'uniq' is not counting, and if the lines are equal, 'count' is +incremented. Nothing is printed, as the point is to remove duplicates. +Otherwise, if 'uniq' is counting repeated lines and more than one line +is seen, or if 'uniq' is counting nonrepeated lines and only one line is +seen, then the line is printed, and 'count' is reset. + + Finally, similar logic is used in the 'END' rule to print the final +line of input data: + + NR == 1 { + last = $0 + next + } + + { + equal = are_equal() + + if (do_count) { # overrides -d and -u + if (equal) + count++ + else { + printf("%4d %s\n", count, last) > outputfile + last = $0 + count = 1 # reset + } + next + } + + if (equal) + count++ + else { + if ((repeated_only && count > 1) || + (non_repeated_only && count == 1)) + print last > outputfile + last = $0 + count = 1 + } + } + + END { + if (do_count) + printf("%4d %s\n", count, last) > outputfile + else if ((repeated_only && count > 1) || + (non_repeated_only && count == 1)) + print last > outputfile + close(outputfile) + } + + +File: gawk.info, Node: Wc Program, Prev: Uniq Program, Up: Clones + +11.2.7 Counting Things +---------------------- + +The 'wc' (word count) utility counts lines, words, and characters in one +or more input files. Its usage is as follows: + + 'wc' ['-lwc'] [FILES ...] + + If no files are specified on the command line, 'wc' reads its +standard input. If there are multiple files, it also prints total +counts for all the files. The options and their meanings are as +follows: + +'-l' + Count only lines. + +'-w' + Count only words. A "word" is a contiguous sequence of + nonwhitespace characters, separated by spaces and/or TABs. + Luckily, this is the normal way 'awk' separates fields in its input + data. + +'-c' + Count only characters. + + Implementing 'wc' in 'awk' is particularly elegant, because 'awk' +does a lot of the work for us; it splits lines into words (i.e., fields) +and counts them, it counts lines (i.e., records), and it can easily tell +us how long a line is. + + This program uses the 'getopt()' library function (*note Getopt +Function::) and the file-transition functions (*note Filetrans +Function::). + + This version has one notable difference from traditional versions of +'wc': it always prints the counts in the order lines, words, and +characters. Traditional versions note the order of the '-l', '-w', and +'-c' options on the command line, and print the counts in that order. + + The 'BEGIN' rule does the argument processing. The variable +'print_total' is true if more than one file is named on the command +line: + + # wc.awk --- count lines, words, characters + + # Options: + # -l only count lines + # -w only count words + # -c only count characters + # + # Default is to count lines, words, characters + # + # Requires getopt() and file transition library functions + + BEGIN { + # let getopt() print a message about + # invalid options. we ignore them + while ((c = getopt(ARGC, ARGV, "lwc")) != -1) { + if (c == "l") + do_lines = 1 + else if (c == "w") + do_words = 1 + else if (c == "c") + do_chars = 1 + } + for (i = 1; i < Optind; i++) + ARGV[i] = "" + + # if no options, do all + if (! do_lines && ! do_words && ! do_chars) + do_lines = do_words = do_chars = 1 + + print_total = (ARGC - i > 1) + } + + The 'beginfile()' function is simple; it just resets the counts of +lines, words, and characters to zero, and saves the current file name in +'fname': + + function beginfile(file) + { + lines = words = chars = 0 + fname = FILENAME + } + + The 'endfile()' function adds the current file's numbers to the +running totals of lines, words, and characters. It then prints out +those numbers for the file that was just read. It relies on +'beginfile()' to reset the numbers for the following data file: + + function endfile(file) + { + tlines += lines + twords += words + tchars += chars + if (do_lines) + printf "\t%d", lines + if (do_words) + printf "\t%d", words + if (do_chars) + printf "\t%d", chars + printf "\t%s\n", fname + } + + There is one rule that is executed for each line. It adds the length +of the record, plus one, to 'chars'.(1) Adding one plus the record +length is needed because the newline character separating records (the +value of 'RS') is not part of the record itself, and thus not included +in its length. Next, 'lines' is incremented for each line read, and +'words' is incremented by the value of 'NF', which is the number of +"words" on this line: + + # do per line + { + chars += length($0) + 1 # get newline + lines++ + words += NF + } + + Finally, the 'END' rule simply prints the totals for all the files: + + END { + if (print_total) { + if (do_lines) + printf "\t%d", tlines + if (do_words) + printf "\t%d", twords + if (do_chars) + printf "\t%d", tchars + print "\ttotal" + } + } + + ---------- Footnotes ---------- + + (1) Because 'gawk' understands multibyte locales, this code counts +characters, not bytes. + + +File: gawk.info, Node: Miscellaneous Programs, Next: Programs Summary, Prev: Clones, Up: Sample Programs + +11.3 A Grab Bag of 'awk' Programs +================================= + +This minor node is a large "grab bag" of miscellaneous programs. We +hope you find them both interesting and enjoyable. + +* Menu: + +* Dupword Program:: Finding duplicated words in a document. +* Alarm Program:: An alarm clock. +* Translate Program:: A program similar to the 'tr' utility. +* Labels Program:: Printing mailing labels. +* Word Sorting:: A program to produce a word usage count. +* History Sorting:: Eliminating duplicate entries from a history + file. +* Extract Program:: Pulling out programs from Texinfo source + files. +* Simple Sed:: A Simple Stream Editor. +* Igawk Program:: A wrapper for 'awk' that includes + files. +* Anagram Program:: Finding anagrams from a dictionary. +* Signature Program:: People do amazing things with too much time on + their hands. + + +File: gawk.info, Node: Dupword Program, Next: Alarm Program, Up: Miscellaneous Programs + +11.3.1 Finding Duplicated Words in a Document +--------------------------------------------- + +A common error when writing large amounts of prose is to accidentally +duplicate words. Typically you will see this in text as something like +"the the program does the following..." When the text is online, often +the duplicated words occur at the end of one line and the beginning of +another, making them very difficult to spot. + + This program, 'dupword.awk', scans through a file one line at a time +and looks for adjacent occurrences of the same word. It also saves the +last word on a line (in the variable 'prev') for comparison with the +first word on the next line. + + The first two statements make sure that the line is all lowercase, so +that, for example, "The" and "the" compare equal to each other. The +next statement replaces nonalphanumeric and nonwhitespace characters +with spaces, so that punctuation does not affect the comparison either. +The characters are replaced with spaces so that formatting controls +don't create nonsense words (e.g., the Texinfo '@code{NF}' becomes +'codeNF' if punctuation is simply deleted). The record is then resplit +into fields, yielding just the actual words on the line, and ensuring +that there are no empty fields. + + If there are no fields left after removing all the punctuation, the +current record is skipped. Otherwise, the program loops through each +word, comparing it to the previous one: + + # dupword.awk --- find duplicate words in text + { + $0 = tolower($0) + gsub(/[^[:alnum:][:blank:]]/, " "); + $0 = $0 # re-split + if (NF == 0) + next + if ($1 == prev) + printf("%s:%d: duplicate %s\n", + FILENAME, FNR, $1) + for (i = 2; i <= NF; i++) + if ($i == $(i-1)) + printf("%s:%d: duplicate %s\n", + FILENAME, FNR, $i) + prev = $NF + } + + +File: gawk.info, Node: Alarm Program, Next: Translate Program, Prev: Dupword Program, Up: Miscellaneous Programs + +11.3.2 An Alarm Clock Program +----------------------------- + + Nothing cures insomnia like a ringing alarm clock. + -- _Arnold Robbins_ + Sleep is for web developers. + -- _Erik Quanstrom_ + + The following program is a simple "alarm clock" program. You give it +a time of day and an optional message. At the specified time, it prints +the message on the standard output. In addition, you can give it the +number of times to repeat the message as well as a delay between +repetitions. + + This program uses the 'getlocaltime()' function from *note +Getlocaltime Function::. + + All the work is done in the 'BEGIN' rule. The first part is argument +checking and setting of defaults: the delay, the count, and the message +to print. If the user supplied a message without the ASCII BEL +character (known as the "alert" character, '"\a"'), then it is added to +the message. (On many systems, printing the ASCII BEL generates an +audible alert. Thus, when the alarm goes off, the system calls +attention to itself in case the user is not looking at the computer.) +Just for a change, this program uses a 'switch' statement (*note Switch +Statement::), but the processing could be done with a series of +'if'-'else' statements instead. Here is the program: + + # alarm.awk --- set an alarm + # + # Requires getlocaltime() library function + # usage: alarm time [ "message" [ count [ delay ] ] ] + + BEGIN { + # Initial argument sanity checking + usage1 = "usage: alarm time ['message' [count [delay]]]" + usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1]) + + if (ARGC < 2) { + print usage1 > "/dev/stderr" + print usage2 > "/dev/stderr" + exit 1 + } + switch (ARGC) { + case 5: + delay = ARGV[4] + 0 + # fall through + case 4: + count = ARGV[3] + 0 + # fall through + case 3: + message = ARGV[2] + break + default: + if (ARGV[1] !~ /[[:digit:]]?[[:digit:]]:[[:digit:]]{2}/) { + print usage1 > "/dev/stderr" + print usage2 > "/dev/stderr" + exit 1 + } + break + } + + # set defaults for once we reach the desired time + if (delay == 0) + delay = 180 # 3 minutes + if (count == 0) + count = 5 + if (message == "") + message = sprintf("\aIt is now %s!\a", ARGV[1]) + else if (index(message, "\a") == 0) + message = "\a" message "\a" + + The next minor node of code turns the alarm time into hours and +minutes, converts it (if necessary) to a 24-hour clock, and then turns +that time into a count of the seconds since midnight. Next it turns the +current time into a count of seconds since midnight. The difference +between the two is how long to wait before setting off the alarm: + + # split up alarm time + split(ARGV[1], atime, ":") + hour = atime[1] + 0 # force numeric + minute = atime[2] + 0 # force numeric + + # get current broken down time + getlocaltime(now) + + # if time given is 12-hour hours and it's after that + # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m., + # then add 12 to real hour + if (hour < 12 && now["hour"] > hour) + hour += 12 + + # set target time in seconds since midnight + target = (hour * 60 * 60) + (minute * 60) + + # get current time in seconds since midnight + current = (now["hour"] * 60 * 60) + \ + (now["minute"] * 60) + now["second"] + + # how long to sleep for + naptime = target - current + if (naptime <= 0) { + print "alarm: time is in the past!" > "/dev/stderr" + exit 1 + } + + Finally, the program uses the 'system()' function (*note I/O +Functions::) to call the 'sleep' utility. The 'sleep' utility simply +pauses for the given number of seconds. If the exit status is not zero, +the program assumes that 'sleep' was interrupted and exits. If 'sleep' +exited with an OK status (zero), then the program prints the message in +a loop, again using 'sleep' to delay for however many seconds are +necessary: + + # zzzzzz..... go away if interrupted + if (system(sprintf("sleep %d", naptime)) != 0) + exit 1 + + # time to notify! + command = sprintf("sleep %d", delay) + for (i = 1; i <= count; i++) { + print message + # if sleep command interrupted, go away + if (system(command) != 0) + break + } + + exit 0 + } + + +File: gawk.info, Node: Translate Program, Next: Labels Program, Prev: Alarm Program, Up: Miscellaneous Programs + +11.3.3 Transliterating Characters +--------------------------------- + +The system 'tr' utility transliterates characters. For example, it is +often used to map uppercase letters into lowercase for further +processing: + + GENERATE DATA | tr 'A-Z' 'a-z' | PROCESS DATA ... + + 'tr' requires two lists of characters.(1) When processing the input, +the first character in the first list is replaced with the first +character in the second list, the second character in the first list is +replaced with the second character in the second list, and so on. If +there are more characters in the "from" list than in the "to" list, the +last character of the "to" list is used for the remaining characters in +the "from" list. + + Once upon a time, a user proposed adding a transliteration function +to 'gawk'. The following program was written to prove that character +transliteration could be done with a user-level function. This program +is not as complete as the system 'tr' utility, but it does most of the +job. + + The 'translate' program was written long before 'gawk' acquired the +ability to split each character in a string into separate array +elements. Thus, it makes repeated use of the 'substr()', 'index()', and +'gsub()' built-in functions (*note String Functions::). There are two +functions. The first, 'stranslate()', takes three arguments: + +'from' + A list of characters from which to translate + +'to' + A list of characters to which to translate + +'target' + The string on which to do the translation + + Associative arrays make the translation part fairly easy. 't_ar' +holds the "to" characters, indexed by the "from" characters. Then a +simple loop goes through 'from', one character at a time. For each +character in 'from', if the character appears in 'target', it is +replaced with the corresponding 'to' character. + + The 'translate()' function calls 'stranslate()', using '$0' as the +target. The main program sets two global variables, 'FROM' and 'TO', +from the command line, and then changes 'ARGV' so that 'awk' reads from +the standard input. + + Finally, the processing rule simply calls 'translate()' for each +record: + + # translate.awk --- do tr-like stuff + # Bugs: does not handle things like tr A-Z a-z; it has + # to be spelled out. However, if `to' is shorter than `from', + # the last character in `to' is used for the rest of `from'. + + function stranslate(from, to, target, lf, lt, ltarget, t_ar, i, c, + result) + { + lf = length(from) + lt = length(to) + ltarget = length(target) + for (i = 1; i <= lt; i++) + t_ar[substr(from, i, 1)] = substr(to, i, 1) + if (lt < lf) + for (; i <= lf; i++) + t_ar[substr(from, i, 1)] = substr(to, lt, 1) + for (i = 1; i <= ltarget; i++) { + c = substr(target, i, 1) + if (c in t_ar) + c = t_ar[c] + result = result c + } + return result + } + + function translate(from, to) + { + return $0 = stranslate(from, to, $0) + } + + # main program + BEGIN { + if (ARGC < 3) { + print "usage: translate from to" > "/dev/stderr" + exit + } + FROM = ARGV[1] + TO = ARGV[2] + ARGC = 2 + ARGV[1] = "-" + } + + { + translate(FROM, TO) + print + } + + It is possible to do character transliteration in a user-level +function, but it is not necessarily efficient, and we (the 'gawk' +developers) started to consider adding a built-in function. However, +shortly after writing this program, we learned that Brian Kernighan had +added the 'toupper()' and 'tolower()' functions to his 'awk' (*note +String Functions::). These functions handle the vast majority of the +cases where character transliteration is necessary, and so we chose to +simply add those functions to 'gawk' as well and then leave well enough +alone. + + An obvious improvement to this program would be to set up the 't_ar' +array only once, in a 'BEGIN' rule. However, this assumes that the +"from" and "to" lists will never change throughout the lifetime of the +program. + + Another obvious improvement is to enable the use of ranges, such as +'a-z', as allowed by the 'tr' utility. Look at the code for 'cut.awk' +(*note Cut Program::) for inspiration. + + ---------- Footnotes ---------- + + (1) On some older systems, including Solaris, the system version of +'tr' may require that the lists be written as range expressions enclosed +in square brackets ('[a-z]') and quoted, to prevent the shell from +attempting a file name expansion. This is not a feature. + + +File: gawk.info, Node: Labels Program, Next: Word Sorting, Prev: Translate Program, Up: Miscellaneous Programs + +11.3.4 Printing Mailing Labels +------------------------------ + +Here is a "real-world"(1) program. This script reads lists of names and +addresses and generates mailing labels. Each page of labels has 20 +labels on it, two across and 10 down. The addresses are guaranteed to +be no more than five lines of data. Each address is separated from the +next by a blank line. + + The basic idea is to read 20 labels' worth of data. Each line of +each label is stored in the 'line' array. The single rule takes care of +filling the 'line' array and printing the page when 20 labels have been +read. + + The 'BEGIN' rule simply sets 'RS' to the empty string, so that 'awk' +splits records at blank lines (*note Records::). It sets 'MAXLINES' to +100, because 100 is the maximum number of lines on the page (20 * 5 = +100). + + Most of the work is done in the 'printpage()' function. The label +lines are stored sequentially in the 'line' array. But they have to +print horizontally: 'line[1]' next to 'line[6]', 'line[2]' next to +'line[7]', and so on. Two loops accomplish this. The outer loop, +controlled by 'i', steps through every 10 lines of data; this is each +row of labels. The inner loop, controlled by 'j', goes through the +lines within the row. As 'j' goes from 0 to 4, 'i+j' is the 'j'th line +in the row, and 'i+j+5' is the entry next to it. The output ends up +looking something like this: + + line 1 line 6 + line 2 line 7 + line 3 line 8 + line 4 line 9 + line 5 line 10 + ... + +The 'printf' format string '%-41s' left-aligns the data and prints it +within a fixed-width field. + + As a final note, an extra blank line is printed at lines 21 and 61, +to keep the output lined up on the labels. This is dependent on the +particular brand of labels in use when the program was written. You +will also note that there are two blank lines at the top and two blank +lines at the bottom. + + The 'END' rule arranges to flush the final page of labels; there may +not have been an even multiple of 20 labels in the data: + + # labels.awk --- print mailing labels + + # Each label is 5 lines of data that may have blank lines. + # The label sheets have 2 blank lines at the top and 2 at + # the bottom. + + BEGIN { RS = "" ; MAXLINES = 100 } + + function printpage( i, j) + { + if (Nlines <= 0) + return + + printf "\n\n" # header + + for (i = 1; i <= Nlines; i += 10) { + if (i == 21 || i == 61) + print "" + for (j = 0; j < 5; j++) { + if (i + j > MAXLINES) + break + printf " %-41s %s\n", line[i+j], line[i+j+5] + } + print "" + } + + printf "\n\n" # footer + + delete line + } + + # main rule + { + if (Count >= 20) { + printpage() + Count = 0 + Nlines = 0 + } + n = split($0, a, "\n") + for (i = 1; i <= n; i++) + line[++Nlines] = a[i] + for (; i <= 5; i++) + line[++Nlines] = "" + Count++ + } + + END { + printpage() + } + + ---------- Footnotes ---------- + + (1) "Real world" is defined as "a program actually used to get +something done." + + +File: gawk.info, Node: Word Sorting, Next: History Sorting, Prev: Labels Program, Up: Miscellaneous Programs + +11.3.5 Generating Word-Usage Counts +----------------------------------- + +When working with large amounts of text, it can be interesting to know +how often different words appear. For example, an author may overuse +certain words, in which case he or she might wish to find synonyms to +substitute for words that appear too often. This node develops a +program for counting words and presenting the frequency information in a +useful format. + + At first glance, a program like this would seem to do the job: + + # wordfreq-first-try.awk --- print list of word frequencies + + { + for (i = 1; i <= NF; i++) + freq[$i]++ + } + + END { + for (word in freq) + printf "%s\t%d\n", word, freq[word] + } + + The program relies on 'awk''s default field-splitting mechanism to +break each line up into "words" and uses an associative array named +'freq', indexed by each word, to count the number of times the word +occurs. In the 'END' rule, it prints the counts. + + This program has several problems that prevent it from being useful +on real text files: + + * The 'awk' language considers upper- and lowercase characters to be + distinct. Therefore, "bartender" and "Bartender" are not treated + as the same word. This is undesirable, because words are + capitalized if they begin sentences in normal text, and a frequency + analyzer should not be sensitive to capitalization. + + * Words are detected using the 'awk' convention that fields are + separated just by whitespace. Other characters in the input + (except newlines) don't have any special meaning to 'awk'. This + means that punctuation characters count as part of words. + + * The output does not come out in any useful order. You're more + likely to be interested in which words occur most frequently or in + having an alphabetized table of how frequently each word occurs. + + The first problem can be solved by using 'tolower()' to remove case +distinctions. The second problem can be solved by using 'gsub()' to +remove punctuation characters. Finally, we solve the third problem by +using the system 'sort' utility to process the output of the 'awk' +script. Here is the new version of the program: + + # wordfreq.awk --- print list of word frequencies + + { + $0 = tolower($0) # remove case distinctions + # remove punctuation + gsub(/[^[:alnum:]_[:blank:]]/, "", $0) + for (i = 1; i <= NF; i++) + freq[$i]++ + } + + END { + for (word in freq) + printf "%s\t%d\n", word, freq[word] + } + + The regexp '/[^[:alnum:]_[:blank:]]/' might have been written +'/[[:punct:]]/', but then underscores would also be removed, and we want +to keep them. + + Assuming we have saved this program in a file named 'wordfreq.awk', +and that the data is in 'file1', the following pipeline: + + awk -f wordfreq.awk file1 | sort -k 2nr + +produces a table of the words appearing in 'file1' in order of +decreasing frequency. + + The 'awk' program suitably massages the data and produces a word +frequency table, which is not ordered. The 'awk' script's output is +then sorted by the 'sort' utility and printed on the screen. + + The options given to 'sort' specify a sort that uses the second field +of each input line (skipping one field), that the sort keys should be +treated as numeric quantities (otherwise '15' would come before '5'), +and that the sorting should be done in descending (reverse) order. + + The 'sort' could even be done from within the program, by changing +the 'END' action to: + + END { + sort = "sort -k 2nr" + for (word in freq) + printf "%s\t%d\n", word, freq[word] | sort + close(sort) + } + + This way of sorting must be used on systems that do not have true +pipes at the command-line (or batch-file) level. See the general +operating system documentation for more information on how to use the +'sort' program. + + +File: gawk.info, Node: History Sorting, Next: Extract Program, Prev: Word Sorting, Up: Miscellaneous Programs + +11.3.6 Removing Duplicates from Unsorted Text +--------------------------------------------- + +The 'uniq' program (*note Uniq Program::) removes duplicate lines from +_sorted_ data. + + Suppose, however, you need to remove duplicate lines from a data file +but that you want to preserve the order the lines are in. A good +example of this might be a shell history file. The history file keeps a +copy of all the commands you have entered, and it is not unusual to +repeat a command several times in a row. Occasionally you might want to +compact the history by removing duplicate entries. Yet it is desirable +to maintain the order of the original commands. + + This simple program does the job. It uses two arrays. The 'data' +array is indexed by the text of each line. For each line, 'data[$0]' is +incremented. If a particular line has not been seen before, then +'data[$0]' is zero. In this case, the text of the line is stored in +'lines[count]'. Each element of 'lines' is a unique command, and the +indices of 'lines' indicate the order in which those lines are +encountered. The 'END' rule simply prints out the lines, in order: + + # histsort.awk --- compact a shell history file + # Thanks to Byron Rakitzis for the general idea + + { + if (data[$0]++ == 0) + lines[++count] = $0 + } + + END { + for (i = 1; i <= count; i++) + print lines[i] + } + + This program also provides a foundation for generating other useful +information. For example, using the following 'print' statement in the +'END' rule indicates how often a particular command is used: + + print data[lines[i]], lines[i] + +This works because 'data[$0]' is incremented each time a line is seen. + + +File: gawk.info, Node: Extract Program, Next: Simple Sed, Prev: History Sorting, Up: Miscellaneous Programs + +11.3.7 Extracting Programs from Texinfo Source Files +---------------------------------------------------- + +The nodes *note Library Functions::, and *note Sample Programs::, are +the top level nodes for a large number of 'awk' programs. If you want +to experiment with these programs, it is tedious to type them in by +hand. Here we present a program that can extract parts of a Texinfo +input file into separate files. + + This Info file is written in Texinfo +(http://www.gnu.org/software/texinfo/), the GNU Project's document +formatting language. A single Texinfo source file can be used to +produce both printed documentation, with TeX, and online documentation. +(The Texinfo language is described fully, starting with *note (Texinfo, +texinfo,Texinfo---The GNU Documentation Format)Top::.) + + For our purposes, it is enough to know three things about Texinfo +input files: + + * The "at" symbol ('@') is special in Texinfo, much as the backslash + ('\') is in C or 'awk'. Literal '@' symbols are represented in + Texinfo source files as '@@'. + + * Comments start with either '@c' or '@comment'. The file-extraction + program works by using special comments that start at the beginning + of a line. + + * Lines containing '@group' and '@end group' commands bracket example + text that should not be split across a page boundary. + (Unfortunately, TeX isn't always smart enough to do things exactly + right, so we have to give it some help.) + + The following program, 'extract.awk', reads through a Texinfo source +file and does two things, based on the special comments. Upon seeing +'@c system ...', it runs a command, by extracting the command text from +the control line and passing it on to the 'system()' function (*note I/O +Functions::). Upon seeing '@c file FILENAME', each subsequent line is +sent to the file FILENAME, until '@c endfile' is encountered. The rules +in 'extract.awk' match either '@c' or '@comment' by letting the 'omment' +part be optional. Lines containing '@group' and '@end group' are simply +removed. 'extract.awk' uses the 'join()' library function (*note Join +Function::). + + The example programs in the online Texinfo source for 'GAWK: +Effective AWK Programming' ('gawktexi.in') have all been bracketed +inside 'file' and 'endfile' lines. The 'gawk' distribution uses a copy +of 'extract.awk' to extract the sample programs and install many of them +in a standard directory where 'gawk' can find them. The Texinfo file +looks something like this: + + ... + This program has a @code{BEGIN} rule + that prints a nice message: + + @example + @c file examples/messages.awk + BEGIN @{ print "Don't panic!" @} + @c endfile + @end example + + It also prints some final advice: + + @example + @c file examples/messages.awk + END @{ print "Always avoid bored archaeologists!" @} + @c endfile + @end example + ... + + 'extract.awk' begins by setting 'IGNORECASE' to one, so that mixed +upper- and lowercase letters in the directives won't matter. + + The first rule handles calling 'system()', checking that a command is +given ('NF' is at least three) and also checking that the command exits +with a zero exit status, signifying OK: + + # extract.awk --- extract files and run programs from Texinfo files + + BEGIN { IGNORECASE = 1 } + + /^@c(omment)?[ \t]+system/ { + if (NF < 3) { + e = ("extract: " FILENAME ":" FNR) + e = (e ": badly formed `system' line") + print e > "/dev/stderr" + next + } + $1 = "" + $2 = "" + stat = system($0) + if (stat != 0) { + e = ("extract: " FILENAME ":" FNR) + e = (e ": warning: system returned " stat) + print e > "/dev/stderr" + } + } + +The variable 'e' is used so that the rule fits nicely on the screen. + + The second rule handles moving data into files. It verifies that a +file name is given in the directive. If the file named is not the +current file, then the current file is closed. Keeping the current file +open until a new file is encountered allows the use of the '>' +redirection for printing the contents, keeping open-file management +simple. + + The 'for' loop does the work. It reads lines using 'getline' (*note +Getline::). For an unexpected end-of-file, it calls the +'unexpected_eof()' function. If the line is an "endfile" line, then it +breaks out of the loop. If the line is an '@group' or '@end group' +line, then it ignores it and goes on to the next line. Similarly, +comments within examples are also ignored. + + Most of the work is in the following few lines. If the line has no +'@' symbols, the program can print it directly. Otherwise, each leading +'@' must be stripped off. To remove the '@' symbols, the line is split +into separate elements of the array 'a', using the 'split()' function +(*note String Functions::). The '@' symbol is used as the separator +character. Each element of 'a' that is empty indicates two successive +'@' symbols in the original line. For each two empty elements ('@@' in +the original file), we have to add a single '@' symbol back in. + + When the processing of the array is finished, 'join()' is called with +the value of 'SUBSEP' (*note Multidimensional::), to rejoin the pieces +back into a single line. That line is then printed to the output file: + + /^@c(omment)?[ \t]+file/ { + if (NF != 3) { + e = ("extract: " FILENAME ":" FNR ": badly formed `file' line") + print e > "/dev/stderr" + next + } + if ($3 != curfile) { + if (curfile != "") + close(curfile) + curfile = $3 + } + + for (;;) { + if ((getline line) <= 0) + unexpected_eof() + if (line ~ /^@c(omment)?[ \t]+endfile/) + break + else if (line ~ /^@(end[ \t]+)?group/) + continue + else if (line ~ /^@c(omment+)?[ \t]+/) + continue + if (index(line, "@") == 0) { + print line > curfile + continue + } + n = split(line, a, "@") + # if a[1] == "", means leading @, + # don't add one back in. + for (i = 2; i <= n; i++) { + if (a[i] == "") { # was an @@ + a[i] = "@" + if (a[i+1] == "") + i++ + } + } + print join(a, 1, n, SUBSEP) > curfile + } + } + + An important thing to note is the use of the '>' redirection. Output +done with '>' only opens the file once; it stays open and subsequent +output is appended to the file (*note Redirection::). This makes it +easy to mix program text and explanatory prose for the same sample +source file (as has been done here!) without any hassle. The file is +only closed when a new data file name is encountered or at the end of +the input file. + + Finally, the function 'unexpected_eof()' prints an appropriate error +message and then exits. The 'END' rule handles the final cleanup, +closing the open file: + + function unexpected_eof() + { + printf("extract: %s:%d: unexpected EOF or error\n", + FILENAME, FNR) > "/dev/stderr" + exit 1 + } + + END { + if (curfile) + close(curfile) + } + + +File: gawk.info, Node: Simple Sed, Next: Igawk Program, Prev: Extract Program, Up: Miscellaneous Programs + +11.3.8 A Simple Stream Editor +----------------------------- + +The 'sed' utility is a "stream editor", a program that reads a stream of +data, makes changes to it, and passes it on. It is often used to make +global changes to a large file or to a stream of data generated by a +pipeline of commands. Although 'sed' is a complicated program in its +own right, its most common use is to perform global substitutions in the +middle of a pipeline: + + COMMAND1 < orig.data | sed 's/old/new/g' | COMMAND2 > result + + Here, 's/old/new/g' tells 'sed' to look for the regexp 'old' on each +input line and globally replace it with the text 'new' (i.e., all the +occurrences on a line). This is similar to 'awk''s 'gsub()' function +(*note String Functions::). + + The following program, 'awksed.awk', accepts at least two +command-line arguments: the pattern to look for and the text to replace +it with. Any additional arguments are treated as data file names to +process. If none are provided, the standard input is used: + + # awksed.awk --- do s/foo/bar/g using just print + # Thanks to Michael Brennan for the idea + + function usage() + { + print "usage: awksed pat repl [files...]" > "/dev/stderr" + exit 1 + } + + BEGIN { + # validate arguments + if (ARGC < 3) + usage() + + RS = ARGV[1] + ORS = ARGV[2] + + # don't use arguments as files + ARGV[1] = ARGV[2] = "" + } + + # look ma, no hands! + { + if (RT == "") + printf "%s", $0 + else + print + } + + The program relies on 'gawk''s ability to have 'RS' be a regexp, as +well as on the setting of 'RT' to the actual text that terminates the +record (*note Records::). + + The idea is to have 'RS' be the pattern to look for. 'gawk' +automatically sets '$0' to the text between matches of the pattern. +This is text that we want to keep, unmodified. Then, by setting 'ORS' +to the replacement text, a simple 'print' statement outputs the text we +want to keep, followed by the replacement text. + + There is one wrinkle to this scheme, which is what to do if the last +record doesn't end with text that matches 'RS'. Using a 'print' +statement unconditionally prints the replacement text, which is not +correct. However, if the file did not end in text that matches 'RS', +'RT' is set to the null string. In this case, we can print '$0' using +'printf' (*note Printf::). + + The 'BEGIN' rule handles the setup, checking for the right number of +arguments and calling 'usage()' if there is a problem. Then it sets +'RS' and 'ORS' from the command-line arguments and sets 'ARGV[1]' and +'ARGV[2]' to the null string, so that they are not treated as file names +(*note ARGC and ARGV::). + + The 'usage()' function prints an error message and exits. Finally, +the single rule handles the printing scheme outlined earlier, using +'print' or 'printf' as appropriate, depending upon the value of 'RT'. + + +File: gawk.info, Node: Igawk Program, Next: Anagram Program, Prev: Simple Sed, Up: Miscellaneous Programs + +11.3.9 An Easy Way to Use Library Functions +------------------------------------------- + +In *note Include Files::, we saw how 'gawk' provides a built-in +file-inclusion capability. However, this is a 'gawk' extension. This +minor node provides the motivation for making file inclusion available +for standard 'awk', and shows how to do it using a combination of shell +and 'awk' programming. + + Using library functions in 'awk' can be very beneficial. It +encourages code reuse and the writing of general functions. Programs +are smaller and therefore clearer. However, using library functions is +only easy when writing 'awk' programs; it is painful when running them, +requiring multiple '-f' options. If 'gawk' is unavailable, then so too +is the 'AWKPATH' environment variable and the ability to put 'awk' +functions into a library directory (*note Options::). It would be nice +to be able to write programs in the following manner: + + # library functions + @include getopt.awk + @include join.awk + ... + + # main program + BEGIN { + while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1) + ... + ... + } + + The following program, 'igawk.sh', provides this service. It +simulates 'gawk''s searching of the 'AWKPATH' variable and also allows +"nested" includes (i.e., a file that is included with '@include' can +contain further '@include' statements). 'igawk' makes an effort to only +include files once, so that nested includes don't accidentally include a +library function twice. + + 'igawk' should behave just like 'gawk' externally. This means it +should accept all of 'gawk''s command-line arguments, including the +ability to have multiple source files specified via '-f' and the ability +to mix command-line and library source files. + + The program is written using the POSIX Shell ('sh') command +language.(1) It works as follows: + + 1. Loop through the arguments, saving anything that doesn't represent + 'awk' source code for later, when the expanded program is run. + + 2. For any arguments that do represent 'awk' text, put the arguments + into a shell variable that will be expanded. There are two cases: + + a. Literal text, provided with '-e' or '--source'. This text is + just appended directly. + + b. Source file names, provided with '-f'. We use a neat trick + and append '@include FILENAME' to the shell variable's + contents. Because the file-inclusion program works the way + 'gawk' does, this gets the text of the file included in the + program at the correct point. + + 3. Run an 'awk' program (naturally) over the shell variable's contents + to expand '@include' statements. The expanded program is placed in + a second shell variable. + + 4. Run the expanded program with 'gawk' and any other original + command-line arguments that the user supplied (such as the data + file names). + + This program uses shell variables extensively: for storing +command-line arguments and the text of the 'awk' program that will +expand the user's program, for the user's original program, and for the +expanded program. Doing so removes some potential problems that might +arise were we to use temporary files instead, at the cost of making the +script somewhat more complicated. + + The initial part of the program turns on shell tracing if the first +argument is 'debug'. + + The next part loops through all the command-line arguments. There +are several cases of interest: + +'--' + This ends the arguments to 'igawk'. Anything else should be passed + on to the user's 'awk' program without being evaluated. + +'-W' + This indicates that the next option is specific to 'gawk'. To make + argument processing easier, the '-W' is appended to the front of + the remaining arguments and the loop continues. (This is an 'sh' + programming trick. Don't worry about it if you are not familiar + with 'sh'.) + +'-v', '-F' + These are saved and passed on to 'gawk'. + +'-f', '--file', '--file=', '-Wfile=' + The file name is appended to the shell variable 'program' with an + '@include' statement. The 'expr' utility is used to remove the + leading option part of the argument (e.g., '--file='). (Typical + 'sh' usage would be to use the 'echo' and 'sed' utilities to do + this work. Unfortunately, some versions of 'echo' evaluate escape + sequences in their arguments, possibly mangling the program text. + Using 'expr' avoids this problem.) + +'--source', '--source=', '-Wsource=' + The source text is appended to 'program'. + +'--version', '-Wversion' + 'igawk' prints its version number, runs 'gawk --version' to get the + 'gawk' version information, and then exits. + + If none of the '-f', '--file', '-Wfile', '--source', or '-Wsource' +arguments are supplied, then the first nonoption argument should be the +'awk' program. If there are no command-line arguments left, 'igawk' +prints an error message and exits. Otherwise, the first argument is +appended to 'program'. In any case, after the arguments have been +processed, the shell variable 'program' contains the complete text of +the original 'awk' program. + + The program is as follows: + + #! /bin/sh + # igawk --- like gawk but do @include processing + + if [ "$1" = debug ] + then + set -x + shift + fi + + # A literal newline, so that program text is formatted correctly + n=' + ' + + # Initialize variables to empty + program= + opts= + + while [ $# -ne 0 ] # loop over arguments + do + case $1 in + --) shift + break ;; + + -W) shift + # The ${x?'message here'} construct prints a + # diagnostic if $x is the null string + set -- -W"${@?'missing operand'}" + continue ;; + + -[vF]) opts="$opts $1 '${2?'missing operand'}'" + shift ;; + + -[vF]*) opts="$opts '$1'" ;; + + -f) program="$program$n@include ${2?'missing operand'}" + shift ;; + + -f*) f=$(expr "$1" : '-f\(.*\)') + program="$program$n@include $f" ;; + + -[W-]file=*) + f=$(expr "$1" : '-.file=\(.*\)') + program="$program$n@include $f" ;; + + -[W-]file) + program="$program$n@include ${2?'missing operand'}" + shift ;; + + -[W-]source=*) + t=$(expr "$1" : '-.source=\(.*\)') + program="$program$n$t" ;; + + -[W-]source) + program="$program$n${2?'missing operand'}" + shift ;; + + -[W-]version) + echo igawk: version 3.0 1>&2 + gawk --version + exit 0 ;; + + -[W-]*) opts="$opts '$1'" ;; + + *) break ;; + esac + shift + done + + if [ -z "$program" ] + then + program=${1?'missing program'} + shift + fi + + # At this point, `program' has the program. + + The 'awk' program to process '@include' directives is stored in the +shell variable 'expand_prog'. Doing this keeps the shell script +readable. The 'awk' program reads through the user's program, one line +at a time, using 'getline' (*note Getline::). The input file names and +'@include' statements are managed using a stack. As each '@include' is +encountered, the current file name is "pushed" onto the stack and the +file named in the '@include' directive becomes the current file name. +As each file is finished, the stack is "popped," and the previous input +file becomes the current input file again. The process is started by +making the original file the first one on the stack. + + The 'pathto()' function does the work of finding the full path to a +file. It simulates 'gawk''s behavior when searching the 'AWKPATH' +environment variable (*note AWKPATH Variable::). If a file name has a +'/' in it, no path search is done. Similarly, if the file name is +'"-"', then that string is used as-is. Otherwise, the file name is +concatenated with the name of each directory in the path, and an attempt +is made to open the generated file name. The only way to test if a file +can be read in 'awk' is to go ahead and try to read it with 'getline'; +this is what 'pathto()' does.(2) If the file can be read, it is closed +and the file name is returned: + + expand_prog=' + + function pathto(file, i, t, junk) + { + if (index(file, "/") != 0) + return file + + if (file == "-") + return file + + for (i = 1; i <= ndirs; i++) { + t = (pathlist[i] "/" file) + if ((getline junk < t) > 0) { + # found it + close(t) + return t + } + } + return "" + } + + The main program is contained inside one 'BEGIN' rule. The first +thing it does is set up the 'pathlist' array that 'pathto()' uses. +After splitting the path on ':', null elements are replaced with '"."', +which represents the current directory: + + BEGIN { + path = ENVIRON["AWKPATH"] + ndirs = split(path, pathlist, ":") + for (i = 1; i <= ndirs; i++) { + if (pathlist[i] == "") + pathlist[i] = "." + } + + The stack is initialized with 'ARGV[1]', which will be +'"/dev/stdin"'. The main loop comes next. Input lines are read in +succession. Lines that do not start with '@include' are printed +verbatim. If the line does start with '@include', the file name is in +'$2'. 'pathto()' is called to generate the full path. If it cannot, +then the program prints an error message and continues. + + The next thing to check is if the file is included already. The +'processed' array is indexed by the full file name of each included file +and it tracks this information for us. If the file is seen again, a +warning message is printed. Otherwise, the new file name is pushed onto +the stack and processing continues. + + Finally, when 'getline' encounters the end of the input file, the +file is closed and the stack is popped. When 'stackptr' is less than +zero, the program is done: + + stackptr = 0 + input[stackptr] = ARGV[1] # ARGV[1] is first file + + for (; stackptr >= 0; stackptr--) { + while ((getline < input[stackptr]) > 0) { + if (tolower($1) != "@include") { + print + continue + } + fpath = pathto($2) + if (fpath == "") { + printf("igawk: %s:%d: cannot find %s\n", + input[stackptr], FNR, $2) > "/dev/stderr" + continue + } + if (! (fpath in processed)) { + processed[fpath] = input[stackptr] + input[++stackptr] = fpath # push onto stack + } else + print $2, "included in", input[stackptr], + "already included in", + processed[fpath] > "/dev/stderr" + } + close(input[stackptr]) + } + }' # close quote ends `expand_prog' variable + + processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF + $program + EOF + ) + + The shell construct 'COMMAND << MARKER' is called a "here document". +Everything in the shell script up to the MARKER is fed to COMMAND as +input. The shell processes the contents of the here document for +variable and command substitution (and possibly other things as well, +depending upon the shell). + + The shell construct '$(...)' is called "command substitution". The +output of the command inside the parentheses is substituted into the +command line. Because the result is used in a variable assignment, it +is saved as a single string, even if the results contain whitespace. + + The expanded program is saved in the variable 'processed_program'. +It's done in these steps: + + 1. Run 'gawk' with the '@include'-processing program (the value of the + 'expand_prog' shell variable) reading standard input. + + 2. Standard input is the contents of the user's program, from the + shell variable 'program'. Feed its contents to 'gawk' via a here + document. + + 3. Save the results of this processing in the shell variable + 'processed_program' by using command substitution. + + The last step is to call 'gawk' with the expanded program, along with +the original options and command-line arguments that the user supplied: + + eval gawk $opts -- '"$processed_program"' '"$@"' + + The 'eval' command is a shell construct that reruns the shell's +parsing process. This keeps things properly quoted. + + This version of 'igawk' represents the fifth version of this program. +There are four key simplifications that make the program work better: + + * Using '@include' even for the files named with '-f' makes building + the initial collected 'awk' program much simpler; all the + '@include' processing can be done once. + + * Not trying to save the line read with 'getline' in the 'pathto()' + function when testing for the file's accessibility for use with the + main program simplifies things considerably. + + * Using a 'getline' loop in the 'BEGIN' rule does it all in one + place. It is not necessary to call out to a separate loop for + processing nested '@include' statements. + + * Instead of saving the expanded program in a temporary file, putting + it in a shell variable avoids some potential security problems. + This has the disadvantage that the script relies upon more features + of the 'sh' language, making it harder to follow for those who + aren't familiar with 'sh'. + + Also, this program illustrates that it is often worthwhile to combine +'sh' and 'awk' programming together. You can usually accomplish quite a +lot, without having to resort to low-level programming in C or C++, and +it is frequently easier to do certain kinds of string and argument +manipulation using the shell than it is in 'awk'. + + Finally, 'igawk' shows that it is not always necessary to add new +features to a program; they can often be layered on top.(3) + + ---------- Footnotes ---------- + + (1) Fully explaining the 'sh' language is beyond the scope of this +book. We provide some minimal explanations, but see a good shell +programming book if you wish to understand things in more depth. + + (2) On some very old versions of 'awk', the test 'getline junk < t' +can loop forever if the file exists but is empty. + + (3) 'gawk' does '@include' processing itself in order to support the +use of 'awk' programs as Web CGI scripts. + + +File: gawk.info, Node: Anagram Program, Next: Signature Program, Prev: Igawk Program, Up: Miscellaneous Programs + +11.3.10 Finding Anagrams from a Dictionary +------------------------------------------ + +An interesting programming challenge is to search for "anagrams" in a +word list (such as '/usr/share/dict/words' on many GNU/Linux systems). +One word is an anagram of another if both words contain the same letters +(e.g., "babbling" and "blabbing"). + + Column 2, Problem C, of Jon Bentley's 'Programming Pearls', Second +Edition, presents an elegant algorithm. The idea is to give words that +are anagrams a common signature, sort all the words together by their +signatures, and then print them. Dr. Bentley observes that taking the +letters in each word and sorting them produces those common signatures. + + The following program uses arrays of arrays to bring together words +with the same signature and array sorting to print the words in sorted +order: + + # anagram.awk --- An implementation of the anagram-finding algorithm + # from Jon Bentley's "Programming Pearls," 2nd edition. + # Addison Wesley, 2000, ISBN 0-201-65788-0. + # Column 2, Problem C, section 2.8, pp 18-20. + + /'s$/ { next } # Skip possessives + + The program starts with a header, and then a rule to skip possessives +in the dictionary file. The next rule builds up the data structure. +The first dimension of the array is indexed by the signature; the second +dimension is the word itself: + + { + key = word2key($1) # Build signature + data[key][$1] = $1 # Store word with signature + } + + The 'word2key()' function creates the signature. It splits the word +apart into individual letters, sorts the letters, and then joins them +back together: + + # word2key --- split word apart into letters, sort, and join back together + + function word2key(word, a, i, n, result) + { + n = split(word, a, "") + asort(a) + + for (i = 1; i <= n; i++) + result = result a[i] + + return result + } + + Finally, the 'END' rule traverses the array and prints out the +anagram lists. It sends the output to the system 'sort' command because +otherwise the anagrams would appear in arbitrary order: + + END { + sort = "sort" + for (key in data) { + # Sort words with same key + nwords = asorti(data[key], words) + if (nwords == 1) + continue + + # And print. Minor glitch: trailing space at end of each line + for (j = 1; j <= nwords; j++) + printf("%s ", words[j]) | sort + print "" | sort + } + close(sort) + } + + Here is some partial output when the program is run: + + $ gawk -f anagram.awk /usr/share/dict/words | grep '^b' + ... + babbled blabbed + babbler blabber brabble + babblers blabbers brabbles + babbling blabbing + babbly blabby + babel bable + babels beslab + babery yabber + ... + + +File: gawk.info, Node: Signature Program, Prev: Anagram Program, Up: Miscellaneous Programs + +11.3.11 And Now for Something Completely Different +-------------------------------------------------- + +The following program was written by Davide Brini and is published on +his website (http://backreference.org/2011/02/03/obfuscated-awk/). It +serves as his signature in the Usenet group 'comp.lang.awk'. He +supplies the following copyright terms: + + Copyright (C) 2008 Davide Brini + + Copying and distribution of the code published in this page, with + or without modification, are permitted in any medium without + royalty provided the copyright notice and this notice are + preserved. + + Here is the program: + + awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c"; + printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O, + X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O, + O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}' + + We leave it to you to determine what the program does. (If you are +truly desperate to understand it, see Chris Johansen's explanation, +which is embedded in the Texinfo source file for this Info file.) + + +File: gawk.info, Node: Programs Summary, Next: Programs Exercises, Prev: Miscellaneous Programs, Up: Sample Programs + +11.4 Summary +============ + + * The programs provided in this major node continue on the theme that + reading programs is an excellent way to learn Good Programming. + + * Using '#!' to make 'awk' programs directly runnable makes them + easier to use. Otherwise, invoke the program using 'awk -f ...'. + + * Reimplementing standard POSIX programs in 'awk' is a pleasant + exercise; 'awk''s expressive power lets you write such programs in + relatively few lines of code, yet they are functionally complete + and usable. + + * One of standard 'awk''s weaknesses is working with individual + characters. The ability to use 'split()' with the empty string as + the separator can considerably simplify such tasks. + + * The examples here demonstrate the usefulness of the library + functions from *note Library Functions:: for a number of real (if + small) programs. + + * Besides reinventing POSIX wheels, other programs solved a selection + of interesting problems, such as finding duplicate words in text, + printing mailing labels, and finding anagrams. + + +File: gawk.info, Node: Programs Exercises, Prev: Programs Summary, Up: Sample Programs + +11.5 Exercises +============== + + 1. Rewrite 'cut.awk' (*note Cut Program::) using 'split()' with '""' + as the separator. + + 2. In *note Egrep Program::, we mentioned that 'egrep -i' could be + simulated in versions of 'awk' without 'IGNORECASE' by using + 'tolower()' on the line and the pattern. In a footnote there, we + also mentioned that this solution has a bug: the translated line is + output, and not the original one. Fix this problem. + + 3. The POSIX version of 'id' takes options that control which + information is printed. Modify the 'awk' version (*note Id + Program::) to accept the same arguments and perform in the same + way. + + 4. The 'split.awk' program (*note Split Program::) assumes that + letters are contiguous in the character set, which isn't true for + EBCDIC systems. Fix this problem. (Hint: Consider a different way + to work through the alphabet, without relying on 'ord()' and + 'chr()'.) + + 5. In 'uniq.awk' (*note Uniq Program::, the logic for choosing which + lines to print represents a "state machine", which is "a device + that can be in one of a set number of stable conditions depending + on its previous condition and on the present values of its + inputs."(1) Brian Kernighan suggests that "an alternative approach + to state machines is to just read the input into an array, then use + indexing. It's almost always easier code, and for most inputs + where you would use this, just as fast." Rewrite the logic to + follow this suggestion. + + 6. Why can't the 'wc.awk' program (*note Wc Program::) just use the + value of 'FNR' in 'endfile()'? Hint: Examine the code in *note + Filetrans Function::. + + 7. Manipulation of individual characters in the 'translate' program + (*note Translate Program::) is painful using standard 'awk' + functions. Given that 'gawk' can split strings into individual + characters using '""' as the separator, how might you use this + feature to simplify the program? + + 8. The 'extract.awk' program (*note Extract Program::) was written + before 'gawk' had the 'gensub()' function. Use it to simplify the + code. + + 9. Compare the performance of the 'awksed.awk' program (*note Simple + Sed::) with the more straightforward: + + BEGIN { + pat = ARGV[1] + repl = ARGV[2] + ARGV[1] = ARGV[2] = "" + } + + { gsub(pat, repl); print } + + 10. What are the advantages and disadvantages of 'awksed.awk' versus + the real 'sed' utility? + + 11. In *note Igawk Program::, we mentioned that not trying to save the + line read with 'getline' in the 'pathto()' function when testing + for the file's accessibility for use with the main program + simplifies things considerably. What problem does this engender + though? + + 12. As an additional example of the idea that it is not always + necessary to add new features to a program, consider the idea of + having two files in a directory in the search path: + + 'default.awk' + This file contains a set of default library functions, such as + 'getopt()' and 'assert()'. + + 'site.awk' + This file contains library functions that are specific to a + site or installation; i.e., locally developed functions. + Having a separate file allows 'default.awk' to change with new + 'gawk' releases, without requiring the system administrator to + update it each time by adding the local functions. + + One user suggested that 'gawk' be modified to automatically read + these files upon startup. Instead, it would be very simple to + modify 'igawk' to do this. Since 'igawk' can process nested + '@include' directives, 'default.awk' could simply contain + '@include' statements for the desired library functions. Make this + change. + + 13. Modify 'anagram.awk' (*note Anagram Program::), to avoid the use + of the external 'sort' utility. + + ---------- Footnotes ---------- + + (1) This is the definition returned from entering 'define: state +machine' into Google. + + +File: gawk.info, Node: Advanced Features, Next: Internationalization, Prev: Sample Programs, Up: Top + +12 Advanced Features of 'gawk' +****************************** + + Write documentation as if whoever reads it is a violent psychopath + who knows where you live. + -- _Steve English, as quoted by Peter Langston_ + + This major node discusses advanced features in 'gawk'. It's a bit of +a "grab bag" of items that are otherwise unrelated to each other. +First, we look at a command-line option that allows 'gawk' to recognize +nondecimal numbers in input data, not just in 'awk' programs. Then, +'gawk''s special features for sorting arrays are presented. Next, +two-way I/O, discussed briefly in earlier parts of this Info file, is +described in full detail, along with the basics of TCP/IP networking. +Finally, we see how 'gawk' can "profile" an 'awk' program, making it +possible to tune it for performance. + + Additional advanced features are discussed in separate major nodes of +their own: + + * *note Internationalization::, discusses how to internationalize + your 'awk' programs, so that they can speak multiple national + languages. + + * *note Debugger::, describes 'gawk''s built-in command-line debugger + for debugging 'awk' programs. + + * *note Arbitrary Precision Arithmetic::, describes how you can use + 'gawk' to perform arbitrary-precision arithmetic. + + * *note Dynamic Extensions::, discusses the ability to dynamically + add new built-in functions to 'gawk'. + +* Menu: + +* Nondecimal Data:: Allowing nondecimal input data. +* Array Sorting:: Facilities for controlling array traversal and + sorting arrays. +* Two-way I/O:: Two-way communications with another process. +* TCP/IP Networking:: Using 'gawk' for network programming. +* Profiling:: Profiling your 'awk' programs. +* Advanced Features Summary:: Summary of advanced features. + + +File: gawk.info, Node: Nondecimal Data, Next: Array Sorting, Up: Advanced Features + +12.1 Allowing Nondecimal Input Data +=================================== + +If you run 'gawk' with the '--non-decimal-data' option, you can have +nondecimal values in your input data: + + $ echo 0123 123 0x123 | + > gawk --non-decimal-data '{ printf "%d, %d, %d\n", $1, $2, $3 }' + -| 83, 123, 291 + + For this feature to work, write your program so that 'gawk' treats +your data as numeric: + + $ echo 0123 123 0x123 | gawk '{ print $1, $2, $3 }' + -| 0123 123 0x123 + +The 'print' statement treats its expressions as strings. Although the +fields can act as numbers when necessary, they are still strings, so +'print' does not try to treat them numerically. You need to add zero to +a field to force it to be treated as a number. For example: + + $ echo 0123 123 0x123 | gawk --non-decimal-data ' + > { print $1, $2, $3 + > print $1 + 0, $2 + 0, $3 + 0 }' + -| 0123 123 0x123 + -| 83 123 291 + + Because it is common to have decimal data with leading zeros, and +because using this facility could lead to surprising results, the +default is to leave it disabled. If you want it, you must explicitly +request it. + + CAUTION: _Use of this option is not recommended._ It can break old + programs very badly. Instead, use the 'strtonum()' function to + convert your data (*note String Functions::). This makes your + programs easier to write and easier to read, and leads to less + surprising results. + + This option may disappear in a future version of 'gawk'. + + +File: gawk.info, Node: Array Sorting, Next: Two-way I/O, Prev: Nondecimal Data, Up: Advanced Features + +12.2 Controlling Array Traversal and Array Sorting +================================================== + +'gawk' lets you control the order in which a 'for (INDX in ARRAY)' loop +traverses an array. + + In addition, two built-in functions, 'asort()' and 'asorti()', let +you sort arrays based on the array values and indices, respectively. +These two functions also provide control over the sorting criteria used +to order the elements during sorting. + +* Menu: + +* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. +* Array Sorting Functions:: How to use 'asort()' and 'asorti()'. + + +File: gawk.info, Node: Controlling Array Traversal, Next: Array Sorting Functions, Up: Array Sorting + +12.2.1 Controlling Array Traversal +---------------------------------- + +By default, the order in which a 'for (INDX in ARRAY)' loop scans an +array is not defined; it is generally based upon the internal +implementation of arrays inside 'awk'. + + Often, though, it is desirable to be able to loop over the elements +in a particular order that you, the programmer, choose. 'gawk' lets you +do this. + + *note Controlling Scanning:: describes how you can assign special, +predefined values to 'PROCINFO["sorted_in"]' in order to control the +order in which 'gawk' traverses an array during a 'for' loop. + + In addition, the value of 'PROCINFO["sorted_in"]' can be a function +name.(1) This lets you traverse an array based on any custom criterion. +The array elements are ordered according to the return value of this +function. The comparison function should be defined with at least four +arguments: + + function comp_func(i1, v1, i2, v2) + { + COMPARE ELEMENTS 1 AND 2 IN SOME FASHION + RETURN < 0; 0; OR > 0 + } + + Here, 'i1' and 'i2' are the indices, and 'v1' and 'v2' are the +corresponding values of the two elements being compared. Either 'v1' or +'v2', or both, can be arrays if the array being traversed contains +subarrays as values. (*Note Arrays of Arrays:: for more information +about subarrays.) The three possible return values are interpreted as +follows: + +'comp_func(i1, v1, i2, v2) < 0' + Index 'i1' comes before index 'i2' during loop traversal. + +'comp_func(i1, v1, i2, v2) == 0' + Indices 'i1' and 'i2' come together, but the relative order with + respect to each other is undefined. + +'comp_func(i1, v1, i2, v2) > 0' + Index 'i1' comes after index 'i2' during loop traversal. + + Our first comparison function can be used to scan an array in +numerical order of the indices: + + function cmp_num_idx(i1, v1, i2, v2) + { + # numerical index comparison, ascending order + return (i1 - i2) + } + + Our second function traverses an array based on the string order of +the element values rather than by indices: + + function cmp_str_val(i1, v1, i2, v2) + { + # string value comparison, ascending order + v1 = v1 "" + v2 = v2 "" + if (v1 < v2) + return -1 + return (v1 != v2) + } + + The third comparison function makes all numbers, and numeric strings +without any leading or trailing spaces, come out first during loop +traversal: + + function cmp_num_str_val(i1, v1, i2, v2, n1, n2) + { + # numbers before string value comparison, ascending order + n1 = v1 + 0 + n2 = v2 + 0 + if (n1 == v1) + return (n2 == v2) ? (n1 - n2) : -1 + else if (n2 == v2) + return 1 + return (v1 < v2) ? -1 : (v1 != v2) + } + + Here is a main program to demonstrate how 'gawk' behaves using each +of the previous functions: + + BEGIN { + data["one"] = 10 + data["two"] = 20 + data[10] = "one" + data[100] = 100 + data[20] = "two" + + f[1] = "cmp_num_idx" + f[2] = "cmp_str_val" + f[3] = "cmp_num_str_val" + for (i = 1; i <= 3; i++) { + printf("Sort function: %s\n", f[i]) + PROCINFO["sorted_in"] = f[i] + for (j in data) + printf("\tdata[%s] = %s\n", j, data[j]) + print "" + } + } + + Here are the results when the program is run: + + $ gawk -f compdemo.awk + -| Sort function: cmp_num_idx Sort by numeric index + -| data[two] = 20 + -| data[one] = 10 Both strings are numerically zero + -| data[10] = one + -| data[20] = two + -| data[100] = 100 + -| + -| Sort function: cmp_str_val Sort by element values as strings + -| data[one] = 10 + -| data[100] = 100 String 100 is less than string 20 + -| data[two] = 20 + -| data[10] = one + -| data[20] = two + -| + -| Sort function: cmp_num_str_val Sort all numeric values before all strings + -| data[one] = 10 + -| data[two] = 20 + -| data[100] = 100 + -| data[10] = one + -| data[20] = two + + Consider sorting the entries of a GNU/Linux system password file +according to login name. The following program sorts records by a +specific field position and can be used for this purpose: + + # passwd-sort.awk --- simple program to sort by field position + # field position is specified by the global variable POS + + function cmp_field(i1, v1, i2, v2) + { + # comparison by value, as string, and ascending order + return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS]) + } + + { + for (i = 1; i <= NF; i++) + a[NR][i] = $i + } + + END { + PROCINFO["sorted_in"] = "cmp_field" + if (POS < 1 || POS > NF) + POS = 1 + for (i in a) { + for (j = 1; j <= NF; j++) + printf("%s%c", a[i][j], j < NF ? ":" : "") + print "" + } + } + + The first field in each entry of the password file is the user's +login name, and the fields are separated by colons. Each record defines +a subarray, with each field as an element in the subarray. Running the +program produces the following output: + + $ gawk -v POS=1 -F: -f sort.awk /etc/passwd + -| adm:x:3:4:adm:/var/adm:/sbin/nologin + -| apache:x:48:48:Apache:/var/www:/sbin/nologin + -| avahi:x:70:70:Avahi daemon:/:/sbin/nologin + ... + + The comparison should normally always return the same value when +given a specific pair of array elements as its arguments. If +inconsistent results are returned, then the order is undefined. This +behavior can be exploited to introduce random order into otherwise +seemingly ordered data: + + function cmp_randomize(i1, v1, i2, v2) + { + # random order (caution: this may never terminate!) + return (2 - 4 * rand()) + } + + As already mentioned, the order of the indices is arbitrary if two +elements compare equal. This is usually not a problem, but letting the +tied elements come out in arbitrary order can be an issue, especially +when comparing item values. The partial ordering of the equal elements +may change the next time the array is traversed, if other elements are +added to or removed from the array. One way to resolve ties when +comparing elements with otherwise equal values is to include the indices +in the comparison rules. Note that doing this may make the loop +traversal less efficient, so consider it only if necessary. The +following comparison functions force a deterministic order, and are +based on the fact that the (string) indices of two elements are never +equal: + + function cmp_numeric(i1, v1, i2, v2) + { + # numerical value (and index) comparison, descending order + return (v1 != v2) ? (v2 - v1) : (i2 - i1) + } + + function cmp_string(i1, v1, i2, v2) + { + # string value (and index) comparison, descending order + v1 = v1 i1 + v2 = v2 i2 + return (v1 > v2) ? -1 : (v1 != v2) + } + + A custom comparison function can often simplify ordered loop +traversal, and the sky is really the limit when it comes to designing +such a function. + + When string comparisons are made during a sort, either for element +values where one or both aren't numbers, or for element indices handled +as strings, the value of 'IGNORECASE' (*note Built-in Variables::) +controls whether the comparisons treat corresponding upper- and +lowercase letters as equivalent or distinct. + + Another point to keep in mind is that in the case of subarrays, the +element values can themselves be arrays; a production comparison +function should use the 'isarray()' function (*note Type Functions::) to +check for this, and choose a defined sorting order for subarrays. + + All sorting based on 'PROCINFO["sorted_in"]' is disabled in POSIX +mode, because the 'PROCINFO' array is not special in that case. + + As a side note, sorting the array indices before traversing the array +has been reported to add a 15% to 20% overhead to the execution time of +'awk' programs. For this reason, sorted array traversal is not the +default. + + ---------- Footnotes ---------- + + (1) This is why the predefined sorting orders start with an '@' +character, which cannot be part of an identifier. + + +File: gawk.info, Node: Array Sorting Functions, Prev: Controlling Array Traversal, Up: Array Sorting + +12.2.2 Sorting Array Values and Indices with 'gawk' +--------------------------------------------------- + +In most 'awk' implementations, sorting an array requires writing a +'sort()' function. This can be educational for exploring different +sorting algorithms, but usually that's not the point of the program. +'gawk' provides the built-in 'asort()' and 'asorti()' functions (*note +String Functions::) for sorting arrays. For example: + + POPULATE THE ARRAY data + n = asort(data) + for (i = 1; i <= n; i++) + DO SOMETHING WITH data[i] + + After the call to 'asort()', the array 'data' is indexed from 1 to +some number N, the total number of elements in 'data'. (This count is +'asort()''s return value.) 'data[1]' <= 'data[2]' <= 'data[3]', and so +on. The default comparison is based on the type of the elements (*note +Typing and Comparison::). All numeric values come before all string +values, which in turn come before all subarrays. + + An important side effect of calling 'asort()' is that _the array's +original indices are irrevocably lost_. As this isn't always desirable, +'asort()' accepts a second argument: + + POPULATE THE ARRAY source + n = asort(source, dest) + for (i = 1; i <= n; i++) + DO SOMETHING WITH dest[i] + + In this case, 'gawk' copies the 'source' array into the 'dest' array +and then sorts 'dest', destroying its indices. However, the 'source' +array is not affected. + + Often, what's needed is to sort on the values of the _indices_ +instead of the values of the elements. To do that, use the 'asorti()' +function. The interface and behavior are identical to that of +'asort()', except that the index values are used for sorting and become +the values of the result array: + + { source[$0] = some_func($0) } + + END { + n = asorti(source, dest) + for (i = 1; i <= n; i++) { + Work with sorted indices directly: + DO SOMETHING WITH dest[i] + ... + Access original array via sorted indices: + DO SOMETHING WITH source[dest[i]] + } + } + + So far, so good. Now it starts to get interesting. Both 'asort()' +and 'asorti()' accept a third string argument to control comparison of +array elements. When we introduced 'asort()' and 'asorti()' in *note +String Functions::, we ignored this third argument; however, now is the +time to describe how this argument affects these two functions. + + Basically, the third argument specifies how the array is to be +sorted. There are two possibilities. As with 'PROCINFO["sorted_in"]', +this argument may be one of the predefined names that 'gawk' provides +(*note Controlling Scanning::), or it may be the name of a user-defined +function (*note Controlling Array Traversal::). + + In the latter case, _the function can compare elements in any way it +chooses_, taking into account just the indices, just the values, or +both. This is extremely powerful. + + Once the array is sorted, 'asort()' takes the _values_ in their final +order and uses them to fill in the result array, whereas 'asorti()' +takes the _indices_ in their final order and uses them to fill in the +result array. + + NOTE: Copying array indices and elements isn't expensive in terms + of memory. Internally, 'gawk' maintains "reference counts" to + data. For example, when 'asort()' copies the first array to the + second one, there is only one copy of the original array elements' + data, even though both arrays use the values. + + Because 'IGNORECASE' affects string comparisons, the value of +'IGNORECASE' also affects sorting for both 'asort()' and 'asorti()'. +Note also that the locale's sorting order does _not_ come into play; +comparisons are based on character values only.(1) + + The following example demonstrates the use of a comparison function +with 'asort()'. The comparison function, 'case_fold_compare()', maps +both values to lowercase in order to compare them ignoring case. + + # case_fold_compare --- compare as strings, ignoring case + + function case_fold_compare(i1, v1, i2, v2, l, r) + { + l = tolower(v1) + r = tolower(v2) + + if (l < r) + return -1 + else if (l == r) + return 0 + else + return 1 + } + + And here is the test program for it: + + # Test program + + BEGIN { + Letters = "abcdefghijklmnopqrstuvwxyz" \ + "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + split(Letters, data, "") + + asort(data, result, "case_fold_compare") + + j = length(result) + for (i = 1; i <= j; i++) { + printf("%s", result[i]) + if (i % (j/2) == 0) + printf("\n") + else + printf(" ") + } + } + + When run, we get the following: + + $ gawk -f case_fold_compare.awk + -| A a B b c C D d e E F f g G H h i I J j k K l L M m + -| n N O o p P Q q r R S s t T u U V v w W X x y Y z Z + + ---------- Footnotes ---------- + + (1) This is true because locale-based comparison occurs only when in +POSIX-compatibility mode, and because 'asort()' and 'asorti()' are +'gawk' extensions, they are not available in that case. + + +File: gawk.info, Node: Two-way I/O, Next: TCP/IP Networking, Prev: Array Sorting, Up: Advanced Features + +12.3 Two-Way Communications with Another Process +================================================ + +It is often useful to be able to send data to a separate program for +processing and then read the result. This can always be done with +temporary files: + + # Write the data for processing + tempfile = ("mydata." PROCINFO["pid"]) + while (NOT DONE WITH DATA) + print DATA | ("subprogram > " tempfile) + close("subprogram > " tempfile) + + # Read the results, remove tempfile when done + while ((getline newdata < tempfile) > 0) + PROCESS newdata APPROPRIATELY + close(tempfile) + system("rm " tempfile) + +This works, but not elegantly. Among other things, it requires that the +program be run in a directory that cannot be shared among users; for +example, '/tmp' will not do, as another user might happen to be using a +temporary file with the same name.(1) + + However, with 'gawk', it is possible to open a _two-way_ pipe to +another process. The second process is termed a "coprocess", as it runs +in parallel with 'gawk'. The two-way connection is created using the +'|&' operator (borrowed from the Korn shell, 'ksh'):(2) + + do { + print DATA |& "subprogram" + "subprogram" |& getline results + } while (DATA LEFT TO PROCESS) + close("subprogram") + + The first time an I/O operation is executed using the '|&' operator, +'gawk' creates a two-way pipeline to a child process that runs the other +program. Output created with 'print' or 'printf' is written to the +program's standard input, and output from the program's standard output +can be read by the 'gawk' program using 'getline'. As is the case with +processes started by '|', the subprogram can be any program, or pipeline +of programs, that can be started by the shell. + + There are some cautionary items to be aware of: + + * As the code inside 'gawk' currently stands, the coprocess's + standard error goes to the same place that the parent 'gawk''s + standard error goes. It is not possible to read the child's + standard error separately. + + * I/O buffering may be a problem. 'gawk' automatically flushes all + output down the pipe to the coprocess. However, if the coprocess + does not flush its output, 'gawk' may hang when doing a 'getline' + in order to read the coprocess's results. This could lead to a + situation known as "deadlock", where each process is waiting for + the other one to do something. + + It is possible to close just one end of the two-way pipe to a +coprocess, by supplying a second argument to the 'close()' function of +either '"to"' or '"from"' (*note Close Files And Pipes::). These +strings tell 'gawk' to close the end of the pipe that sends data to the +coprocess or the end that reads from it, respectively. + + This is particularly necessary in order to use the system 'sort' +utility as part of a coprocess; 'sort' must read _all_ of its input data +before it can produce any output. The 'sort' program does not receive +an end-of-file indication until 'gawk' closes the write end of the pipe. + + When you have finished writing data to the 'sort' utility, you can +close the '"to"' end of the pipe, and then start reading sorted data via +'getline'. For example: + + BEGIN { + command = "LC_ALL=C sort" + n = split("abcdefghijklmnopqrstuvwxyz", a, "") + + for (i = n; i > 0; i--) + print a[i] |& command + close(command, "to") + + while ((command |& getline line) > 0) + print "got", line + close(command) + } + + This program writes the letters of the alphabet in reverse order, one +per line, down the two-way pipe to 'sort'. It then closes the write end +of the pipe, so that 'sort' receives an end-of-file indication. This +causes 'sort' to sort the data and write the sorted data back to the +'gawk' program. Once all of the data has been read, 'gawk' terminates +the coprocess and exits. + + As a side note, the assignment 'LC_ALL=C' in the 'sort' command +ensures traditional Unix (ASCII) sorting from 'sort'. This is not +strictly necessary here, but it's good to know how to do this. + + Be careful when closing the '"from"' end of a two-way pipe; in this +case 'gawk' waits for the child process to exit, which may cause your +program to hang. (Thus, this particular feature is of much less use in +practice than being able to close the '"to"' end.) + + CAUTION: Normally, it is a fatal error to write to the '"to"' end + of a two-way pipe which has been closed, and it is also a fatal + error to read from the '"from"' end of a two-way pipe that has been + closed. + + You may set 'PROCINFO["COMMAND", "NONFATAL"]' to make such + operations become nonfatal, in which case you then need to check + 'ERRNO' after each 'print', 'printf', or 'getline'. *Note + Nonfatal::, for more information. + + You may also use pseudo-ttys (ptys) for two-way communication instead +of pipes, if your system supports them. This is done on a per-command +basis, by setting a special element in the 'PROCINFO' array (*note +Auto-set::), like so: + + command = "sort -nr" # command, save in convenience variable + PROCINFO[command, "pty"] = 1 # update PROCINFO + print ... |& command # start two-way pipe + ... + +If your system does not have ptys, or if all the system's ptys are in +use, 'gawk' automatically falls back to using regular pipes. + + Using ptys usually avoids the buffer deadlock issues described +earlier, at some loss in performance. This is because the tty driver +buffers and sends data line-by-line. On systems with the 'stdbuf' (part +of the GNU Coreutils package +(http://www.gnu.org/software/coreutils/coreutils.html)), you can use +that program instead of ptys. + + Note also that ptys are not fully transparent. Certain binary +control codes, such 'Ctrl-d' for end-of-file, are interpreted by the tty +driver and not passed through. + + CAUTION: Finally, coprocesses open up the possibility of "deadlock" + between 'gawk' and the program running in the coprocess. This can + occur if you send "too much" data to the coprocess before reading + any back; each process is blocked writing data with noone available + to read what they've already written. There is no workaround for + deadlock; careful programming and knowledge of the behavior of the + coprocess are required. + + ---------- Footnotes ---------- + + (1) Michael Brennan suggests the use of 'rand()' to generate unique +file names. This is a valid point; nevertheless, temporary files remain +more difficult to use than two-way pipes. + + (2) This is very different from the same operator in the C shell and +in Bash. + + +File: gawk.info, Node: TCP/IP Networking, Next: Profiling, Prev: Two-way I/O, Up: Advanced Features + +12.4 Using 'gawk' for Network Programming +========================================= + + 'EMRED': + A host is a host from coast to coast, + and nobody talks to a host that's close, + unless the host that isn't close + is busy, hung, or dead. + -- _Mike O'Brien (aka Mr. Protocol)_ + + In addition to being able to open a two-way pipeline to a coprocess +on the same system (*note Two-way I/O::), it is possible to make a +two-way connection to another process on another system across an IP +network connection. + + You can think of this as just a _very long_ two-way pipeline to a +coprocess. The way 'gawk' decides that you want to use TCP/IP +networking is by recognizing special file names that begin with one of +'/inet/', '/inet4/', or '/inet6/'. + + The full syntax of the special file name is +'/NET-TYPE/PROTOCOL/LOCAL-PORT/REMOTE-HOST/REMOTE-PORT'. The components +are: + +NET-TYPE + Specifies the kind of Internet connection to make. Use '/inet4/' + to force IPv4, and '/inet6/' to force IPv6. Plain '/inet/' (which + used to be the only option) uses the system default, most likely + IPv4. + +PROTOCOL + The protocol to use over IP. This must be either 'tcp', or 'udp', + for a TCP or UDP IP connection, respectively. TCP should be used + for most applications. + +LOCAL-PORT + The local TCP or UDP port number to use. Use a port number of '0' + when you want the system to pick a port. This is what you should + do when writing a TCP or UDP client. You may also use a well-known + service name, such as 'smtp' or 'http', in which case 'gawk' + attempts to determine the predefined port number using the C + 'getaddrinfo()' function. + +REMOTE-HOST + The IP address or fully qualified domain name of the Internet host + to which you want to connect. + +REMOTE-PORT + The TCP or UDP port number to use on the given REMOTE-HOST. Again, + use '0' if you don't care, or else a well-known service name. + + NOTE: Failure in opening a two-way socket will result in a nonfatal + error being returned to the calling code. The value of 'ERRNO' + indicates the error (*note Auto-set::). + + Consider the following very simple example: + + BEGIN { + Service = "/inet/tcp/0/localhost/daytime" + Service |& getline + print $0 + close(Service) + } + + This program reads the current date and time from the local system's +TCP 'daytime' server. It then prints the results and closes the +connection. + + Because this topic is extensive, the use of 'gawk' for TCP/IP +programming is documented separately. See *note (General Introduction, +gawkinet, TCP/IP Internetworking with 'gawk')Top::, for a much more +complete introduction and discussion, as well as extensive examples. + + NOTE: 'gawk' can only open direct sockets. There is currently no + way to access services available over Secure Socket Layer (SSL); + this includes any web service whose URL starts with 'https://'. + + +File: gawk.info, Node: Profiling, Next: Advanced Features Summary, Prev: TCP/IP Networking, Up: Advanced Features + +12.5 Profiling Your 'awk' Programs +================================== + +You may produce execution traces of your 'awk' programs. This is done +by passing the option '--profile' to 'gawk'. When 'gawk' has finished +running, it creates a profile of your program in a file named +'awkprof.out'. Because it is profiling, it also executes up to 45% +slower than 'gawk' normally does. + + As shown in the following example, the '--profile' option can be used +to change the name of the file where 'gawk' will write the profile: + + gawk --profile=myprog.prof -f myprog.awk data1 data2 + +In the preceding example, 'gawk' places the profile in 'myprog.prof' +instead of in 'awkprof.out'. + + Here is a sample session showing a simple 'awk' program, its input +data, and the results from running 'gawk' with the '--profile' option. +First, the 'awk' program: + + BEGIN { print "First BEGIN rule" } + + END { print "First END rule" } + + /foo/ { + print "matched /foo/, gosh" + for (i = 1; i <= 3; i++) + sing() + } + + { + if (/foo/) + print "if is true" + else + print "else is true" + } + + BEGIN { print "Second BEGIN rule" } + + END { print "Second END rule" } + + function sing( dummy) + { + print "I gotta be me!" + } + + Following is the input data: + + foo + bar + baz + foo + junk + + Here is the 'awkprof.out' that results from running the 'gawk' +profiler on this program and data (this example also illustrates that +'awk' programmers sometimes get up very early in the morning to work): + + # gawk profile, created Mon Sep 29 05:16:21 2014 + + # BEGIN rule(s) + + BEGIN { + 1 print "First BEGIN rule" + } + + BEGIN { + 1 print "Second BEGIN rule" + } + + # Rule(s) + + 5 /foo/ { # 2 + 2 print "matched /foo/, gosh" + 6 for (i = 1; i <= 3; i++) { + 6 sing() + } + } + + 5 { + 5 if (/foo/) { # 2 + 2 print "if is true" + 3 } else { + 3 print "else is true" + } + } + + # END rule(s) + + END { + 1 print "First END rule" + } + + END { + 1 print "Second END rule" + } + + + # Functions, listed alphabetically + + 6 function sing(dummy) + { + 6 print "I gotta be me!" + } + + This example illustrates many of the basic features of profiling +output. They are as follows: + + * The program is printed in the order 'BEGIN' rules, 'BEGINFILE' + rules, pattern-action rules, 'ENDFILE' rules, 'END' rules, and + functions, listed alphabetically. Multiple 'BEGIN' and 'END' rules + retain their separate identities, as do multiple 'BEGINFILE' and + 'ENDFILE' rules. + + * Pattern-action rules have two counts. The first count, to the left + of the rule, shows how many times the rule's pattern was _tested_. + The second count, to the right of the rule's opening left brace in + a comment, shows how many times the rule's action was _executed_. + The difference between the two indicates how many times the rule's + pattern evaluated to false. + + * Similarly, the count for an 'if'-'else' statement shows how many + times the condition was tested. To the right of the opening left + brace for the 'if''s body is a count showing how many times the + condition was true. The count for the 'else' indicates how many + times the test failed. + + * The count for a loop header (such as 'for' or 'while') shows how + many times the loop test was executed. (Because of this, you can't + just look at the count on the first statement in a rule to + determine how many times the rule was executed. If the first + statement is a loop, the count is misleading.) + + * For user-defined functions, the count next to the 'function' + keyword indicates how many times the function was called. The + counts next to the statements in the body show how many times those + statements were executed. + + * The layout uses "K&R" style with TABs. Braces are used everywhere, + even when the body of an 'if', 'else', or loop is only a single + statement. + + * Parentheses are used only where needed, as indicated by the + structure of the program and the precedence rules. For example, + '(3 + 5) * 4' means add three and five, then multiply the total by + four. However, '3 + 5 * 4' has no parentheses, and means '3 + (5 * + 4)'. + + * Parentheses are used around the arguments to 'print' and 'printf' + only when the 'print' or 'printf' statement is followed by a + redirection. Similarly, if the target of a redirection isn't a + scalar, it gets parenthesized. + + * 'gawk' supplies leading comments in front of the 'BEGIN' and 'END' + rules, the 'BEGINFILE' and 'ENDFILE' rules, the pattern-action + rules, and the functions. + + The profiled version of your program may not look exactly like what +you typed when you wrote it. This is because 'gawk' creates the +profiled version by "pretty-printing" its internal representation of the +program. The advantage to this is that 'gawk' can produce a standard +representation. Also, things such as: + + /foo/ + +come out as: + + /foo/ { + print $0 + } + +which is correct, but possibly unexpected. + + Besides creating profiles when a program has completed, 'gawk' can +produce a profile while it is running. This is useful if your 'awk' +program goes into an infinite loop and you want to see what has been +executed. To use this feature, run 'gawk' with the '--profile' option +in the background: + + $ gawk --profile -f myprog & + [1] 13992 + +The shell prints a job number and process ID number; in this case, +13992. Use the 'kill' command to send the 'USR1' signal to 'gawk': + + $ kill -USR1 13992 + +As usual, the profiled version of the program is written to +'awkprof.out', or to a different file if one was specified with the +'--profile' option. + + Along with the regular profile, as shown earlier, the profile file +includes a trace of any active functions: + + # Function Call Stack: + + # 3. baz + # 2. bar + # 1. foo + # -- main -- + + You may send 'gawk' the 'USR1' signal as many times as you like. +Each time, the profile and function call trace are appended to the +output profile file. + + If you use the 'HUP' signal instead of the 'USR1' signal, 'gawk' +produces the profile and the function call trace and then exits. + + When 'gawk' runs on MS-Windows systems, it uses the 'INT' and 'QUIT' +signals for producing the profile, and in the case of the 'INT' signal, +'gawk' exits. This is because these systems don't support the 'kill' +command, so the only signals you can deliver to a program are those +generated by the keyboard. The 'INT' signal is generated by the +'Ctrl-c' or 'Ctrl-BREAK' key, while the 'QUIT' signal is generated by +the 'Ctrl-\' key. + + Finally, 'gawk' also accepts another option, '--pretty-print'. When +called this way, 'gawk' "pretty-prints" the program into 'awkprof.out', +without any execution counts. + + NOTE: Once upon a time, the '--pretty-print' option would also run + your program. This is is no longer the case. + + There is a significant difference between the output created when +profiling, and that created when pretty-printing. Pretty-printed output +preserves the original comments that were in the program, although their +placement may not correspond exactly to their original locations in the +source code.(1) + + However, as a deliberate design decision, profiling output _omits_ +the original program's comments. This allows you to focus on the +execution count data and helps you avoid the temptation to use the +profiler for pretty-printing. + + Additionally, pretty-printed output does not have the leading +indentation that the profiling output does. This makes it easy to +pretty-print your code once development is completed, and then use the +result as the final version of your program. + + Because the internal representation of your program is formatted to +recreate an 'awk' program, profiling and pretty-printing automatically +disable 'gawk''s default optimizations. + + ---------- Footnotes ---------- + + (1) 'gawk' does the best it can to preserve the distinction between +comments at the end of a statement and comments on lines by themselves. +Due to implementation constraints, it does not always do so correctly, +particularly for 'switch' statements. The 'gawk' maintainers hope to +improve this in a subsequent release. + + +File: gawk.info, Node: Advanced Features Summary, Prev: Profiling, Up: Advanced Features + +12.6 Summary +============ + + * The '--non-decimal-data' option causes 'gawk' to treat octal- and + hexadecimal-looking input data as octal and hexadecimal. This + option should be used with caution or not at all; use of + 'strtonum()' is preferable. Note that this option may disappear in + a future version of 'gawk'. + + * You can take over complete control of sorting in 'for (INDX in + ARRAY)' array traversal by setting 'PROCINFO["sorted_in"]' to the + name of a user-defined function that does the comparison of array + elements based on index and value. + + * Similarly, you can supply the name of a user-defined comparison + function as the third argument to either 'asort()' or 'asorti()' to + control how those functions sort arrays. Or you may provide one of + the predefined control strings that work for + 'PROCINFO["sorted_in"]'. + + * You can use the '|&' operator to create a two-way pipe to a + coprocess. You read from the coprocess with 'getline' and write to + it with 'print' or 'printf'. Use 'close()' to close off the + coprocess completely, or optionally, close off one side of the + two-way communications. + + * By using special file names with the '|&' operator, you can open a + TCP/IP (or UDP/IP) connection to remote hosts on the Internet. + 'gawk' supports both IPv4 and IPv6. + + * You can generate statement count profiles of your program. This + can help you determine which parts of your program may be taking + the most time and let you tune them more easily. Sending the + 'USR1' signal while profiling causes 'gawk' to dump the profile and + keep going, including a function call stack. + + * You can also just "pretty-print" the program. + + +File: gawk.info, Node: Internationalization, Next: Debugger, Prev: Advanced Features, Up: Top + +13 Internationalization with 'gawk' +*********************************** + +Once upon a time, computer makers wrote software that worked only in +English. Eventually, hardware and software vendors noticed that if +their systems worked in the native languages of non-English-speaking +countries, they were able to sell more systems. As a result, +internationalization and localization of programs and software systems +became a common practice. + + For many years, the ability to provide internationalization was +largely restricted to programs written in C and C++. This major node +describes the underlying library 'gawk' uses for internationalization, +as well as how 'gawk' makes internationalization features available at +the 'awk' program level. Having internationalization available at the +'awk' level gives software developers additional flexibility--they are +no longer forced to write in C or C++ when internationalization is a +requirement. + +* Menu: + +* I18N and L10N:: Internationalization and Localization. +* Explaining gettext:: How GNU 'gettext' works. +* Programmer i18n:: Features for the programmer. +* Translator i18n:: Features for the translator. +* I18N Example:: A simple i18n example. +* Gawk I18N:: 'gawk' is also internationalized. +* I18N Summary:: Summary of I18N stuff. + + +File: gawk.info, Node: I18N and L10N, Next: Explaining gettext, Up: Internationalization + +13.1 Internationalization and Localization +========================================== + +"Internationalization" means writing (or modifying) a program once, in +such a way that it can use multiple languages without requiring further +source code changes. "Localization" means providing the data necessary +for an internationalized program to work in a particular language. Most +typically, these terms refer to features such as the language used for +printing error messages, the language used to read responses, and +information related to how numerical and monetary values are printed and +read. + + +File: gawk.info, Node: Explaining gettext, Next: Programmer i18n, Prev: I18N and L10N, Up: Internationalization + +13.2 GNU 'gettext' +================== + +'gawk' uses GNU 'gettext' to provide its internationalization features. +The facilities in GNU 'gettext' focus on messages: strings printed by a +program, either directly or via formatting with 'printf' or +'sprintf()'.(1) + + When using GNU 'gettext', each application has its own "text domain". +This is a unique name, such as 'kpilot' or 'gawk', that identifies the +application. A complete application may have multiple +components--programs written in C or C++, as well as scripts written in +'sh' or 'awk'. All of the components use the same text domain. + + To make the discussion concrete, assume we're writing an application +named 'guide'. Internationalization consists of the following steps, in +this order: + + 1. The programmer reviews the source for all of 'guide''s components + and marks each string that is a candidate for translation. For + example, '"`-F': option required"' is a good candidate for + translation. A table with strings of option names is not (e.g., + 'gawk''s '--profile' option should remain the same, no matter what + the local language). + + 2. The programmer indicates the application's text domain ('"guide"') + to the 'gettext' library, by calling the 'textdomain()' function. + + 3. Messages from the application are extracted from the source code + and collected into a portable object template file ('guide.pot'), + which lists the strings and their translations. The translations + are initially empty. The original (usually English) messages serve + as the key for lookup of the translations. + + 4. For each language with a translator, 'guide.pot' is copied to a + portable object file ('.po') and translations are created and + shipped with the application. For example, there might be a + 'fr.po' for a French translation. + + 5. Each language's '.po' file is converted into a binary message + object ('.gmo') file. A message object file contains the original + messages and their translations in a binary format that allows fast + lookup of translations at runtime. + + 6. When 'guide' is built and installed, the binary translation files + are installed in a standard place. + + 7. For testing and development, it is possible to tell 'gettext' to + use '.gmo' files in a different directory than the standard one by + using the 'bindtextdomain()' function. + + 8. At runtime, 'guide' looks up each string via a call to 'gettext()'. + The returned string is the translated string if available, or the + original string if not. + + 9. If necessary, it is possible to access messages from a different + text domain than the one belonging to the application, without + having to switch the application's default text domain back and + forth. + + In C (or C++), the string marking and dynamic translation lookup are +accomplished by wrapping each string in a call to 'gettext()': + + printf("%s", gettext("Don't Panic!\n")); + + The tools that extract messages from source code pull out all strings +enclosed in calls to 'gettext()'. + + The GNU 'gettext' developers, recognizing that typing 'gettext(...)' +over and over again is both painful and ugly to look at, use the macro +'_' (an underscore) to make things easier: + + /* In the standard header file: */ + #define _(str) gettext(str) + + /* In the program text: */ + printf("%s", _("Don't Panic!\n")); + +This reduces the typing overhead to just three extra characters per +string and is considerably easier to read as well. + + There are locale "categories" for different types of locale-related +information. The defined locale categories that 'gettext' knows about +are: + +'LC_MESSAGES' + Text messages. This is the default category for 'gettext' + operations, but it is possible to supply a different one + explicitly, if necessary. (It is almost never necessary to supply + a different category.) + +'LC_COLLATE' + Text-collation information (i.e., how different characters and/or + groups of characters sort in a given language). + +'LC_CTYPE' + Character-type information (alphabetic, digit, upper- or lowercase, + and so on) as well as character encoding. This information is + accessed via the POSIX character classes in regular expressions, + such as '/[[:alnum:]]/' (*note Bracket Expressions::). + +'LC_MONETARY' + Monetary information, such as the currency symbol, and whether the + symbol goes before or after a number. + +'LC_NUMERIC' + Numeric information, such as which characters to use for the + decimal point and the thousands separator.(2) + +'LC_TIME' + Time- and date-related information, such as 12- or 24-hour clock, + month printed before or after the day in a date, local month + abbreviations, and so on. + +'LC_ALL' + All of the above. (Not too useful in the context of 'gettext'.) + + NOTE: As described in *note Locales::, environment variables with + the same name as the locale categories ('LC_CTYPE', 'LC_ALL', etc.) + influence 'gawk''s behavior (and that of other utilities). + + Normally, these variables also affect how the 'gettext' library + finds translations. However, the 'LANGUAGE' environment variable + overrides the 'LC_XXX' variables. Many GNU/Linux systems may + define this variable without your knowledge, causing 'gawk' to not + find the correct translations. If this happens to you, look to see + if 'LANGUAGE' is defined, and if so, use the shell's 'unset' + command to remove it. + + For testing translations of 'gawk' itself, you can set the +'GAWK_LOCALE_DIR' environment variable. See the documentation for the C +'bindtextdomain()' function and also see *note Other Environment +Variables::. + + ---------- Footnotes ---------- + + (1) For some operating systems, the 'gawk' port doesn't support GNU +'gettext'. Therefore, these features are not available if you are using +one of those operating systems. Sorry. + + (2) Americans use a comma every three decimal places and a period for +the decimal point, while many Europeans do exactly the opposite: +1,234.56 versus 1.234,56. + + +File: gawk.info, Node: Programmer i18n, Next: Translator i18n, Prev: Explaining gettext, Up: Internationalization + +13.3 Internationalizing 'awk' Programs +====================================== + +'gawk' provides the following variables for internationalization: + +'TEXTDOMAIN' + This variable indicates the application's text domain. For + compatibility with GNU 'gettext', the default value is + '"messages"'. + +'_"your message here"' + String constants marked with a leading underscore are candidates + for translation at runtime. String constants without a leading + underscore are not translated. + + 'gawk' provides the following functions for internationalization: + +'dcgettext(STRING [, DOMAIN [, CATEGORY]])' + Return the translation of STRING in text domain DOMAIN for locale + category CATEGORY. The default value for DOMAIN is the current + value of 'TEXTDOMAIN'. The default value for CATEGORY is + '"LC_MESSAGES"'. + + If you supply a value for CATEGORY, it must be a string equal to + one of the known locale categories described in *note Explaining + gettext::. You must also supply a text domain. Use 'TEXTDOMAIN' + if you want to use the current domain. + + CAUTION: The order of arguments to the 'awk' version of the + 'dcgettext()' function is purposely different from the order + for the C version. The 'awk' version's order was chosen to be + simple and to allow for reasonable 'awk'-style default + arguments. + +'dcngettext(STRING1, STRING2, NUMBER [, DOMAIN [, CATEGORY]])' + Return the plural form used for NUMBER of the translation of + STRING1 and STRING2 in text domain DOMAIN for locale category + CATEGORY. STRING1 is the English singular variant of a message, + and STRING2 is the English plural variant of the same message. The + default value for DOMAIN is the current value of 'TEXTDOMAIN'. The + default value for CATEGORY is '"LC_MESSAGES"'. + + The same remarks about argument order as for the 'dcgettext()' + function apply. + +'bindtextdomain(DIRECTORY [, DOMAIN ])' + Change the directory in which 'gettext' looks for '.gmo' files, in + case they will not or cannot be placed in the standard locations + (e.g., during testing). Return the directory in which DOMAIN is + "bound." + + The default DOMAIN is the value of 'TEXTDOMAIN'. If DIRECTORY is + the null string ('""'), then 'bindtextdomain()' returns the current + binding for the given DOMAIN. + + To use these facilities in your 'awk' program, follow these steps: + + 1. Set the variable 'TEXTDOMAIN' to the text domain of your program. + This is best done in a 'BEGIN' rule (*note BEGIN/END::), or it can + also be done via the '-v' command-line option (*note Options::): + + BEGIN { + TEXTDOMAIN = "guide" + ... + } + + 2. Mark all translatable strings with a leading underscore ('_') + character. It _must_ be adjacent to the opening quote of the + string. For example: + + print _"hello, world" + x = _"you goofed" + printf(_"Number of users is %d\n", nusers) + + 3. If you are creating strings dynamically, you can still translate + them, using the 'dcgettext()' built-in function:(1) + + if (groggy) + message = dcgettext("%d customers disturbing me\n", "adminprog") + else + message = dcgettext("enjoying %d customers\n", "adminprog") + printf(message, ncustomers) + + Here, the call to 'dcgettext()' supplies a different text domain + ('"adminprog"') in which to find the message, but it uses the + default '"LC_MESSAGES"' category. + + The previous example only works if 'ncustomers' is greater than + one. This example would be better done with 'dcngettext()': + + if (groggy) + message = dcngettext("%d customer disturbing me\n", + "%d customers disturbing me\n", "adminprog") + else + message = dcngettext("enjoying %d customer\n", + "enjoying %d customers\n", "adminprog") + printf(message, ncustomers) + + 4. During development, you might want to put the '.gmo' file in a + private directory for testing. This is done with the + 'bindtextdomain()' built-in function: + + BEGIN { + TEXTDOMAIN = "guide" # our text domain + if (Testing) { + # where to find our files + bindtextdomain("testdir") + # joe is in charge of adminprog + bindtextdomain("../joe/testdir", "adminprog") + } + ... + } + + *Note I18N Example:: for an example program showing the steps to +create and use translations from 'awk'. + + ---------- Footnotes ---------- + + (1) Thanks to Bruno Haible for this example. + + +File: gawk.info, Node: Translator i18n, Next: I18N Example, Prev: Programmer i18n, Up: Internationalization + +13.4 Translating 'awk' Programs +=============================== + +Once a program's translatable strings have been marked, they must be +extracted to create the initial '.pot' file. As part of translation, it +is often helpful to rearrange the order in which arguments to 'printf' +are output. + + 'gawk''s '--gen-pot' command-line option extracts the messages and is +discussed next. After that, 'printf''s ability to rearrange the order +for 'printf' arguments at runtime is covered. + +* Menu: + +* String Extraction:: Extracting marked strings. +* Printf Ordering:: Rearranging 'printf' arguments. +* I18N Portability:: 'awk'-level portability issues. + + +File: gawk.info, Node: String Extraction, Next: Printf Ordering, Up: Translator i18n + +13.4.1 Extracting Marked Strings +-------------------------------- + +Once your 'awk' program is working, and all the strings have been marked +and you've set (and perhaps bound) the text domain, it is time to +produce translations. First, use the '--gen-pot' command-line option to +create the initial '.pot' file: + + gawk --gen-pot -f guide.awk > guide.pot + + When run with '--gen-pot', 'gawk' does not execute your program. +Instead, it parses it as usual and prints all marked strings to standard +output in the format of a GNU 'gettext' Portable Object file. Also +included in the output are any constant strings that appear as the first +argument to 'dcgettext()' or as the first and second argument to +'dcngettext()'.(1) You should distribute the generated '.pot' file with +your 'awk' program; translators will eventually use it to provide you +translations that you can also then distribute. *Note I18N Example:: +for the full list of steps to go through to create and test translations +for 'guide'. + + ---------- Footnotes ---------- + + (1) The 'xgettext' utility that comes with GNU 'gettext' can handle +'.awk' files. + + +File: gawk.info, Node: Printf Ordering, Next: I18N Portability, Prev: String Extraction, Up: Translator i18n + +13.4.2 Rearranging 'printf' Arguments +------------------------------------- + +Format strings for 'printf' and 'sprintf()' (*note Printf::) present a +special problem for translation. Consider the following:(1) + + printf(_"String `%s' has %d characters\n", + string, length(string))) + + A possible German translation for this might be: + + "%d Zeichen lang ist die Zeichenkette `%s'\n" + + The problem should be obvious: the order of the format specifications +is different from the original! Even though 'gettext()' can return the +translated string at runtime, it cannot change the argument order in the +call to 'printf'. + + To solve this problem, 'printf' format specifiers may have an +additional optional element, which we call a "positional specifier". +For example: + + "%2$d Zeichen lang ist die Zeichenkette `%1$s'\n" + + Here, the positional specifier consists of an integer count, which +indicates which argument to use, and a '$'. Counts are one-based, and +the format string itself is _not_ included. Thus, in the following +example, 'string' is the first argument and 'length(string)' is the +second: + + $ gawk 'BEGIN { + > string = "Don\47t Panic" + > printf "%2$d characters live in \"%1$s\"\n", + > string, length(string) + > }' + -| 11 characters live in "Don't Panic" + + If present, positional specifiers come first in the format +specification, before the flags, the field width, and/or the precision. + + Positional specifiers can be used with the dynamic field width and +precision capability: + + $ gawk 'BEGIN { + > printf("%*.*s\n", 10, 20, "hello") + > printf("%3$*2$.*1$s\n", 20, 10, "hello") + > }' + -| hello + -| hello + + NOTE: When using '*' with a positional specifier, the '*' comes + first, then the integer position, and then the '$'. This is + somewhat counterintuitive. + + 'gawk' does not allow you to mix regular format specifiers and those +with positional specifiers in the same string: + + $ gawk 'BEGIN { printf "%d %3$s\n", 1, 2, "hi" }' + error-> gawk: cmd. line:1: fatal: must use `count$' on all formats or none + + NOTE: There are some pathological cases that 'gawk' may fail to + diagnose. In such cases, the output may not be what you expect. + It's still a bad idea to try mixing them, even if 'gawk' doesn't + detect it. + + Although positional specifiers can be used directly in 'awk' +programs, their primary purpose is to help in producing correct +translations of format strings into languages different from the one in +which the program is first written. + + ---------- Footnotes ---------- + + (1) This example is borrowed from the GNU 'gettext' manual. + + +File: gawk.info, Node: I18N Portability, Prev: Printf Ordering, Up: Translator i18n + +13.4.3 'awk' Portability Issues +------------------------------- + +'gawk''s internationalization features were purposely chosen to have as +little impact as possible on the portability of 'awk' programs that use +them to other versions of 'awk'. Consider this program: + + BEGIN { + TEXTDOMAIN = "guide" + if (Test_Guide) # set with -v + bindtextdomain("/test/guide/messages") + print _"don't panic!" + } + +As written, it won't work on other versions of 'awk'. However, it is +actually almost portable, requiring very little change: + + * Assignments to 'TEXTDOMAIN' won't have any effect, because + 'TEXTDOMAIN' is not special in other 'awk' implementations. + + * Non-GNU versions of 'awk' treat marked strings as the concatenation + of a variable named '_' with the string following it.(1) + Typically, the variable '_' has the null string ('""') as its + value, leaving the original string constant as the result. + + * By defining "dummy" functions to replace 'dcgettext()', + 'dcngettext()', and 'bindtextdomain()', the 'awk' program can be + made to run, but all the messages are output in the original + language. For example: + + function bindtextdomain(dir, domain) + { + return dir + } + + function dcgettext(string, domain, category) + { + return string + } + + function dcngettext(string1, string2, number, domain, category) + { + return (number == 1 ? string1 : string2) + } + + * The use of positional specifications in 'printf' or 'sprintf()' is + _not_ portable. To support 'gettext()' at the C level, many + systems' C versions of 'sprintf()' do support positional + specifiers. But it works only if enough arguments are supplied in + the function call. Many versions of 'awk' pass 'printf' formats + and arguments unchanged to the underlying C library version of + 'sprintf()', but only one format and argument at a time. What + happens if a positional specification is used is anybody's guess. + However, because the positional specifications are primarily for + use in _translated_ format strings, and because non-GNU 'awk's + never retrieve the translated string, this should not be a problem + in practice. + + ---------- Footnotes ---------- + + (1) This is good fodder for an "Obfuscated 'awk'" contest. + + +File: gawk.info, Node: I18N Example, Next: Gawk I18N, Prev: Translator i18n, Up: Internationalization + +13.5 A Simple Internationalization Example +========================================== + +Now let's look at a step-by-step example of how to internationalize and +localize a simple 'awk' program, using 'guide.awk' as our original +source: + + BEGIN { + TEXTDOMAIN = "guide" + bindtextdomain(".") # for testing + print _"Don't Panic" + print _"The Answer Is", 42 + print "Pardon me, Zaphod who?" + } + +Run 'gawk --gen-pot' to create the '.pot' file: + + $ gawk --gen-pot -f guide.awk > guide.pot + +This produces: + + #: guide.awk:4 + msgid "Don't Panic" + msgstr "" + + #: guide.awk:5 + msgid "The Answer Is" + msgstr "" + + + This original portable object template file is saved and reused for +each language into which the application is translated. The 'msgid' is +the original string and the 'msgstr' is the translation. + + NOTE: Strings not marked with a leading underscore do not appear in + the 'guide.pot' file. + + Next, the messages must be translated. Here is a translation to a +hypothetical dialect of English, called "Mellow":(1) + + $ cp guide.pot guide-mellow.po + ADD TRANSLATIONS TO guide-mellow.po ... + +Following are the translations: + + #: guide.awk:4 + msgid "Don't Panic" + msgstr "Hey man, relax!" + + #: guide.awk:5 + msgid "The Answer Is" + msgstr "Like, the scoop is" + + + The next step is to make the directory to hold the binary message +object file and then to create the 'guide.mo' file. We pretend that our +file is to be used in the 'en_US.UTF-8' locale, because we have to use a +locale name known to the C 'gettext' routines. The directory layout +shown here is standard for GNU 'gettext' on GNU/Linux systems. Other +versions of 'gettext' may use a different layout: + + $ mkdir en_US.UTF-8 en_US.UTF-8/LC_MESSAGES + + The 'msgfmt' utility does the conversion from human-readable '.po' +file to machine-readable '.mo' file. By default, 'msgfmt' creates a +file named 'messages'. This file must be renamed and placed in the +proper directory (using the '-o' option) so that 'gawk' can find it: + + $ msgfmt guide-mellow.po -o en_US.UTF-8/LC_MESSAGES/guide.mo + + Finally, we run the program to test it: + + $ gawk -f guide.awk + -| Hey man, relax! + -| Like, the scoop is 42 + -| Pardon me, Zaphod who? + + If the three replacement functions for 'dcgettext()', 'dcngettext()', +and 'bindtextdomain()' (*note I18N Portability::) are in a file named +'libintl.awk', then we can run 'guide.awk' unchanged as follows: + + $ gawk --posix -f guide.awk -f libintl.awk + -| Don't Panic + -| The Answer Is 42 + -| Pardon me, Zaphod who? + + ---------- Footnotes ---------- + + (1) Perhaps it would be better if it were called "Hippy." Ah, well. + + +File: gawk.info, Node: Gawk I18N, Next: I18N Summary, Prev: I18N Example, Up: Internationalization + +13.6 'gawk' Can Speak Your Language +=================================== + +'gawk' itself has been internationalized using the GNU 'gettext' +package. (GNU 'gettext' is described in complete detail in *note (GNU +'gettext' utilities, gettext, GNU 'gettext' utilities)Top::.) As of +this writing, the latest version of GNU 'gettext' is version 0.19.4 +(ftp://ftp.gnu.org/gnu/gettext/gettext-0.19.4.tar.gz). + + If a translation of 'gawk''s messages exists, then 'gawk' produces +usage messages, warnings, and fatal errors in the local language. + + +File: gawk.info, Node: I18N Summary, Prev: Gawk I18N, Up: Internationalization + +13.7 Summary +============ + + * Internationalization means writing a program such that it can use + multiple languages without requiring source code changes. + Localization means providing the data necessary for an + internationalized program to work in a particular language. + + * 'gawk' uses GNU 'gettext' to let you internationalize and localize + 'awk' programs. A program's text domain identifies the program for + grouping all messages and other data together. + + * You mark a program's strings for translation by preceding them with + an underscore. Once that is done, the strings are extracted into a + '.pot' file. This file is copied for each language into a '.po' + file, and the '.po' files are compiled into '.gmo' files for use at + runtime. + + * You can use positional specifications with 'sprintf()' and 'printf' + to rearrange the placement of argument values in formatted strings + and output. This is useful for the translation of format control + strings. + + * The internationalization features have been designed so that they + can be easily worked around in a standard 'awk'. + + * 'gawk' itself has been internationalized and ships with a number of + translations for its messages. + + +File: gawk.info, Node: Debugger, Next: Arbitrary Precision Arithmetic, Prev: Internationalization, Up: Top + +14 Debugging 'awk' Programs +*************************** + +It would be nice if computer programs worked perfectly the first time +they were run, but in real life, this rarely happens for programs of any +complexity. Thus, most programming languages have facilities available +for "debugging" programs, and now 'awk' is no exception. + + The 'gawk' debugger is purposely modeled after the GNU Debugger (GDB) +(http://www.gnu.org/software/gdb/) command-line debugger. If you are +familiar with GDB, learning how to use 'gawk' for debugging your program +is easy. + +* Menu: + +* Debugging:: Introduction to 'gawk' debugger. +* Sample Debugging Session:: Sample debugging session. +* List of Debugger Commands:: Main debugger commands. +* Readline Support:: Readline support. +* Limitations:: Limitations and future plans. +* Debugging Summary:: Debugging summary. + + +File: gawk.info, Node: Debugging, Next: Sample Debugging Session, Up: Debugger + +14.1 Introduction to the 'gawk' Debugger +======================================== + +This minor node introduces debugging in general and begins the +discussion of debugging in 'gawk'. + +* Menu: + +* Debugging Concepts:: Debugging in General. +* Debugging Terms:: Additional Debugging Concepts. +* Awk Debugging:: Awk Debugging. + + +File: gawk.info, Node: Debugging Concepts, Next: Debugging Terms, Up: Debugging + +14.1.1 Debugging in General +--------------------------- + +(If you have used debuggers in other languages, you may want to skip +ahead to *note Awk Debugging::.) + + Of course, a debugging program cannot remove bugs for you, because it +has no way of knowing what you or your users consider a "bug" versus a +"feature." (Sometimes, we humans have a hard time with this ourselves.) +In that case, what can you expect from such a tool? The answer to that +depends on the language being debugged, but in general, you can expect +at least the following: + + * The ability to watch a program execute its instructions one by one, + giving you, the programmer, the opportunity to think about what is + happening on a time scale of seconds, minutes, or hours, rather + than the nanosecond time scale at which the code usually runs. + + * The opportunity to not only passively observe the operation of your + program, but to control it and try different paths of execution, + without having to change your source files. + + * The chance to see the values of data in the program at any point in + execution, and also to change that data on the fly, to see how that + affects what happens afterward. (This often includes the ability + to look at internal data structures besides the variables you + actually defined in your code.) + + * The ability to obtain additional information about your program's + state or even its internal structure. + + All of these tools provide a great amount of help in using your own +skills and understanding of the goals of your program to find where it +is going wrong (or, for that matter, to better comprehend a perfectly +functional program that you or someone else wrote). + + +File: gawk.info, Node: Debugging Terms, Next: Awk Debugging, Prev: Debugging Concepts, Up: Debugging + +14.1.2 Debugging Concepts +------------------------- + +Before diving in to the details, we need to introduce several important +concepts that apply to just about all debuggers. The following list +defines terms used throughout the rest of this major node: + +"Stack frame" + Programs generally call functions during the course of their + execution. One function can call another, or a function can call + itself (recursion). You can view the chain of called functions + (main program calls A, which calls B, which calls C), as a stack of + executing functions: the currently running function is the topmost + one on the stack, and when it finishes (returns), the next one down + then becomes the active function. Such a stack is termed a "call + stack". + + For each function on the call stack, the system maintains a data + area that contains the function's parameters, local variables, and + return value, as well as any other "bookkeeping" information needed + to manage the call stack. This data area is termed a "stack + frame". + + 'gawk' also follows this model, and gives you access to the call + stack and to each stack frame. You can see the call stack, as well + as from where each function on the stack was invoked. Commands + that print the call stack print information about each stack frame + (as detailed later on). + +"Breakpoint" + During debugging, you often wish to let the program run until it + reaches a certain point, and then continue execution from there one + statement (or instruction) at a time. The way to do this is to set + a "breakpoint" within the program. A breakpoint is where the + execution of the program should break off (stop), so that you can + take over control of the program's execution. You can add and + remove as many breakpoints as you like. + +"Watchpoint" + A watchpoint is similar to a breakpoint. The difference is that + breakpoints are oriented around the code: stop when a certain point + in the code is reached. A watchpoint, however, specifies that + program execution should stop when a _data value_ is changed. This + is useful, as sometimes it happens that a variable receives an + erroneous value, and it's hard to track down where this happens + just by looking at the code. By using a watchpoint, you can stop + whenever a variable is assigned to, and usually find the errant + code quite quickly. + + +File: gawk.info, Node: Awk Debugging, Prev: Debugging Terms, Up: Debugging + +14.1.3 'awk' Debugging +---------------------- + +Debugging an 'awk' program has some specific aspects that are not shared +with programs written in other languages. + + First of all, the fact that 'awk' programs usually take input line by +line from a file or files and operate on those lines using specific +rules makes it especially useful to organize viewing the execution of +the program in terms of these rules. As we will see, each 'awk' rule is +treated almost like a function call, with its own specific block of +instructions. + + In addition, because 'awk' is by design a very concise language, it +is easy to lose sight of everything that is going on "inside" each line +of 'awk' code. The debugger provides the opportunity to look at the +individual primitive instructions carried out by the higher-level 'awk' +commands. + + +File: gawk.info, Node: Sample Debugging Session, Next: List of Debugger Commands, Prev: Debugging, Up: Debugger + +14.2 Sample 'gawk' Debugging Session +==================================== + +In order to illustrate the use of 'gawk' as a debugger, let's look at a +sample debugging session. We will use the 'awk' implementation of the +POSIX 'uniq' command described earlier (*note Uniq Program::) as our +example. + +* Menu: + +* Debugger Invocation:: How to Start the Debugger. +* Finding The Bug:: Finding the Bug. + + +File: gawk.info, Node: Debugger Invocation, Next: Finding The Bug, Up: Sample Debugging Session + +14.2.1 How to Start the Debugger +-------------------------------- + +Starting the debugger is almost exactly like running 'gawk' normally, +except you have to pass an additional option, '--debug', or the +corresponding short option, '-D'. The file(s) containing the program +and any supporting code are given on the command line as arguments to +one or more '-f' options. ('gawk' is not designed to debug command-line +programs, only programs contained in files.) In our case, we invoke the +debugger like this: + + $ gawk -D -f getopt.awk -f join.awk -f uniq.awk -1 inputfile + +where both 'getopt.awk' and 'uniq.awk' are in '$AWKPATH'. (Experienced +users of GDB or similar debuggers should note that this syntax is +slightly different from what you are used to. With the 'gawk' debugger, +you give the arguments for running the program in the command line to +the debugger rather than as part of the 'run' command at the debugger +prompt.) The '-1' is an option to 'uniq.awk'. + + Instead of immediately running the program on 'inputfile', as 'gawk' +would ordinarily do, the debugger merely loads all the program source +files, compiles them internally, and then gives us a prompt: + + gawk> + +from which we can issue commands to the debugger. At this point, no +code has been executed. + + +File: gawk.info, Node: Finding The Bug, Prev: Debugger Invocation, Up: Sample Debugging Session + +14.2.2 Finding the Bug +---------------------- + +Let's say that we are having a problem using (a faulty version of) +'uniq.awk' in the "field-skipping" mode, and it doesn't seem to be +catching lines which should be identical when skipping the first field, +such as: + + awk is a wonderful program! + gawk is a wonderful program! + + This could happen if we were thinking (C-like) of the fields in a +record as being numbered in a zero-based fashion, so instead of the +lines: + + clast = join(alast, fcount+1, n) + cline = join(aline, fcount+1, m) + +we wrote: + + clast = join(alast, fcount, n) + cline = join(aline, fcount, m) + + The first thing we usually want to do when trying to investigate a +problem like this is to put a breakpoint in the program so that we can +watch it at work and catch what it is doing wrong. A reasonable spot +for a breakpoint in 'uniq.awk' is at the beginning of the function +'are_equal()', which compares the current line with the previous one. +To set the breakpoint, use the 'b' (breakpoint) command: + + gawk> b are_equal + -| Breakpoint 1 set at file `awklib/eg/prog/uniq.awk', line 63 + + The debugger tells us the file and line number where the breakpoint +is. Now type 'r' or 'run' and the program runs until it hits the +breakpoint for the first time: + + gawk> r + -| Starting program: + -| Stopping in Rule ... + -| Breakpoint 1, are_equal(n, m, clast, cline, alast, aline) + at `awklib/eg/prog/uniq.awk':63 + -| 63 if (fcount == 0 && charcount == 0) + gawk> + + Now we can look at what's going on inside our program. First of all, +let's see how we got to where we are. At the prompt, we type 'bt' +(short for "backtrace"), and the debugger responds with a listing of the +current stack frames: + + gawk> bt + -| #0 are_equal(n, m, clast, cline, alast, aline) + at `awklib/eg/prog/uniq.awk':68 + -| #1 in main() at `awklib/eg/prog/uniq.awk':88 + + This tells us that 'are_equal()' was called by the main program at +line 88 of 'uniq.awk'. (This is not a big surprise, because this is the +only call to 'are_equal()' in the program, but in more complex programs, +knowing who called a function and with what parameters can be the key to +finding the source of the problem.) + + Now that we're in 'are_equal()', we can start looking at the values +of some variables. Let's say we type 'p n' ('p' is short for "print"). +We would expect to see the value of 'n', a parameter to 'are_equal()'. +Actually, the debugger gives us: + + gawk> p n + -| n = untyped variable + +In this case, 'n' is an uninitialized local variable, because the +function was called without arguments (*note Function Calls::). + + A more useful variable to display might be the current record: + + gawk> p $0 + -| $0 = "gawk is a wonderful program!" + +This might be a bit puzzling at first, as this is the second line of our +test input. Let's look at 'NR': + + gawk> p NR + -| NR = 2 + +So we can see that 'are_equal()' was only called for the second record +of the file. Of course, this is because our program contains a rule for +'NR == 1': + + NR == 1 { + last = $0 + next + } + + OK, let's just check that that rule worked correctly: + + gawk> p last + -| last = "awk is a wonderful program!" + + Everything we have done so far has verified that the program has +worked as planned, up to and including the call to 'are_equal()', so the +problem must be inside this function. To investigate further, we must +begin "stepping through" the lines of 'are_equal()'. We start by typing +'n' (for "next"): + + gawk> n + -| 66 if (fcount > 0) { + + This tells us that 'gawk' is now ready to execute line 66, which +decides whether to give the lines the special "field-skipping" treatment +indicated by the '-1' command-line option. (Notice that we skipped from +where we were before, at line 63, to here, because the condition in line +63, 'if (fcount == 0 && charcount == 0)', was false.) + + Continuing to step, we now get to the splitting of the current and +last records: + + gawk> n + -| 67 n = split(last, alast) + gawk> n + -| 68 m = split($0, aline) + + At this point, we should be curious to see what our records were +split into, so we try to look: + + gawk> p n m alast aline + -| n = 5 + -| m = untyped variable + -| alast = array, 5 elements + -| aline = untyped variable + +(The 'p' command can take more than one argument, similar to 'awk''s +'print' statement.) + + This is kind of disappointing, though. All we found out is that +there are five elements in 'alast'; 'm' and 'aline' don't have values +because we are at line 68 but haven't executed it yet. This information +is useful enough (we now know that none of the words were accidentally +left out), but what if we want to see inside the array? + + The first choice would be to use subscripts: + + gawk> p alast[0] + -| "0" not in array `alast' + +Oops! + + gawk> p alast[1] + -| alast["1"] = "awk" + + This would be kind of slow for a 100-member array, though, so 'gawk' +provides a shortcut (reminiscent of another language not to be +mentioned): + + gawk> p @alast + -| alast["1"] = "awk" + -| alast["2"] = "is" + -| alast["3"] = "a" + -| alast["4"] = "wonderful" + -| alast["5"] = "program!" + + It looks like we got this far OK. Let's take another step or two: + + gawk> n + -| 69 clast = join(alast, fcount, n) + gawk> n + -| 70 cline = join(aline, fcount, m) + + Well, here we are at our error (sorry to spoil the suspense). What +we had in mind was to join the fields starting from the second one to +make the virtual record to compare, and if the first field were numbered +zero, this would work. Let's look at what we've got: + + gawk> p cline clast + -| cline = "gawk is a wonderful program!" + -| clast = "awk is a wonderful program!" + + Hey, those look pretty familiar! They're just our original, +unaltered input records. A little thinking (the human brain is still +the best debugging tool), and we realize that we were off by one! + + We get out of the debugger: + + gawk> q + -| The program is running. Exit anyway (y/n)? y + +Then we get into an editor: + + clast = join(alast, fcount+1, n) + cline = join(aline, fcount+1, m) + +and problem solved! + + +File: gawk.info, Node: List of Debugger Commands, Next: Readline Support, Prev: Sample Debugging Session, Up: Debugger + +14.3 Main Debugger Commands +=========================== + +The 'gawk' debugger command set can be divided into the following +categories: + + * Breakpoint control + + * Execution control + + * Viewing and changing data + + * Working with the stack + + * Getting information + + * Miscellaneous + + Each of these are discussed in the following subsections. In the +following descriptions, commands that may be abbreviated show the +abbreviation on a second description line. A debugger command name may +also be truncated if that partial name is unambiguous. The debugger has +the built-in capability to automatically repeat the previous command +just by hitting 'Enter'. This works for the commands 'list', 'next', +'nexti', 'step', 'stepi', and 'continue' executed without any argument. + +* Menu: + +* Breakpoint Control:: Control of Breakpoints. +* Debugger Execution Control:: Control of Execution. +* Viewing And Changing Data:: Viewing and Changing Data. +* Execution Stack:: Dealing with the Stack. +* Debugger Info:: Obtaining Information about the Program and + the Debugger State. +* Miscellaneous Debugger Commands:: Miscellaneous Commands. + + +File: gawk.info, Node: Breakpoint Control, Next: Debugger Execution Control, Up: List of Debugger Commands + +14.3.1 Control of Breakpoints +----------------------------- + +As we saw earlier, the first thing you probably want to do in a +debugging session is to get your breakpoints set up, because your +program will otherwise just run as if it was not under the debugger. +The commands for controlling breakpoints are: + +'break' [[FILENAME':']N | FUNCTION] ['"EXPRESSION"'] +'b' [[FILENAME':']N | FUNCTION] ['"EXPRESSION"'] + Without any argument, set a breakpoint at the next instruction to + be executed in the selected stack frame. Arguments can be one of + the following: + + N + Set a breakpoint at line number N in the current source file. + + FILENAME':'N + Set a breakpoint at line number N in source file FILENAME. + + FUNCTION + Set a breakpoint at entry to (the first instruction of) + function FUNCTION. + + Each breakpoint is assigned a number that can be used to delete it + from the breakpoint list using the 'delete' command. + + With a breakpoint, you may also supply a condition. This is an + 'awk' expression (enclosed in double quotes) that the debugger + evaluates whenever the breakpoint is reached. If the condition is + true, then the debugger stops execution and prompts for a command. + Otherwise, it continues executing the program. + +'clear' [[FILENAME':']N | FUNCTION] + Without any argument, delete any breakpoint at the next instruction + to be executed in the selected stack frame. If the program stops + at a breakpoint, this deletes that breakpoint so that the program + does not stop at that location again. Arguments can be one of the + following: + + N + Delete breakpoint(s) set at line number N in the current + source file. + + FILENAME':'N + Delete breakpoint(s) set at line number N in source file + FILENAME. + + FUNCTION + Delete breakpoint(s) set at entry to function FUNCTION. + +'condition' N '"EXPRESSION"' + Add a condition to existing breakpoint or watchpoint N. The + condition is an 'awk' expression _enclosed in double quotes_ that + the debugger evaluates whenever the breakpoint or watchpoint is + reached. If the condition is true, then the debugger stops + execution and prompts for a command. Otherwise, the debugger + continues executing the program. If the condition expression is + not specified, any existing condition is removed (i.e., the + breakpoint or watchpoint is made unconditional). + +'delete' [N1 N2 ...] [N-M] +'d' [N1 N2 ...] [N-M] + Delete specified breakpoints or a range of breakpoints. Delete all + defined breakpoints if no argument is supplied. + +'disable' [N1 N2 ... | N-M] + Disable specified breakpoints or a range of breakpoints. Without + any argument, disable all breakpoints. + +'enable' ['del' | 'once'] [N1 N2 ...] [N-M] +'e' ['del' | 'once'] [N1 N2 ...] [N-M] + Enable specified breakpoints or a range of breakpoints. Without + any argument, enable all breakpoints. Optionally, you can specify + how to enable the breakpoints: + + 'del' + Enable the breakpoints temporarily, then delete each one when + the program stops at it. + + 'once' + Enable the breakpoints temporarily, then disable each one when + the program stops at it. + +'ignore' N COUNT + Ignore breakpoint number N the next COUNT times it is hit. + +'tbreak' [[FILENAME':']N | FUNCTION] +'t' [[FILENAME':']N | FUNCTION] + Set a temporary breakpoint (enabled for only one stop). The + arguments are the same as for 'break'. + + +File: gawk.info, Node: Debugger Execution Control, Next: Viewing And Changing Data, Prev: Breakpoint Control, Up: List of Debugger Commands + +14.3.2 Control of Execution +--------------------------- + +Now that your breakpoints are ready, you can start running the program +and observing its behavior. There are more commands for controlling +execution of the program than we saw in our earlier example: + +'commands' [N] +'silent' +... +'end' + Set a list of commands to be executed upon stopping at a breakpoint + or watchpoint. N is the breakpoint or watchpoint number. Without + a number, the last one set is used. The actual commands follow, + starting on the next line, and terminated by the 'end' command. If + the command 'silent' is in the list, the usual messages about + stopping at a breakpoint and the source line are not printed. Any + command in the list that resumes execution (e.g., 'continue') + terminates the list (an implicit 'end'), and subsequent commands + are ignored. For example: + + gawk> commands + > silent + > printf "A silent breakpoint; i = %d\n", i + > info locals + > set i = 10 + > continue + > end + gawk> + +'continue' [COUNT] +'c' [COUNT] + Resume program execution. If continued from a breakpoint and COUNT + is specified, ignore the breakpoint at that location the next COUNT + times before stopping. + +'finish' + Execute until the selected stack frame returns. Print the returned + value. + +'next' [COUNT] +'n' [COUNT] + Continue execution to the next source line, stepping over function + calls. The argument COUNT controls how many times to repeat the + action, as in 'step'. + +'nexti' [COUNT] +'ni' [COUNT] + Execute one (or COUNT) instruction(s), stepping over function + calls. + +'return' [VALUE] + Cancel execution of a function call. If VALUE (either a string or + a number) is specified, it is used as the function's return value. + If used in a frame other than the innermost one (the currently + executing function; i.e., frame number 0), discard all inner frames + in addition to the selected one, and the caller of that frame + becomes the innermost frame. + +'run' +'r' + Start/restart execution of the program. When restarting, the + debugger retains the current breakpoints, watchpoints, command + history, automatic display variables, and debugger options. + +'step' [COUNT] +'s' [COUNT] + Continue execution until control reaches a different source line in + the current stack frame, stepping inside any function called within + the line. If the argument COUNT is supplied, steps that many times + before stopping, unless it encounters a breakpoint or watchpoint. + +'stepi' [COUNT] +'si' [COUNT] + Execute one (or COUNT) instruction(s), stepping inside function + calls. (For illustration of what is meant by an "instruction" in + 'gawk', see the output shown under 'dump' in *note Miscellaneous + Debugger Commands::.) + +'until' [[FILENAME':']N | FUNCTION] +'u' [[FILENAME':']N | FUNCTION] + Without any argument, continue execution until a line past the + current line in the current stack frame is reached. With an + argument, continue execution until the specified location is + reached, or the current stack frame returns. + + +File: gawk.info, Node: Viewing And Changing Data, Next: Execution Stack, Prev: Debugger Execution Control, Up: List of Debugger Commands + +14.3.3 Viewing and Changing Data +-------------------------------- + +The commands for viewing and changing variables inside of 'gawk' are: + +'display' [VAR | '$'N] + Add variable VAR (or field '$N') to the display list. The value of + the variable or field is displayed each time the program stops. + Each variable added to the list is identified by a unique number: + + gawk> display x + -| 10: x = 1 + + This displays the assigned item number, the variable name, and its + current value. If the display variable refers to a function + parameter, it is silently deleted from the list as soon as the + execution reaches a context where no such variable of the given + name exists. Without argument, 'display' displays the current + values of items on the list. + +'eval "AWK STATEMENTS"' + Evaluate AWK STATEMENTS in the context of the running program. You + can do anything that an 'awk' program would do: assign values to + variables, call functions, and so on. + +'eval' PARAM, ... +AWK STATEMENTS +'end' + This form of 'eval' is similar, but it allows you to define "local + variables" that exist in the context of the AWK STATEMENTS, instead + of using variables or function parameters defined by the program. + +'print' VAR1[',' VAR2 ...] +'p' VAR1[',' VAR2 ...] + Print the value of a 'gawk' variable or field. Fields must be + referenced by constants: + + gawk> print $3 + + This prints the third field in the input record (if the specified + field does not exist, it prints 'Null field'). A variable can be + an array element, with the subscripts being constant string values. + To print the contents of an array, prefix the name of the array + with the '@' symbol: + + gawk> print @a + + This prints the indices and the corresponding values for all + elements in the array 'a'. + +'printf' FORMAT [',' ARG ...] + Print formatted text. The FORMAT may include escape sequences, + such as '\n' (*note Escape Sequences::). No newline is printed + unless one is specified. + +'set' VAR'='VALUE + Assign a constant (number or string) value to an 'awk' variable or + field. String values must be enclosed between double quotes + ('"'...'"'). + + You can also set special 'awk' variables, such as 'FS', 'NF', 'NR', + and so on. + +'watch' VAR | '$'N ['"EXPRESSION"'] +'w' VAR | '$'N ['"EXPRESSION"'] + Add variable VAR (or field '$N') to the watch list. The debugger + then stops whenever the value of the variable or field changes. + Each watched item is assigned a number that can be used to delete + it from the watch list using the 'unwatch' command. + + With a watchpoint, you may also supply a condition. This is an + 'awk' expression (enclosed in double quotes) that the debugger + evaluates whenever the watchpoint is reached. If the condition is + true, then the debugger stops execution and prompts for a command. + Otherwise, 'gawk' continues executing the program. + +'undisplay' [N] + Remove item number N (or all items, if no argument) from the + automatic display list. + +'unwatch' [N] + Remove item number N (or all items, if no argument) from the watch + list. + + +File: gawk.info, Node: Execution Stack, Next: Debugger Info, Prev: Viewing And Changing Data, Up: List of Debugger Commands + +14.3.4 Working with the Stack +----------------------------- + +Whenever you run a program that contains any function calls, 'gawk' +maintains a stack of all of the function calls leading up to where the +program is right now. You can see how you got to where you are, and +also move around in the stack to see what the state of things was in the +functions that called the one you are in. The commands for doing this +are: + +'backtrace' [COUNT] +'bt' [COUNT] +'where' [COUNT] + Print a backtrace of all function calls (stack frames), or + innermost COUNT frames if COUNT > 0. Print the outermost COUNT + frames if COUNT < 0. The backtrace displays the name and arguments + to each function, the source file name, and the line number. The + alias 'where' for 'backtrace' is provided for longtime GDB users + who may be used to that command. + +'down' [COUNT] + Move COUNT (default 1) frames down the stack toward the innermost + frame. Then select and print the frame. + +'frame' [N] +'f' [N] + Select and print stack frame N. Frame 0 is the currently + executing, or "innermost", frame (function call); frame 1 is the + frame that called the innermost one. The highest-numbered frame is + the one for the main program. The printed information consists of + the frame number, function and argument names, source file, and the + source line. + +'up' [COUNT] + Move COUNT (default 1) frames up the stack toward the outermost + frame. Then select and print the frame. + + +File: gawk.info, Node: Debugger Info, Next: Miscellaneous Debugger Commands, Prev: Execution Stack, Up: List of Debugger Commands + +14.3.5 Obtaining Information About the Program and the Debugger State +--------------------------------------------------------------------- + +Besides looking at the values of variables, there is often a need to get +other sorts of information about the state of your program and of the +debugging environment itself. The 'gawk' debugger has one command that +provides this information, appropriately called 'info'. 'info' is used +with one of a number of arguments that tell it exactly what you want to +know: + +'info' WHAT +'i' WHAT + The value for WHAT should be one of the following: + + 'args' + List arguments of the selected frame. + + 'break' + List all currently set breakpoints. + + 'display' + List all items in the automatic display list. + + 'frame' + Give a description of the selected stack frame. + + 'functions' + List all function definitions including source file names and + line numbers. + + 'locals' + List local variables of the selected frame. + + 'source' + Print the name of the current source file. Each time the + program stops, the current source file is the file containing + the current instruction. When the debugger first starts, the + current source file is the first file included via the '-f' + option. The 'list FILENAME:LINENO' command can be used at any + time to change the current source. + + 'sources' + List all program sources. + + 'variables' + List all global variables. + + 'watch' + List all items in the watch list. + + Additional commands give you control over the debugger, the ability +to save the debugger's state, and the ability to run debugger commands +from a file. The commands are: + +'option' [NAME['='VALUE]] +'o' [NAME['='VALUE]] + Without an argument, display the available debugger options and + their current values. 'option NAME' shows the current value of the + named option. 'option NAME=VALUE' assigns a new value to the named + option. The available options are: + + 'history_size' + Set the maximum number of lines to keep in the history file + './.gawk_history'. The default is 100. + + 'listsize' + Specify the number of lines that 'list' prints. The default + is 15. + + 'outfile' + Send 'gawk' output to a file; debugger output still goes to + standard output. An empty string ('""') resets output to + standard output. + + 'prompt' + Change the debugger prompt. The default is 'gawk> '. + + 'save_history' ['on' | 'off'] + Save command history to file './.gawk_history'. The default + is 'on'. + + 'save_options' ['on' | 'off'] + Save current options to file './.gawkrc' upon exit. The + default is 'on'. Options are read back into the next session + upon startup. + + 'trace' ['on' | 'off'] + Turn instruction tracing on or off. The default is 'off'. + +'save' FILENAME + Save the commands from the current session to the given file name, + so that they can be replayed using the 'source' command. + +'source' FILENAME + Run command(s) from a file; an error in any command does not + terminate execution of subsequent commands. Comments (lines + starting with '#') are allowed in a command file. Empty lines are + ignored; they do _not_ repeat the last command. You can't restart + the program by having more than one 'run' command in the file. + Also, the list of commands may include additional 'source' + commands; however, the 'gawk' debugger will not source the same + file more than once in order to avoid infinite recursion. + + In addition to, or instead of, the 'source' command, you can use + the '-D FILE' or '--debug=FILE' command-line options to execute + commands from a file non-interactively (*note Options::). + + +File: gawk.info, Node: Miscellaneous Debugger Commands, Prev: Debugger Info, Up: List of Debugger Commands + +14.3.6 Miscellaneous Commands +----------------------------- + +There are a few more commands that do not fit into the previous +categories, as follows: + +'dump' [FILENAME] + Dump byte code of the program to standard output or to the file + named in FILENAME. This prints a representation of the internal + instructions that 'gawk' executes to implement the 'awk' commands + in a program. This can be very enlightening, as the following + partial dump of Davide Brini's obfuscated code (*note Signature + Program::) demonstrates: + + gawk> dump + -| # BEGIN + -| + -| [ 1:0xfcd340] Op_rule : [in_rule = BEGIN] [source_file = brini.awk] + -| [ 1:0xfcc240] Op_push_i : "~" [MALLOC|STRING|STRCUR] + -| [ 1:0xfcc2a0] Op_push_i : "~" [MALLOC|STRING|STRCUR] + -| [ 1:0xfcc280] Op_match : + -| [ 1:0xfcc1e0] Op_store_var : O + -| [ 1:0xfcc2e0] Op_push_i : "==" [MALLOC|STRING|STRCUR] + -| [ 1:0xfcc340] Op_push_i : "==" [MALLOC|STRING|STRCUR] + -| [ 1:0xfcc320] Op_equal : + -| [ 1:0xfcc200] Op_store_var : o + -| [ 1:0xfcc380] Op_push : o + -| [ 1:0xfcc360] Op_plus_i : 0 [MALLOC|NUMCUR|NUMBER] + -| [ 1:0xfcc220] Op_push_lhs : o [do_reference = true] + -| [ 1:0xfcc300] Op_assign_plus : + -| [ :0xfcc2c0] Op_pop : + -| [ 1:0xfcc400] Op_push : O + -| [ 1:0xfcc420] Op_push_i : "" [MALLOC|STRING|STRCUR] + -| [ :0xfcc4a0] Op_no_op : + -| [ 1:0xfcc480] Op_push : O + -| [ :0xfcc4c0] Op_concat : [expr_count = 3] [concat_flag = 0] + -| [ 1:0xfcc3c0] Op_store_var : x + -| [ 1:0xfcc440] Op_push_lhs : X [do_reference = true] + -| [ 1:0xfcc3a0] Op_postincrement : + -| [ 1:0xfcc4e0] Op_push : x + -| [ 1:0xfcc540] Op_push : o + -| [ 1:0xfcc500] Op_plus : + -| [ 1:0xfcc580] Op_push : o + -| [ 1:0xfcc560] Op_plus : + -| [ 1:0xfcc460] Op_leq : + -| [ :0xfcc5c0] Op_jmp_false : [target_jmp = 0xfcc5e0] + -| [ 1:0xfcc600] Op_push_i : "%c" [MALLOC|STRING|STRCUR] + -| [ :0xfcc660] Op_no_op : + -| [ 1:0xfcc520] Op_assign_concat : c + -| [ :0xfcc620] Op_jmp : [target_jmp = 0xfcc440] + -| + ... + -| + -| [ 2:0xfcc5a0] Op_K_printf : [expr_count = 17] [redir_type = ""] + -| [ :0xfcc140] Op_no_op : + -| [ :0xfcc1c0] Op_atexit : + -| [ :0xfcc640] Op_stop : + -| [ :0xfcc180] Op_no_op : + -| [ :0xfcd150] Op_after_beginfile : + -| [ :0xfcc160] Op_no_op : + -| [ :0xfcc1a0] Op_after_endfile : + gawk> + +'exit' + Exit the debugger. See the entry for 'quit', later in this list. + +'help' +'h' + Print a list of all of the 'gawk' debugger commands with a short + summary of their usage. 'help COMMAND' prints the information + about the command COMMAND. + +'list' ['-' | '+' | N | FILENAME':'N | N-M | FUNCTION] +'l' ['-' | '+' | N | FILENAME':'N | N-M | FUNCTION] + Print the specified lines (default 15) from the current source file + or the file named FILENAME. The possible arguments to 'list' are + as follows: + + '-' (Minus) + Print lines before the lines last printed. + + '+' + Print lines after the lines last printed. 'list' without any + argument does the same thing. + + N + Print lines centered around line number N. + + N-M + Print lines from N to M. + + FILENAME':'N + Print lines centered around line number N in source file + FILENAME. This command may change the current source file. + + FUNCTION + Print lines centered around the beginning of the function + FUNCTION. This command may change the current source file. + +'quit' +'q' + Exit the debugger. Debugging is great fun, but sometimes we all + have to tend to other obligations in life, and sometimes we find + the bug and are free to go on to the next one! As we saw earlier, + if you are running a program, the debugger warns you when you type + 'q' or 'quit', to make sure you really want to quit. + +'trace' ['on' | 'off'] + Turn on or off continuous printing of the instructions that are + about to be executed, along with the 'awk' lines they implement. + The default is 'off'. + + It is to be hoped that most of the "opcodes" in these instructions + are fairly self-explanatory, and using 'stepi' and 'nexti' while + 'trace' is on will make them into familiar friends. + + +File: gawk.info, Node: Readline Support, Next: Limitations, Prev: List of Debugger Commands, Up: Debugger + +14.4 Readline Support +===================== + +If 'gawk' is compiled with the GNU Readline library +(http://cnswww.cns.cwru.edu/php/chet/readline/readline.html), you can +take advantage of that library's command completion and history +expansion features. The following types of completion are available: + +Command completion + Command names. + +Source file name completion + Source file names. Relevant commands are 'break', 'clear', 'list', + 'tbreak', and 'until'. + +Argument completion + Non-numeric arguments to a command. Relevant commands are 'enable' + and 'info'. + +Variable name completion + Global variable names, and function arguments in the current + context if the program is running. Relevant commands are + 'display', 'print', 'set', and 'watch'. + + +File: gawk.info, Node: Limitations, Next: Debugging Summary, Prev: Readline Support, Up: Debugger + +14.5 Limitations +================ + +We hope you find the 'gawk' debugger useful and enjoyable to work with, +but as with any program, especially in its early releases, it still has +some limitations. A few that it's worth being aware of are: + + * At this point, the debugger does not give a detailed explanation of + what you did wrong when you type in something it doesn't like. + Rather, it just responds 'syntax error'. When you do figure out + what your mistake was, though, you'll feel like a real guru. + + * If you perused the dump of opcodes in *note Miscellaneous Debugger + Commands:: (or if you are already familiar with 'gawk' internals), + you will realize that much of the internal manipulation of data in + 'gawk', as in many interpreters, is done on a stack. 'Op_push', + 'Op_pop', and the like are the "bread and butter" of most 'gawk' + code. + + Unfortunately, as of now, the 'gawk' debugger does not allow you to + examine the stack's contents. That is, the intermediate results of + expression evaluation are on the stack, but cannot be printed. + Rather, only variables that are defined in the program can be + printed. Of course, a workaround for this is to use more explicit + variables at the debugging stage and then change back to obscure, + perhaps more optimal code later. + + * There is no way to look "inside" the process of compiling regular + expressions to see if you got it right. As an 'awk' programmer, + you are expected to know the meaning of '/[^[:alnum:][:blank:]]/'. + + * The 'gawk' debugger is designed to be used by running a program + (with all its parameters) on the command line, as described in + *note Debugger Invocation::. There is no way (as of now) to attach + or "break into" a running program. This seems reasonable for a + language that is used mainly for quickly executing, short programs. + + * The 'gawk' debugger only accepts source code supplied with the '-f' + option. + + One other point is worth discussing. Conventional debuggers run in a +separate process (and thus address space) from the programs that they +debug (the "debuggee", if you will). + + The 'gawk' debugger is different; it is an integrated part of 'gawk' +itself. This makes it possible, in rare cases, for 'gawk' to become an +excellent demonstrator of Heisenberg Uncertainty physics, where the mere +act of observing something can change it. Consider the following:(1) + + $ cat test.awk + -| { print typeof($1), typeof($2) } + $ cat test.data + -| abc 123 + $ gawk -f test.awk test.data + -| strnum strnum + + This is all as expected: field data has the STRNUM attribute (*note +Variable Typing::). Now watch what happens when we run this program +under the debugger: + + $ gawk -D -f test.awk test.data + gawk> w $1 Set watchpoint on $1 + -| Watchpoint 1: $1 + gawk> w $2 Set watchpoint on $2 + -| Watchpoint 2: $2 + gawk> r Start the program + -| Starting program: + -| Stopping in Rule ... + -| Watchpoint 1: $1 Watchpoint fires + -| Old value: "" + -| New value: "abc" + -| main() at `test.awk':1 + -| 1 { print typeof($1), typeof($2) } + gawk> n Keep going ... + -| Watchpoint 2: $2 Watchpoint fires + -| Old value: "" + -| New value: "123" + -| main() at `test.awk':1 + -| 1 { print typeof($1), typeof($2) } + gawk> n Get result from typeof() + -| strnum number Result for $2 isn't right + -| Program exited normally with exit value: 0 + gawk> quit + + In this case, the act of comparing the new value of '$2' with the old +one caused 'gawk' to evaluate it and determine that it is indeed a +number, and this is reflected in the result of 'typeof()'. + + Cases like this where the debugger is not transparent to the +program's execution should be rare. If you encounter one, please report +it (*note Bugs::). + + ---------- Footnotes ---------- + + (1) Thanks to Hermann Peifer for this example. + + +File: gawk.info, Node: Debugging Summary, Prev: Limitations, Up: Debugger + +14.6 Summary +============ + + * Programs rarely work correctly the first time. Finding bugs is + called debugging, and a program that helps you find bugs is a + debugger. 'gawk' has a built-in debugger that works very similarly + to the GNU Debugger, GDB. + + * Debuggers let you step through your program one statement at a + time, examine and change variable and array values, and do a number + of other things that let you understand what your program is + actually doing (as opposed to what it is supposed to do). + + * Like most debuggers, the 'gawk' debugger works in terms of stack + frames, and lets you set both breakpoints (stop at a point in the + code) and watchpoints (stop when a data value changes). + + * The debugger command set is fairly complete, providing control over + breakpoints, execution, viewing and changing data, working with the + stack, getting information, and other tasks. + + * If the GNU Readline library is available when 'gawk' is compiled, + it is used by the debugger to provide command-line history and + editing. + + * Usually, the debugger does not not affect the program being + debugged, but occasionally it can. + + +File: gawk.info, Node: Arbitrary Precision Arithmetic, Next: Dynamic Extensions, Prev: Debugger, Up: Top + +15 Arithmetic and Arbitrary-Precision Arithmetic with 'gawk' +************************************************************ + +This major node introduces some basic concepts relating to how computers +do arithmetic and defines some important terms. It then proceeds to +describe floating-point arithmetic, which is what 'awk' uses for all its +computations, including a discussion of arbitrary-precision +floating-point arithmetic, which is a feature available only in 'gawk'. +It continues on to present arbitrary-precision integers, and concludes +with a description of some points where 'gawk' and the POSIX standard +are not quite in agreement. + + NOTE: Most users of 'gawk' can safely skip this chapter. But if + you want to do scientific calculations with 'gawk', this is the + place to be. + +* Menu: + +* Computer Arithmetic:: A quick intro to computer math. +* Math Definitions:: Defining terms used. +* MPFR features:: The MPFR features in 'gawk'. +* FP Math Caution:: Things to know. +* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with + 'gawk'. +* POSIX Floating Point Problems:: Standards Versus Existing Practice. +* Floating point summary:: Summary of floating point discussion. + + +File: gawk.info, Node: Computer Arithmetic, Next: Math Definitions, Up: Arbitrary Precision Arithmetic + +15.1 A General Description of Computer Arithmetic +================================================= + +Until now, we have worked with data as either numbers or strings. +Ultimately, however, computers represent everything in terms of "binary +digits", or "bits". A decimal digit can take on any of 10 values: zero +through nine. A binary digit can take on any of two values, zero or +one. Using binary, computers (and computer software) can represent and +manipulate numerical and character data. In general, the more bits you +can use to represent a particular thing, the greater the range of +possible values it can take on. + + Modern computers support at least two, and often more, ways to do +arithmetic. Each kind of arithmetic uses a different representation +(organization of the bits) for the numbers. The kinds of arithmetic +that interest us are: + +Decimal arithmetic + This is the kind of arithmetic you learned in elementary school, + using paper and pencil (and/or a calculator). In theory, numbers + can have an arbitrary number of digits on either side (or both + sides) of the decimal point, and the results of a computation are + always exact. + + Some modern systems can do decimal arithmetic in hardware, but + usually you need a special software library to provide access to + these instructions. There are also libraries that do decimal + arithmetic entirely in software. + + Despite the fact that some users expect 'gawk' to be performing + decimal arithmetic,(1) it does not do so. + +Integer arithmetic + In school, integer values were referred to as "whole" numbers--that + is, numbers without any fractional part, such as 1, 42, or -17. + The advantage to integer numbers is that they represent values + exactly. The disadvantage is that their range is limited. + + In computers, integer values come in two flavors: "signed" and + "unsigned". Signed values may be negative or positive, whereas + unsigned values are always greater than or equal to zero. + + In computer systems, integer arithmetic is exact, but the possible + range of values is limited. Integer arithmetic is generally faster + than floating-point arithmetic. + +Floating-point arithmetic + Floating-point numbers represent what were called in school "real" + numbers (i.e., those that have a fractional part, such as + 3.1415927). The advantage to floating-point numbers is that they + can represent a much larger range of values than can integers. The + disadvantage is that there are numbers that they cannot represent + exactly. + + Modern systems support floating-point arithmetic in hardware, with + a limited range of values. There are software libraries that allow + the use of arbitrary-precision floating-point calculations. + + POSIX 'awk' uses "double-precision" floating-point numbers, which + can hold more digits than "single-precision" floating-point + numbers. 'gawk' has facilities for performing arbitrary-precision + floating-point arithmetic, which we describe in more detail + shortly. + + Computers work with integer and floating-point values of different +ranges. Integer values are usually either 32 or 64 bits in size. +Single-precision floating-point values occupy 32 bits, whereas +double-precision floating-point values occupy 64 bits. Floating-point +values are always signed. The possible ranges of values are shown in +*note Table 15.1: table-numeric-ranges. + +Numeric representation Minimum value Maximum value +--------------------------------------------------------------------------- +32-bit signed integer -2,147,483,648 2,147,483,647 +32-bit unsigned 0 4,294,967,295 +integer +64-bit signed integer -9,223,372,036,854,775,8089,223,372,036,854,775,807 +64-bit unsigned 0 18,446,744,073,709,551,615 +integer +Single-precision 1.175494e-38 3.402823e38 +floating point +(approximate) +Double-precision 2.225074e-308 1.797693e308 +floating point +(approximate) + +Table 15.1: Value ranges for different numeric representations + + ---------- Footnotes ---------- + + (1) We don't know why they expect this, but they do. + + +File: gawk.info, Node: Math Definitions, Next: MPFR features, Prev: Computer Arithmetic, Up: Arbitrary Precision Arithmetic + +15.2 Other Stuff to Know +======================== + +The rest of this major node uses a number of terms. Here are some +informal definitions that should help you work your way through the +material here: + +"Accuracy" + A floating-point calculation's accuracy is how close it comes to + the real (paper and pencil) value. + +"Error" + The difference between what the result of a computation "should be" + and what it actually is. It is best to minimize error as much as + possible. + +"Exponent" + The order of magnitude of a value; some number of bits in a + floating-point value store the exponent. + +"Inf" + A special value representing infinity. Operations involving + another number and infinity produce infinity. + +"NaN" + "Not a number."(1) A special value that results from attempting a + calculation that has no answer as a real number. In such a case, + programs can either receive a floating-point exception, or get + 'NaN' back as the result. The IEEE 754 standard recommends that + systems return 'NaN'. Some examples: + + 'sqrt(-1)' + This makes sense in the range of complex numbers, but not in + the range of real numbers, so the result is 'NaN'. + + 'log(-8)' + -8 is out of the domain of 'log()', so the result is 'NaN'. + +"Normalized" + How the significand (see later in this list) is usually stored. + The value is adjusted so that the first bit is one, and then that + leading one is assumed instead of physically stored. This provides + one extra bit of precision. + +"Precision" + The number of bits used to represent a floating-point number. The + more bits, the more digits you can represent. Binary and decimal + precisions are related approximately, according to the formula: + + PREC = 3.322 * DPS + + Here, _prec_ denotes the binary precision (measured in bits) and + _dps_ (short for decimal places) is the decimal digits. + +"Rounding mode" + How numbers are rounded up or down when necessary. More details + are provided later. + +"Significand" + A floating-point value consists of the significand multiplied by 10 + to the power of the exponent. For example, in '1.2345e67', the + significand is '1.2345'. + +"Stability" + From the Wikipedia article on numerical stability + (http://en.wikipedia.org/wiki/Numerical_stability): "Calculations + that can be proven not to magnify approximation errors are called + "numerically stable"." + + See the Wikipedia article on accuracy and precision +(http://en.wikipedia.org/wiki/Accuracy_and_precision) for more +information on some of those terms. + + On modern systems, floating-point hardware uses the representation +and operations defined by the IEEE 754 standard. Three of the standard +IEEE 754 types are 32-bit single precision, 64-bit double precision, and +128-bit quadruple precision. The standard also specifies extended +precision formats to allow greater precisions and larger exponent +ranges. ('awk' uses only the 64-bit double-precision format.) + + *note Table 15.2: table-ieee-formats. lists the precision and +exponent field values for the basic IEEE 754 binary formats. + +Name Total bits Precision Minimum Maximum + exponent exponent +--------------------------------------------------------------------------- +Single 32 24 -126 +127 +Double 64 53 -1022 +1023 +Quadruple 128 113 -16382 +16383 + +Table 15.2: Basic IEEE format values + + NOTE: The precision numbers include the implied leading one that + gives them one extra bit of significand. + + ---------- Footnotes ---------- + + (1) Thanks to Michael Brennan for this description, which we have +paraphrased, and for the examples. + + +File: gawk.info, Node: MPFR features, Next: FP Math Caution, Prev: Math Definitions, Up: Arbitrary Precision Arithmetic + +15.3 Arbitrary-Precision Arithmetic Features in 'gawk' +====================================================== + +By default, 'gawk' uses the double-precision floating-point values +supplied by the hardware of the system it runs on. However, if it was +compiled to do so, and the '-M' command-line option is supplied, 'gawk' +uses the GNU MPFR (http://www.mpfr.org) and GNU MP (http://gmplib.org) +(GMP) libraries for arbitrary-precision arithmetic on numbers. You can +see if MPFR support is available like so: + + $ gawk --version + -| GNU Awk 4.1.2, API: 1.1 (GNU MPFR 3.1.0-p3, GNU MP 5.0.2) + -| Copyright (C) 1989, 1991-2015 Free Software Foundation. + ... + +(You may see different version numbers than what's shown here. That's +OK; what's important is to see that GNU MPFR and GNU MP are listed in +the output.) + + Additionally, there are a few elements available in the 'PROCINFO' +array to provide information about the MPFR and GMP libraries (*note +Auto-set::). + + The MPFR library provides precise control over precisions and +rounding modes, and gives correctly rounded, reproducible, +platform-independent results. With the '-M' command-line option, all +floating-point arithmetic operators and numeric functions can yield +results to any desired precision level supported by MPFR. + + Two predefined variables, 'PREC' and 'ROUNDMODE', provide control +over the working precision and the rounding mode. The precision and the +rounding mode are set globally for every operation to follow. *Note +Setting precision:: and *note Setting the rounding mode:: for more +information. + + +File: gawk.info, Node: FP Math Caution, Next: Arbitrary Precision Integers, Prev: MPFR features, Up: Arbitrary Precision Arithmetic + +15.4 Floating-Point Arithmetic: Caveat Emptor! +============================================== + + Math class is tough! + -- _Teen Talk Barbie, July 1992_ + + This minor node provides a high-level overview of the issues involved +when doing lots of floating-point arithmetic.(1) The discussion applies +to both hardware and arbitrary-precision floating-point arithmetic. + + CAUTION: The material here is purposely general. If you need to do + serious computer arithmetic, you should do some research first, and + not rely just on what we tell you. + +* Menu: + +* Inexactness of computations:: Floating point math is not exact. +* Getting Accuracy:: Getting more accuracy takes some work. +* Try To Round:: Add digits and round. +* Setting precision:: How to set the precision. +* Setting the rounding mode:: How to set the rounding mode. + + ---------- Footnotes ---------- + + (1) There is a very nice paper on floating-point arithmetic +(http://www.validlab.com/goldberg/paper.pdf) by David Goldberg, "What +Every Computer Scientist Should Know About Floating-Point Arithmetic," +'ACM Computing Surveys' *23*, 1 (1991-03): 5-48. This is worth reading +if you are interested in the details, but it does require a background +in computer science. + + +File: gawk.info, Node: Inexactness of computations, Next: Getting Accuracy, Up: FP Math Caution + +15.4.1 Floating-Point Arithmetic Is Not Exact +--------------------------------------------- + +Binary floating-point representations and arithmetic are inexact. +Simple values like 0.1 cannot be precisely represented using binary +floating-point numbers, and the limited precision of floating-point +numbers means that slight changes in the order of operations or the +precision of intermediate storage can change the result. To make +matters worse, with arbitrary-precision floating-point arithmetic, you +can set the precision before starting a computation, but then you cannot +be sure of the number of significant decimal places in the final result. + +* Menu: + +* Inexact representation:: Numbers are not exactly represented. +* Comparing FP Values:: How to compare floating point values. +* Errors accumulate:: Errors get bigger as they go. + + +File: gawk.info, Node: Inexact representation, Next: Comparing FP Values, Up: Inexactness of computations + +15.4.1.1 Many Numbers Cannot Be Represented Exactly +................................................... + +So, before you start to write any code, you should think about what you +really want and what's really happening. Consider the two numbers in +the following example: + + x = 0.875 # 1/2 + 1/4 + 1/8 + y = 0.425 + + Unlike the number in 'y', the number stored in 'x' is exactly +representable in binary because it can be written as a finite sum of one +or more fractions whose denominators are all powers of two. When 'gawk' +reads a floating-point number from program source, it automatically +rounds that number to whatever precision your machine supports. If you +try to print the numeric content of a variable using an output format +string of '"%.17g"', it may not produce the same number as you assigned +to it: + + $ gawk 'BEGIN { x = 0.875; y = 0.425 + > printf("%0.17g, %0.17g\n", x, y) }' + -| 0.875, 0.42499999999999999 + + Often the error is so small you do not even notice it, and if you do, +you can always specify how much precision you would like in your output. +Usually this is a format string like '"%.15g"', which, when used in the +previous example, produces an output identical to the input. + + +File: gawk.info, Node: Comparing FP Values, Next: Errors accumulate, Prev: Inexact representation, Up: Inexactness of computations + +15.4.1.2 Be Careful Comparing Values +.................................... + +Because the underlying representation can be a little bit off from the +exact value, comparing floating-point values to see if they are exactly +equal is generally a bad idea. Here is an example where it does not +work like you would expect: + + $ gawk 'BEGIN { print (0.1 + 12.2 == 12.3) }' + -| 0 + + The general wisdom when comparing floating-point values is to see if +they are within some small range of each other (called a "delta", or +"tolerance"). You have to decide how small a delta is important to you. +Code to do this looks something like the following: + + delta = 0.00001 # for example + difference = abs(a) - abs(b) # subtract the two values + if (difference < delta) + # all ok + else + # not ok + +(We assume that you have a simple absolute value function named 'abs()' +defined elsewhere in your program.) + + +File: gawk.info, Node: Errors accumulate, Prev: Comparing FP Values, Up: Inexactness of computations + +15.4.1.3 Errors Accumulate +.......................... + +The loss of accuracy during a single computation with floating-point +numbers usually isn't enough to worry about. However, if you compute a +value that is the result of a sequence of floating-point operations, the +error can accumulate and greatly affect the computation itself. Here is +an attempt to compute the value of pi using one of its many series +representations: + + BEGIN { + x = 1.0 / sqrt(3.0) + n = 6 + for (i = 1; i < 30; i++) { + n = n * 2.0 + x = (sqrt(x * x + 1) - 1) / x + printf("%.15f\n", n * x) + } + } + + When run, the early errors propagate through later computations, +causing the loop to terminate prematurely after attempting to divide by +zero: + + $ gawk -f pi.awk + -| 3.215390309173475 + -| 3.159659942097510 + -| 3.146086215131467 + -| 3.142714599645573 + ... + -| 3.224515243534819 + -| 2.791117213058638 + -| 0.000000000000000 + error-> gawk: pi.awk:6: fatal: division by zero attempted + + Here is an additional example where the inaccuracies in internal +representations yield an unexpected result: + + $ gawk 'BEGIN { + > for (d = 1.1; d <= 1.5; d += 0.1) # loop five times (?) + > i++ + > print i + > }' + -| 4 + + +File: gawk.info, Node: Getting Accuracy, Next: Try To Round, Prev: Inexactness of computations, Up: FP Math Caution + +15.4.2 Getting the Accuracy You Need +------------------------------------ + +Can arbitrary-precision arithmetic give exact results? There are no +easy answers. The standard rules of algebra often do not apply when +using floating-point arithmetic. Among other things, the distributive +and associative laws do not hold completely, and order of operation may +be important for your computation. Rounding error, cumulative precision +loss, and underflow are often troublesome. + + When 'gawk' tests the expressions '0.1 + 12.2' and '12.3' for +equality using the machine double-precision arithmetic, it decides that +they are not equal! (*Note Comparing FP Values::.) You can get the +result you want by increasing the precision; 56 bits in this case does +the job: + + $ gawk -M -v PREC=56 'BEGIN { print (0.1 + 12.2 == 12.3) }' + -| 1 + + If adding more bits is good, perhaps adding even more bits of +precision is better? Here is what happens if we use an even larger +value of 'PREC': + + $ gawk -M -v PREC=201 'BEGIN { print (0.1 + 12.2 == 12.3) }' + -| 0 + + This is not a bug in 'gawk' or in the MPFR library. It is easy to +forget that the finite number of bits used to store the value is often +just an approximation after proper rounding. The test for equality +succeeds if and only if _all_ bits in the two operands are exactly the +same. Because this is not necessarily true after floating-point +computations with a particular precision and effective rounding mode, a +straight test for equality may not work. Instead, compare the two +numbers to see if they are within the desirable delta of each other. + + In applications where 15 or fewer decimal places suffice, hardware +double-precision arithmetic can be adequate, and is usually much faster. +But you need to keep in mind that every floating-point operation can +suffer a new rounding error with catastrophic consequences, as +illustrated by our earlier attempt to compute the value of pi. Extra +precision can greatly enhance the stability and the accuracy of your +computation in such cases. + + Additionally, you should understand that repeated addition is not +necessarily equivalent to multiplication in floating-point arithmetic. +In the example in *note Errors accumulate::: + + $ gawk 'BEGIN { + > for (d = 1.1; d <= 1.5; d += 0.1) # loop five times (?) + > i++ + > print i + > }' + -| 4 + +you may or may not succeed in getting the correct result by choosing an +arbitrarily large value for 'PREC'. Reformulation of the problem at +hand is often the correct approach in such situations. + + +File: gawk.info, Node: Try To Round, Next: Setting precision, Prev: Getting Accuracy, Up: FP Math Caution + +15.4.3 Try a Few Extra Bits of Precision and Rounding +----------------------------------------------------- + +Instead of arbitrary-precision floating-point arithmetic, often all you +need is an adjustment of your logic or a different order for the +operations in your calculation. The stability and the accuracy of the +computation of pi in the earlier example can be enhanced by using the +following simple algebraic transformation: + + (sqrt(x * x + 1) - 1) / x == x / (sqrt(x * x + 1) + 1) + +After making this change, the program converges to pi in under 30 +iterations: + + $ gawk -f pi2.awk + -| 3.215390309173473 + -| 3.159659942097501 + -| 3.146086215131436 + -| 3.142714599645370 + -| 3.141873049979825 + ... + -| 3.141592653589797 + -| 3.141592653589797 + + +File: gawk.info, Node: Setting precision, Next: Setting the rounding mode, Prev: Try To Round, Up: FP Math Caution + +15.4.4 Setting the Precision +---------------------------- + +'gawk' uses a global working precision; it does not keep track of the +precision or accuracy of individual numbers. Performing an arithmetic +operation or calling a built-in function rounds the result to the +current working precision. The default working precision is 53 bits, +which you can modify using the predefined variable 'PREC'. You can also +set the value to one of the predefined case-insensitive strings shown in +*note Table 15.3: table-predefined-precision-strings, to emulate an IEEE +754 binary format. + +'PREC' IEEE 754 binary format +--------------------------------------------------- +'"half"' 16-bit half-precision +'"single"' Basic 32-bit single precision +'"double"' Basic 64-bit double precision +'"quad"' Basic 128-bit quadruple precision +'"oct"' 256-bit octuple precision + +Table 15.3: Predefined precision strings for 'PREC' + + The following example illustrates the effects of changing precision +on arithmetic operations: + + $ gawk -M -v PREC=100 'BEGIN { x = 1.0e-400; print x + 0 + > PREC = "double"; print x + 0 }' + -| 1e-400 + -| 0 + + CAUTION: Be wary of floating-point constants! When reading a + floating-point constant from program source code, 'gawk' uses the + default precision (that of a C 'double'), unless overridden by an + assignment to the special variable 'PREC' on the command line, to + store it internally as an MPFR number. Changing the precision + using 'PREC' in the program text does _not_ change the precision of + a constant. + + If you need to represent a floating-point constant at a higher + precision than the default and cannot use a command-line assignment + to 'PREC', you should either specify the constant as a string, or + as a rational number, whenever possible. The following example + illustrates the differences among various ways to print a + floating-point constant: + + $ gawk -M 'BEGIN { PREC = 113; printf("%0.25f\n", 0.1) }' + -| 0.1000000000000000055511151 + $ gawk -M -v PREC=113 'BEGIN { printf("%0.25f\n", 0.1) }' + -| 0.1000000000000000000000000 + $ gawk -M 'BEGIN { PREC = 113; printf("%0.25f\n", "0.1") }' + -| 0.1000000000000000000000000 + $ gawk -M 'BEGIN { PREC = 113; printf("%0.25f\n", 1/10) }' + -| 0.1000000000000000000000000 + + +File: gawk.info, Node: Setting the rounding mode, Prev: Setting precision, Up: FP Math Caution + +15.4.5 Setting the Rounding Mode +-------------------------------- + +The 'ROUNDMODE' variable provides program-level control over the +rounding mode. The correspondence between 'ROUNDMODE' and the IEEE +rounding modes is shown in *note Table 15.4: table-gawk-rounding-modes. + +Rounding mode IEEE name 'ROUNDMODE' +--------------------------------------------------------------------------- +Round to nearest, ties to even 'roundTiesToEven' '"N"' or '"n"' +Round toward positive infinity 'roundTowardPositive' '"U"' or '"u"' +Round toward negative infinity 'roundTowardNegative' '"D"' or '"d"' +Round toward zero 'roundTowardZero' '"Z"' or '"z"' +Round to nearest, ties away 'roundTiesToAway' '"A"' or '"a"' +from zero + +Table 15.4: 'gawk' rounding modes + + 'ROUNDMODE' has the default value '"N"', which selects the IEEE 754 +rounding mode 'roundTiesToEven'. In *note Table 15.4: +table-gawk-rounding-modes, the value '"A"' selects 'roundTiesToAway'. +This is only available if your version of the MPFR library supports it; +otherwise, setting 'ROUNDMODE' to '"A"' has no effect. + + The default mode 'roundTiesToEven' is the most preferred, but the +least intuitive. This method does the obvious thing for most values, by +rounding them up or down to the nearest digit. For example, rounding +1.132 to two digits yields 1.13, and rounding 1.157 yields 1.16. + + However, when it comes to rounding a value that is exactly halfway +between, things do not work the way you probably learned in school. In +this case, the number is rounded to the nearest even digit. So rounding +0.125 to two digits rounds down to 0.12, but rounding 0.6875 to three +digits rounds up to 0.688. You probably have already encountered this +rounding mode when using 'printf' to format floating-point numbers. For +example: + + BEGIN { + x = -4.5 + for (i = 1; i < 10; i++) { + x += 1.0 + printf("%4.1f => %2.0f\n", x, x) + } + } + +produces the following output when run on the author's system:(1) + + -3.5 => -4 + -2.5 => -2 + -1.5 => -2 + -0.5 => 0 + 0.5 => 0 + 1.5 => 2 + 2.5 => 2 + 3.5 => 4 + 4.5 => 4 + + The theory behind 'roundTiesToEven' is that it more or less evenly +distributes upward and downward rounds of exact halves, which might +cause any accumulating round-off error to cancel itself out. This is +the default rounding mode for IEEE 754 computing functions and +operators. + + The other rounding modes are rarely used. Rounding toward positive +infinity ('roundTowardPositive') and toward negative infinity +('roundTowardNegative') are often used to implement interval arithmetic, +where you adjust the rounding mode to calculate upper and lower bounds +for the range of output. The 'roundTowardZero' mode can be used for +converting floating-point numbers to integers. The rounding mode +'roundTiesToAway' rounds the result to the nearest number and selects +the number with the larger magnitude if a tie occurs. + + Some numerical analysts will tell you that your choice of rounding +style has tremendous impact on the final outcome, and advise you to wait +until final output for any rounding. Instead, you can often avoid +round-off error problems by setting the precision initially to some +value sufficiently larger than the final desired precision, so that the +accumulation of round-off error does not influence the outcome. If you +suspect that results from your computation are sensitive to accumulation +of round-off error, look for a significant difference in output when you +change the rounding mode to be sure. + + ---------- Footnotes ---------- + + (1) It is possible for the output to be completely different if the C +library in your system does not use the IEEE 754 even-rounding rule to +round halfway cases for 'printf'. + + +File: gawk.info, Node: Arbitrary Precision Integers, Next: POSIX Floating Point Problems, Prev: FP Math Caution, Up: Arbitrary Precision Arithmetic + +15.5 Arbitrary-Precision Integer Arithmetic with 'gawk' +======================================================= + +When given the '-M' option, 'gawk' performs all integer arithmetic using +GMP arbitrary-precision integers. Any number that looks like an integer +in a source or data file is stored as an arbitrary-precision integer. +The size of the integer is limited only by the available memory. For +example, the following computes 5^4^3^2, the result of which is beyond +the limits of ordinary hardware double-precision floating-point values: + + $ gawk -M 'BEGIN { + > x = 5^4^3^2 + > print "number of digits =", length(x) + > print substr(x, 1, 20), "...", substr(x, length(x) - 19, 20) + > }' + -| number of digits = 183231 + -| 62060698786608744707 ... 92256259918212890625 + + If instead you were to compute the same value using +arbitrary-precision floating-point values, the precision needed for +correct output (using the formula 'prec = 3.322 * dps') would be 3.322 x +183231, or 608693. + + The result from an arithmetic operation with an integer and a +floating-point value is a floating-point value with a precision equal to +the working precision. The following program calculates the eighth term +in Sylvester's sequence(1) using a recurrence: + + $ gawk -M 'BEGIN { + > s = 2.0 + > for (i = 1; i <= 7; i++) + > s = s * (s - 1) + 1 + > print s + > }' + -| 113423713055421845118910464 + + The output differs from the actual number, +113,423,713,055,421,844,361,000,443, because the default precision of 53 +bits is not enough to represent the floating-point results exactly. You +can either increase the precision (100 bits is enough in this case), or +replace the floating-point constant '2.0' with an integer, to perform +all computations using integer arithmetic to get the correct output. + + Sometimes 'gawk' must implicitly convert an arbitrary-precision +integer into an arbitrary-precision floating-point value. This is +primarily because the MPFR library does not always provide the relevant +interface to process arbitrary-precision integers or mixed-mode numbers +as needed by an operation or function. In such a case, the precision is +set to the minimum value necessary for exact conversion, and the working +precision is not used for this purpose. If this is not what you need or +want, you can employ a subterfuge and convert the integer to floating +point first, like this: + + gawk -M 'BEGIN { n = 13; print (n + 0.0) % 2.0 }' + + You can avoid this issue altogether by specifying the number as a +floating-point value to begin with: + + gawk -M 'BEGIN { n = 13.0; print n % 2.0 }' + + Note that for this particular example, it is likely best to just use +the following: + + gawk -M 'BEGIN { n = 13; print n % 2 }' + + When dividing two arbitrary precision integers with either '/' or +'%', the result is typically an arbitrary precision floating point value +(unless the denominator evenly divides into the numerator). In order to +do integer division or remainder with arbitrary precision integers, use +the built-in 'intdiv()' function (*note Numeric Functions::). + + You can simulate the 'intdiv()' function in standard 'awk' using this +user-defined function: + + # intdiv --- do integer division + + function intdiv(numerator, denominator, result) + { + split("", result) + + numerator = int(numerator) + denominator = int(denominator) + result["quotient"] = int(numerator / denominator) + result["remainder"] = int(numerator % denominator) + + return 0.0 + } + + The following example program, contributed by Katie Wasserman, uses +'intdiv()' to compute the digits of pi to as many places as you choose +to set: + + # pi.awk --- compute the digits of pi + + BEGIN { + digits = 100000 + two = 2 * 10 ^ digits + pi = two + for (m = digits * 4; m > 0; --m) { + d = m * 2 + 1 + x = pi * m + intdiv(x, d, result) + pi = result["quotient"] + pi = pi + two + } + print pi + } + + When asked about the algorithm used, Katie replied: + + It's not that well known but it's not that obscure either. It's + Euler's modification to Newton's method for calculating pi. Take a + look at lines (23) - (25) here: + <http://mathworld.wolfram.com/PiFormulas.html>. + + The algorithm I wrote simply expands the multiply by 2 and works + from the innermost expression outwards. I used this to program HP + calculators because it's quite easy to modify for tiny memory + devices with smallish word sizes. See + <http://www.hpmuseum.org/cgi-sys/cgiwrap/hpmuseum/articles.cgi?read=899>. + + ---------- Footnotes ---------- + + (1) Weisstein, Eric W. 'Sylvester's Sequence'. From MathWorld--A +Wolfram Web Resource +(<http://mathworld.wolfram.com/SylvestersSequence.html>). + + +File: gawk.info, Node: POSIX Floating Point Problems, Next: Floating point summary, Prev: Arbitrary Precision Integers, Up: Arbitrary Precision Arithmetic + +15.6 Standards Versus Existing Practice +======================================= + +Historically, 'awk' has converted any nonnumeric-looking string to the +numeric value zero, when required. Furthermore, the original definition +of the language and the original POSIX standards specified that 'awk' +only understands decimal numbers (base 10), and not octal (base 8) or +hexadecimal numbers (base 16). + + Changes in the language of the 2001 and 2004 POSIX standards can be +interpreted to imply that 'awk' should support additional features. +These features are: + + * Interpretation of floating-point data values specified in + hexadecimal notation (e.g., '0xDEADBEEF'). (Note: data values, + _not_ source code constants.) + + * Support for the special IEEE 754 floating-point values "not a + number" (NaN), positive infinity ("inf"), and negative infinity + ("-inf"). In particular, the format for these values is as + specified by the ISO 1999 C standard, which ignores case and can + allow implementation-dependent additional characters after the + 'nan' and allow either 'inf' or 'infinity'. + + The first problem is that both of these are clear changes to +historical practice: + + * The 'gawk' maintainer feels that supporting hexadecimal + floating-point values, in particular, is ugly, and was never + intended by the original designers to be part of the language. + + * Allowing completely alphabetic strings to have valid numeric values + is also a very severe departure from historical practice. + + The second problem is that the 'gawk' maintainer feels that this +interpretation of the standard, which required a certain amount of +"language lawyering" to arrive at in the first place, was not even +intended by the standard developers. In other words, "We see how you +got where you are, but we don't think that that's where you want to be." + + Recognizing these issues, but attempting to provide compatibility +with the earlier versions of the standard, the 2008 POSIX standard added +explicit wording to allow, but not require, that 'awk' support +hexadecimal floating-point values and special values for "not a number" +and infinity. + + Although the 'gawk' maintainer continues to feel that providing those +features is inadvisable, nevertheless, on systems that support IEEE +floating point, it seems reasonable to provide _some_ way to support NaN +and infinity values. The solution implemented in 'gawk' is as follows: + + * With the '--posix' command-line option, 'gawk' becomes "hands off." + String values are passed directly to the system library's + 'strtod()' function, and if it successfully returns a numeric + value, that is what's used.(1) By definition, the results are not + portable across different systems. They are also a little + surprising: + + $ echo nanny | gawk --posix '{ print $1 + 0 }' + -| nan + $ echo 0xDeadBeef | gawk --posix '{ print $1 + 0 }' + -| 3735928559 + + * Without '--posix', 'gawk' interprets the four string values '+inf', + '-inf', '+nan', and '-nan' specially, producing the corresponding + special numeric values. The leading sign acts a signal to 'gawk' + (and the user) that the value is really numeric. Hexadecimal + floating point is not supported (unless you also use + '--non-decimal-data', which is _not_ recommended). For example: + + $ echo nanny | gawk '{ print $1 + 0 }' + -| 0 + $ echo +nan | gawk '{ print $1 + 0 }' + -| nan + $ echo 0xDeadBeef | gawk '{ print $1 + 0 }' + -| 0 + + 'gawk' ignores case in the four special values. Thus, '+nan' and + '+NaN' are the same. + + ---------- Footnotes ---------- + + (1) You asked for it, you got it. + + +File: gawk.info, Node: Floating point summary, Prev: POSIX Floating Point Problems, Up: Arbitrary Precision Arithmetic + +15.7 Summary +============ + + * Most computer arithmetic is done using either integers or + floating-point values. Standard 'awk' uses double-precision + floating-point values. + + * In the early 1990s Barbie mistakenly said, "Math class is tough!" + Although math isn't tough, floating-point arithmetic isn't the same + as pencil-and-paper math, and care must be taken: + + - Not all numbers can be represented exactly. + + - Comparing values should use a delta, instead of being done + directly with '==' and '!='. + + - Errors accumulate. + + - Operations are not always truly associative or distributive. + + * Increasing the accuracy can help, but it is not a panacea. + + * Often, increasing the accuracy and then rounding to the desired + number of digits produces reasonable results. + + * Use '-M' (or '--bignum') to enable MPFR arithmetic. Use 'PREC' to + set the precision in bits, and 'ROUNDMODE' to set the IEEE 754 + rounding mode. + + * With '-M', 'gawk' performs arbitrary-precision integer arithmetic + using the GMP library. This is faster and more space-efficient + than using MPFR for the same calculations. + + * There are several areas with respect to floating-point numbers + where 'gawk' disagrees with the POSIX standard. It pays to be + aware of them. + + * Overall, there is no need to be unduly suspicious about the results + from floating-point arithmetic. The lesson to remember is that + floating-point arithmetic is always more complex than arithmetic + using pencil and paper. In order to take advantage of the power of + floating-point arithmetic, you need to know its limitations and + work within them. For most casual use of floating-point + arithmetic, you will often get the expected result if you simply + round the display of your final results to the correct number of + significant decimal digits. + + * As general advice, avoid presenting numerical data in a manner that + implies better precision than is actually the case. + + +File: gawk.info, Node: Dynamic Extensions, Next: Language History, Prev: Arbitrary Precision Arithmetic, Up: Top + +16 Writing Extensions for 'gawk' +******************************** + +It is possible to add new functions written in C or C++ to 'gawk' using +dynamically loaded libraries. This facility is available on systems +that support the C 'dlopen()' and 'dlsym()' functions. This major node +describes how to create extensions using code written in C or C++. + + If you don't know anything about C programming, you can safely skip +this major node, although you may wish to review the documentation on +the extensions that come with 'gawk' (*note Extension Samples::), and +the information on the 'gawkextlib' project (*note gawkextlib::). The +sample extensions are automatically built and installed when 'gawk' is. + + NOTE: When '--sandbox' is specified, extensions are disabled (*note + Options::). + +* Menu: + +* Extension Intro:: What is an extension. +* Plugin License:: A note about licensing. +* Extension Mechanism Outline:: An outline of how it works. +* Extension API Description:: A full description of the API. +* Finding Extensions:: How 'gawk' finds compiled extensions. +* Extension Example:: Example C code for an extension. +* Extension Samples:: The sample extensions that ship with + 'gawk'. +* gawkextlib:: The 'gawkextlib' project. +* Extension summary:: Extension summary. +* Extension Exercises:: Exercises. + + +File: gawk.info, Node: Extension Intro, Next: Plugin License, Up: Dynamic Extensions + +16.1 Introduction +================= + +An "extension" (sometimes called a "plug-in") is a piece of external +compiled code that 'gawk' can load at runtime to provide additional +functionality, over and above the built-in capabilities described in the +rest of this Info file. + + Extensions are useful because they allow you (of course) to extend +'gawk''s functionality. For example, they can provide access to system +calls (such as 'chdir()' to change directory) and to other C library +routines that could be of use. As with most software, "the sky is the +limit"; if you can imagine something that you might want to do and can +write in C or C++, you can write an extension to do it! + + Extensions are written in C or C++, using the "application +programming interface" (API) defined for this purpose by the 'gawk' +developers. The rest of this major node explains the facilities that +the API provides and how to use them, and presents a small example +extension. In addition, it documents the sample extensions included in +the 'gawk' distribution and describes the 'gawkextlib' project. *Note +Extension Design::, for a discussion of the extension mechanism goals +and design. + + +File: gawk.info, Node: Plugin License, Next: Extension Mechanism Outline, Prev: Extension Intro, Up: Dynamic Extensions + +16.2 Extension Licensing +======================== + +Every dynamic extension must be distributed under a license that is +compatible with the GNU GPL (*note Copying::). + + In order for the extension to tell 'gawk' that it is properly +licensed, the extension must define the global symbol +'plugin_is_GPL_compatible'. If this symbol does not exist, 'gawk' emits +a fatal error and exits when it tries to load your extension. + + The declared type of the symbol should be 'int'. It does not need to +be in any allocated section, though. The code merely asserts that the +symbol exists in the global scope. Something like this is enough: + + int plugin_is_GPL_compatible; + + +File: gawk.info, Node: Extension Mechanism Outline, Next: Extension API Description, Prev: Plugin License, Up: Dynamic Extensions + +16.3 How It Works at a High Level +================================= + +Communication between 'gawk' and an extension is two-way. First, when +an extension is loaded, 'gawk' passes it a pointer to a 'struct' whose +fields are function pointers. This is shown in *note Figure 16.1: +figure-load-extension. + + + Struct + +---+ + | | + +---+ + +---------------| | + | +---+ dl_load(api_p, id); + | | | ___________________ + | +---+ | + | +---------| | __________________ | + | | +---+ || + | | | | || + | | +---+ || + | | +---| | || + | | | +---+ \\ || / + | | | \\ / + v v v \\/ ++-------+-+---+-+---+-+------------------+--------------------+ +| |x| |x| |x| |OOOOOOOOOOOOOOOOOOOO| +| |x| |x| |x| |OOOOOOOOOOOOOOOOOOOO| +| |x| |x| |x| |OOOOOOOOOOOOOOOOOOOO| ++-------+-+---+-+---+-+------------------+--------------------+ + + gawk Main Program Address Space Extension" + +Figure 16.1: Loading the extension + + The extension can call functions inside 'gawk' through these function +pointers, at runtime, without needing (link-time) access to 'gawk''s +symbols. One of these function pointers is to a function for +"registering" new functions. This is shown in *note Figure 16.2: +figure-register-new-function. + + + + +--------------------------------------------+ + | | + V | ++-------+-+---+-+---+-+------------------+--------------+-+---+ +| |x| |x| |x| |OOOOOOOOOOOOOO|X|OOO| +| |x| |x| |x| |OOOOOOOOOOOOOO|X|OOO| +| |x| |x| |x| |OOOOOOOOOOOOOO|X|OOO| ++-------+-+---+-+---+-+------------------+--------------+-+---+ + + gawk Main Program Address Space Extension" + +Figure 16.2: Registering a new function + + In the other direction, the extension registers its new functions +with 'gawk' by passing function pointers to the functions that provide +the new feature ('do_chdir()', for example). 'gawk' associates the +function pointer with a name and can then call it, using a defined +calling convention. This is shown in *note Figure 16.3: +figure-call-new-function. + + + chdir(\"/path\") (*fnptr)(1); + } + +--------------------------------------------+ + | | + | V ++-------+-+---+-+---+-+------------------+--------------+-+---+ +| |x| |x| |x| |OOOOOOOOOOOOOO|X|OOO| +| |x| |x| |x| |OOOOOOOOOOOOOO|X|OOO| +| |x| |x| |x| |OOOOOOOOOOOOOO|X|OOO| ++-------+-+---+-+---+-+------------------+--------------+-+---+ + + gawk Main Program Address Space Extension" + +Figure 16.3: Calling the new function + + The 'do_XXX()' function, in turn, then uses the function pointers in +the API 'struct' to do its work, such as updating variables or arrays, +printing messages, setting 'ERRNO', and so on. + + Convenience macros make calling through the function pointers look +like regular function calls so that extension code is quite readable and +understandable. + + Although all of this sounds somewhat complicated, the result is that +extension code is quite straightforward to write and to read. You can +see this in the sample extension 'filefuncs.c' (*note Extension +Example::) and also in the 'testext.c' code for testing the APIs. + + Some other bits and pieces: + + * The API provides access to 'gawk''s 'do_XXX' values, reflecting + command-line options, like 'do_lint', 'do_profiling', and so on + (*note Extension API Variables::). These are informational: an + extension cannot affect their values inside 'gawk'. In addition, + attempting to assign to them produces a compile-time error. + + * The API also provides major and minor version numbers, so that an + extension can check if the 'gawk' it is loaded with supports the + facilities it was compiled with. (Version mismatches "shouldn't" + happen, but we all know how _that_ goes.) *Note Extension + Versioning:: for details. + + +File: gawk.info, Node: Extension API Description, Next: Finding Extensions, Prev: Extension Mechanism Outline, Up: Dynamic Extensions + +16.4 API Description +==================== + +C or C++ code for an extension must include the header file 'gawkapi.h', +which declares the functions and defines the data types used to +communicate with 'gawk'. This (rather large) minor node describes the +API in detail. + +* Menu: + +* Extension API Functions Introduction:: Introduction to the API functions. +* General Data Types:: The data types. +* Memory Allocation Functions:: Functions for allocating memory. +* Constructor Functions:: Functions for creating values. +* Registration Functions:: Functions to register things with + 'gawk'. +* Printing Messages:: Functions for printing messages. +* Updating ERRNO:: Functions for updating 'ERRNO'. +* Requesting Values:: How to get a value. +* Accessing Parameters:: Functions for accessing parameters. +* Symbol Table Access:: Functions for accessing global + variables. +* Array Manipulation:: Functions for working with arrays. +* Redirection API:: How to access and manipulate redirections. +* Extension API Variables:: Variables provided by the API. +* Extension API Boilerplate:: Boilerplate code for using the API. + + +File: gawk.info, Node: Extension API Functions Introduction, Next: General Data Types, Up: Extension API Description + +16.4.1 Introduction +------------------- + +Access to facilities within 'gawk' is achieved by calling through +function pointers passed into your extension. + + API function pointers are provided for the following kinds of +operations: + + * Allocating, reallocating, and releasing memory. + + * Registration functions. You may register: + + - Extension functions + - Exit callbacks + - A version string + - Input parsers + - Output wrappers + - Two-way processors + + All of these are discussed in detail later in this major node. + + * Printing fatal, warning, and "lint" warning messages. + + * Updating 'ERRNO', or unsetting it. + + * Accessing parameters, including converting an undefined parameter + into an array. + + * Symbol table access: retrieving a global variable, creating one, or + changing one. + + * Creating and releasing cached values; this provides an efficient + way to use values for multiple variables and can be a big + performance win. + + * Manipulating arrays: + + - Retrieving, adding, deleting, and modifying elements + + - Getting the count of elements in an array + + - Creating a new array + + - Clearing an array + + - Flattening an array for easy C-style looping over all its + indices and elements + + * Accessing and manipulating redirections. + + Some points about using the API: + + * The following types, macros, and/or functions are referenced in + 'gawkapi.h'. For correct use, you must therefore include the + corresponding standard header file _before_ including 'gawkapi.h': + + C entity Header file + ------------------------------------------- + 'EOF' '<stdio.h>' + Values for 'errno' '<errno.h>' + 'FILE' '<stdio.h>' + 'NULL' '<stddef.h>' + 'memcpy()' '<string.h>' + 'memset()' '<string.h>' + 'size_t' '<sys/types.h>' + 'struct stat' '<sys/stat.h>' + + Due to portability concerns, especially to systems that are not + fully standards-compliant, it is your responsibility to include the + correct files in the correct way. This requirement is necessary in + order to keep 'gawkapi.h' clean, instead of becoming a portability + hodge-podge as can be seen in some parts of the 'gawk' source code. + + * The 'gawkapi.h' file may be included more than once without ill + effect. Doing so, however, is poor coding practice. + + * Although the API only uses ISO C 90 features, there is an + exception; the "constructor" functions use the 'inline' keyword. + If your compiler does not support this keyword, you should either + place '-Dinline=''' on your command line or use the GNU Autotools + and include a 'config.h' file in your extensions. + + * All pointers filled in by 'gawk' point to memory managed by 'gawk' + and should be treated by the extension as read-only. Memory for + _all_ strings passed into 'gawk' from the extension _must_ come + from calling one of 'gawk_malloc()', 'gawk_calloc()', or + 'gawk_realloc()', and is managed by 'gawk' from then on. + + * The API defines several simple 'struct's that map values as seen + from 'awk'. A value can be a 'double', a string, or an array (as + in multidimensional arrays, or when creating a new array). String + values maintain both pointer and length, because embedded NUL + characters are allowed. + + NOTE: By intent, strings are maintained using the current + multibyte encoding (as defined by 'LC_XXX' environment + variables) and not using wide characters. This matches how + 'gawk' stores strings internally and also how characters are + likely to be input into and output from files. + + * When retrieving a value (such as a parameter or that of a global + variable or array element), the extension requests a specific type + (number, string, scalar, value cookie, array, or "undefined"). + When the request is "undefined," the returned value will have the + real underlying type. + + However, if the request and actual type don't match, the access + function returns "false" and fills in the type of the actual value + that is there, so that the extension can, e.g., print an error + message (such as "scalar passed where array expected"). + + You may call the API functions by using the function pointers +directly, but the interface is not so pretty. To make extension code +look more like regular code, the 'gawkapi.h' header file defines several +macros that you should use in your code. This minor node presents the +macros as if they were functions. + + +File: gawk.info, Node: General Data Types, Next: Memory Allocation Functions, Prev: Extension API Functions Introduction, Up: Extension API Description + +16.4.2 General-Purpose Data Types +--------------------------------- + + I have a true love/hate relationship with unions. + -- _Arnold Robbins_ + + That's the thing about unions: the compiler will arrange things so + they can accommodate both love and hate. + -- _Chet Ramey_ + + The extension API defines a number of simple types and structures for +general-purpose use. Additional, more specialized, data structures are +introduced in subsequent minor nodes, together with the functions that +use them. + + The general-purpose types and structures are as follows: + +'typedef void *awk_ext_id_t;' + A value of this type is received from 'gawk' when an extension is + loaded. That value must then be passed back to 'gawk' as the first + parameter of each API function. + +'#define awk_const ...' + This macro expands to 'const' when compiling an extension, and to + nothing when compiling 'gawk' itself. This makes certain fields in + the API data structures unwritable from extension code, while + allowing 'gawk' to use them as it needs to. + +'typedef enum awk_bool {' +' awk_false = 0,' +' awk_true' +'} awk_bool_t;' + A simple Boolean type. + +'typedef struct awk_string {' +' char *str; /* data */' +' size_t len; /* length thereof, in chars */' +'} awk_string_t;' + This represents a mutable string. 'gawk' owns the memory pointed + to if it supplied the value. Otherwise, it takes ownership of the + memory pointed to. _Such memory must come from calling one of the + 'gawk_malloc()', 'gawk_calloc()', or 'gawk_realloc()' functions!_ + + As mentioned earlier, strings are maintained using the current + multibyte encoding. + +'typedef enum {' +' AWK_UNDEFINED,' +' AWK_NUMBER,' +' AWK_STRING,' +' AWK_ARRAY,' +' AWK_SCALAR, /* opaque access to a variable */' +' AWK_VALUE_COOKIE /* for updating a previously created value */' +'} awk_valtype_t;' + This 'enum' indicates the type of a value. It is used in the + following 'struct'. + +'typedef struct awk_value {' +' awk_valtype_t val_type;' +' union {' +' awk_string_t s;' +' double d;' +' awk_array_t a;' +' awk_scalar_t scl;' +' awk_value_cookie_t vc;' +' } u;' +'} awk_value_t;' + An "'awk' value." The 'val_type' member indicates what kind of + value the 'union' holds, and each member is of the appropriate + type. + +'#define str_value u.s' +'#define num_value u.d' +'#define array_cookie u.a' +'#define scalar_cookie u.scl' +'#define value_cookie u.vc' + Using these macros makes accessing the fields of the 'awk_value_t' + more readable. + +'typedef void *awk_scalar_t;' + Scalars can be represented as an opaque type. These values are + obtained from 'gawk' and then passed back into it. This is + discussed in a general fashion in the text following this list, and + in more detail in *note Symbol table by cookie::. + +'typedef void *awk_value_cookie_t;' + A "value cookie" is an opaque type representing a cached value. + This is also discussed in a general fashion in the text following + this list, and in more detail in *note Cached values::. + + Scalar values in 'awk' are either numbers or strings. The +'awk_value_t' struct represents values. The 'val_type' member indicates +what is in the 'union'. + + Representing numbers is easy--the API uses a C 'double'. Strings +require more work. Because 'gawk' allows embedded NUL bytes in string +values, a string must be represented as a pair containing a data pointer +and length. This is the 'awk_string_t' type. + + Identifiers (i.e., the names of global variables) can be associated +with either scalar values or with arrays. In addition, 'gawk' provides +true arrays of arrays, where any given array element can itself be an +array. Discussion of arrays is delayed until *note Array +Manipulation::. + + The various macros listed earlier make it easier to use the elements +of the 'union' as if they were fields in a 'struct'; this is a common +coding practice in C. Such code is easier to write and to read, but it +remains _your_ responsibility to make sure that the 'val_type' member +correctly reflects the type of the value in the 'awk_value_t' struct. + + Conceptually, the first three members of the 'union' (number, string, +and array) are all that is needed for working with 'awk' values. +However, because the API provides routines for accessing and changing +the value of a global scalar variable only by using the variable's name, +there is a performance penalty: 'gawk' must find the variable each time +it is accessed and changed. This turns out to be a real issue, not just +a theoretical one. + + Thus, if you know that your extension will spend considerable time +reading and/or changing the value of one or more scalar variables, you +can obtain a "scalar cookie"(1) object for that variable, and then use +the cookie for getting the variable's value or for changing the +variable's value. The 'awk_scalar_t' type holds a scalar cookie, and +the 'scalar_cookie' macro provides access to the value of that type in +the 'awk_value_t' struct. Given a scalar cookie, 'gawk' can directly +retrieve or modify the value, as required, without having to find it +first. + + The 'awk_value_cookie_t' type and 'value_cookie' macro are similar. +If you know that you wish to use the same numeric or string _value_ for +one or more variables, you can create the value once, retaining a "value +cookie" for it, and then pass in that value cookie whenever you wish to +set the value of a variable. This saves storage space within the +running 'gawk' process and reduces the time needed to create the value. + + ---------- Footnotes ---------- + + (1) See the "cookie" entry in the Jargon file +(http://catb.org/jargon/html/C/cookie.html) for a definition of +"cookie", and the "magic cookie" entry in the Jargon file +(http://catb.org/jargon/html/M/magic-cookie.html) for a nice example. +See also the entry for "Cookie" in the *note Glossary::. + + +File: gawk.info, Node: Memory Allocation Functions, Next: Constructor Functions, Prev: General Data Types, Up: Extension API Description + +16.4.3 Memory Allocation Functions and Convenience Macros +--------------------------------------------------------- + +The API provides a number of "memory allocation" functions for +allocating memory that can be passed to 'gawk', as well as a number of +convenience macros. This node presents them all as function prototypes, +in the way that extension code would use them: + +'void *gawk_malloc(size_t size);' + Call the correct version of 'malloc()' to allocate storage that may + be passed to 'gawk'. + +'void *gawk_calloc(size_t nmemb, size_t size);' + Call the correct version of 'calloc()' to allocate storage that may + be passed to 'gawk'. + +'void *gawk_realloc(void *ptr, size_t size);' + Call the correct version of 'realloc()' to allocate storage that + may be passed to 'gawk'. + +'void gawk_free(void *ptr);' + Call the correct version of 'free()' to release storage that was + allocated with 'gawk_malloc()', 'gawk_calloc()', or + 'gawk_realloc()'. + + The API has to provide these functions because it is possible for an +extension to be compiled and linked against a different version of the C +library than was used for the 'gawk' executable.(1) If 'gawk' were to +use its version of 'free()' when the memory came from an unrelated +version of 'malloc()', unexpected behavior would likely result. + + Two convenience macros may be used for allocating storage from +'gawk_malloc()' and 'gawk_realloc()'. If the allocation fails, they +cause 'gawk' to exit with a fatal error message. They should be used as +if they were procedure calls that do not return a value: + +'#define emalloc(pointer, type, size, message) ...' + The arguments to this macro are as follows: + + 'pointer' + The pointer variable to point at the allocated storage. + + 'type' + The type of the pointer variable. This is used to create a + cast for the call to 'gawk_malloc()'. + + 'size' + The total number of bytes to be allocated. + + 'message' + A message to be prefixed to the fatal error message. + Typically this is the name of the function using the macro. + + For example, you might allocate a string value like so: + + awk_value_t result; + char *message; + const char greet[] = "Don't Panic!"; + + emalloc(message, char *, sizeof(greet), "myfunc"); + strcpy(message, greet); + make_malloced_string(message, strlen(message), & result); + +'#define erealloc(pointer, type, size, message) ...' + This is like 'emalloc()', but it calls 'gawk_realloc()' instead of + 'gawk_malloc()'. The arguments are the same as for the 'emalloc()' + macro. + + ---------- Footnotes ---------- + + (1) This is more common on MS-Windows systems, but it can happen on +Unix-like systems as well. + + +File: gawk.info, Node: Constructor Functions, Next: Registration Functions, Prev: Memory Allocation Functions, Up: Extension API Description + +16.4.4 Constructor Functions +---------------------------- + +The API provides a number of "constructor" functions for creating string +and numeric values, as well as a number of convenience macros. This +node presents them all as function prototypes, in the way that extension +code would use them: + +'static inline awk_value_t *' +'make_const_string(const char *string, size_t length, awk_value_t *result);' + This function creates a string value in the 'awk_value_t' variable + pointed to by 'result'. It expects 'string' to be a C string + constant (or other string data), and automatically creates a _copy_ + of the data for storage in 'result'. It returns 'result'. + +'static inline awk_value_t *' +'make_malloced_string(const char *string, size_t length, awk_value_t *result);' + This function creates a string value in the 'awk_value_t' variable + pointed to by 'result'. It expects 'string' to be a 'char *' value + pointing to data previously obtained from 'gawk_malloc()', + 'gawk_calloc()', or 'gawk_realloc()'. The idea here is that the + data is passed directly to 'gawk', which assumes responsibility for + it. It returns 'result'. + +'static inline awk_value_t *' +'make_null_string(awk_value_t *result);' + This specialized function creates a null string (the "undefined" + value) in the 'awk_value_t' variable pointed to by 'result'. It + returns 'result'. + +'static inline awk_value_t *' +'make_number(double num, awk_value_t *result);' + This function simply creates a numeric value in the 'awk_value_t' + variable pointed to by 'result'. + + +File: gawk.info, Node: Registration Functions, Next: Printing Messages, Prev: Constructor Functions, Up: Extension API Description + +16.4.5 Registration Functions +----------------------------- + +This minor node describes the API functions for registering parts of +your extension with 'gawk'. + +* Menu: + +* Extension Functions:: Registering extension functions. +* Exit Callback Functions:: Registering an exit callback. +* Extension Version String:: Registering a version string. +* Input Parsers:: Registering an input parser. +* Output Wrappers:: Registering an output wrapper. +* Two-way processors:: Registering a two-way processor. + + +File: gawk.info, Node: Extension Functions, Next: Exit Callback Functions, Up: Registration Functions + +16.4.5.1 Registering An Extension Function +.......................................... + +Extension functions are described by the following record: + + typedef struct awk_ext_func { + const char *name; + awk_value_t *(*function)(int num_actual_args, awk_value_t *result); + size_t max_expected_args; + } awk_ext_func_t; + + The fields are: + +'const char *name;' + The name of the new function. 'awk'-level code calls the function + by this name. This is a regular C string. + + Function names must obey the rules for 'awk' identifiers. That is, + they must begin with either an English letter or an underscore, + which may be followed by any number of letters, digits, and + underscores. Letter case in function names is significant. + +'awk_value_t *(*function)(int num_actual_args, awk_value_t *result);' + This is a pointer to the C function that provides the extension's + functionality. The function must fill in '*result' with either a + number or a string. 'gawk' takes ownership of any string memory. + As mentioned earlier, string memory _must_ come from one of + 'gawk_malloc()', 'gawk_calloc()', or 'gawk_realloc()'. + + The 'num_actual_args' argument tells the C function how many actual + parameters were passed from the calling 'awk' code. + + The function must return the value of 'result'. This is for the + convenience of the calling code inside 'gawk'. + +'size_t max_expected_args;' + This is the maximum number of arguments the function expects to + receive. Each extension function may decide what to do if the + number of arguments isn't what it expected. As with real 'awk' + functions, it is likely OK to ignore extra arguments. This value + does not affect actual program execution. + + Extension functions should compare this value to the number of + actual arguments passed and possibly issue a lint warning if there + is an undesirable mismatch. Of course, if '--lint=fatal' is used, + this would cause the program to exit. + + Once you have a record representing your extension function, you +register it with 'gawk' using this API function: + +'awk_bool_t add_ext_func(const char *namespace, const awk_ext_func_t *func);' + This function returns true upon success, false otherwise. The + 'namespace' parameter is currently not used; you should pass in an + empty string ('""'). The 'func' pointer is the address of a + 'struct' representing your function, as just described. + + +File: gawk.info, Node: Exit Callback Functions, Next: Extension Version String, Prev: Extension Functions, Up: Registration Functions + +16.4.5.2 Registering An Exit Callback Function +.............................................. + +An "exit callback" function is a function that 'gawk' calls before it +exits. Such functions are useful if you have general "cleanup" tasks +that should be performed in your extension (such as closing database +connections or other resource deallocations). You can register such a +function with 'gawk' using the following function: + +'void awk_atexit(void (*funcp)(void *data, int exit_status),' +' void *arg0);' + The parameters are: + + 'funcp' + A pointer to the function to be called before 'gawk' exits. + The 'data' parameter will be the original value of 'arg0'. + The 'exit_status' parameter is the exit status value that + 'gawk' intends to pass to the 'exit()' system call. + + 'arg0' + A pointer to private data that 'gawk' saves in order to pass + to the function pointed to by 'funcp'. + + Exit callback functions are called in last-in, first-out (LIFO) +order--that is, in the reverse order in which they are registered with +'gawk'. + + +File: gawk.info, Node: Extension Version String, Next: Input Parsers, Prev: Exit Callback Functions, Up: Registration Functions + +16.4.5.3 Registering An Extension Version String +................................................ + +You can register a version string that indicates the name and version of +your extension with 'gawk', as follows: + +'void register_ext_version(const char *version);' + Register the string pointed to by 'version' with 'gawk'. Note that + 'gawk' does _not_ copy the 'version' string, so it should not be + changed. + + 'gawk' prints all registered extension version strings when it is +invoked with the '--version' option. + + +File: gawk.info, Node: Input Parsers, Next: Output Wrappers, Prev: Extension Version String, Up: Registration Functions + +16.4.5.4 Customized Input Parsers +................................. + +By default, 'gawk' reads text files as its input. It uses the value of +'RS' to find the end of the record, and then uses 'FS' (or 'FIELDWIDTHS' +or 'FPAT') to split it into fields (*note Reading Files::). +Additionally, it sets the value of 'RT' (*note Built-in Variables::). + + If you want, you can provide your own custom input parser. An input +parser's job is to return a record to the 'gawk' record-processing code, +along with indicators for the value and length of the data to be used +for 'RT', if any. + + To provide an input parser, you must first provide two functions +(where XXX is a prefix name for your extension): + +'awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);' + This function examines the information available in 'iobuf' (which + we discuss shortly). Based on the information there, it decides if + the input parser should be used for this file. If so, it should + return true. Otherwise, it should return false. It should not + change any state (variable values, etc.) within 'gawk'. + +'awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);' + When 'gawk' decides to hand control of the file over to the input + parser, it calls this function. This function in turn must fill in + certain fields in the 'awk_input_buf_t' structure and ensure that + certain conditions are true. It should then return true. If an + error of some kind occurs, it should not fill in any fields and + should return false; then 'gawk' will not use the input parser. + The details are presented shortly. + + Your extension should package these functions inside an +'awk_input_parser_t', which looks like this: + + typedef struct awk_input_parser { + const char *name; /* name of parser */ + awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf); + awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf); + awk_const struct awk_input_parser *awk_const next; /* for gawk */ + } awk_input_parser_t; + + The fields are: + +'const char *name;' + The name of the input parser. This is a regular C string. + +'awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);' + A pointer to your 'XXX_can_take_file()' function. + +'awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);' + A pointer to your 'XXX_take_control_of()' function. + +'awk_const struct input_parser *awk_const next;' + This is for use by 'gawk'; therefore it is marked 'awk_const' so + that the extension cannot modify it. + + The steps are as follows: + + 1. Create a 'static awk_input_parser_t' variable and initialize it + appropriately. + + 2. When your extension is loaded, register your input parser with + 'gawk' using the 'register_input_parser()' API function (described + next). + + An 'awk_input_buf_t' looks like this: + + typedef struct awk_input { + const char *name; /* filename */ + int fd; /* file descriptor */ + #define INVALID_HANDLE (-1) + void *opaque; /* private data for input parsers */ + int (*get_record)(char **out, struct awk_input *iobuf, + int *errcode, char **rt_start, size_t *rt_len); + ssize_t (*read_func)(); + void (*close_func)(struct awk_input *iobuf); + struct stat sbuf; /* stat buf */ + } awk_input_buf_t; + + The fields can be divided into two categories: those for use +(initially, at least) by 'XXX_can_take_file()', and those for use by +'XXX_take_control_of()'. The first group of fields and their uses are +as follows: + +'const char *name;' + The name of the file. + +'int fd;' + A file descriptor for the file. If 'gawk' was able to open the + file, then 'fd' will _not_ be equal to 'INVALID_HANDLE'. + Otherwise, it will. + +'struct stat sbuf;' + If the file descriptor is valid, then 'gawk' will have filled in + this structure via a call to the 'fstat()' system call. + + The 'XXX_can_take_file()' function should examine these fields and +decide if the input parser should be used for the file. The decision +can be made based upon 'gawk' state (the value of a variable defined +previously by the extension and set by 'awk' code), the name of the +file, whether or not the file descriptor is valid, the information in +the 'struct stat', or any combination of these factors. + + Once 'XXX_can_take_file()' has returned true, and 'gawk' has decided +to use your input parser, it calls 'XXX_take_control_of()'. That +function then fills either the 'get_record' field or the 'read_func' +field in the 'awk_input_buf_t'. It must also ensure that 'fd' is _not_ +set to 'INVALID_HANDLE'. The following list describes the fields that +may be filled by 'XXX_take_control_of()': + +'void *opaque;' + This is used to hold any state information needed by the input + parser for this file. It is "opaque" to 'gawk'. The input parser + is not required to use this pointer. + +'int (*get_record)(char **out,' +' struct awk_input *iobuf,' +' int *errcode,' +' char **rt_start,' +' size_t *rt_len);' + This function pointer should point to a function that creates the + input records. Said function is the core of the input parser. Its + behavior is described in the text following this list. + +'ssize_t (*read_func)();' + This function pointer should point to a function that has the same + behavior as the standard POSIX 'read()' system call. It is an + alternative to the 'get_record' pointer. Its behavior is also + described in the text following this list. + +'void (*close_func)(struct awk_input *iobuf);' + This function pointer should point to a function that does the + "teardown." It should release any resources allocated by + 'XXX_take_control_of()'. It may also close the file. If it does + so, it should set the 'fd' field to 'INVALID_HANDLE'. + + If 'fd' is still not 'INVALID_HANDLE' after the call to this + function, 'gawk' calls the regular 'close()' system call. + + Having a "teardown" function is optional. If your input parser + does not need it, do not set this field. Then, 'gawk' calls the + regular 'close()' system call on the file descriptor, so it should + be valid. + + The 'XXX_get_record()' function does the work of creating input +records. The parameters are as follows: + +'char **out' + This is a pointer to a 'char *' variable that is set to point to + the record. 'gawk' makes its own copy of the data, so the + extension must manage this storage. + +'struct awk_input *iobuf' + This is the 'awk_input_buf_t' for the file. The fields should be + used for reading data ('fd') and for managing private state + ('opaque'), if any. + +'int *errcode' + If an error occurs, '*errcode' should be set to an appropriate code + from '<errno.h>'. + +'char **rt_start' +'size_t *rt_len' + If the concept of a "record terminator" makes sense, then + '*rt_start' should be set to point to the data to be used for 'RT', + and '*rt_len' should be set to the length of the data. Otherwise, + '*rt_len' should be set to zero. 'gawk' makes its own copy of this + data, so the extension must manage this storage. + + The return value is the length of the buffer pointed to by '*out', or +'EOF' if end-of-file was reached or an error occurred. + + It is guaranteed that 'errcode' is a valid pointer, so there is no +need to test for a 'NULL' value. 'gawk' sets '*errcode' to zero, so +there is no need to set it unless an error occurs. + + If an error does occur, the function should return 'EOF' and set +'*errcode' to a value greater than zero. In that case, if '*errcode' +does not equal zero, 'gawk' automatically updates the 'ERRNO' variable +based on the value of '*errcode'. (In general, setting '*errcode = +errno' should do the right thing.) + + As an alternative to supplying a function that returns an input +record, you may instead supply a function that simply reads bytes, and +let 'gawk' parse the data into records. If you do so, the data should +be returned in the multibyte encoding of the current locale. Such a +function should follow the same behavior as the 'read()' system call, +and you fill in the 'read_func' pointer with its address in the +'awk_input_buf_t' structure. + + By default, 'gawk' sets the 'read_func' pointer to point to the +'read()' system call. So your extension need not set this field +explicitly. + + NOTE: You must choose one method or the other: either a function + that returns a record, or one that returns raw data. In + particular, if you supply a function to get a record, 'gawk' will + call it, and will never call the raw read function. + + 'gawk' ships with a sample extension that reads directories, +returning records for each entry in a directory (*note Extension Sample +Readdir::). You may wish to use that code as a guide for writing your +own input parser. + + When writing an input parser, you should think about (and document) +how it is expected to interact with 'awk' code. You may want it to +always be called, and to take effect as appropriate (as the 'readdir' +extension does). Or you may want it to take effect based upon the value +of an 'awk' variable, as the XML extension from the 'gawkextlib' project +does (*note gawkextlib::). In the latter case, code in a 'BEGINFILE' +rule can look at 'FILENAME' and 'ERRNO' to decide whether or not to +activate an input parser (*note BEGINFILE/ENDFILE::). + + You register your input parser with the following function: + +'void register_input_parser(awk_input_parser_t *input_parser);' + Register the input parser pointed to by 'input_parser' with 'gawk'. + + +File: gawk.info, Node: Output Wrappers, Next: Two-way processors, Prev: Input Parsers, Up: Registration Functions + +16.4.5.5 Customized Output Wrappers +................................... + +An "output wrapper" is the mirror image of an input parser. It allows +an extension to take over the output to a file opened with the '>' or +'>>' I/O redirection operators (*note Redirection::). + + The output wrapper is very similar to the input parser structure: + + typedef struct awk_output_wrapper { + const char *name; /* name of the wrapper */ + awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf); + awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf); + awk_const struct awk_output_wrapper *awk_const next; /* for gawk */ + } awk_output_wrapper_t; + + The members are as follows: + +'const char *name;' + This is the name of the output wrapper. + +'awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf);' + This points to a function that examines the information in the + 'awk_output_buf_t' structure pointed to by 'outbuf'. It should + return true if the output wrapper wants to take over the file, and + false otherwise. It should not change any state (variable values, + etc.) within 'gawk'. + +'awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf);' + The function pointed to by this field is called when 'gawk' decides + to let the output wrapper take control of the file. It should fill + in appropriate members of the 'awk_output_buf_t' structure, as + described next, and return true if successful, false otherwise. + +'awk_const struct output_wrapper *awk_const next;' + This is for use by 'gawk'; therefore it is marked 'awk_const' so + that the extension cannot modify it. + + The 'awk_output_buf_t' structure looks like this: + + typedef struct awk_output_buf { + const char *name; /* name of output file */ + const char *mode; /* mode argument to fopen */ + FILE *fp; /* stdio file pointer */ + awk_bool_t redirected; /* true if a wrapper is active */ + void *opaque; /* for use by output wrapper */ + size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count, + FILE *fp, void *opaque); + int (*gawk_fflush)(FILE *fp, void *opaque); + int (*gawk_ferror)(FILE *fp, void *opaque); + int (*gawk_fclose)(FILE *fp, void *opaque); + } awk_output_buf_t; + + Here too, your extension will define 'XXX_can_take_file()' and +'XXX_take_control_of()' functions that examine and update data members +in the 'awk_output_buf_t'. The data members are as follows: + +'const char *name;' + The name of the output file. + +'const char *mode;' + The mode string (as would be used in the second argument to + 'fopen()') with which the file was opened. + +'FILE *fp;' + The 'FILE' pointer from '<stdio.h>'. 'gawk' opens the file before + attempting to find an output wrapper. + +'awk_bool_t redirected;' + This field must be set to true by the 'XXX_take_control_of()' + function. + +'void *opaque;' + This pointer is opaque to 'gawk'. The extension should use it to + store a pointer to any private data associated with the file. + +'size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count,' +' FILE *fp, void *opaque);' +'int (*gawk_fflush)(FILE *fp, void *opaque);' +'int (*gawk_ferror)(FILE *fp, void *opaque);' +'int (*gawk_fclose)(FILE *fp, void *opaque);' + These pointers should be set to point to functions that perform the + equivalent function as the '<stdio.h>' functions do, if + appropriate. 'gawk' uses these function pointers for all output. + 'gawk' initializes the pointers to point to internal "pass-through" + functions that just call the regular '<stdio.h>' functions, so an + extension only needs to redefine those functions that are + appropriate for what it does. + + The 'XXX_can_take_file()' function should make a decision based upon +the 'name' and 'mode' fields, and any additional state (such as 'awk' +variable values) that is appropriate. + + When 'gawk' calls 'XXX_take_control_of()', that function should fill +in the other fields as appropriate, except for 'fp', which it should +just use normally. + + You register your output wrapper with the following function: + +'void register_output_wrapper(awk_output_wrapper_t *output_wrapper);' + Register the output wrapper pointed to by 'output_wrapper' with + 'gawk'. + + +File: gawk.info, Node: Two-way processors, Prev: Output Wrappers, Up: Registration Functions + +16.4.5.6 Customized Two-way Processors +...................................... + +A "two-way processor" combines an input parser and an output wrapper for +two-way I/O with the '|&' operator (*note Redirection::). It makes +identical use of the 'awk_input_parser_t' and 'awk_output_buf_t' +structures as described earlier. + + A two-way processor is represented by the following structure: + + typedef struct awk_two_way_processor { + const char *name; /* name of the two-way processor */ + awk_bool_t (*can_take_two_way)(const char *name); + awk_bool_t (*take_control_of)(const char *name, + awk_input_buf_t *inbuf, + awk_output_buf_t *outbuf); + awk_const struct awk_two_way_processor *awk_const next; /* for gawk */ + } awk_two_way_processor_t; + + The fields are as follows: + +'const char *name;' + The name of the two-way processor. + +'awk_bool_t (*can_take_two_way)(const char *name);' + The function pointed to by this field should return true if it + wants to take over two-way I/O for this file name. It should not + change any state (variable values, etc.) within 'gawk'. + +'awk_bool_t (*take_control_of)(const char *name,' +' awk_input_buf_t *inbuf,' +' awk_output_buf_t *outbuf);' + The function pointed to by this field should fill in the + 'awk_input_buf_t' and 'awk_output_buf_t' structures pointed to by + 'inbuf' and 'outbuf', respectively. These structures were + described earlier. + +'awk_const struct two_way_processor *awk_const next;' + This is for use by 'gawk'; therefore it is marked 'awk_const' so + that the extension cannot modify it. + + As with the input parser and output processor, you provide "yes I can +take this" and "take over for this" functions, 'XXX_can_take_two_way()' +and 'XXX_take_control_of()'. + + You register your two-way processor with the following function: + +'void register_two_way_processor(awk_two_way_processor_t *two_way_processor);' + Register the two-way processor pointed to by 'two_way_processor' + with 'gawk'. + + +File: gawk.info, Node: Printing Messages, Next: Updating ERRNO, Prev: Registration Functions, Up: Extension API Description + +16.4.6 Printing Messages +------------------------ + +You can print different kinds of warning messages from your extension, +as described here. Note that for these functions, you must pass in the +extension ID received from 'gawk' when the extension was loaded:(1) + +'void fatal(awk_ext_id_t id, const char *format, ...);' + Print a message and then cause 'gawk' to exit immediately. + +'void nonfatal(awk_ext_id_t id, const char *format, ...);' + Print a nonfatal error message. + +'void warning(awk_ext_id_t id, const char *format, ...);' + Print a warning message. + +'void lintwarn(awk_ext_id_t id, const char *format, ...);' + Print a "lint warning." Normally this is the same as printing a + warning message, but if 'gawk' was invoked with '--lint=fatal', + then lint warnings become fatal error messages. + + All of these functions are otherwise like the C 'printf()' family of +functions, where the 'format' parameter is a string with literal +characters and formatting codes intermixed. + + ---------- Footnotes ---------- + + (1) Because the API uses only ISO C 90 features, it cannot make use +of the ISO C 99 variadic macro feature to hide that parameter. More's +the pity. + + +File: gawk.info, Node: Updating ERRNO, Next: Requesting Values, Prev: Printing Messages, Up: Extension API Description + +16.4.7 Updating 'ERRNO' +----------------------- + +The following functions allow you to update the 'ERRNO' variable: + +'void update_ERRNO_int(int errno_val);' + Set 'ERRNO' to the string equivalent of the error code in + 'errno_val'. The value should be one of the defined error codes in + '<errno.h>', and 'gawk' turns it into a (possibly translated) + string using the C 'strerror()' function. + +'void update_ERRNO_string(const char *string);' + Set 'ERRNO' directly to the string value of 'ERRNO'. 'gawk' makes + a copy of the value of 'string'. + +'void unset_ERRNO(void);' + Unset 'ERRNO'. + + +File: gawk.info, Node: Requesting Values, Next: Accessing Parameters, Prev: Updating ERRNO, Up: Extension API Description + +16.4.8 Requesting Values +------------------------ + +All of the functions that return values from 'gawk' work in the same +way. You pass in an 'awk_valtype_t' value to indicate what kind of +value you expect. If the actual value matches what you requested, the +function returns true and fills in the 'awk_value_t' result. Otherwise, +the function returns false, and the 'val_type' member indicates the type +of the actual value. You may then print an error message or reissue the +request for the actual value type, as appropriate. This behavior is +summarized in *note Table 16.1: table-value-types-returned. + + Type of Actual Value +-------------------------------------------------------------------------- + String Number Array Undefined +------------------------------------------------------------------------------ + String String String False False + Number Number if Number False False + can be + converted, + else false +Type Array False False Array False +Requested Scalar Scalar Scalar False False + Undefined String Number Array Undefined + Value False False False False + cookie + +Table 16.1: API value types returned + + +File: gawk.info, Node: Accessing Parameters, Next: Symbol Table Access, Prev: Requesting Values, Up: Extension API Description + +16.4.9 Accessing and Updating Parameters +---------------------------------------- + +Two functions give you access to the arguments (parameters) passed to +your extension function. They are: + +'awk_bool_t get_argument(size_t count,' +' awk_valtype_t wanted,' +' awk_value_t *result);' + Fill in the 'awk_value_t' structure pointed to by 'result' with the + 'count'th argument. Return true if the actual type matches + 'wanted', and false otherwise. In the latter case, + 'result->val_type' indicates the actual type (*note Table 16.1: + table-value-types-returned.). Counts are zero-based--the first + argument is numbered zero, the second one, and so on. 'wanted' + indicates the type of value expected. + +'awk_bool_t set_argument(size_t count, awk_array_t array);' + Convert a parameter that was undefined into an array; this provides + call by reference for arrays. Return false if 'count' is too big, + or if the argument's type is not undefined. *Note Array + Manipulation:: for more information on creating arrays. + + +File: gawk.info, Node: Symbol Table Access, Next: Array Manipulation, Prev: Accessing Parameters, Up: Extension API Description + +16.4.10 Symbol Table Access +--------------------------- + +Two sets of routines provide access to global variables, and one set +allows you to create and release cached values. + +* Menu: + +* Symbol table by name:: Accessing variables by name. +* Symbol table by cookie:: Accessing variables by "cookie". +* Cached values:: Creating and using cached values. + + +File: gawk.info, Node: Symbol table by name, Next: Symbol table by cookie, Up: Symbol Table Access + +16.4.10.1 Variable Access and Update by Name +............................................ + +The following routines provide the ability to access and update global +'awk'-level variables by name. In compiler terminology, identifiers of +different kinds are termed "symbols", thus the "sym" in the routines' +names. The data structure that stores information about symbols is +termed a "symbol table". The functions are as follows: + +'awk_bool_t sym_lookup(const char *name,' +' awk_valtype_t wanted,' +' awk_value_t *result);' + Fill in the 'awk_value_t' structure pointed to by 'result' with the + value of the variable named by the string 'name', which is a + regular C string. 'wanted' indicates the type of value expected. + Return true if the actual type matches 'wanted', and false + otherwise. In the latter case, 'result->val_type' indicates the + actual type (*note Table 16.1: table-value-types-returned.). + +'awk_bool_t sym_update(const char *name, awk_value_t *value);' + Update the variable named by the string 'name', which is a regular + C string. The variable is added to 'gawk''s symbol table if it is + not there. Return true if everything worked, and false otherwise. + + Changing types (scalar to array or vice versa) of an existing + variable is _not_ allowed, nor may this routine be used to update + an array. This routine cannot be used to update any of the + predefined variables (such as 'ARGC' or 'NF'). + + An extension can look up the value of 'gawk''s special variables. +However, with the exception of the 'PROCINFO' array, an extension cannot +change any of those variables. + + CAUTION: It is possible for the lookup of 'PROCINFO' to fail. This + happens if the 'awk' program being run does not reference + 'PROCINFO'; in this case, 'gawk' doesn't bother to create the array + and populate it. + + +File: gawk.info, Node: Symbol table by cookie, Next: Cached values, Prev: Symbol table by name, Up: Symbol Table Access + +16.4.10.2 Variable Access and Update by Cookie +.............................................. + +A "scalar cookie" is an opaque handle that provides access to a global +variable or array. It is an optimization that avoids looking up +variables in 'gawk''s symbol table every time access is needed. This +was discussed earlier, in *note General Data Types::. + + The following functions let you work with scalar cookies: + +'awk_bool_t sym_lookup_scalar(awk_scalar_t cookie,' +' awk_valtype_t wanted,' +' awk_value_t *result);' + Retrieve the current value of a scalar cookie. Once you have + obtained a scalar cookie using 'sym_lookup()', you can use this + function to get its value more efficiently. Return false if the + value cannot be retrieved. + +'awk_bool_t sym_update_scalar(awk_scalar_t cookie, awk_value_t *value);' + Update the value associated with a scalar cookie. Return false if + the new value is not of type 'AWK_STRING' or 'AWK_NUMBER'. Here + too, the predefined variables may not be updated. + + It is not obvious at first glance how to work with scalar cookies or +what their raison d'e^tre really is. In theory, the 'sym_lookup()' and +'sym_update()' routines are all you really need to work with variables. +For example, you might have code that looks up the value of a variable, +evaluates a condition, and then possibly changes the value of the +variable based on the result of that evaluation, like so: + + /* do_magic --- do something really great */ + + static awk_value_t * + do_magic(int nargs, awk_value_t *result) + { + awk_value_t value; + + if ( sym_lookup("MAGIC_VAR", AWK_NUMBER, & value) + && some_condition(value.num_value)) { + value.num_value += 42; + sym_update("MAGIC_VAR", & value); + } + + return make_number(0.0, result); + } + +This code looks (and is) simple and straightforward. So what's the +problem? + + Well, consider what happens if 'awk'-level code associated with your +extension calls the 'magic()' function (implemented in C by +'do_magic()'), once per record, while processing hundreds of thousands +or millions of records. The 'MAGIC_VAR' variable is looked up in the +symbol table once or twice per function call! + + The symbol table lookup is really pure overhead; it is considerably +more efficient to get a cookie that represents the variable, and use +that to get the variable's value and update it as needed.(1) + + Thus, the way to use cookies is as follows. First, install your +extension's variable in 'gawk''s symbol table using 'sym_update()', as +usual. Then get a scalar cookie for the variable using 'sym_lookup()': + + static awk_scalar_t magic_var_cookie; /* cookie for MAGIC_VAR */ + + static void + my_extension_init() + { + awk_value_t value; + + /* install initial value */ + sym_update("MAGIC_VAR", make_number(42.0, & value)); + + /* get the cookie */ + sym_lookup("MAGIC_VAR", AWK_SCALAR, & value); + + /* save the cookie */ + magic_var_cookie = value.scalar_cookie; + ... + } + + Next, use the routines in this minor node for retrieving and updating +the value through the cookie. Thus, 'do_magic()' now becomes something +like this: + + /* do_magic --- do something really great */ + + static awk_value_t * + do_magic(int nargs, awk_value_t *result) + { + awk_value_t value; + + if ( sym_lookup_scalar(magic_var_cookie, AWK_NUMBER, & value) + && some_condition(value.num_value)) { + value.num_value += 42; + sym_update_scalar(magic_var_cookie, & value); + } + ... + + return make_number(0.0, result); + } + + NOTE: The previous code omitted error checking for presentation + purposes. Your extension code should be more robust and carefully + check the return values from the API functions. + + ---------- Footnotes ---------- + + (1) The difference is measurable and quite real. Trust us. + + +File: gawk.info, Node: Cached values, Prev: Symbol table by cookie, Up: Symbol Table Access + +16.4.10.3 Creating and Using Cached Values +.......................................... + +The routines in this minor node allow you to create and release cached +values. Like scalar cookies, in theory, cached values are not +necessary. You can create numbers and strings using the functions in +*note Constructor Functions::. You can then assign those values to +variables using 'sym_update()' or 'sym_update_scalar()', as you like. + + However, you can understand the point of cached values if you +remember that _every_ string value's storage _must_ come from +'gawk_malloc()', 'gawk_calloc()', or 'gawk_realloc()'. If you have 20 +variables, all of which have the same string value, you must create 20 +identical copies of the string.(1) + + It is clearly more efficient, if possible, to create a value once, +and then tell 'gawk' to reuse the value for multiple variables. That is +what the routines in this minor node let you do. The functions are as +follows: + +'awk_bool_t create_value(awk_value_t *value, awk_value_cookie_t *result);' + Create a cached string or numeric value from 'value' for efficient + later assignment. Only values of type 'AWK_NUMBER' and + 'AWK_STRING' are allowed. Any other type is rejected. + 'AWK_UNDEFINED' could be allowed, but doing so would result in + inferior performance. + +'awk_bool_t release_value(awk_value_cookie_t vc);' + Release the memory associated with a value cookie obtained from + 'create_value()'. + + You use value cookies in a fashion similar to the way you use scalar +cookies. In the extension initialization routine, you create the value +cookie: + + static awk_value_cookie_t answer_cookie; /* static value cookie */ + + static void + my_extension_init() + { + awk_value_t value; + char *long_string; + size_t long_string_len; + + /* code from earlier */ + ... + /* ... fill in long_string and long_string_len ... */ + make_malloced_string(long_string, long_string_len, & value); + create_value(& value, & answer_cookie); /* create cookie */ + ... + } + + Once the value is created, you can use it as the value of any number +of variables: + + static awk_value_t * + do_magic(int nargs, awk_value_t *result) + { + awk_value_t new_value; + + ... /* as earlier */ + + value.val_type = AWK_VALUE_COOKIE; + value.value_cookie = answer_cookie; + sym_update("VAR1", & value); + sym_update("VAR2", & value); + ... + sym_update("VAR100", & value); + ... + } + +Using value cookies in this way saves considerable storage, as all of +'VAR1' through 'VAR100' share the same value. + + You might be wondering, "Is this sharing problematic? What happens +if 'awk' code assigns a new value to 'VAR1'; are all the others changed +too?" + + That's a great question. The answer is that no, it's not a problem. +Internally, 'gawk' uses "reference-counted strings". This means that +many variables can share the same string value, and 'gawk' keeps track +of the usage. When a variable's value changes, 'gawk' simply decrements +the reference count on the old value and updates the variable to use the +new value. + + Finally, as part of your cleanup action (*note Exit Callback +Functions::) you should release any cached values that you created, +using 'release_value()'. + + ---------- Footnotes ---------- + + (1) Numeric values are clearly less problematic, requiring only a C +'double' to store. + + +File: gawk.info, Node: Array Manipulation, Next: Redirection API, Prev: Symbol Table Access, Up: Extension API Description + +16.4.11 Array Manipulation +-------------------------- + +The primary data structure(1) in 'awk' is the associative array (*note +Arrays::). Extensions need to be able to manipulate 'awk' arrays. The +API provides a number of data structures for working with arrays, +functions for working with individual elements, and functions for +working with arrays as a whole. This includes the ability to "flatten" +an array so that it is easy for C code to traverse every element in an +array. The array data structures integrate nicely with the data +structures for values to make it easy to both work with and create true +arrays of arrays (*note General Data Types::). + +* Menu: + +* Array Data Types:: Data types for working with arrays. +* Array Functions:: Functions for working with arrays. +* Flattening Arrays:: How to flatten arrays. +* Creating Arrays:: How to create and populate arrays. + + ---------- Footnotes ---------- + + (1) OK, the only data structure. + + +File: gawk.info, Node: Array Data Types, Next: Array Functions, Up: Array Manipulation + +16.4.11.1 Array Data Types +.......................... + +The data types associated with arrays are as follows: + +'typedef void *awk_array_t;' + If you request the value of an array variable, you get back an + 'awk_array_t' value. This value is opaque(1) to the extension; it + uniquely identifies the array but can only be used by passing it + into API functions or receiving it from API functions. This is + very similar to way 'FILE *' values are used with the '<stdio.h>' + library routines. + +'typedef struct awk_element {' +' /* convenience linked list pointer, not used by gawk */' +' struct awk_element *next;' +' enum {' +' AWK_ELEMENT_DEFAULT = 0, /* set by gawk */' +' AWK_ELEMENT_DELETE = 1 /* set by extension */' +' } flags;' +' awk_value_t index;' +' awk_value_t value;' +'} awk_element_t;' + The 'awk_element_t' is a "flattened" array element. 'awk' produces + an array of these inside the 'awk_flat_array_t' (see the next + item). Individual elements may be marked for deletion. New + elements must be added individually, one at a time, using the + separate API for that purpose. The fields are as follows: + + 'struct awk_element *next;' + This pointer is for the convenience of extension writers. It + allows an extension to create a linked list of new elements + that can then be added to an array in a loop that traverses + the list. + + 'enum { ... } flags;' + A set of flag values that convey information between the + extension and 'gawk'. Currently there is only one: + 'AWK_ELEMENT_DELETE'. Setting it causes 'gawk' to delete the + element from the original array upon release of the flattened + array. + + 'index' + 'value' + The index and value of the element, respectively. _All_ + memory pointed to by 'index' and 'value' belongs to 'gawk'. + +'typedef struct awk_flat_array {' +' awk_const void *awk_const opaque1; /* for use by gawk */' +' awk_const void *awk_const opaque2; /* for use by gawk */' +' awk_const size_t count; /* how many elements */' +' awk_element_t elements[1]; /* will be extended */' +'} awk_flat_array_t;' + This is a flattened array. When an extension gets one of these + from 'gawk', the 'elements' array is of actual size 'count'. The + 'opaque1' and 'opaque2' pointers are for use by 'gawk'; therefore + they are marked 'awk_const' so that the extension cannot modify + them. + + ---------- Footnotes ---------- + + (1) It is also a "cookie," but the 'gawk' developers did not wish to +overuse this term. + + +File: gawk.info, Node: Array Functions, Next: Flattening Arrays, Prev: Array Data Types, Up: Array Manipulation + +16.4.11.2 Array Functions +......................... + +The following functions relate to individual array elements: + +'awk_bool_t get_element_count(awk_array_t a_cookie, size_t *count);' + For the array represented by 'a_cookie', place in '*count' the + number of elements it contains. A subarray counts as a single + element. Return false if there is an error. + +'awk_bool_t get_array_element(awk_array_t a_cookie,' +' const awk_value_t *const index,' +' awk_valtype_t wanted,' +' awk_value_t *result);' + For the array represented by 'a_cookie', return in '*result' the + value of the element whose index is 'index'. 'wanted' specifies + the type of value you wish to retrieve. Return false if 'wanted' + does not match the actual type or if 'index' is not in the array + (*note Table 16.1: table-value-types-returned.). + + The value for 'index' can be numeric, in which case 'gawk' converts + it to a string. Using nonintegral values is possible, but requires + that you understand how such values are converted to strings (*note + Conversion::); thus, using integral values is safest. + + As with _all_ strings passed into 'gawk' from an extension, the + string value of 'index' must come from 'gawk_malloc()', + 'gawk_calloc()', or 'gawk_realloc()', and 'gawk' releases the + storage. + +'awk_bool_t set_array_element(awk_array_t a_cookie,' +' const awk_value_t *const index,' +' const awk_value_t *const value);' + In the array represented by 'a_cookie', create or modify the + element whose index is given by 'index'. The 'ARGV' and 'ENVIRON' + arrays may not be changed, although the 'PROCINFO' array can be. + +'awk_bool_t set_array_element_by_elem(awk_array_t a_cookie,' +' awk_element_t element);' + Like 'set_array_element()', but take the 'index' and 'value' from + 'element'. This is a convenience macro. + +'awk_bool_t del_array_element(awk_array_t a_cookie,' +' const awk_value_t* const index);' + Remove the element with the given index from the array represented + by 'a_cookie'. Return true if the element was removed, or false if + the element did not exist in the array. + + The following functions relate to arrays as a whole: + +'awk_array_t create_array(void);' + Create a new array to which elements may be added. *Note Creating + Arrays:: for a discussion of how to create a new array and add + elements to it. + +'awk_bool_t clear_array(awk_array_t a_cookie);' + Clear the array represented by 'a_cookie'. Return false if there + was some kind of problem, true otherwise. The array remains an + array, but after calling this function, it has no elements. This + is equivalent to using the 'delete' statement (*note Delete::). + +'awk_bool_t flatten_array(awk_array_t a_cookie, awk_flat_array_t **data);' + For the array represented by 'a_cookie', create an + 'awk_flat_array_t' structure and fill it in. Set the pointer whose + address is passed as 'data' to point to this structure. Return + true upon success, or false otherwise. *Note Flattening Arrays::, + for a discussion of how to flatten an array and work with it. + +'awk_bool_t release_flattened_array(awk_array_t a_cookie,' +' awk_flat_array_t *data);' + When done with a flattened array, release the storage using this + function. You must pass in both the original array cookie and the + address of the created 'awk_flat_array_t' structure. The function + returns true upon success, false otherwise. + + +File: gawk.info, Node: Flattening Arrays, Next: Creating Arrays, Prev: Array Functions, Up: Array Manipulation + +16.4.11.3 Working With All The Elements of an Array +................................................... + +To "flatten" an array is to create a structure that represents the full +array in a fashion that makes it easy for C code to traverse the entire +array. Some of the code in 'extension/testext.c' does this, and also +serves as a nice example showing how to use the APIs. + + We walk through that part of the code one step at a time. First, the +'gawk' script that drives the test extension: + + @load "testext" + BEGIN { + n = split("blacky rusty sophie raincloud lucky", pets) + printf("pets has %d elements\n", length(pets)) + ret = dump_array_and_delete("pets", "3") + printf("dump_array_and_delete(pets) returned %d\n", ret) + if ("3" in pets) + printf("dump_array_and_delete() did NOT remove index \"3\"!\n") + else + printf("dump_array_and_delete() did remove index \"3\"!\n") + print "" + } + +This code creates an array with 'split()' (*note String Functions::) and +then calls 'dump_array_and_delete()'. That function looks up the array +whose name is passed as the first argument, and deletes the element at +the index passed in the second argument. The 'awk' code then prints the +return value and checks if the element was indeed deleted. Here is the +C code that implements 'dump_array_and_delete()'. It has been edited +slightly for presentation. + + The first part declares variables, sets up the default return value +in 'result', and checks that the function was called with the correct +number of arguments: + + static awk_value_t * + dump_array_and_delete(int nargs, awk_value_t *result) + { + awk_value_t value, value2, value3; + awk_flat_array_t *flat_array; + size_t count; + char *name; + int i; + + assert(result != NULL); + make_number(0.0, result); + + if (nargs != 2) { + printf("dump_array_and_delete: nargs not right " + "(%d should be 2)\n", nargs); + goto out; + } + + The function then proceeds in steps, as follows. First, retrieve the +name of the array, passed as the first argument, followed by the array +itself. If either operation fails, print an error message and return: + + /* get argument named array as flat array and print it */ + if (get_argument(0, AWK_STRING, & value)) { + name = value.str_value.str; + if (sym_lookup(name, AWK_ARRAY, & value2)) + printf("dump_array_and_delete: sym_lookup of %s passed\n", + name); + else { + printf("dump_array_and_delete: sym_lookup of %s failed\n", + name); + goto out; + } + } else { + printf("dump_array_and_delete: get_argument(0) failed\n"); + goto out; + } + + For testing purposes and to make sure that the C code sees the same +number of elements as the 'awk' code, the second step is to get the +count of elements in the array and print it: + + if (! get_element_count(value2.array_cookie, & count)) { + printf("dump_array_and_delete: get_element_count failed\n"); + goto out; + } + + printf("dump_array_and_delete: incoming size is %lu\n", + (unsigned long) count); + + The third step is to actually flatten the array, and then to +double-check that the count in the 'awk_flat_array_t' is the same as the +count just retrieved: + + if (! flatten_array(value2.array_cookie, & flat_array)) { + printf("dump_array_and_delete: could not flatten array\n"); + goto out; + } + + if (flat_array->count != count) { + printf("dump_array_and_delete: flat_array->count (%lu)" + " != count (%lu)\n", + (unsigned long) flat_array->count, + (unsigned long) count); + goto out; + } + + The fourth step is to retrieve the index of the element to be +deleted, which was passed as the second argument. Remember that +argument counts passed to 'get_argument()' are zero-based, and thus the +second argument is numbered one: + + if (! get_argument(1, AWK_STRING, & value3)) { + printf("dump_array_and_delete: get_argument(1) failed\n"); + goto out; + } + + The fifth step is where the "real work" is done. The function loops +over every element in the array, printing the index and element values. +In addition, upon finding the element with the index that is supposed to +be deleted, the function sets the 'AWK_ELEMENT_DELETE' bit in the +'flags' field of the element. When the array is released, 'gawk' +traverses the flattened array, and deletes any elements that have this +flag bit set: + + for (i = 0; i < flat_array->count; i++) { + printf("\t%s[\"%.*s\"] = %s\n", + name, + (int) flat_array->elements[i].index.str_value.len, + flat_array->elements[i].index.str_value.str, + valrep2str(& flat_array->elements[i].value)); + + if (strcmp(value3.str_value.str, + flat_array->elements[i].index.str_value.str) == 0) { + flat_array->elements[i].flags |= AWK_ELEMENT_DELETE; + printf("dump_array_and_delete: marking element \"%s\" " + "for deletion\n", + flat_array->elements[i].index.str_value.str); + } + } + + The sixth step is to release the flattened array. This tells 'gawk' +that the extension is no longer using the array, and that it should +delete any elements marked for deletion. 'gawk' also frees any storage +that was allocated, so you should not use the pointer ('flat_array' in +this code) once you have called 'release_flattened_array()': + + if (! release_flattened_array(value2.array_cookie, flat_array)) { + printf("dump_array_and_delete: could not release flattened array\n"); + goto out; + } + + Finally, because everything was successful, the function sets the +return value to success, and returns: + + make_number(1.0, result); + out: + return result; + } + + Here is the output from running this part of the test: + + pets has 5 elements + dump_array_and_delete: sym_lookup of pets passed + dump_array_and_delete: incoming size is 5 + pets["1"] = "blacky" + pets["2"] = "rusty" + pets["3"] = "sophie" + dump_array_and_delete: marking element "3" for deletion + pets["4"] = "raincloud" + pets["5"] = "lucky" + dump_array_and_delete(pets) returned 1 + dump_array_and_delete() did remove index "3"! + + +File: gawk.info, Node: Creating Arrays, Prev: Flattening Arrays, Up: Array Manipulation + +16.4.11.4 How To Create and Populate Arrays +........................................... + +Besides working with arrays created by 'awk' code, you can create arrays +and populate them as you see fit, and then 'awk' code can access them +and manipulate them. + + There are two important points about creating arrays from extension +code: + + * You must install a new array into 'gawk''s symbol table immediately + upon creating it. Once you have done so, you can then populate the + array. + + Similarly, if installing a new array as a subarray of an existing + array, you must add the new array to its parent before adding any + elements to it. + + Thus, the correct way to build an array is to work "top down." + Create the array, and immediately install it in 'gawk''s symbol + table using 'sym_update()', or install it as an element in a + previously existing array using 'set_array_element()'. We show + example code shortly. + + * Due to 'gawk' internals, after using 'sym_update()' to install an + array into 'gawk', you have to retrieve the array cookie from the + value passed in to 'sym_update()' before doing anything else with + it, like so: + + awk_value_t val; + awk_array_t new_array; + + new_array = create_array(); + val.val_type = AWK_ARRAY; + val.array_cookie = new_array; + + /* install array in the symbol table */ + sym_update("array", & val); + + new_array = val.array_cookie; /* YOU MUST DO THIS */ + + If installing an array as a subarray, you must also retrieve the + value of the array cookie after the call to 'set_element()'. + + The following C code is a simple test extension to create an array +with two regular elements and with a subarray. The leading '#include' +directives and boilerplate variable declarations (*note Extension API +Boilerplate::) are omitted for brevity. The first step is to create a +new array and then install it in the symbol table: + + /* create_new_array --- create a named array */ + + static void + create_new_array() + { + awk_array_t a_cookie; + awk_array_t subarray; + awk_value_t index, value; + + a_cookie = create_array(); + value.val_type = AWK_ARRAY; + value.array_cookie = a_cookie; + + if (! sym_update("new_array", & value)) + printf("create_new_array: sym_update(\"new_array\") failed!\n"); + a_cookie = value.array_cookie; + +Note how 'a_cookie' is reset from the 'array_cookie' field in the +'value' structure. + + The second step is to install two regular values into 'new_array': + + (void) make_const_string("hello", 5, & index); + (void) make_const_string("world", 5, & value); + if (! set_array_element(a_cookie, & index, & value)) { + printf("fill_in_array: set_array_element failed\n"); + return; + } + + (void) make_const_string("answer", 6, & index); + (void) make_number(42.0, & value); + if (! set_array_element(a_cookie, & index, & value)) { + printf("fill_in_array: set_array_element failed\n"); + return; + } + + The third step is to create the subarray and install it: + + (void) make_const_string("subarray", 8, & index); + subarray = create_array(); + value.val_type = AWK_ARRAY; + value.array_cookie = subarray; + if (! set_array_element(a_cookie, & index, & value)) { + printf("fill_in_array: set_array_element failed\n"); + return; + } + subarray = value.array_cookie; + + The final step is to populate the subarray with its own element: + + (void) make_const_string("foo", 3, & index); + (void) make_const_string("bar", 3, & value); + if (! set_array_element(subarray, & index, & value)) { + printf("fill_in_array: set_array_element failed\n"); + return; + } + } + + Here is a sample script that loads the extension and then dumps the +array: + + @load "subarray" + + function dumparray(name, array, i) + { + for (i in array) + if (isarray(array[i])) + dumparray(name "[\"" i "\"]", array[i]) + else + printf("%s[\"%s\"] = %s\n", name, i, array[i]) + } + + BEGIN { + dumparray("new_array", new_array); + } + + Here is the result of running the script: + + $ AWKLIBPATH=$PWD ./gawk -f subarray.awk + -| new_array["subarray"]["foo"] = bar + -| new_array["hello"] = world + -| new_array["answer"] = 42 + +(*Note Finding Extensions:: for more information on the 'AWKLIBPATH' +environment variable.) + + +File: gawk.info, Node: Redirection API, Next: Extension API Variables, Prev: Array Manipulation, Up: Extension API Description + +16.4.12 Accessing and Manipulating Redirections +----------------------------------------------- + +The following function allows extensions to access and manipulate +redirections. + +'awk_bool_t get_file(const char *name,' +' size_t name_len,' +' const char *filetype,' +' int fd,' +' const awk_input_buf_t **ibufp,' +' const awk_output_buf_t **obufp);' + Look up a file in 'gawk''s internal redirection table. If 'name' + is 'NULL' or 'name_len' is zero, return data for the currently open + input file corresponding to 'FILENAME'. (This does not access the + 'filetype' argument, so that may be undefined). If the file is not + already open, attempt to open it. The 'filetype' argument must be + zero-terminated and should be one of: + + '">"' + A file opened for output. + + '">>"' + A file opened for append. + + '"<"' + A file opened for input. + + '"|>"' + A pipe opened for output. + + '"|<"' + A pipe opened for input. + + '"|&"' + A two-way coprocess. + + On error, return a 'false' value. Otherwise, return 'true', and + return additional information about the redirection in the 'ibufp' + and 'obufp' pointers. For input redirections, the '*ibufp' value + should be non-'NULL', and '*obufp' should be 'NULL'. For output + redirections, the '*obufp' value should be non-'NULL', and '*ibufp' + should be 'NULL'. For two-way coprocesses, both values should be + non-'NULL'. + + In the usual case, the extension is interested in '(*ibufp)->fd' + and/or 'fileno((*obufp)->fp)'. If the file is not already open, + and the 'fd' argument is non-negative, 'gawk' will use that file + descriptor instead of opening the file in the usual way. If 'fd' + is non-negative, but the file exists already, 'gawk' ignores 'fd' + and returns the existing file. It is the caller's responsibility + to notice that neither the 'fd' in the returned 'awk_input_buf_t' + nor the 'fd' in the returned 'awk_output_buf_t' matches the + requested value. + + Note that supplying a file descriptor is currently _not_ supported + for pipes. However, supplying a file descriptor should work for + input, output, append, and two-way (coprocess) sockets. If + 'filetype' is two-way, 'gawk' assumes that it is a socket! Note + that in the two-way case, the input and output file descriptors may + differ. To check for success, you must check whether either + matches. + + It is anticipated that this API function will be used to implement +I/O multiplexing and a socket library. + + +File: gawk.info, Node: Extension API Variables, Next: Extension API Boilerplate, Prev: Redirection API, Up: Extension API Description + +16.4.13 API Variables +--------------------- + +The API provides two sets of variables. The first provides information +about the version of the API (both with which the extension was +compiled, and with which 'gawk' was compiled). The second provides +information about how 'gawk' was invoked. + +* Menu: + +* Extension Versioning:: API Version information. +* Extension API Informational Variables:: Variables providing information about + 'gawk''s invocation. + + +File: gawk.info, Node: Extension Versioning, Next: Extension API Informational Variables, Up: Extension API Variables + +16.4.13.1 API Version Constants and Variables +............................................. + +The API provides both a "major" and a "minor" version number. The API +versions are available at compile time as C preprocessor defines to +support conditional compilation, and as enum constants to facilitate +debugging: + +API Version C preprocessor define enum constant +--------------------------------------------------------------------------- +Major gawk_api_major_version GAWK_API_MAJOR_VERSION +Minor gawk_api_minor_version GAWK_API_MINOR_VERSION + +Table 16.2: gawk API version constants + + The minor version increases when new functions are added to the API. +Such new functions are always added to the end of the API 'struct'. + + The major version increases (and the minor version is reset to zero) +if any of the data types change size or member order, or if any of the +existing functions change signature. + + It could happen that an extension may be compiled against one version +of the API but loaded by a version of 'gawk' using a different version. +For this reason, the major and minor API versions of the running 'gawk' +are included in the API 'struct' as read-only constant integers: + +'api->major_version' + The major version of the running 'gawk' + +'api->minor_version' + The minor version of the running 'gawk' + + It is up to the extension to decide if there are API +incompatibilities. Typically, a check like this is enough: + + if (api->major_version != GAWK_API_MAJOR_VERSION + || api->minor_version < GAWK_API_MINOR_VERSION) { + fprintf(stderr, "foo_extension: version mismatch with gawk!\n"); + fprintf(stderr, "\tmy version (%d, %d), gawk version (%d, %d)\n", + GAWK_API_MAJOR_VERSION, GAWK_API_MINOR_VERSION, + api->major_version, api->minor_version); + exit(1); + } + + Such code is included in the boilerplate 'dl_load_func()' macro +provided in 'gawkapi.h' (discussed in *note Extension API +Boilerplate::). + + +File: gawk.info, Node: Extension API Informational Variables, Prev: Extension Versioning, Up: Extension API Variables + +16.4.13.2 Informational Variables +................................. + +The API provides access to several variables that describe whether the +corresponding command-line options were enabled when 'gawk' was invoked. +The variables are: + +'do_debug' + This variable is true if 'gawk' was invoked with '--debug' option. + +'do_lint' + This variable is true if 'gawk' was invoked with '--lint' option. + +'do_mpfr' + This variable is true if 'gawk' was invoked with '--bignum' option. + +'do_profile' + This variable is true if 'gawk' was invoked with '--profile' + option. + +'do_sandbox' + This variable is true if 'gawk' was invoked with '--sandbox' + option. + +'do_traditional' + This variable is true if 'gawk' was invoked with '--traditional' + option. + + The value of 'do_lint' can change if 'awk' code modifies the 'LINT' +predefined variable (*note Built-in Variables::). The others should not +change during execution. + + +File: gawk.info, Node: Extension API Boilerplate, Prev: Extension API Variables, Up: Extension API Description + +16.4.14 Boilerplate Code +------------------------ + +As mentioned earlier (*note Extension Mechanism Outline::), the function +definitions as presented are really macros. To use these macros, your +extension must provide a small amount of boilerplate code (variables and +functions) toward the top of your source file, using predefined names as +described here. The boilerplate needed is also provided in comments in +the 'gawkapi.h' header file: + + /* Boilerplate code: */ + int plugin_is_GPL_compatible; + + static gawk_api_t *const api; + static awk_ext_id_t ext_id; + static const char *ext_version = NULL; /* or ... = "some string" */ + + static awk_ext_func_t func_table[] = { + { "name", do_name, 1 }, + /* ... */ + }; + + /* EITHER: */ + + static awk_bool_t (*init_func)(void) = NULL; + + /* OR: */ + + static awk_bool_t + init_my_extension(void) + { + ... + } + + static awk_bool_t (*init_func)(void) = init_my_extension; + + dl_load_func(func_table, some_name, "name_space_in_quotes") + + These variables and functions are as follows: + +'int plugin_is_GPL_compatible;' + This asserts that the extension is compatible with the GNU GPL + (*note Copying::). If your extension does not have this, 'gawk' + will not load it (*note Plugin License::). + +'static gawk_api_t *const api;' + This global 'static' variable should be set to point to the + 'gawk_api_t' pointer that 'gawk' passes to your 'dl_load()' + function. This variable is used by all of the macros. + +'static awk_ext_id_t ext_id;' + This global static variable should be set to the 'awk_ext_id_t' + value that 'gawk' passes to your 'dl_load()' function. This + variable is used by all of the macros. + +'static const char *ext_version = NULL; /* or ... = "some string" */' + This global 'static' variable should be set either to 'NULL', or to + point to a string giving the name and version of your extension. + +'static awk_ext_func_t func_table[] = { ... };' + This is an array of one or more 'awk_ext_func_t' structures, as + described earlier (*note Extension Functions::). It can then be + looped over for multiple calls to 'add_ext_func()'. + +'static awk_bool_t (*init_func)(void) = NULL;' +' OR' +'static awk_bool_t init_my_extension(void) { ... }' +'static awk_bool_t (*init_func)(void) = init_my_extension;' + If you need to do some initialization work, you should define a + function that does it (creates variables, opens files, etc.) and + then define the 'init_func' pointer to point to your function. The + function should return 'awk_false' upon failure, or 'awk_true' if + everything goes well. + + If you don't need to do any initialization, define the pointer and + initialize it to 'NULL'. + +'dl_load_func(func_table, some_name, "name_space_in_quotes")' + This macro expands to a 'dl_load()' function that performs all the + necessary initializations. + + The point of all the variables and arrays is to let the 'dl_load()' +function (from the 'dl_load_func()' macro) do all the standard work. It +does the following: + + 1. Check the API versions. If the extension major version does not + match 'gawk''s, or if the extension minor version is greater than + 'gawk''s, it prints a fatal error message and exits. + + 2. Load the functions defined in 'func_table'. If any of them fails + to load, it prints a warning message but continues on. + + 3. If the 'init_func' pointer is not 'NULL', call the function it + points to. If it returns 'awk_false', print a warning message. + + 4. If 'ext_version' is not 'NULL', register the version string with + 'gawk'. + + +File: gawk.info, Node: Finding Extensions, Next: Extension Example, Prev: Extension API Description, Up: Dynamic Extensions + +16.5 How 'gawk' Finds Extensions +================================ + +Compiled extensions have to be installed in a directory where 'gawk' can +find them. If 'gawk' is configured and built in the default fashion, +the directory in which to find extensions is '/usr/local/lib/gawk'. You +can also specify a search path with a list of directories to search for +compiled extensions. *Note AWKLIBPATH Variable:: for more information. + + +File: gawk.info, Node: Extension Example, Next: Extension Samples, Prev: Finding Extensions, Up: Dynamic Extensions + +16.6 Example: Some File Functions +================================= + + No matter where you go, there you are. + -- _Buckaroo Banzai_ + + Two useful functions that are not in 'awk' are 'chdir()' (so that an +'awk' program can change its directory) and 'stat()' (so that an 'awk' +program can gather information about a file). In order to illustrate +the API in action, this minor node implements these functions for 'gawk' +in an extension. + +* Menu: + +* Internal File Description:: What the new functions will do. +* Internal File Ops:: The code for internal file operations. +* Using Internal File Ops:: How to use an external extension. + + +File: gawk.info, Node: Internal File Description, Next: Internal File Ops, Up: Extension Example + +16.6.1 Using 'chdir()' and 'stat()' +----------------------------------- + +This minor node shows how to use the new functions at the 'awk' level +once they've been integrated into the running 'gawk' interpreter. Using +'chdir()' is very straightforward. It takes one argument, the new +directory to change to: + + @load "filefuncs" + ... + newdir = "/home/arnold/funstuff" + ret = chdir(newdir) + if (ret < 0) { + printf("could not change to %s: %s\n", newdir, ERRNO) > "/dev/stderr" + exit 1 + } + ... + + The return value is negative if the 'chdir()' failed, and 'ERRNO' +(*note Built-in Variables::) is set to a string indicating the error. + + Using 'stat()' is a bit more complicated. The C 'stat()' function +fills in a structure that has a fair amount of information. The right +way to model this in 'awk' is to fill in an associative array with the +appropriate information: + + file = "/home/arnold/.profile" + ret = stat(file, fdata) + if (ret < 0) { + printf("could not stat %s: %s\n", + file, ERRNO) > "/dev/stderr" + exit 1 + } + printf("size of %s is %d bytes\n", file, fdata["size"]) + + The 'stat()' function always clears the data array, even if the +'stat()' fails. It fills in the following elements: + +'"name"' + The name of the file that was 'stat()'ed. + +'"dev"' +'"ino"' + The file's device and inode numbers, respectively. + +'"mode"' + The file's mode, as a numeric value. This includes both the file's + type and its permissions. + +'"nlink"' + The number of hard links (directory entries) the file has. + +'"uid"' +'"gid"' + The numeric user and group ID numbers of the file's owner. + +'"size"' + The size in bytes of the file. + +'"blocks"' + The number of disk blocks the file actually occupies. This may not + be a function of the file's size if the file has holes. + +'"atime"' +'"mtime"' +'"ctime"' + The file's last access, modification, and inode update times, + respectively. These are numeric timestamps, suitable for + formatting with 'strftime()' (*note Time Functions::). + +'"pmode"' + The file's "printable mode." This is a string representation of + the file's type and permissions, such as is produced by 'ls + -l'--for example, '"drwxr-xr-x"'. + +'"type"' + A printable string representation of the file's type. The value is + one of the following: + + '"blockdev"' + '"chardev"' + The file is a block or character device ("special file"). + + '"directory"' + The file is a directory. + + '"fifo"' + The file is a named pipe (also known as a FIFO). + + '"file"' + The file is just a regular file. + + '"socket"' + The file is an 'AF_UNIX' ("Unix domain") socket in the + filesystem. + + '"symlink"' + The file is a symbolic link. + +'"devbsize"' + The size of a block for the element indexed by '"blocks"'. This + information is derived from either the 'DEV_BSIZE' constant defined + in '<sys/param.h>' on most systems, or the 'S_BLKSIZE' constant in + '<sys/stat.h>' on BSD systems. For some other systems, "a priori" + knowledge is used to provide a value. Where no value can be + determined, it defaults to 512. + + Several additional elements may be present, depending upon the +operating system and the type of the file. You can test for them in +your 'awk' program by using the 'in' operator (*note Reference to +Elements::): + +'"blksize"' + The preferred block size for I/O to the file. This field is not + present on all POSIX-like systems in the C 'stat' structure. + +'"linkval"' + If the file is a symbolic link, this element is the name of the + file the link points to (i.e., the value of the link). + +'"rdev"' +'"major"' +'"minor"' + If the file is a block or character device file, then these values + represent the numeric device number and the major and minor + components of that number, respectively. + + +File: gawk.info, Node: Internal File Ops, Next: Using Internal File Ops, Prev: Internal File Description, Up: Extension Example + +16.6.2 C Code for 'chdir()' and 'stat()' +---------------------------------------- + +Here is the C code for these extensions.(1) + + The file includes a number of standard header files, and then +includes the 'gawkapi.h' header file, which provides the API +definitions. Those are followed by the necessary variable declarations +to make use of the API macros and boilerplate code (*note Extension API +Boilerplate::): + + #ifdef HAVE_CONFIG_H + #include <config.h> + #endif + + #include <stdio.h> + #include <assert.h> + #include <errno.h> + #include <stdlib.h> + #include <string.h> + #include <unistd.h> + + #include <sys/types.h> + #include <sys/stat.h> + + #include "gawkapi.h" + + #include "gettext.h" + #define _(msgid) gettext(msgid) + #define N_(msgid) msgid + + #include "gawkfts.h" + #include "stack.h" + + static const gawk_api_t *api; /* for convenience macros to work */ + static awk_ext_id_t *ext_id; + static awk_bool_t init_filefuncs(void); + static awk_bool_t (*init_func)(void) = init_filefuncs; + static const char *ext_version = "filefuncs extension: version 1.0"; + + int plugin_is_GPL_compatible; + + By convention, for an 'awk' function 'foo()', the C function that +implements it is called 'do_foo()'. The function should have two +arguments. The first is an 'int', usually called 'nargs', that +represents the number of actual arguments for the function. The second +is a pointer to an 'awk_value_t' structure, usually named 'result': + + /* do_chdir --- provide dynamically loaded chdir() function for gawk */ + + static awk_value_t * + do_chdir(int nargs, awk_value_t *result) + { + awk_value_t newdir; + int ret = -1; + + assert(result != NULL); + + if (do_lint && nargs != 1) + lintwarn(ext_id, + _("chdir: called with incorrect number of arguments, " + "expecting 1")); + + The 'newdir' variable represents the new directory to change to, +which is retrieved with 'get_argument()'. Note that the first argument +is numbered zero. + + If the argument is retrieved successfully, the function calls the +'chdir()' system call. If the 'chdir()' fails, 'ERRNO' is updated: + + if (get_argument(0, AWK_STRING, & newdir)) { + ret = chdir(newdir.str_value.str); + if (ret < 0) + update_ERRNO_int(errno); + } + + Finally, the function returns the return value to the 'awk' level: + + return make_number(ret, result); + } + + The 'stat()' extension is more involved. First comes a function that +turns a numeric mode into a printable representation (e.g., octal '0644' +becomes '-rw-r--r--'). This is omitted here for brevity: + + /* format_mode --- turn a stat mode field into something readable */ + + static char * + format_mode(unsigned long fmode) + { + ... + } + + Next comes a function for reading symbolic links, which is also +omitted here for brevity: + + /* read_symlink --- read a symbolic link into an allocated buffer. + ... */ + + static char * + read_symlink(const char *fname, size_t bufsize, ssize_t *linksize) + { + ... + } + + Two helper functions simplify entering values in the array that will +contain the result of the 'stat()': + + /* array_set --- set an array element */ + + static void + array_set(awk_array_t array, const char *sub, awk_value_t *value) + { + awk_value_t index; + + set_array_element(array, + make_const_string(sub, strlen(sub), & index), + value); + + } + + /* array_set_numeric --- set an array element with a number */ + + static void + array_set_numeric(awk_array_t array, const char *sub, double num) + { + awk_value_t tmp; + + array_set(array, sub, make_number(num, & tmp)); + } + + The following function does most of the work to fill in the +'awk_array_t' result array with values obtained from a valid 'struct +stat'. This work is done in a separate function to support the 'stat()' +function for 'gawk' and also to support the 'fts()' extension, which is +included in the same file but whose code is not shown here (*note +Extension Sample File Functions::). + + The first part of the function is variable declarations, including a +table to map file types to strings: + + /* fill_stat_array --- do the work to fill an array with stat info */ + + static int + fill_stat_array(const char *name, awk_array_t array, struct stat *sbuf) + { + char *pmode; /* printable mode */ + const char *type = "unknown"; + awk_value_t tmp; + static struct ftype_map { + unsigned int mask; + const char *type; + } ftype_map[] = { + { S_IFREG, "file" }, + { S_IFBLK, "blockdev" }, + { S_IFCHR, "chardev" }, + { S_IFDIR, "directory" }, + #ifdef S_IFSOCK + { S_IFSOCK, "socket" }, + #endif + #ifdef S_IFIFO + { S_IFIFO, "fifo" }, + #endif + #ifdef S_IFLNK + { S_IFLNK, "symlink" }, + #endif + #ifdef S_IFDOOR /* Solaris weirdness */ + { S_IFDOOR, "door" }, + #endif /* S_IFDOOR */ + }; + int j, k; + + The destination array is cleared, and then code fills in various +elements based on values in the 'struct stat': + + /* empty out the array */ + clear_array(array); + + /* fill in the array */ + array_set(array, "name", make_const_string(name, strlen(name), + & tmp)); + array_set_numeric(array, "dev", sbuf->st_dev); + array_set_numeric(array, "ino", sbuf->st_ino); + array_set_numeric(array, "mode", sbuf->st_mode); + array_set_numeric(array, "nlink", sbuf->st_nlink); + array_set_numeric(array, "uid", sbuf->st_uid); + array_set_numeric(array, "gid", sbuf->st_gid); + array_set_numeric(array, "size", sbuf->st_size); + array_set_numeric(array, "blocks", sbuf->st_blocks); + array_set_numeric(array, "atime", sbuf->st_atime); + array_set_numeric(array, "mtime", sbuf->st_mtime); + array_set_numeric(array, "ctime", sbuf->st_ctime); + + /* for block and character devices, add rdev, + major and minor numbers */ + if (S_ISBLK(sbuf->st_mode) || S_ISCHR(sbuf->st_mode)) { + array_set_numeric(array, "rdev", sbuf->st_rdev); + array_set_numeric(array, "major", major(sbuf->st_rdev)); + array_set_numeric(array, "minor", minor(sbuf->st_rdev)); + } + +The latter part of the function makes selective additions to the +destination array, depending upon the availability of certain members +and/or the type of the file. It then returns zero, for success: + + #ifdef HAVE_STRUCT_STAT_ST_BLKSIZE + array_set_numeric(array, "blksize", sbuf->st_blksize); + #endif /* HAVE_STRUCT_STAT_ST_BLKSIZE */ + + pmode = format_mode(sbuf->st_mode); + array_set(array, "pmode", make_const_string(pmode, strlen(pmode), + & tmp)); + + /* for symbolic links, add a linkval field */ + if (S_ISLNK(sbuf->st_mode)) { + char *buf; + ssize_t linksize; + + if ((buf = read_symlink(name, sbuf->st_size, + & linksize)) != NULL) + array_set(array, "linkval", + make_malloced_string(buf, linksize, & tmp)); + else + warning(ext_id, _("stat: unable to read symbolic link `%s'"), + name); + } + + /* add a type field */ + type = "unknown"; /* shouldn't happen */ + for (j = 0, k = sizeof(ftype_map)/sizeof(ftype_map[0]); j < k; j++) { + if ((sbuf->st_mode & S_IFMT) == ftype_map[j].mask) { + type = ftype_map[j].type; + break; + } + } + + array_set(array, "type", make_const_string(type, strlen(type), & tmp)); + + return 0; + } + + The third argument to 'stat()' was not discussed previously. This +argument is optional. If present, it causes 'do_stat()' to use the +'stat()' system call instead of the 'lstat()' system call. This is done +by using a function pointer: 'statfunc'. 'statfunc' is initialized to +point to 'lstat()' (instead of 'stat()') to get the file information, in +case the file is a symbolic link. However, if the third argument is +included, 'statfunc' is set to point to 'stat()', instead. + + Here is the 'do_stat()' function, which starts with variable +declarations and argument checking: + + /* do_stat --- provide a stat() function for gawk */ + + static awk_value_t * + do_stat(int nargs, awk_value_t *result) + { + awk_value_t file_param, array_param; + char *name; + awk_array_t array; + int ret; + struct stat sbuf; + /* default is lstat() */ + int (*statfunc)(const char *path, struct stat *sbuf) = lstat; + + assert(result != NULL); + + if (nargs != 2 && nargs != 3) { + if (do_lint) + lintwarn(ext_id, + _("stat: called with wrong number of arguments")); + return make_number(-1, result); + } + + Then comes the actual work. First, the function gets the arguments. +Next, it gets the information for the file. If the called function +('lstat()' or 'stat()') returns an error, the code sets 'ERRNO' and +returns: + + /* file is first arg, array to hold results is second */ + if ( ! get_argument(0, AWK_STRING, & file_param) + || ! get_argument(1, AWK_ARRAY, & array_param)) { + warning(ext_id, _("stat: bad parameters")); + return make_number(-1, result); + } + + if (nargs == 3) { + statfunc = stat; + } + + name = file_param.str_value.str; + array = array_param.array_cookie; + + /* always empty out the array */ + clear_array(array); + + /* stat the file; if error, set ERRNO and return */ + ret = statfunc(name, & sbuf); + if (ret < 0) { + update_ERRNO_int(errno); + return make_number(ret, result); + } + + The tedious work is done by 'fill_stat_array()', shown earlier. When +done, the function returns the result from 'fill_stat_array()': + + ret = fill_stat_array(name, array, & sbuf); + + return make_number(ret, result); + } + + Finally, it's necessary to provide the "glue" that loads the new +function(s) into 'gawk'. + + The 'filefuncs' extension also provides an 'fts()' function, which we +omit here (*note Extension Sample File Functions::). For its sake, +there is an initialization function: + + /* init_filefuncs --- initialization routine */ + + static awk_bool_t + init_filefuncs(void) + { + ... + } + + We are almost done. We need an array of 'awk_ext_func_t' structures +for loading each function into 'gawk': + + static awk_ext_func_t func_table[] = { + { "chdir", do_chdir, 1 }, + { "stat", do_stat, 2 }, + #ifndef __MINGW32__ + { "fts", do_fts, 3 }, + #endif + }; + + Each extension must have a routine named 'dl_load()' to load +everything that needs to be loaded. It is simplest to use the +'dl_load_func()' macro in 'gawkapi.h': + + /* define the dl_load() function using the boilerplate macro */ + + dl_load_func(func_table, filefuncs, "") + + And that's it! + + ---------- Footnotes ---------- + + (1) This version is edited slightly for presentation. See +'extension/filefuncs.c' in the 'gawk' distribution for the complete +version. + + +File: gawk.info, Node: Using Internal File Ops, Prev: Internal File Ops, Up: Extension Example + +16.6.3 Integrating the Extensions +--------------------------------- + +Now that the code is written, it must be possible to add it at runtime +to the running 'gawk' interpreter. First, the code must be compiled. +Assuming that the functions are in a file named 'filefuncs.c', and IDIR +is the location of the 'gawkapi.h' header file, the following steps(1) +create a GNU/Linux shared library: + + $ gcc -fPIC -shared -DHAVE_CONFIG_H -c -O -g -IIDIR filefuncs.c + $ gcc -o filefuncs.so -shared filefuncs.o + + Once the library exists, it is loaded by using the '@load' keyword: + + # file testff.awk + @load "filefuncs" + + BEGIN { + "pwd" | getline curdir # save current directory + close("pwd") + + chdir("/tmp") + system("pwd") # test it + chdir(curdir) # go back + + print "Info for testff.awk" + ret = stat("testff.awk", data) + print "ret =", ret + for (i in data) + printf "data[\"%s\"] = %s\n", i, data[i] + print "testff.awk modified:", + strftime("%m %d %Y %H:%M:%S", data["mtime"]) + + print "\nInfo for JUNK" + ret = stat("JUNK", data) + print "ret =", ret + for (i in data) + printf "data[\"%s\"] = %s\n", i, data[i] + print "JUNK modified:", strftime("%m %d %Y %H:%M:%S", data["mtime"]) + } + + The 'AWKLIBPATH' environment variable tells 'gawk' where to find +extensions (*note Finding Extensions::). We set it to the current +directory and run the program: + + $ AWKLIBPATH=$PWD gawk -f testff.awk + -| /tmp + -| Info for testff.awk + -| ret = 0 + -| data["blksize"] = 4096 + -| data["devbsize"] = 512 + -| data["mtime"] = 1412004710 + -| data["mode"] = 33204 + -| data["type"] = file + -| data["dev"] = 2053 + -| data["gid"] = 1000 + -| data["ino"] = 10358899 + -| data["ctime"] = 1412004710 + -| data["blocks"] = 8 + -| data["nlink"] = 1 + -| data["name"] = testff.awk + -| data["atime"] = 1412004716 + -| data["pmode"] = -rw-rw-r-- + -| data["size"] = 666 + -| data["uid"] = 1000 + -| testff.awk modified: 09 29 2014 18:31:50 + -| + -| Info for JUNK + -| ret = -1 + -| JUNK modified: 01 01 1970 02:00:00 + + ---------- Footnotes ---------- + + (1) In practice, you would probably want to use the GNU Autotools +(Automake, Autoconf, Libtool, and 'gettext') to configure and build your +libraries. Instructions for doing so are beyond the scope of this Info +file. *Note gawkextlib:: for Internet links to the tools. + + +File: gawk.info, Node: Extension Samples, Next: gawkextlib, Prev: Extension Example, Up: Dynamic Extensions + +16.7 The Sample Extensions in the 'gawk' Distribution +===================================================== + +This minor node provides a brief overview of the sample extensions that +come in the 'gawk' distribution. Some of them are intended for +production use (e.g., the 'filefuncs', 'readdir', and 'inplace' +extensions). Others mainly provide example code that shows how to use +the extension API. + +* Menu: + +* Extension Sample File Functions:: The file functions sample. +* Extension Sample Fnmatch:: An interface to 'fnmatch()'. +* Extension Sample Fork:: An interface to 'fork()' and other + process functions. +* Extension Sample Inplace:: Enabling in-place file editing. +* Extension Sample Ord:: Character to value to character + conversions. +* Extension Sample Readdir:: An interface to 'readdir()'. +* Extension Sample Revout:: Reversing output sample output wrapper. +* Extension Sample Rev2way:: Reversing data sample two-way processor. +* Extension Sample Read write array:: Serializing an array to a file. +* Extension Sample Readfile:: Reading an entire file into a string. +* Extension Sample Time:: An interface to 'gettimeofday()' + and 'sleep()'. +* Extension Sample API Tests:: Tests for the API. + + +File: gawk.info, Node: Extension Sample File Functions, Next: Extension Sample Fnmatch, Up: Extension Samples + +16.7.1 File-Related Functions +----------------------------- + +The 'filefuncs' extension provides three different functions, as +follows. The usage is: + +'@load "filefuncs"' + This is how you load the extension. + +'result = chdir("/some/directory")' + The 'chdir()' function is a direct hook to the 'chdir()' system + call to change the current directory. It returns zero upon success + or a value less than zero upon error. In the latter case, it + updates 'ERRNO'. + +'result = stat("/some/path", statdata' [', follow']')' + The 'stat()' function provides a hook into the 'stat()' system + call. It returns zero upon success or a value less than zero upon + error. In the latter case, it updates 'ERRNO'. + + By default, it uses the 'lstat()' system call. However, if passed + a third argument, it uses 'stat()' instead. + + In all cases, it clears the 'statdata' array. When the call is + successful, 'stat()' fills the 'statdata' array with information + retrieved from the filesystem, as follows: + + Subscript Field in 'struct stat' File type + ---------------------------------------------------------------- + '"name"' The file name All + '"dev"' 'st_dev' All + '"ino"' 'st_ino' All + '"mode"' 'st_mode' All + '"nlink"' 'st_nlink' All + '"uid"' 'st_uid' All + '"gid"' 'st_gid' All + '"size"' 'st_size' All + '"atime"' 'st_atime' All + '"mtime"' 'st_mtime' All + '"ctime"' 'st_ctime' All + '"rdev"' 'st_rdev' Device files + '"major"' 'st_major' Device files + '"minor"' 'st_minor' Device files + '"blksize"' 'st_blksize' All + '"pmode"' A human-readable version of the All + mode value, like that printed by + 'ls' (for example, '"-rwxr-xr-x"') + '"linkval"' The value of the symbolic link Symbolic + links + '"type"' The type of the file as a All + string--one of '"file"', + '"blockdev"', '"chardev"', + '"directory"', '"socket"', + '"fifo"', '"symlink"', '"door"', + or '"unknown"' (not all systems + support all file types) + +'flags = or(FTS_PHYSICAL, ...)' +'result = fts(pathlist, flags, filedata)' + Walk the file trees provided in 'pathlist' and fill in the + 'filedata' array, as described next. 'flags' is the bitwise OR of + several predefined values, also described in a moment. Return zero + if there were no errors, otherwise return -1. + + The 'fts()' function provides a hook to the C library 'fts()' +routines for traversing file hierarchies. Instead of returning data +about one file at a time in a stream, it fills in a multidimensional +array with data about each file and directory encountered in the +requested hierarchies. + + The arguments are as follows: + +'pathlist' + An array of file names. The element values are used; the index + values are ignored. + +'flags' + This should be the bitwise OR of one or more of the following + predefined constant flag values. At least one of 'FTS_LOGICAL' or + 'FTS_PHYSICAL' must be provided; otherwise 'fts()' returns an error + value and sets 'ERRNO'. The flags are: + + 'FTS_LOGICAL' + Do a "logical" file traversal, where the information returned + for a symbolic link refers to the linked-to file, and not to + the symbolic link itself. This flag is mutually exclusive + with 'FTS_PHYSICAL'. + + 'FTS_PHYSICAL' + Do a "physical" file traversal, where the information returned + for a symbolic link refers to the symbolic link itself. This + flag is mutually exclusive with 'FTS_LOGICAL'. + + 'FTS_NOCHDIR' + As a performance optimization, the C library 'fts()' routines + change directory as they traverse a file hierarchy. This flag + disables that optimization. + + 'FTS_COMFOLLOW' + Immediately follow a symbolic link named in 'pathlist', + whether or not 'FTS_LOGICAL' is set. + + 'FTS_SEEDOT' + By default, the C library 'fts()' routines do not return + entries for '.' (dot) and '..' (dot-dot). This option causes + entries for dot-dot to also be included. (The extension + always includes an entry for dot; more on this in a moment.) + + 'FTS_XDEV' + During a traversal, do not cross onto a different mounted + filesystem. + +'filedata' + The 'filedata' array holds the results. 'fts()' first clears it. + Then it creates an element in 'filedata' for every element in + 'pathlist'. The index is the name of the directory or file given + in 'pathlist'. The element for this index is itself an array. + There are two cases: + + _The path is a file_ + In this case, the array contains two or three elements: + + '"path"' + The full path to this file, starting from the "root" that + was given in the 'pathlist' array. + + '"stat"' + This element is itself an array, containing the same + information as provided by the 'stat()' function + described earlier for its 'statdata' argument. The + element may not be present if the 'stat()' system call + for the file failed. + + '"error"' + If some kind of error was encountered, the array will + also contain an element named '"error"', which is a + string describing the error. + + _The path is a directory_ + In this case, the array contains one element for each entry in + the directory. If an entry is a file, that element is the + same as for files, just described. If the entry is a + directory, that element is (recursively) an array describing + the subdirectory. If 'FTS_SEEDOT' was provided in the flags, + then there will also be an element named '".."'. This element + will be an array containing the data as provided by 'stat()'. + + In addition, there will be an element whose index is '"."'. + This element is an array containing the same two or three + elements as for a file: '"path"', '"stat"', and '"error"'. + + The 'fts()' function returns zero if there were no errors. +Otherwise, it returns -1. + + NOTE: The 'fts()' extension does not exactly mimic the interface of + the C library 'fts()' routines, choosing instead to provide an + interface that is based on associative arrays, which is more + comfortable to use from an 'awk' program. This includes the lack + of a comparison function, because 'gawk' already provides powerful + array sorting facilities. Although an 'fts_read()'-like interface + could have been provided, this felt less natural than simply + creating a multidimensional array to represent the file hierarchy + and its information. + + See 'test/fts.awk' in the 'gawk' distribution for an example use of +the 'fts()' extension function. + + +File: gawk.info, Node: Extension Sample Fnmatch, Next: Extension Sample Fork, Prev: Extension Sample File Functions, Up: Extension Samples + +16.7.2 Interface to 'fnmatch()' +------------------------------- + +This extension provides an interface to the C library 'fnmatch()' +function. The usage is: + +'@load "fnmatch"' + This is how you load the extension. + +'result = fnmatch(pattern, string, flags)' + The return value is zero on success, 'FNM_NOMATCH' if the string + did not match the pattern, or a different nonzero value if an error + occurred. + + In addition to the 'fnmatch()' function, the 'fnmatch' extension adds +one constant ('FNM_NOMATCH'), and an array of flag values named 'FNM'. + + The arguments to 'fnmatch()' are: + +'pattern' + The file name wildcard to match + +'string' + The file name string + +'flag' + Either zero, or the bitwise OR of one or more of the flags in the + 'FNM' array + + The flags are as follows: + +Array element Corresponding flag defined by 'fnmatch()' +-------------------------------------------------------------------------- +'FNM["CASEFOLD"]' 'FNM_CASEFOLD' +'FNM["FILE_NAME"]' 'FNM_FILE_NAME' +'FNM["LEADING_DIR"]''FNM_LEADING_DIR' +'FNM["NOESCAPE"]' 'FNM_NOESCAPE' +'FNM["PATHNAME"]' 'FNM_PATHNAME' +'FNM["PERIOD"]' 'FNM_PERIOD' + + Here is an example: + + @load "fnmatch" + ... + flags = or(FNM["PERIOD"], FNM["NOESCAPE"]) + if (fnmatch("*.a", "foo.c", flags) == FNM_NOMATCH) + print "no match" + + +File: gawk.info, Node: Extension Sample Fork, Next: Extension Sample Inplace, Prev: Extension Sample Fnmatch, Up: Extension Samples + +16.7.3 Interface to 'fork()', 'wait()', and 'waitpid()' +------------------------------------------------------- + +The 'fork' extension adds three functions, as follows: + +'@load "fork"' + This is how you load the extension. + +'pid = fork()' + This function creates a new process. The return value is zero in + the child and the process ID number of the child in the parent, or + -1 upon error. In the latter case, 'ERRNO' indicates the problem. + In the child, 'PROCINFO["pid"]' and 'PROCINFO["ppid"]' are updated + to reflect the correct values. + +'ret = waitpid(pid)' + This function takes a numeric argument, which is the process ID to + wait for. The return value is that of the 'waitpid()' system call. + +'ret = wait()' + This function waits for the first child to die. The return value + is that of the 'wait()' system call. + + There is no corresponding 'exec()' function. + + Here is an example: + + @load "fork" + ... + if ((pid = fork()) == 0) + print "hello from the child" + else + print "hello from the parent" + + +File: gawk.info, Node: Extension Sample Inplace, Next: Extension Sample Ord, Prev: Extension Sample Fork, Up: Extension Samples + +16.7.4 Enabling In-Place File Editing +------------------------------------- + +The 'inplace' extension emulates GNU 'sed''s '-i' option, which performs +"in-place" editing of each input file. It uses the bundled +'inplace.awk' include file to invoke the extension properly: + + # inplace --- load and invoke the inplace extension. + + @load "inplace" + + # Please set INPLACE_SUFFIX to make a backup copy. For example, you may + # want to set INPLACE_SUFFIX to .bak on the command line or in a BEGIN rule. + + # By default, each filename on the command line will be edited inplace. + # But you can selectively disable this by adding an inplace=0 argument + # prior to files that you do not want to process this way. You can then + # reenable it later on the commandline by putting inplace=1 before files + # that you wish to be subject to inplace editing. + + # N.B. We call inplace_end() in the BEGINFILE and END rules so that any + # actions in an ENDFILE rule will be redirected as expected. + + BEGIN { + inplace = 1 # enabled by default + } + + BEGINFILE { + if (_inplace_filename != "") + inplace_end(_inplace_filename, INPLACE_SUFFIX) + if (inplace) + inplace_begin(_inplace_filename = FILENAME, INPLACE_SUFFIX) + else + _inplace_filename = "" + } + + END { + if (_inplace_filename != "") + inplace_end(_inplace_filename, INPLACE_SUFFIX) + } + + For each regular file that is processed, the extension redirects +standard output to a temporary file configured to have the same owner +and permissions as the original. After the file has been processed, the +extension restores standard output to its original destination. If +'INPLACE_SUFFIX' is not an empty string, the original file is linked to +a backup file name created by appending that suffix. Finally, the +temporary file is renamed to the original file name. + + Note that the use of this feature can be controlled by placing +'inplace=0' on the command-line prior to listing files that should not +be processed this way. You can reenable inplace editing by adding an +'inplace=1' argument prior to files that should be subject to inplace +editing. + + The '_inplace_filename' variable serves to keep track of the current +filename so as to not invoke 'inplace_end()' before processing the first +file. + + If any error occurs, the extension issues a fatal error to terminate +processing immediately without damaging the original file. + + Here are some simple examples: + + $ gawk -i inplace '{ gsub(/foo/, "bar") }; { print }' file1 file2 file3 + + To keep a backup copy of the original files, try this: + + $ gawk -i inplace -v INPLACE_SUFFIX=.bak '{ gsub(/foo/, "bar") } + > { print }' file1 file2 file3 + + Please note that, while the extension does attempt to preserve +ownership and permissions, it makes no attempt to copy the ACLs from the +original file. + + If the program dies prematurely, as might happen if an unhandled +signal is received, a temporary file may be left behind. + + +File: gawk.info, Node: Extension Sample Ord, Next: Extension Sample Readdir, Prev: Extension Sample Inplace, Up: Extension Samples + +16.7.5 Character and Numeric values: 'ord()' and 'chr()' +-------------------------------------------------------- + +The 'ordchr' extension adds two functions, named 'ord()' and 'chr()', as +follows: + +'@load "ordchr"' + This is how you load the extension. + +'number = ord(string)' + Return the numeric value of the first character in 'string'. + +'char = chr(number)' + Return a string whose first character is that represented by + 'number'. + + These functions are inspired by the Pascal language functions of the +same name. Here is an example: + + @load "ordchr" + ... + printf("The numeric value of 'A' is %d\n", ord("A")) + printf("The string value of 65 is %s\n", chr(65)) + + +File: gawk.info, Node: Extension Sample Readdir, Next: Extension Sample Revout, Prev: Extension Sample Ord, Up: Extension Samples + +16.7.6 Reading Directories +-------------------------- + +The 'readdir' extension adds an input parser for directories. The usage +is as follows: + + @load "readdir" + + When this extension is in use, instead of skipping directories named +on the command line (or with 'getline'), they are read, with each entry +returned as a record. + + The record consists of three fields. The first two are the inode +number and the file name, separated by a forward slash character. On +systems where the directory entry contains the file type, the record has +a third field (also separated by a slash), which is a single letter +indicating the type of the file. The letters and their corresponding +file types are shown in *note Table 16.3: table-readdir-file-types. + +Letter File type +-------------------------------------------------------------------------- +'b' Block device +'c' Character device +'d' Directory +'f' Regular file +'l' Symbolic link +'p' Named pipe (FIFO) +'s' Socket +'u' Anything else (unknown) + +Table 16.3: File types returned by the 'readdir' extension + + On systems without the file type information, the third field is +always 'u'. + + NOTE: On GNU/Linux systems, there are filesystems that don't + support the 'd_type' entry (see the readdir(3) manual page), and so + the file type is always 'u'. You can use the 'filefuncs' extension + to call 'stat()' in order to get correct type information. + + Here is an example: + + @load "readdir" + ... + BEGIN { FS = "/" } + { print "file name is", $2 } + + +File: gawk.info, Node: Extension Sample Revout, Next: Extension Sample Rev2way, Prev: Extension Sample Readdir, Up: Extension Samples + +16.7.7 Reversing Output +----------------------- + +The 'revoutput' extension adds a simple output wrapper that reverses the +characters in each output line. Its main purpose is to show how to +write an output wrapper, although it may be mildly amusing for the +unwary. Here is an example: + + @load "revoutput" + + BEGIN { + REVOUT = 1 + print "don't panic" > "/dev/stdout" + } + + The output from this program is 'cinap t'nod'. + + +File: gawk.info, Node: Extension Sample Rev2way, Next: Extension Sample Read write array, Prev: Extension Sample Revout, Up: Extension Samples + +16.7.8 Two-Way I/O Example +-------------------------- + +The 'revtwoway' extension adds a simple two-way processor that reverses +the characters in each line sent to it for reading back by the 'awk' +program. Its main purpose is to show how to write a two-way processor, +although it may also be mildly amusing. The following example shows how +to use it: + + @load "revtwoway" + + BEGIN { + cmd = "/magic/mirror" + print "don't panic" |& cmd + cmd |& getline result + print result + close(cmd) + } + + The output from this program is: 'cinap t'nod'. + + +File: gawk.info, Node: Extension Sample Read write array, Next: Extension Sample Readfile, Prev: Extension Sample Rev2way, Up: Extension Samples + +16.7.9 Dumping and Restoring an Array +------------------------------------- + +The 'rwarray' extension adds two functions, named 'writea()' and +'reada()', as follows: + +'@load "rwarray"' + This is how you load the extension. + +'ret = writea(file, array)' + This function takes a string argument, which is the name of the + file to which to dump the array, and the array itself as the second + argument. 'writea()' understands arrays of arrays. It returns one + on success, or zero upon failure. + +'ret = reada(file, array)' + 'reada()' is the inverse of 'writea()'; it reads the file named as + its first argument, filling in the array named as the second + argument. It clears the array first. Here too, the return value + is one on success, or zero upon failure. + + The array created by 'reada()' is identical to that written by +'writea()' in the sense that the contents are the same. However, due to +implementation issues, the array traversal order of the re-created array +is likely to be different from that of the original array. As array +traversal order in 'awk' is by default undefined, this is (technically) +not a problem. If you need to guarantee a particular traversal order, +use the array sorting features in 'gawk' to do so (*note Array +Sorting::). + + The file contains binary data. All integral values are written in +network byte order. However, double-precision floating-point values are +written as native binary data. Thus, arrays containing only string data +can theoretically be dumped on systems with one byte order and restored +on systems with a different one, but this has not been tried. + + Here is an example: + + @load "rwarray" + ... + ret = writea("arraydump.bin", array) + ... + ret = reada("arraydump.bin", array) + + +File: gawk.info, Node: Extension Sample Readfile, Next: Extension Sample Time, Prev: Extension Sample Read write array, Up: Extension Samples + +16.7.10 Reading an Entire File +------------------------------ + +The 'readfile' extension adds a single function named 'readfile()', and +an input parser: + +'@load "readfile"' + This is how you load the extension. + +'result = readfile("/some/path")' + The argument is the name of the file to read. The return value is + a string containing the entire contents of the requested file. + Upon error, the function returns the empty string and sets 'ERRNO'. + +'BEGIN { PROCINFO["readfile"] = 1 }' + In addition, the extension adds an input parser that is activated + if 'PROCINFO["readfile"]' exists. When activated, each input file + is returned in its entirety as '$0'. 'RT' is set to the null + string. + + Here is an example: + + @load "readfile" + ... + contents = readfile("/path/to/file"); + if (contents == "" && ERRNO != "") { + print("problem reading file", ERRNO) > "/dev/stderr" + ... + } + + +File: gawk.info, Node: Extension Sample Time, Next: Extension Sample API Tests, Prev: Extension Sample Readfile, Up: Extension Samples + +16.7.11 Extension Time Functions +-------------------------------- + +The 'time' extension adds two functions, named 'gettimeofday()' and +'sleep()', as follows: + +'@load "time"' + This is how you load the extension. + +'the_time = gettimeofday()' + Return the time in seconds that has elapsed since 1970-01-01 UTC as + a floating-point value. If the time is unavailable on this + platform, return -1 and set 'ERRNO'. The returned time should have + sub-second precision, but the actual precision may vary based on + the platform. If the standard C 'gettimeofday()' system call is + available on this platform, then it simply returns the value. + Otherwise, if on MS-Windows, it tries to use + 'GetSystemTimeAsFileTime()'. + +'result = sleep(SECONDS)' + Attempt to sleep for SECONDS seconds. If SECONDS is negative, or + the attempt to sleep fails, return -1 and set 'ERRNO'. Otherwise, + return zero after sleeping for the indicated amount of time. Note + that SECONDS may be a floating-point (nonintegral) value. + Implementation details: depending on platform availability, this + function tries to use 'nanosleep()' or 'select()' to implement the + delay. + + +File: gawk.info, Node: Extension Sample API Tests, Prev: Extension Sample Time, Up: Extension Samples + +16.7.12 API Tests +----------------- + +The 'testext' extension exercises parts of the extension API that are +not tested by the other samples. The 'extension/testext.c' file +contains both the C code for the extension and 'awk' test code inside C +comments that run the tests. The testing framework extracts the 'awk' +code and runs the tests. See the source file for more information. + + +File: gawk.info, Node: gawkextlib, Next: Extension summary, Prev: Extension Samples, Up: Dynamic Extensions + +16.8 The 'gawkextlib' Project +============================= + +The 'gawkextlib' (http://sourceforge.net/projects/gawkextlib/) project +provides a number of 'gawk' extensions, including one for processing XML +files. This is the evolution of the original 'xgawk' (XML 'gawk') +project. + + As of this writing, there are seven extensions: + + * 'errno' extension + + * GD graphics library extension + + * MPFR library extension (this provides access to a number of MPFR + functions that 'gawk''s native MPFR support does not) + + * PDF extension + + * PostgreSQL extension + + * Redis extension + + * Select extension + + * XML parser extension, using the Expat + (http://expat.sourceforge.net) XML parsing library + + You can check out the code for the 'gawkextlib' project using the Git +(http://git-scm.com) distributed source code control system. The +command is as follows: + + git clone git://git.code.sf.net/p/gawkextlib/code gawkextlib-code + + You will need to have the Expat (http://expat.sourceforge.net) XML +parser library installed in order to build and use the XML extension. + + In addition, you must have the GNU Autotools installed (Autoconf +(http://www.gnu.org/software/autoconf), Automake +(http://www.gnu.org/software/automake), Libtool +(http://www.gnu.org/software/libtool), and GNU 'gettext' +(http://www.gnu.org/software/gettext)). + + The simple recipe for building and testing 'gawkextlib' is as +follows. First, build and install 'gawk': + + cd .../path/to/gawk/code + ./configure --prefix=/tmp/newgawk Install in /tmp/newgawk for now + make && make check Build and check that all is OK + make install Install gawk + + Next, go to <http://sourceforge.net/projects/gawkextlib/files> to +download 'gawkextlib' and any extensions that you would like to build. +The 'README' file at that site explains how to build the code. If you +installed 'gawk' in a non-standard location, you will need to specify +'./configure --with-gawk=/PATH/TO/GAWK' to find it. You may need to use +the 'sudo' utility to install both 'gawk' and 'gawkextlib', depending +upon how your system works. + + If you write an extension that you wish to share with other 'gawk' +users, consider doing so through the 'gawkextlib' project. See the +project's website for more information. + + +File: gawk.info, Node: Extension summary, Next: Extension Exercises, Prev: gawkextlib, Up: Dynamic Extensions + +16.9 Summary +============ + + * You can write extensions (sometimes called plug-ins) for 'gawk' in + C or C++ using the application programming interface (API) defined + by the 'gawk' developers. + + * Extensions must have a license compatible with the GNU General + Public License (GPL), and they must assert that fact by declaring a + variable named 'plugin_is_GPL_compatible'. + + * Communication between 'gawk' and an extension is two-way. 'gawk' + passes a 'struct' to the extension that contains various data + fields and function pointers. The extension can then call into + 'gawk' via the supplied function pointers to accomplish certain + tasks. + + * One of these tasks is to "register" the name and implementation of + new 'awk'-level functions with 'gawk'. The implementation takes + the form of a C function pointer with a defined signature. By + convention, implementation functions are named 'do_XXXX()' for some + 'awk'-level function 'XXXX()'. + + * The API is defined in a header file named 'gawkapi.h'. You must + include a number of standard header files _before_ including it in + your source file. + + * API function pointers are provided for the following kinds of + operations: + + * Allocating, reallocating, and releasing memory + + * Registration functions (you may register extension functions, + exit callbacks, a version string, input parsers, output + wrappers, and two-way processors) + + * Printing fatal, nonfatal, warning, and "lint" warning messages + + * Updating 'ERRNO', or unsetting it + + * Accessing parameters, including converting an undefined + parameter into an array + + * Symbol table access (retrieving a global variable, creating + one, or changing one) + + * Creating and releasing cached values; this provides an + efficient way to use values for multiple variables and can be + a big performance win + + * Manipulating arrays (retrieving, adding, deleting, and + modifying elements; getting the count of elements in an array; + creating a new array; clearing an array; and flattening an + array for easy C-style looping over all its indices and + elements) + + * The API defines a number of standard data types for representing + 'awk' values, array elements, and arrays. + + * The API provides convenience functions for constructing values. It + also provides memory management functions to ensure compatibility + between memory allocated by 'gawk' and memory allocated by an + extension. + + * _All_ memory passed from 'gawk' to an extension must be treated as + read-only by the extension. + + * _All_ memory passed from an extension to 'gawk' must come from the + API's memory allocation functions. 'gawk' takes responsibility for + the memory and releases it when appropriate. + + * The API provides information about the running version of 'gawk' so + that an extension can make sure it is compatible with the 'gawk' + that loaded it. + + * It is easiest to start a new extension by copying the boilerplate + code described in this major node. Macros in the 'gawkapi.h' + header file make this easier to do. + + * The 'gawk' distribution includes a number of small but useful + sample extensions. The 'gawkextlib' project includes several more + (larger) extensions. If you wish to write an extension and + contribute it to the community of 'gawk' users, the 'gawkextlib' + project is the place to do so. + + +File: gawk.info, Node: Extension Exercises, Prev: Extension summary, Up: Dynamic Extensions + +16.10 Exercises +=============== + + 1. Add functions to implement system calls such as 'chown()', + 'chmod()', and 'umask()' to the file operations extension presented + in *note Internal File Ops::. + + 2. Write an input parser that prints a prompt if the input is a from a + "terminal" device. You can use the 'isatty()' function to tell if + the input file is a terminal. (Hint: this function is usually + expensive to call; try to call it just once.) The content of the + prompt should come from a variable settable by 'awk'-level code. + You can write the prompt to standard error. However, for best + results, open a new file descriptor (or file pointer) on '/dev/tty' + and print the prompt there, in case standard error has been + redirected. + + Why is standard error a better choice than standard output for + writing the prompt? Which reading mechanism should you replace, + the one to get a record, or the one to read raw bytes? + + 3. (Hard.) How would you provide namespaces in 'gawk', so that the + names of functions in different extensions don't conflict with each + other? If you come up with a really good scheme, contact the + 'gawk' maintainer to tell him about it. + + 4. Write a wrapper script that provides an interface similar to 'sed + -i' for the "inplace" extension presented in *note Extension Sample + Inplace::. + + +File: gawk.info, Node: Language History, Next: Installation, Prev: Dynamic Extensions, Up: Top + +Appendix A The Evolution of the 'awk' Language +********************************************** + +This Info file describes the GNU implementation of 'awk', which follows +the POSIX specification. Many longtime 'awk' users learned 'awk' +programming with the original 'awk' implementation in Version 7 Unix. +(This implementation was the basis for 'awk' in Berkeley Unix, through +4.3-Reno. Subsequent versions of Berkeley Unix, and, for a while, some +systems derived from 4.4BSD-Lite, used various versions of 'gawk' for +their 'awk'.) This major node briefly describes the evolution of the +'awk' language, with cross-references to other parts of the Info file +where you can find more information. + +* Menu: + +* V7/SVR3.1:: The major changes between V7 and System V + Release 3.1. +* SVR4:: Minor changes between System V Releases 3.1 + and 4. +* POSIX:: New features from the POSIX standard. +* BTL:: New features from Brian Kernighan's version of + 'awk'. +* POSIX/GNU:: The extensions in 'gawk' not in POSIX + 'awk'. +* Feature History:: The history of the features in 'gawk'. +* Common Extensions:: Common Extensions Summary. +* Ranges and Locales:: How locales used to affect regexp ranges. +* Contributors:: The major contributors to 'gawk'. +* History summary:: History summary. + + +File: gawk.info, Node: V7/SVR3.1, Next: SVR4, Up: Language History + +A.1 Major Changes Between V7 and SVR3.1 +======================================= + +The 'awk' language evolved considerably between the release of Version 7 +Unix (1978) and the new version that was first made generally available +in System V Release 3.1 (1987). This minor node summarizes the changes, +with cross-references to further details: + + * The requirement for ';' to separate rules on a line (*note + Statements/Lines::) + + * User-defined functions and the 'return' statement (*note + User-defined::) + + * The 'delete' statement (*note Delete::) + + * The 'do'-'while' statement (*note Do Statement::) + + * The built-in functions 'atan2()', 'cos()', 'sin()', 'rand()', and + 'srand()' (*note Numeric Functions::) + + * The built-in functions 'gsub()', 'sub()', and 'match()' (*note + String Functions::) + + * The built-in functions 'close()' and 'system()' (*note I/O + Functions::) + + * The 'ARGC', 'ARGV', 'FNR', 'RLENGTH', 'RSTART', and 'SUBSEP' + predefined variables (*note Built-in Variables::) + + * Assignable '$0' (*note Changing Fields::) + + * The conditional expression using the ternary operator '?:' (*note + Conditional Exp::) + + * The expression 'INDX in ARRAY' outside of 'for' statements (*note + Reference to Elements::) + + * The exponentiation operator '^' (*note Arithmetic Ops::) and its + assignment operator form '^=' (*note Assignment Ops::) + + * C-compatible operator precedence, which breaks some old 'awk' + programs (*note Precedence::) + + * Regexps as the value of 'FS' (*note Field Separators::) and as the + third argument to the 'split()' function (*note String + Functions::), rather than using only the first character of 'FS' + + * Dynamic regexps as operands of the '~' and '!~' operators (*note + Computed Regexps::) + + * The escape sequences '\b', '\f', and '\r' (*note Escape + Sequences::) + + * Redirection of input for the 'getline' function (*note Getline::) + + * Multiple 'BEGIN' and 'END' rules (*note BEGIN/END::) + + * Multidimensional arrays (*note Multidimensional::) + + +File: gawk.info, Node: SVR4, Next: POSIX, Prev: V7/SVR3.1, Up: Language History + +A.2 Changes Between SVR3.1 and SVR4 +=================================== + +The System V Release 4 (1989) version of Unix 'awk' added these features +(some of which originated in 'gawk'): + + * The 'ENVIRON' array (*note Built-in Variables::) + + * Multiple '-f' options on the command line (*note Options::) + + * The '-v' option for assigning variables before program execution + begins (*note Options::) + + * The '--' signal for terminating command-line options + + * The '\a', '\v', and '\x' escape sequences (*note Escape + Sequences::) + + * A defined return value for the 'srand()' built-in function (*note + Numeric Functions::) + + * The 'toupper()' and 'tolower()' built-in string functions for case + translation (*note String Functions::) + + * A cleaner specification for the '%c' format-control letter in the + 'printf' function (*note Control Letters::) + + * The ability to dynamically pass the field width and precision + ('"%*.*d"') in the argument list of 'printf' and 'sprintf()' (*note + Control Letters::) + + * The use of regexp constants, such as '/foo/', as expressions, where + they are equivalent to using the matching operator, as in '$0 ~ + /foo/' (*note Using Constant Regexps::) + + * Processing of escape sequences inside command-line variable + assignments (*note Assignment Options::) + + +File: gawk.info, Node: POSIX, Next: BTL, Prev: SVR4, Up: Language History + +A.3 Changes Between SVR4 and POSIX 'awk' +======================================== + +The POSIX Command Language and Utilities standard for 'awk' (1992) +introduced the following changes into the language: + + * The use of '-W' for implementation-specific options (*note + Options::) + + * The use of 'CONVFMT' for controlling the conversion of numbers to + strings (*note Conversion::) + + * The concept of a numeric string and tighter comparison rules to go + with it (*note Typing and Comparison::) + + * The use of predefined variables as function parameter names is + forbidden (*note Definition Syntax::) + + * More complete documentation of many of the previously undocumented + features of the language + + In 2012, a number of extensions that had been commonly available for +many years were finally added to POSIX. They are: + + * The 'fflush()' built-in function for flushing buffered output + (*note I/O Functions::) + + * The 'nextfile' statement (*note Nextfile Statement::) + + * The ability to delete all of an array at once with 'delete ARRAY' + (*note Delete::) + + *Note Common Extensions:: for a list of common extensions not +permitted by the POSIX standard. + + The 2008 POSIX standard can be found online at +<http://www.opengroup.org/onlinepubs/9699919799/>. + + +File: gawk.info, Node: BTL, Next: POSIX/GNU, Prev: POSIX, Up: Language History + +A.4 Extensions in Brian Kernighan's 'awk' +========================================= + +Brian Kernighan has made his version available via his home page (*note +Other Versions::). + + This minor node describes common extensions that originally appeared +in his version of 'awk': + + * The '**' and '**=' operators (*note Arithmetic Ops:: and *note + Assignment Ops::) + + * The use of 'func' as an abbreviation for 'function' (*note + Definition Syntax::) + + * The 'fflush()' built-in function for flushing buffered output + (*note I/O Functions::) + + *Note Common Extensions:: for a full list of the extensions available +in his 'awk'. + + +File: gawk.info, Node: POSIX/GNU, Next: Feature History, Prev: BTL, Up: Language History + +A.5 Extensions in 'gawk' Not in POSIX 'awk' +=========================================== + +The GNU implementation, 'gawk', adds a large number of features. They +can all be disabled with either the '--traditional' or '--posix' options +(*note Options::). + + A number of features have come and gone over the years. This minor +node summarizes the additional features over POSIX 'awk' that are in the +current version of 'gawk'. + + * Additional predefined variables: + + - The 'ARGIND', 'BINMODE', 'ERRNO', 'FIELDWIDTHS', 'FPAT', + 'IGNORECASE', 'LINT', 'PROCINFO', 'RT', and 'TEXTDOMAIN' + variables (*note Built-in Variables::) + + * Special files in I/O redirections: + + - The '/dev/stdin', '/dev/stdout', '/dev/stderr', and + '/dev/fd/N' special file names (*note Special Files::) + + - The '/inet', '/inet4', and '/inet6' special files for TCP/IP + networking using '|&' to specify which version of the IP + protocol to use (*note TCP/IP Networking::) + + * Changes and/or additions to the language: + + - The '\x' escape sequence (*note Escape Sequences::) + + - Full support for both POSIX and GNU regexps (*note Regexp::) + + - The ability for 'FS' and for the third argument to 'split()' + to be null strings (*note Single Character Fields::) + + - The ability for 'RS' to be a regexp (*note Records::) + + - The ability to use octal and hexadecimal constants in 'awk' + program source code (*note Nondecimal-numbers::) + + - The '|&' operator for two-way I/O to a coprocess (*note + Two-way I/O::) + + - Indirect function calls (*note Indirect Calls::) + + - Directories on the command line produce a warning and are + skipped (*note Command-line directories::) + + - Output with 'print' and 'printf' need not be fatal (*note + Nonfatal::) + + * New keywords: + + - The 'BEGINFILE' and 'ENDFILE' special patterns (*note + BEGINFILE/ENDFILE::) + + - The 'switch' statement (*note Switch Statement::) + + * Changes to standard 'awk' functions: + + - The optional second argument to 'close()' that allows closing + one end of a two-way pipe to a coprocess (*note Two-way I/O::) + + - POSIX compliance for 'gsub()' and 'sub()' with '--posix' + + - The 'length()' function accepts an array argument and returns + the number of elements in the array (*note String Functions::) + + - The optional third argument to the 'match()' function for + capturing text-matching subexpressions within a regexp (*note + String Functions::) + + - Positional specifiers in 'printf' formats for making + translations easier (*note Printf Ordering::) + + - The 'split()' function's additional optional fourth argument, + which is an array to hold the text of the field separators + (*note String Functions::) + + * Additional functions only in 'gawk': + + - The 'gensub()', 'patsplit()', and 'strtonum()' functions for + more powerful text manipulation (*note String Functions::) + + - The 'asort()' and 'asorti()' functions for sorting arrays + (*note Array Sorting::) + + - The 'mktime()', 'systime()', and 'strftime()' functions for + working with timestamps (*note Time Functions::) + + - The 'and()', 'compl()', 'lshift()', 'or()', 'rshift()', and + 'xor()' functions for bit manipulation (*note Bitwise + Functions::) + + - The 'isarray()' function to check if a variable is an array or + not (*note Type Functions::) + + - The 'bindtextdomain()', 'dcgettext()', and 'dcngettext()' + functions for internationalization (*note Programmer i18n::) + + - The 'intdiv()' function for doing integer division and + remainder (*note Numeric Functions::) + + * Changes and/or additions in the command-line options: + + - The 'AWKPATH' environment variable for specifying a path + search for the '-f' command-line option (*note Options::) + + - The 'AWKLIBPATH' environment variable for specifying a path + search for the '-l' command-line option (*note Options::) + + - The '-b', '-c', '-C', '-d', '-D', '-e', '-E', '-g', '-h', + '-i', '-l', '-L', '-M', '-n', '-N', '-o', '-O', '-p', '-P', + '-r', '-s', '-S', '-t', and '-V' short options. Also, the + ability to use GNU-style long-named options that start with + '--', and the '--assign', '--bignum', '--characters-as-bytes', + '--copyright', '--debug', '--dump-variables', '--exec', + '--field-separator', '--file', '--gen-pot', '--help', + '--include', '--lint', '--lint-old', '--load', + '--non-decimal-data', '--optimize', '--no-optimize', + '--posix', '--pretty-print', '--profile', '--re-interval', + '--sandbox', '--source', '--traditional', '--use-lc-numeric', + and '--version' long options (*note Options::). + + * Support for the following obsolete systems was removed from the + code and the documentation for 'gawk' version 4.0: + + - Amiga + + - Atari + + - BeOS + + - Cray + + - MIPS RiscOS + + - MS-DOS with the Microsoft Compiler + + - MS-Windows with the Microsoft Compiler + + - NeXT + + - SunOS 3.x, Sun 386 (Road Runner) + + - Tandem (non-POSIX) + + - Prestandard VAX C compiler for VAX/VMS + + - GCC for VAX and Alpha has not been tested for a while. + + * Support for the following obsolete system was removed from the code + for 'gawk' version 4.1: + + - Ultrix + + * Support for the following systems was removed from the code for + 'gawk' version 4.2: + + - MirBSD + + +File: gawk.info, Node: Feature History, Next: Common Extensions, Prev: POSIX/GNU, Up: Language History + +A.6 History of 'gawk' Features +============================== + +This minor node describes the features in 'gawk' over and above those in +POSIX 'awk', in the order they were added to 'gawk'. + + Version 2.10 of 'gawk' introduced the following features: + + * The 'AWKPATH' environment variable for specifying a path search for + the '-f' command-line option (*note Options::). + + * The 'IGNORECASE' variable and its effects (*note + Case-sensitivity::). + + * The '/dev/stdin', '/dev/stdout', '/dev/stderr' and '/dev/fd/N' + special file names (*note Special Files::). + + Version 2.13 of 'gawk' introduced the following features: + + * The 'FIELDWIDTHS' variable and its effects (*note Constant Size::). + + * The 'systime()' and 'strftime()' built-in functions for obtaining + and printing timestamps (*note Time Functions::). + + * Additional command-line options (*note Options::): + + - The '-W lint' option to provide error and portability checking + for both the source code and at runtime. + + - The '-W compat' option to turn off the GNU extensions. + + - The '-W posix' option for full POSIX compliance. + + Version 2.14 of 'gawk' introduced the following feature: + + * The 'next file' statement for skipping to the next data file (*note + Nextfile Statement::). + + Version 2.15 of 'gawk' introduced the following features: + + * New variables (*note Built-in Variables::): + + - 'ARGIND', which tracks the movement of 'FILENAME' through + 'ARGV'. + + - 'ERRNO', which contains the system error message when + 'getline' returns -1 or 'close()' fails. + + * The '/dev/pid', '/dev/ppid', '/dev/pgrpid', and '/dev/user' special + file names. These have since been removed. + + * The ability to delete all of an array at once with 'delete ARRAY' + (*note Delete::). + + * Command-line option changes (*note Options::): + + - The ability to use GNU-style long-named options that start + with '--'. + + - The '--source' option for mixing command-line and library-file + source code. + + Version 3.0 of 'gawk' introduced the following features: + + * New or changed variables: + + - 'IGNORECASE' changed, now applying to string comparison as + well as regexp operations (*note Case-sensitivity::). + + - 'RT', which contains the input text that matched 'RS' (*note + Records::). + + * Full support for both POSIX and GNU regexps (*note Regexp::). + + * The 'gensub()' function for more powerful text manipulation (*note + String Functions::). + + * The 'strftime()' function acquired a default time format, allowing + it to be called with no arguments (*note Time Functions::). + + * The ability for 'FS' and for the third argument to 'split()' to be + null strings (*note Single Character Fields::). + + * The ability for 'RS' to be a regexp (*note Records::). + + * The 'next file' statement became 'nextfile' (*note Nextfile + Statement::). + + * The 'fflush()' function from BWK 'awk' (then at Bell Laboratories; + *note I/O Functions::). + + * New command-line options: + + - The '--lint-old' option to warn about constructs that are not + available in the original Version 7 Unix version of 'awk' + (*note V7/SVR3.1::). + + - The '-m' option from BWK 'awk'. (Brian was still at Bell + Laboratories at the time.) This was later removed from both + his 'awk' and from 'gawk'. + + - The '--re-interval' option to provide interval expressions in + regexps (*note Regexp Operators::). + + - The '--traditional' option was added as a better name for + '--compat' (*note Options::). + + * The use of GNU Autoconf to control the configuration process (*note + Quick Installation::). + + * Amiga support. This has since been removed. + + Version 3.1 of 'gawk' introduced the following features: + + * New variables (*note Built-in Variables::): + + - 'BINMODE', for non-POSIX systems, which allows binary I/O for + input and/or output files (*note PC Using::). + + - 'LINT', which dynamically controls lint warnings. + + - 'PROCINFO', an array for providing process-related + information. + + - 'TEXTDOMAIN', for setting an application's + internationalization text domain (*note + Internationalization::). + + * The ability to use octal and hexadecimal constants in 'awk' program + source code (*note Nondecimal-numbers::). + + * The '|&' operator for two-way I/O to a coprocess (*note Two-way + I/O::). + + * The '/inet' special files for TCP/IP networking using '|&' (*note + TCP/IP Networking::). + + * The optional second argument to 'close()' that allows closing one + end of a two-way pipe to a coprocess (*note Two-way I/O::). + + * The optional third argument to the 'match()' function for capturing + text-matching subexpressions within a regexp (*note String + Functions::). + + * Positional specifiers in 'printf' formats for making translations + easier (*note Printf Ordering::). + + * A number of new built-in functions: + + - The 'asort()' and 'asorti()' functions for sorting arrays + (*note Array Sorting::). + + - The 'bindtextdomain()', 'dcgettext()' and 'dcngettext()' + functions for internationalization (*note Programmer i18n::). + + - The 'extension()' function and the ability to add new built-in + functions dynamically (*note Dynamic Extensions::). + + - The 'mktime()' function for creating timestamps (*note Time + Functions::). + + - The 'and()', 'or()', 'xor()', 'compl()', 'lshift()', + 'rshift()', and 'strtonum()' functions (*note Bitwise + Functions::). + + * The support for 'next file' as two words was removed completely + (*note Nextfile Statement::). + + * Additional command-line options (*note Options::): + + - The '--dump-variables' option to print a list of all global + variables. + + - The '--exec' option, for use in CGI scripts. + + - The '--gen-po' command-line option and the use of a leading + underscore to mark strings that should be translated (*note + String Extraction::). + + - The '--non-decimal-data' option to allow non-decimal input + data (*note Nondecimal Data::). + + - The '--profile' option and 'pgawk', the profiling version of + 'gawk', for producing execution profiles of 'awk' programs + (*note Profiling::). + + - The '--use-lc-numeric' option to force 'gawk' to use the + locale's decimal point for parsing input data (*note + Conversion::). + + * The use of GNU Automake to help in standardizing the configuration + process (*note Quick Installation::). + + * The use of GNU 'gettext' for 'gawk''s own message output (*note + Gawk I18N::). + + * BeOS support. This was later removed. + + * Tandem support. This was later removed. + + * The Atari port became officially unsupported and was later removed + entirely. + + * The source code changed to use ISO C standard-style function + definitions. + + * POSIX compliance for 'sub()' and 'gsub()' (*note Gory Details::). + + * The 'length()' function was extended to accept an array argument + and return the number of elements in the array (*note String + Functions::). + + * The 'strftime()' function acquired a third argument to enable + printing times as UTC (*note Time Functions::). + + Version 4.0 of 'gawk' introduced the following features: + + * Variable additions: + + - 'FPAT', which allows you to specify a regexp that matches the + fields, instead of matching the field separator (*note + Splitting By Content::). + + - If 'PROCINFO["sorted_in"]' exists, 'for(iggy in foo)' loops + sort the indices before looping over them. The value of this + element provides control over how the indices are sorted + before the loop traversal starts (*note Controlling + Scanning::). + + - 'PROCINFO["strftime"]', which holds the default format for + 'strftime()' (*note Time Functions::). + + * The special files '/dev/pid', '/dev/ppid', '/dev/pgrpid' and + '/dev/user' were removed. + + * Support for IPv6 was added via the '/inet6' special file. '/inet4' + forces IPv4 and '/inet' chooses the system default, which is + probably IPv4 (*note TCP/IP Networking::). + + * The use of '\s' and '\S' escape sequences in regular expressions + (*note GNU Regexp Operators::). + + * Interval expressions became part of default regular expressions + (*note Regexp Operators::). + + * POSIX character classes work even with '--traditional' (*note + Regexp Operators::). + + * 'break' and 'continue' became invalid outside a loop, even with + '--traditional' (*note Break Statement::, and also see *note + Continue Statement::). + + * 'fflush()', 'nextfile', and 'delete ARRAY' are allowed if '--posix' + or '--traditional', since they are all now part of POSIX. + + * An optional third argument to 'asort()' and 'asorti()', specifying + how to sort (*note String Functions::). + + * The behavior of 'fflush()' changed to match BWK 'awk' and for + POSIX; now both 'fflush()' and 'fflush("")' flush all open output + redirections (*note I/O Functions::). + + * The 'isarray()' function which distinguishes if an item is an array + or not, to make it possible to traverse arrays of arrays (*note + Type Functions::). + + * The 'patsplit()' function which gives the same capability as + 'FPAT', for splitting (*note String Functions::). + + * An optional fourth argument to the 'split()' function, which is an + array to hold the values of the separators (*note String + Functions::). + + * Arrays of arrays (*note Arrays of Arrays::). + + * The 'BEGINFILE' and 'ENDFILE' special patterns (*note + BEGINFILE/ENDFILE::). + + * Indirect function calls (*note Indirect Calls::). + + * 'switch' / 'case' are enabled by default (*note Switch + Statement::). + + * Command-line option changes (*note Options::): + + - The '-b' and '--characters-as-bytes' options which prevent + 'gawk' from treating input as a multibyte string. + + - The redundant '--compat', '--copyleft', and '--usage' long + options were removed. + + - The '--gen-po' option was finally renamed to the correct + '--gen-pot'. + + - The '--sandbox' option which disables certain features. + + - All long options acquired corresponding short options, for use + in '#!' scripts. + + * Directories named on the command line now produce a warning, not a + fatal error, unless '--posix' or '--traditional' are used (*note + Command-line directories::). + + * The 'gawk' internals were rewritten, bringing the 'dgawk' debugger + and possibly improved performance (*note Debugger::). + + * Per the GNU Coding Standards, dynamic extensions must now define a + global symbol indicating that they are GPL-compatible (*note Plugin + License::). + + * In POSIX mode, string comparisons use 'strcoll()' / 'wcscoll()' + (*note POSIX String Comparison::). + + * The option for raw sockets was removed, since it was never + implemented (*note TCP/IP Networking::). + + * Ranges of the form '[d-h]' are treated as if they were in the C + locale, no matter what kind of regexp is being used, and even if + '--posix' (*note Ranges and Locales::). + + * Support was removed for the following systems: + + - Atari + + - Amiga + + - BeOS + + - Cray + + - MIPS RiscOS + + - MS-DOS with Microsoft Compiler + + - MS-Windows with Microsoft Compiler + + - NeXT + + - SunOS 3.x, Sun 386 (Road Runner) + + - Tandem (non-POSIX) + + - Prestandard VAX C compiler for VAX/VMS + + Version 4.1 of 'gawk' introduced the following features: + + * Three new arrays: 'SYMTAB', 'FUNCTAB', and + 'PROCINFO["identifiers"]' (*note Auto-set::). + + * The three executables 'gawk', 'pgawk', and 'dgawk', were merged + into one, named just 'gawk'. As a result the command-line options + changed. + + * Command-line option changes (*note Options::): + + - The '-D' option invokes the debugger. + + - The '-i' and '--include' options load 'awk' library files. + + - The '-l' and '--load' options load compiled dynamic + extensions. + + - The '-M' and '--bignum' options enable MPFR. + + - The '-o' option only does pretty-printing. + + - The '-p' option is used for profiling. + + - The '-R' option was removed. + + * Support for high precision arithmetic with MPFR (*note Arbitrary + Precision Arithmetic::). + + * The 'and()', 'or()' and 'xor()' functions changed to allow any + number of arguments, with a minimum of two (*note Bitwise + Functions::). + + * The dynamic extension interface was completely redone (*note + Dynamic Extensions::). + + * Redirected 'getline' became allowed inside 'BEGINFILE' and + 'ENDFILE' (*note BEGINFILE/ENDFILE::). + + * The 'where' command was added to the debugger (*note Execution + Stack::). + + * Support for Ultrix was removed. + + Version 4.2 introduced the following changes: + + * Changes to 'ENVIRON' are reflected into 'gawk''s environment and + that of programs that it runs. *Note Auto-set::. + + * The '--pretty-print' option no longer runs the 'awk' program too. + *Note Options::. + + * The 'igawk' program and its manual page are no longer installed + when 'gawk' is built. *Note Igawk Program::. + + * The 'intdiv()' function. *Note Numeric Functions::. + + * The maximum number of hexadecimal digits in '\x' escapes is now + two. *Note Escape Sequences::. + + * Nonfatal output with 'print' and 'printf'. *Note Nonfatal::. + + * For many years, POSIX specified that default field splitting only + allowed spaces and tabs to separate fields, and this was how 'gawk' + behaved with '--posix'. As of 2013, the standard restored + historical behavior, and now default field splitting with '--posix' + also allows newlines to separate fields. + + * Support for MirBSD was removed. + + * Support for GNU/Linux on Alpha was removed. + + +File: gawk.info, Node: Common Extensions, Next: Ranges and Locales, Prev: Feature History, Up: Language History + +A.7 Common Extensions Summary +============================= + +The following table summarizes the common extensions supported by +'gawk', Brian Kernighan's 'awk', and 'mawk', the three most widely used +freely available versions of 'awk' (*note Other Versions::). + +Feature BWK 'awk' 'mawk' 'gawk' Now standard +-------------------------------------------------------------------------- +'\x' escape sequence X X X +'FS' as null string X X X +'/dev/stdin' special file X X X +'/dev/stdout' special file X X X +'/dev/stderr' special file X X X +'delete' without subscript X X X X +'fflush()' function X X X X +'length()' of an array X X X +'nextfile' statement X X X X +'**' and '**=' operators X X +'func' keyword X X +'BINMODE' variable X X +'RS' as regexp X X +Time-related functions X X + + +File: gawk.info, Node: Ranges and Locales, Next: Contributors, Prev: Common Extensions, Up: Language History + +A.8 Regexp Ranges and Locales: A Long Sad Story +=============================================== + +This minor node describes the confusing history of ranges within regular +expressions and their interactions with locales, and how this affected +different versions of 'gawk'. + + The original Unix tools that worked with regular expressions defined +character ranges (such as '[a-z]') to match any character between the +first character in the range and the last character in the range, +inclusive. Ordering was based on the numeric value of each character in +the machine's native character set. Thus, on ASCII-based systems, +'[a-z]' matched all the lowercase letters, and only the lowercase +letters, as the numeric values for the letters from 'a' through 'z' were +contiguous. (On an EBCDIC system, the range '[a-z]' includes additional +nonalphabetic characters as well.) + + Almost all introductory Unix literature explained range expressions +as working in this fashion, and in particular, would teach that the +"correct" way to match lowercase letters was with '[a-z]', and that +'[A-Z]' was the "correct" way to match uppercase letters. And indeed, +this was true.(1) + + The 1992 POSIX standard introduced the idea of locales (*note +Locales::). Because many locales include other letters besides the +plain 26 letters of the English alphabet, the POSIX standard added +character classes (*note Bracket Expressions::) as a way to match +different kinds of characters besides the traditional ones in the ASCII +character set. + + However, the standard _changed_ the interpretation of range +expressions. In the '"C"' and '"POSIX"' locales, a range expression +like '[a-dx-z]' is still equivalent to '[abcdxyz]', as in ASCII. But +outside those locales, the ordering was defined to be based on +"collation order". + + What does that mean? In many locales, 'A' and 'a' are both less than +'B'. In other words, these locales sort characters in dictionary order, +and '[a-dx-z]' is typically not equivalent to '[abcdxyz]'; instead, it +might be equivalent to '[ABCXYabcdxyz]', for example. + + This point needs to be emphasized: much literature teaches that you +should use '[a-z]' to match a lowercase character. But on systems with +non-ASCII locales, this also matches all of the uppercase characters +except 'A' or 'Z'! This was a continuous cause of confusion, even well +into the twenty-first century. + + To demonstrate these issues, the following example uses the 'sub()' +function, which does text replacement (*note String Functions::). Here, +the intent is to remove trailing uppercase characters: + + $ echo something1234abc | gawk-3.1.8 '{ sub("[A-Z]*$", ""); print }' + -| something1234a + +This output is unexpected, as the 'bc' at the end of 'something1234abc' +should not normally match '[A-Z]*'. This result is due to the locale +setting (and thus you may not see it on your system). + + Similar considerations apply to other ranges. For example, '["-/]' +is perfectly valid in ASCII, but is not valid in many Unicode locales, +such as 'en_US.UTF-8'. + + Early versions of 'gawk' used regexp matching code that was not +locale-aware, so ranges had their traditional interpretation. + + When 'gawk' switched to using locale-aware regexp matchers, the +problems began; especially as both GNU/Linux and commercial Unix vendors +started implementing non-ASCII locales, _and making them the default_. +Perhaps the most frequently asked question became something like, "Why +does '[A-Z]' match lowercase letters?!?" + + This situation existed for close to 10 years, if not more, and the +'gawk' maintainer grew weary of trying to explain that 'gawk' was being +nicely standards-compliant, and that the issue was in the user's locale. +During the development of version 4.0, he modified 'gawk' to always +treat ranges in the original, pre-POSIX fashion, unless '--posix' was +used (*note Options::).(2) + + Fortunately, shortly before the final release of 'gawk' 4.0, the +maintainer learned that the 2008 standard had changed the definition of +ranges, such that outside the '"C"' and '"POSIX"' locales, the meaning +of range expressions was _undefined_.(3) + + By using this lovely technical term, the standard gives license to +implementers to implement ranges in whatever way they choose. The +'gawk' maintainer chose to apply the pre-POSIX meaning both with the +default regexp matching and when '--traditional' or '--posix' are used. +In all cases 'gawk' remains POSIX-compliant. + + ---------- Footnotes ---------- + + (1) And Life was good. + + (2) And thus was born the Campaign for Rational Range Interpretation +(or RRI). A number of GNU tools have already implemented this change, or +will soon. Thanks to Karl Berry for coining the phrase "Rational Range +Interpretation." + + (3) See the standard +(http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05) +and its rationale +(http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05). + + +File: gawk.info, Node: Contributors, Next: History summary, Prev: Ranges and Locales, Up: Language History + +A.9 Major Contributors to 'gawk' +================================ + + Always give credit where credit is due. + -- _Anonymous_ + + This minor node names the major contributors to 'gawk' and/or this +Info file, in approximate chronological order: + + * Dr. Alfred V. Aho, Dr. Peter J. Weinberger, and Dr. Brian W. + Kernighan, all of Bell Laboratories, designed and implemented Unix + 'awk', from which 'gawk' gets the majority of its feature set. + + * Paul Rubin did the initial design and implementation in 1986, and + wrote the first draft (around 40 pages) of this Info file. + + * Jay Fenlason finished the initial implementation. + + * Diane Close revised the first draft of this Info file, bringing it + to around 90 pages. + + * Richard Stallman helped finish the implementation and the initial + draft of this Info file. He is also the founder of the FSF and the + GNU Project. + + * John Woods contributed parts of the code (mostly fixes) in the + initial version of 'gawk'. + + * In 1988, David Trueman took over primary maintenance of 'gawk', + making it compatible with "new" 'awk', and greatly improving its + performance. + + * Conrad Kwok, Scott Garfinkle, and Kent Williams did the initial + ports to MS-DOS with various versions of MSC. + + * Pat Rankin provided the VMS port and its documentation. + + * Hal Peterson provided help in porting 'gawk' to Cray systems. + (This is no longer supported.) + + * Kai Uwe Rommel provided the initial port to OS/2 and its + documentation. + + * Michal Jaegermann provided the port to Atari systems and its + documentation. (This port is no longer supported.) He continues + to provide portability checking, and has done a lot of work to make + sure 'gawk' works on non-32-bit systems. + + * Fred Fish provided the port to Amiga systems and its documentation. + (With Fred's sad passing, this is no longer supported.) + + * Scott Deifik maintained the MS-DOS port using DJGPP. + + * Eli Zaretskii currently maintains the MS-Windows port using MinGW. + + * Juan Grigera provided a port to Windows32 systems. (This is no + longer supported.) + + * For many years, Dr. Darrel Hankerson acted as coordinator for the + various ports to different PC platforms and created binary + distributions for various PC operating systems. He was also + instrumental in keeping the documentation up to date for the + various PC platforms. + + * Christos Zoulas provided the 'extension()' built-in function for + dynamically adding new functions. (This was obsoleted at 'gawk' + 4.1.) + + * Ju"rgen Kahrs contributed the initial version of the TCP/IP + networking code and documentation, and motivated the inclusion of + the '|&' operator. + + * Stephen Davies provided the initial port to Tandem systems and its + documentation. (However, this is no longer supported.) He was + also instrumental in the initial work to integrate the byte-code + internals into the 'gawk' code base. + + * Matthew Woehlke provided improvements for Tandem's POSIX-compliant + systems. + + * Martin Brown provided the port to BeOS and its documentation. + (This is no longer supported.) + + * Arno Peters did the initial work to convert 'gawk' to use GNU + Automake and GNU 'gettext'. + + * Alan J. Broder provided the initial version of the 'asort()' + function as well as the code for the optional third argument to the + 'match()' function. + + * Andreas Buening updated the 'gawk' port for OS/2. + + * Isamu Hasegawa, of IBM in Japan, contributed support for multibyte + characters. + + * Michael Benzinger contributed the initial code for 'switch' + statements. + + * Patrick T.J. McPhee contributed the code for dynamic loading in + Windows32 environments. (This is no longer supported.) + + * Anders Wallin helped keep the VMS port going for several years. + + * Assaf Gordon contributed the code to implement the '--sandbox' + option. + + * John Haque made the following contributions: + + - The modifications to convert 'gawk' into a byte-code + interpreter, including the debugger + + - The addition of true arrays of arrays + + - The additional modifications for support of + arbitrary-precision arithmetic + + - The initial text of *note Arbitrary Precision Arithmetic:: + + - The work to merge the three versions of 'gawk' into one, for + the 4.1 release + + - Improved array internals for arrays indexed by integers + + - The improved array sorting features were also driven by John, + together with Pat Rankin + + * Panos Papadopoulos contributed the original text for *note Include + Files::. + + * Efraim Yawitz contributed the original text for *note Debugger::. + + * The development of the extension API first released with 'gawk' 4.1 + was driven primarily by Arnold Robbins and Andrew Schorr, with + notable contributions from the rest of the development team. + + * John Malmberg contributed significant improvements to the OpenVMS + port and the related documentation. + + * Antonio Giovanni Colombo rewrote a number of examples in the early + chapters that were severely dated, for which I am incredibly + grateful. + + * Arnold Robbins has been working on 'gawk' since 1988, at first + helping David Trueman, and as the primary maintainer since around + 1994. + + +File: gawk.info, Node: History summary, Prev: Contributors, Up: Language History + +A.10 Summary +============ + + * The 'awk' language has evolved over time. The first release was + with V7 Unix, circa 1978. In 1987, for System V Release 3.1, major + additions, including user-defined functions, were made to the + language. Additional changes were made for System V Release 4, in + 1989. Since then, further minor changes have happened under the + auspices of the POSIX standard. + + * Brian Kernighan's 'awk' provides a small number of extensions that + are implemented in common with other versions of 'awk'. + + * 'gawk' provides a large number of extensions over POSIX 'awk'. + They can be disabled with either the '--traditional' or '--posix' + options. + + * The interaction of POSIX locales and regexp matching in 'gawk' has + been confusing over the years. Today, 'gawk' implements Rational + Range Interpretation, where ranges of the form '[a-z]' match _only_ + the characters numerically between 'a' through 'z' in the machine's + native character set. Usually this is ASCII, but it can be EBCDIC + on IBM S/390 systems. + + * Many people have contributed to 'gawk' development over the years. + We hope that the list provided in this major node is complete and + gives the appropriate credit where credit is due. + + +File: gawk.info, Node: Installation, Next: Notes, Prev: Language History, Up: Top + +Appendix B Installing 'gawk' +**************************** + +This appendix provides instructions for installing 'gawk' on the various +platforms that are supported by the developers. The primary developer +supports GNU/Linux (and Unix), whereas the other ports are contributed. +*Note Bugs:: for the email addresses of the people who maintain the +respective ports. + +* Menu: + +* Gawk Distribution:: What is in the 'gawk' distribution. +* Unix Installation:: Installing 'gawk' under various + versions of Unix. +* Non-Unix Installation:: Installation on Other Operating Systems. +* Bugs:: Reporting Problems and Bugs. +* Other Versions:: Other freely available 'awk' + implementations. +* Installation summary:: Summary of installation. + + +File: gawk.info, Node: Gawk Distribution, Next: Unix Installation, Up: Installation + +B.1 The 'gawk' Distribution +=========================== + +This minor node describes how to get the 'gawk' distribution, how to +extract it, and then what is in the various files and subdirectories. + +* Menu: + +* Getting:: How to get the distribution. +* Extracting:: How to extract the distribution. +* Distribution contents:: What is in the distribution. + + +File: gawk.info, Node: Getting, Next: Extracting, Up: Gawk Distribution + +B.1.1 Getting the 'gawk' Distribution +------------------------------------- + +There are two ways to get GNU software: + + * Copy it from someone else who already has it. + + * Retrieve 'gawk' from the Internet host 'ftp.gnu.org', in the + directory '/gnu/gawk'. Both anonymous 'ftp' and 'http' access are + supported. If you have the 'wget' program, you can use a command + like the following: + + wget http://ftp.gnu.org/gnu/gawk/gawk-4.1.4.tar.gz + + The GNU software archive is mirrored around the world. The +up-to-date list of mirror sites is available from the main FSF website +(http://www.gnu.org/order/ftp.html). Try to use one of the mirrors; +they will be less busy, and you can usually find one closer to your +site. + + You may also retrieve the 'gawk' source code from the official Git +repository; for more information see *note Accessing The Source::. + + +File: gawk.info, Node: Extracting, Next: Distribution contents, Prev: Getting, Up: Gawk Distribution + +B.1.2 Extracting the Distribution +--------------------------------- + +'gawk' is distributed as several 'tar' files compressed with different +compression programs: 'gzip', 'bzip2', and 'xz'. For simplicity, the +rest of these instructions assume you are using the one compressed with +the GNU Gzip program ('gzip'). + + Once you have the distribution (e.g., 'gawk-4.1.4.tar.gz'), use +'gzip' to expand the file and then use 'tar' to extract it. You can use +the following pipeline to produce the 'gawk' distribution: + + gzip -d -c gawk-4.1.4.tar.gz | tar -xvpf - + + On a system with GNU 'tar', you can let 'tar' do the decompression +for you: + + tar -xvpzf gawk-4.1.4.tar.gz + +Extracting the archive creates a directory named 'gawk-4.1.4' in the +current directory. + + The distribution file name is of the form 'gawk-V.R.P.tar.gz'. The V +represents the major version of 'gawk', the R represents the current +release of version V, and the P represents a "patch level", meaning that +minor bugs have been fixed in the release. The current patch level is +4, but when retrieving distributions, you should get the version with +the highest version, release, and patch level. (Note, however, that +patch levels greater than or equal to 70 denote "beta" or nonproduction +software; you might not want to retrieve such a version unless you don't +mind experimenting.) If you are not on a Unix or GNU/Linux system, you +need to make other arrangements for getting and extracting the 'gawk' +distribution. You should consult a local expert. + + +File: gawk.info, Node: Distribution contents, Prev: Extracting, Up: Gawk Distribution + +B.1.3 Contents of the 'gawk' Distribution +----------------------------------------- + +The 'gawk' distribution has a number of C source files, documentation +files, subdirectories, and files related to the configuration process +(*note Unix Installation::), as well as several subdirectories related +to different non-Unix operating systems: + +Various '.c', '.y', and '.h' files + These files contain the actual 'gawk' source code. + +'ABOUT-NLS' + A file containing information about GNU 'gettext' and translations. + +'AUTHORS' + A file with some information about the authorship of 'gawk'. It + exists only to satisfy the pedants at the Free Software Foundation. + +'README' +'README_d/README.*' + Descriptive files: 'README' for 'gawk' under Unix and the rest for + the various hardware and software combinations. + +'INSTALL' + A file providing an overview of the configuration and installation + process. + +'ChangeLog' + A detailed list of source code changes as bugs are fixed or + improvements made. + +'ChangeLog.0' + An older list of source code changes. + +'NEWS' + A list of changes to 'gawk' since the last release or patch. + +'NEWS.0' + An older list of changes to 'gawk'. + +'COPYING' + The GNU General Public License. + +'POSIX.STD' + A description of behaviors in the POSIX standard for 'awk' that are + left undefined, or where 'gawk' may not comply fully, as well as a + list of things that the POSIX standard should describe but does + not. + +'doc/awkforai.txt' + Pointers to the original draft of a short article describing why + 'gawk' is a good language for artificial intelligence (AI) + programming. + +'doc/bc_notes' + A brief description of 'gawk''s "byte code" internals. + +'doc/README.card' +'doc/ad.block' +'doc/awkcard.in' +'doc/cardfonts' +'doc/colors' +'doc/macros' +'doc/no.colors' +'doc/setter.outline' + The 'troff' source for a five-color 'awk' reference card. A modern + version of 'troff' such as GNU 'troff' ('groff') is needed to + produce the color version. See the file 'README.card' for + instructions if you have an older 'troff'. + +'doc/gawk.1' + The 'troff' source for a manual page describing 'gawk'. This is + distributed for the convenience of Unix users. + +'doc/gawktexi.in' +'doc/sidebar.awk' + The Texinfo source file for this Info file. It should be processed + by 'doc/sidebar.awk' before processing with 'texi2dvi' or + 'texi2pdf' to produce a printed document, and with 'makeinfo' to + produce an Info or HTML file. The 'Makefile' takes care of this + processing and produces printable output via 'texi2dvi' or + 'texi2pdf'. + +'doc/gawk.texi' + The file produced after processing 'gawktexi.in' with + 'sidebar.awk'. + +'doc/gawk.info' + The generated Info file for this Info file. + +'doc/gawkinet.texi' + The Texinfo source file for *note (General Introduction, gawkinet, + TCP/IP Internetworking with 'gawk')Top::. It should be processed + with TeX (via 'texi2dvi' or 'texi2pdf') to produce a printed + document and with 'makeinfo' to produce an Info or HTML file. + +'doc/gawkinet.info' + The generated Info file for 'TCP/IP Internetworking with 'gawk''. + +'doc/igawk.1' + The 'troff' source for a manual page describing the 'igawk' program + presented in *note Igawk Program::. (Since 'gawk' can do its own + '@include' processing, neither 'igawk' nor 'igawk.1' are + installed.) + +'doc/Makefile.in' + The input file used during the configuration process to generate + the actual 'Makefile' for creating the documentation. + +'Makefile.am' +'*/Makefile.am' + Files used by the GNU Automake software for generating the + 'Makefile.in' files used by Autoconf and 'configure'. + +'Makefile.in' +'aclocal.m4' +'bisonfix.awk' +'config.guess' +'configh.in' +'configure.ac' +'configure' +'custom.h' +'depcomp' +'install-sh' +'missing_d/*' +'mkinstalldirs' +'m4/*' + These files and subdirectories are used when configuring and + compiling 'gawk' for various Unix systems. Most of them are + explained in *note Unix Installation::. The rest are there to + support the main infrastructure. + +'po/*' + The 'po' library contains message translations. + +'awklib/extract.awk' +'awklib/Makefile.am' +'awklib/Makefile.in' +'awklib/eg/*' + The 'awklib' directory contains a copy of 'extract.awk' (*note + Extract Program::), which can be used to extract the sample + programs from the Texinfo source file for this Info file. It also + contains a 'Makefile.in' file, which 'configure' uses to generate a + 'Makefile'. 'Makefile.am' is used by GNU Automake to create + 'Makefile.in'. The library functions from *note Library + Functions::, are included as ready-to-use files in the 'gawk' + distribution. They are installed as part of the installation + process. The rest of the programs in this Info file are available + in appropriate subdirectories of 'awklib/eg'. + +'extension/*' + The source code, manual pages, and infrastructure files for the + sample extensions included with 'gawk'. *Note Dynamic + Extensions::, for more information. + +'extras/*' + Additional non-essential files. Currently, this directory contains + some shell startup files to be installed in '/etc/profile.d' to aid + in manipulating the 'AWKPATH' and 'AWKLIBPATH' environment + variables. *Note Shell Startup Files::, for more information. + +'posix/*' + Files needed for building 'gawk' on POSIX-compliant systems. + +'pc/*' + Files needed for building 'gawk' under MS-Windows (*note PC + Installation:: for details). + +'vms/*' + Files needed for building 'gawk' under Vax/VMS and OpenVMS (*note + VMS Installation:: for details). + +'test/*' + A test suite for 'gawk'. You can use 'make check' from the + top-level 'gawk' directory to run your version of 'gawk' against + the test suite. If 'gawk' successfully passes 'make check', then + you can be confident of a successful port. + + +File: gawk.info, Node: Unix Installation, Next: Non-Unix Installation, Prev: Gawk Distribution, Up: Installation + +B.2 Compiling and Installing 'gawk' on Unix-Like Systems +======================================================== + +Usually, you can compile and install 'gawk' by typing only two commands. +However, if you use an unusual system, you may need to configure 'gawk' +for your system yourself. + +* Menu: + +* Quick Installation:: Compiling 'gawk' under Unix. +* Shell Startup Files:: Shell convenience functions. +* Additional Configuration Options:: Other compile-time options. +* Configuration Philosophy:: How it's all supposed to work. + + +File: gawk.info, Node: Quick Installation, Next: Shell Startup Files, Up: Unix Installation + +B.2.1 Compiling 'gawk' for Unix-Like Systems +-------------------------------------------- + +The normal installation steps should work on all modern commercial +Unix-derived systems, GNU/Linux, BSD-based systems, and the Cygwin +environment for MS-Windows. + + After you have extracted the 'gawk' distribution, 'cd' to +'gawk-4.1.4'. As with most GNU software, you configure 'gawk' for your +system by running the 'configure' program. This program is a Bourne +shell script that is generated automatically using GNU Autoconf. (The +Autoconf software is described fully starting with *note (Autoconf, +autoconf,Autoconf---Generating Automatic Configuration Scripts)Top::.) + + To configure 'gawk', simply run 'configure': + + sh ./configure + + This produces a 'Makefile' and 'config.h' tailored to your system. +The 'config.h' file describes various facts about your system. You +might want to edit the 'Makefile' to change the 'CFLAGS' variable, which +controls the command-line options that are passed to the C compiler +(such as optimization levels or compiling for debugging). + + Alternatively, you can add your own values for most 'make' variables +on the command line, such as 'CC' and 'CFLAGS', when running +'configure': + + CC=cc CFLAGS=-g sh ./configure + +See the file 'INSTALL' in the 'gawk' distribution for all the details. + + After you have run 'configure' and possibly edited the 'Makefile', +type: + + make + +Shortly thereafter, you should have an executable version of 'gawk'. +That's all there is to it! To verify that 'gawk' is working properly, +run 'make check'. All of the tests should succeed. If these steps do +not work, or if any of the tests fail, check the files in the 'README_d' +directory to see if you've found a known problem. If the failure is not +described there, send in a bug report (*note Bugs::). + + Of course, once you've built 'gawk', it is likely that you will wish +to install it. To do so, you need to run the command 'make install', as +a user with the appropriate permissions. How to do this varies by +system, but on many systems you can use the 'sudo' command to do so. +The command then becomes 'sudo make install'. It is likely that you +will be asked for your password, and you will have to have been set up +previously as a user who is allowed to run the 'sudo' command. + + +File: gawk.info, Node: Shell Startup Files, Next: Additional Configuration Options, Prev: Quick Installation, Up: Unix Installation + +B.2.2 Shell Startup Files +------------------------- + +The distribution contains shell startup files 'gawk.sh' and 'gawk.csh' +containing functions to aid in manipulating the 'AWKPATH' and +'AWKLIBPATH' environment variables. On a Fedora system, these files +should be installed in '/etc/profile.d'; on other platforms, the +appropriate location may be different. + +'gawkpath_default' + Reset the 'AWKPATH' environment variable to its default value. + +'gawkpath_prepend' + Add the argument to the front of the 'AWKPATH' environment + variable. + +'gawkpath_append' + Add the argument to the end of the 'AWKPATH' environment variable. + +'gawklibpath_default' + Reset the 'AWKLIBPATH' environment variable to its default value. + +'gawklibpath_prepend' + Add the argument to the front of the 'AWKLIBPATH' environment + variable. + +'gawklibpath_append' + Add the argument to the end of the 'AWKLIBPATH' environment + variable. + + +File: gawk.info, Node: Additional Configuration Options, Next: Configuration Philosophy, Prev: Shell Startup Files, Up: Unix Installation + +B.2.3 Additional Configuration Options +-------------------------------------- + +There are several additional options you may use on the 'configure' +command line when compiling 'gawk' from scratch, including: + +'--disable-extensions' + Disable configuring and building the sample extensions in the + 'extension' directory. This is useful for cross-compiling. The + default action is to dynamically check if the extensions can be + configured and compiled. + +'--disable-lint' + Disable all lint checking within 'gawk'. The '--lint' and + '--lint-old' options (*note Options::) are accepted, but silently + do nothing. Similarly, setting the 'LINT' variable (*note + User-modified::) has no effect on the running 'awk' program. + + When used with the GNU Compiler Collection's (GCC's) automatic + dead-code-elimination, this option cuts almost 23K bytes off the + size of the 'gawk' executable on GNU/Linux x86_64 systems. Results + on other systems and with other compilers are likely to vary. + Using this option may bring you some slight performance + improvement. + + CAUTION: Using this option will cause some of the tests in the + test suite to fail. This option may be removed at a later + date. + +'--disable-nls' + Disable all message-translation facilities. This is usually not + desirable, but it may bring you some slight performance + improvement. + +'--with-whiny-user-strftime' + Force use of the included version of the C 'strftime()' function + for deficient systems. + + Use the command './configure --help' to see the full list of options +supplied by 'configure'. + + +File: gawk.info, Node: Configuration Philosophy, Prev: Additional Configuration Options, Up: Unix Installation + +B.2.4 The Configuration Process +------------------------------- + +This minor node is of interest only if you know something about using +the C language and Unix-like operating systems. + + The source code for 'gawk' generally attempts to adhere to formal +standards wherever possible. This means that 'gawk' uses library +routines that are specified by the ISO C standard and by the POSIX +operating system interface standard. The 'gawk' source code requires +using an ISO C compiler (the 1990 standard). + + Many Unix systems do not support all of either the ISO or the POSIX +standards. The 'missing_d' subdirectory in the 'gawk' distribution +contains replacement versions of those functions that are most likely to +be missing. + + The 'config.h' file that 'configure' creates contains definitions +that describe features of the particular operating system where you are +attempting to compile 'gawk'. The three things described by this file +are: what header files are available, so that they can be correctly +included, what (supposedly) standard functions are actually available in +your C libraries, and various miscellaneous facts about your operating +system. For example, there may not be an 'st_blksize' element in the +'stat' structure. In this case, 'HAVE_STRUCT_STAT_ST_BLKSIZE' is +undefined. + + It is possible for your C compiler to lie to 'configure'. It may do +so by not exiting with an error when a library function is not +available. To get around this, edit the 'custom.h' file. Use an +'#ifdef' that is appropriate for your system, and either '#define' any +constants that 'configure' should have defined but didn't, or '#undef' +any constants that 'configure' defined and should not have. The +'custom.h' file is automatically included by the 'config.h' file. + + It is also possible that the 'configure' program generated by +Autoconf will not work on your system in some other fashion. If you do +have a problem, the 'configure.ac' file is the input for Autoconf. You +may be able to change this file and generate a new version of +'configure' that works on your system (*note Bugs:: for information on +how to report problems in configuring 'gawk'). The same mechanism may +be used to send in updates to 'configure.ac' and/or 'custom.h'. + + +File: gawk.info, Node: Non-Unix Installation, Next: Bugs, Prev: Unix Installation, Up: Installation + +B.3 Installation on Other Operating Systems +=========================================== + +This minor node describes how to install 'gawk' on various non-Unix +systems. + +* Menu: + +* PC Installation:: Installing and Compiling 'gawk' on + Microsoft Windows. +* VMS Installation:: Installing 'gawk' on VMS. + + +File: gawk.info, Node: PC Installation, Next: VMS Installation, Up: Non-Unix Installation + +B.3.1 Installation on MS-Windows +-------------------------------- + +This minor node covers installation and usage of 'gawk' on Intel +architecture machines running any version of MS-Windows. In this minor +node, the term "Windows32" refers to any of Microsoft Windows +95/98/ME/NT/2000/XP/Vista/7/8/10. + + See also the 'README_d/README.pc' file in the distribution. + +* Menu: + +* PC Binary Installation:: Installing a prepared distribution. +* PC Compiling:: Compiling 'gawk' for Windows32. +* PC Using:: Running 'gawk' on Windows32. +* Cygwin:: Building and running 'gawk' for + Cygwin. +* MSYS:: Using 'gawk' In The MSYS Environment. + + +File: gawk.info, Node: PC Binary Installation, Next: PC Compiling, Up: PC Installation + +B.3.1.1 Installing a Prepared Distribution for MS-Windows Systems +................................................................. + +The only supported binary distribution for MS-Windows systems is that +provided by Eli Zaretskii's "ezwinports" +(https://sourceforge.net/projects/ezwinports/) project. Install the +compiled 'gawk' from there. + + +File: gawk.info, Node: PC Compiling, Next: PC Using, Prev: PC Binary Installation, Up: PC Installation + +B.3.1.2 Compiling 'gawk' for PC Operating Systems +................................................. + +'gawk' can be compiled for Windows32 using MinGW (Windows32). The file +'README_d/README.pc' in the 'gawk' distribution contains additional +notes, and 'pc/Makefile' contains important information on compilation +options. + + To build 'gawk' for Windows32, copy the files in the 'pc' directory +(_except_ for 'ChangeLog') to the directory with the rest of the 'gawk' +sources, then invoke 'make' with the appropriate target name as an +argument to build 'gawk'. The 'Makefile' copied from the 'pc' directory +contains a configuration section with comments and may need to be edited +in order to work with your 'make' utility. + + The 'Makefile' supports a number of targets for building various +MS-DOS and Windows32 versions. A list of targets is printed if the +'make' command is given without a target. As an example, to build a +native MS-Windows binary of 'gawk' using the MinGW tools, type 'make +mingw32'. + + +File: gawk.info, Node: PC Using, Next: Cygwin, Prev: PC Compiling, Up: PC Installation + +B.3.1.3 Using 'gawk' on PC Operating Systems +............................................ + +Under MS-Windows, the Cygwin and MinGW environments support both the +'|&' operator and TCP/IP networking (*note TCP/IP Networking::). + + The MS-Windows version of 'gawk' searches for program files as +described in *note AWKPATH Variable::. However, semicolons (rather than +colons) separate elements in the 'AWKPATH' variable. If 'AWKPATH' is +not set or is empty, then the default search path is +'.;c:/lib/awk;c:/gnu/lib/awk'. + + Under MS-Windows, 'gawk' (and many other text programs) silently +translates end-of-line '\r\n' to '\n' on input and '\n' to '\r\n' on +output. A special 'BINMODE' variable (c.e.) allows control over these +translations and is interpreted as follows: + + * If 'BINMODE' is '"r"' or one, then binary mode is set on read + (i.e., no translations on reads). + + * If 'BINMODE' is '"w"' or two, then binary mode is set on write + (i.e., no translations on writes). + + * If 'BINMODE' is '"rw"' or '"wr"' or three, binary mode is set for + both read and write. + + * 'BINMODE=NON-NULL-STRING' is the same as 'BINMODE=3' (i.e., no + translations on reads or writes). However, 'gawk' issues a warning + message if the string is not one of '"rw"' or '"wr"'. + +The modes for standard input and standard output are set one time only +(after the command line is read, but before processing any of the 'awk' +program). Setting 'BINMODE' for standard input or standard output is +accomplished by using an appropriate '-v BINMODE=N' option on the +command line. 'BINMODE' is set at the time a file or pipe is opened and +cannot be changed midstream. + + The name 'BINMODE' was chosen to match 'mawk' (*note Other +Versions::). 'mawk' and 'gawk' handle 'BINMODE' similarly; however, +'mawk' adds a '-W BINMODE=N' option and an environment variable that can +set 'BINMODE', 'RS', and 'ORS'. The files 'binmode[1-3].awk' (under +'gnu/lib/awk' in some of the prepared binary distributions) have been +chosen to match 'mawk''s '-W BINMODE=N' option. These can be changed or +discarded; in particular, the setting of 'RS' giving the fewest +"surprises" is open to debate. 'mawk' uses 'RS = "\r\n"' if binary mode +is set on read, which is appropriate for files with the MS-DOS-style +end-of-line. + + To illustrate, the following examples set binary mode on writes for +standard output and other files, and set 'ORS' as the "usual" +MS-DOS-style end-of-line: + + gawk -v BINMODE=2 -v ORS="\r\n" ... + +or: + + gawk -v BINMODE=w -f binmode2.awk ... + +These give the same result as the '-W BINMODE=2' option in 'mawk'. The +following changes the record separator to '"\r\n"' and sets binary mode +on reads, but does not affect the mode on standard input: + + gawk -v RS="\r\n" -e "BEGIN { BINMODE = 1 }" ... + +or: + + gawk -f binmode1.awk ... + +With proper quoting, in the first example the setting of 'RS' can be +moved into the 'BEGIN' rule. + + +File: gawk.info, Node: Cygwin, Next: MSYS, Prev: PC Using, Up: PC Installation + +B.3.1.4 Using 'gawk' In The Cygwin Environment +.............................................. + +'gawk' can be built and used "out of the box" under MS-Windows if you +are using the Cygwin environment (http://www.cygwin.com). This +environment provides an excellent simulation of GNU/Linux, using Bash, +GCC, GNU Make, and other GNU programs. Compilation and installation for +Cygwin is the same as for a Unix system: + + tar -xvpzf gawk-4.1.4.tar.gz + cd gawk-4.1.4 + ./configure + make && make check + + When compared to GNU/Linux on the same system, the 'configure' step +on Cygwin takes considerably longer. However, it does finish, and then +the 'make' proceeds as usual. + + +File: gawk.info, Node: MSYS, Prev: Cygwin, Up: PC Installation + +B.3.1.5 Using 'gawk' In The MSYS Environment +............................................ + +In the MSYS environment under MS-Windows, 'gawk' automatically uses +binary mode for reading and writing files. Thus, there is no need to +use the 'BINMODE' variable. + + This can cause problems with other Unix-like components that have +been ported to MS-Windows that expect 'gawk' to do automatic translation +of '"\r\n"', because it won't. + + +File: gawk.info, Node: VMS Installation, Prev: PC Installation, Up: Non-Unix Installation + +B.3.2 Compiling and Installing 'gawk' on Vax/VMS and OpenVMS +------------------------------------------------------------ + +This node describes how to compile and install 'gawk' under VMS. The +older designation "VMS" is used throughout to refer to OpenVMS. + +* Menu: + +* VMS Compilation:: How to compile 'gawk' under VMS. +* VMS Dynamic Extensions:: Compiling 'gawk' dynamic extensions on + VMS. +* VMS Installation Details:: How to install 'gawk' under VMS. +* VMS Running:: How to run 'gawk' under VMS. +* VMS GNV:: The VMS GNV Project. +* VMS Old Gawk:: An old version comes with some VMS systems. + + +File: gawk.info, Node: VMS Compilation, Next: VMS Dynamic Extensions, Up: VMS Installation + +B.3.2.1 Compiling 'gawk' on VMS +............................... + +To compile 'gawk' under VMS, there is a 'DCL' command procedure that +issues all the necessary 'CC' and 'LINK' commands. There is also a +'Makefile' for use with the 'MMS' and 'MMK' utilities. From the source +directory, use either: + + $ @[.vms]vmsbuild.com + +or: + + $ MMS/DESCRIPTION=[.vms]descrip.mms gawk + +or: + + $ MMK/DESCRIPTION=[.vms]descrip.mms gawk + + 'MMK' is an open source, free, near-clone of 'MMS' and can better +handle ODS-5 volumes with upper- and lowercase file names. 'MMK' is +available from <https://github.com/endlesssoftware/mmk>. + + With ODS-5 volumes and extended parsing enabled, the case of the +target parameter may need to be exact. + + 'gawk' has been tested under VAX/VMS 7.3 and Alpha/VMS 7.3-1 using +Compaq C V6.4, and under Alpha/VMS 7.3, Alpha/VMS 7.3-2, and IA64/VMS +8.3. The most recent builds used HP C V7.3 on Alpha VMS 8.3 and both +Alpha and IA64 VMS 8.4 used HP C 7.3.(1) + + *Note VMS GNV:: for information on building 'gawk' as a PCSI kit that +is compatible with the GNV product. + + ---------- Footnotes ---------- + + (1) The IA64 architecture is also known as "Itanium." + + +File: gawk.info, Node: VMS Dynamic Extensions, Next: VMS Installation Details, Prev: VMS Compilation, Up: VMS Installation + +B.3.2.2 Compiling 'gawk' Dynamic Extensions on VMS +.................................................. + +The extensions that have been ported to VMS can be built using one of +the following commands: + + $ MMS/DESCRIPTION=[.vms]descrip.mms extensions + +or: + + $ MMK/DESCRIPTION=[.vms]descrip.mms extensions + + 'gawk' uses 'AWKLIBPATH' as either an environment variable or a +logical name to find the dynamic extensions. + + Dynamic extensions need to be compiled with the same compiler options +for floating-point, pointer size, and symbol name handling as were used +to compile 'gawk' itself. Alpha and Itanium should use IEEE floating +point. The pointer size is 32 bits, and the symbol name handling should +be exact case with CRC shortening for symbols longer than 32 bits. + + For Alpha and Itanium: + + /name=(as_is,short) + /float=ieee/ieee_mode=denorm_results + + For VAX: + + /name=(as_is,short) + + Compile-time macros need to be defined before the first VMS-supplied +header file is included, as follows: + + #if (__CRTL_VER >= 70200000) && !defined (__VAX) + #define _LARGEFILE 1 + #endif + + #ifndef __VAX + #ifdef __CRTL_VER + #if __CRTL_VER >= 80200000 + #define _USE_STD_STAT 1 + #endif + #endif + #endif + + If you are writing your own extensions to run on VMS, you must supply +these definitions yourself. The 'config.h' file created when building +'gawk' on VMS does this for you; if instead you use that file or a +similar one, then you must remember to include it before any +VMS-supplied header files. + + +File: gawk.info, Node: VMS Installation Details, Next: VMS Running, Prev: VMS Dynamic Extensions, Up: VMS Installation + +B.3.2.3 Installing 'gawk' on VMS +................................ + +To use 'gawk', all you need is a "foreign" command, which is a 'DCL' +symbol whose value begins with a dollar sign. For example: + + $ GAWK :== $disk1:[gnubin]gawk + +Substitute the actual location of 'gawk.exe' for '$disk1:[gnubin]'. The +symbol should be placed in the 'login.com' of any user who wants to run +'gawk', so that it is defined every time the user logs on. +Alternatively, the symbol may be placed in the system-wide 'sylogin.com' +procedure, which allows all users to run 'gawk'. + + If your 'gawk' was installed by a PCSI kit into the 'GNV$GNU:' +directory tree, the program will be known as 'GNV$GNU:[bin]gnv$gawk.exe' +and the help file will be 'GNV$GNU:[vms_help]gawk.hlp'. + + The PCSI kit also installs a 'GNV$GNU:[vms_bin]gawk_verb.cld' file +that can be used to add 'gawk' and 'awk' as DCL commands. + + For just the current process you can use: + + $ set command gnv$gnu:[vms_bin]gawk_verb.cld + + Or the system manager can use 'GNV$GNU:[vms_bin]gawk_verb.cld' to add +the 'gawk' and 'awk' to the system-wide 'DCLTABLES'. + + The DCL syntax is documented in the 'gawk.hlp' file. + + Optionally, the 'gawk.hlp' entry can be loaded into a VMS help +library: + + $ LIBRARY/HELP sys$help:helplib [.vms]gawk.hlp + +(You may want to substitute a site-specific help library rather than the +standard VMS library 'HELPLIB'.) After loading the help text, the +command: + + $ HELP GAWK + +provides information about both the 'gawk' implementation and the 'awk' +programming language. + + The logical name 'AWK_LIBRARY' can designate a default location for +'awk' program files. For the '-f' option, if the specified file name +has no device or directory path information in it, 'gawk' looks in the +current directory first, then in the directory specified by the +translation of 'AWK_LIBRARY' if the file is not found. If, after +searching in both directories, the file still is not found, 'gawk' +appends the suffix '.awk' to the file name and retries the file search. +If 'AWK_LIBRARY' has no definition, a default value of 'SYS$LIBRARY:' is +used for it. + + +File: gawk.info, Node: VMS Running, Next: VMS GNV, Prev: VMS Installation Details, Up: VMS Installation + +B.3.2.4 Running 'gawk' on VMS +............................. + +Command-line parsing and quoting conventions are significantly different +on VMS, so examples in this Info file or from other sources often need +minor changes. They _are_ minor though, and all 'awk' programs should +run correctly. + + Here are a couple of trivial tests: + + $ gawk -- "BEGIN {print ""Hello, World!""}" + $ gawk -"W" version + ! could also be -"W version" or "-W version" + +Note that uppercase and mixed-case text must be quoted. + + The VMS port of 'gawk' includes a 'DCL'-style interface in addition +to the original shell-style interface (see the help entry for details). +One side effect of dual command-line parsing is that if there is only a +single parameter (as in the quoted string program), the command becomes +ambiguous. To work around this, the normally optional '--' flag is +required to force Unix-style parsing rather than 'DCL' parsing. If any +other dash-type options (or multiple parameters such as data files to +process) are present, there is no ambiguity and '--' can be omitted. + + The 'exit' value is a Unix-style value and is encoded into a VMS exit +status value when the program exits. + + The VMS severity bits will be set based on the 'exit' value. A +failure is indicated by 1, and VMS sets the 'ERROR' status. A fatal +error is indicated by 2, and VMS sets the 'FATAL' status. All other +values will have the 'SUCCESS' status. The exit value is encoded to +comply with VMS coding standards and will have the 'C_FACILITY_NO' of +'0x350000' with the constant '0xA000' added to the number shifted over +by 3 bits to make room for the severity codes. + + To extract the actual 'gawk' exit code from the VMS status, use: + + unix_status = (vms_status .and. %x7f8) / 8 + +A C program that uses 'exec()' to call 'gawk' will get the original +Unix-style exit value. + + Older versions of 'gawk' for VMS treated a Unix exit code 0 as 1, a +failure as 2, a fatal error as 4, and passed all the other numbers +through. This violated the VMS exit status coding requirements. + + VAX/VMS floating point uses unbiased rounding. *Note Round +Function::. + + VMS reports time values in GMT unless one of the 'SYS$TIMEZONE_RULE' +or 'TZ' logical names is set. Older versions of VMS, such as VAX/VMS +7.3, do not set these logical names. + + The default search path, when looking for 'awk' program files +specified by the '-f' option, is '"SYS$DISK:[],AWK_LIBRARY:"'. The +logical name 'AWKPATH' can be used to override this default. The format +of 'AWKPATH' is a comma-separated list of directory specifications. +When defining it, the value should be quoted so that it retains a single +translation and not a multitranslation 'RMS' searchlist. + + This restriction also applies to running 'gawk' under GNV, as +redirection is always to a DCL command. + + If you are redirecting data to a VMS command or utility, the current +implementation requires that setting up a VMS foreign command that runs +a command file before invoking 'gawk'. (This restriction may be removed +in a future release of 'gawk' on VMS.) + + Without this command file, the input data will also appear prepended +to the output data. + + This also allows simulating POSIX commands that are not found on VMS +or the use of GNV utilities. + + The example below is for 'gawk' redirecting data to the VMS 'sort' +command. + + $ sort = "@device:[dir]vms_gawk_sort.com" + + The command file needs to be of the format in the example below. + + The first line inhibits the passed input data from also showing up in +the output. It must be in the format in the example. + + The next line creates a foreign command that overrides the outer +foreign command which prevents an infinite recursion of command files. + + The next to the last command redirects 'sys$input' to be +'sys$command', in order to pick up the data that is being redirected to +the command. + + The last line runs the actual command. It must be the last command +as the data redirected from 'gawk' will be read when the command file +ends. + + $!'f$verify(0,0)' + $ sort := sort + $ define/user sys$input sys$command: + $ sort sys$input: sys$output: + + +File: gawk.info, Node: VMS GNV, Next: VMS Old Gawk, Prev: VMS Running, Up: VMS Installation + +B.3.2.5 The VMS GNV Project +........................... + +The VMS GNV package provides a build environment similar to POSIX with +ports of a collection of open source tools. The 'gawk' found in the GNV +base kit is an older port. Currently, the GNV project is being +reorganized to supply individual PCSI packages for each component. See +<https://sourceforge.net/p/gnv/wiki/InstallingGNVPackages/>. + + The normal build procedure for 'gawk' produces a program that is +suitable for use with GNV. + + The file 'vms/gawk_build_steps.txt' in the distribution documents the +procedure for building a VMS PCSI kit that is compatible with GNV. + + +File: gawk.info, Node: VMS Old Gawk, Prev: VMS GNV, Up: VMS Installation + +B.3.2.6 Some VMS Systems Have An Old Version of 'gawk' +...................................................... + +Some versions of VMS have an old version of 'gawk'. To access it, +define a symbol, as follows: + + $ gawk :== $sys$common:[syshlp.examples.tcpip.snmp]gawk.exe + + This is apparently version 2.15.6, which is extremely old. We +recommend compiling and using the current version. + + +File: gawk.info, Node: Bugs, Next: Other Versions, Prev: Non-Unix Installation, Up: Installation + +B.4 Reporting Problems and Bugs +=============================== + + There is nothing more dangerous than a bored archaeologist. + -- _Douglas Adams, 'The Hitchhiker's Guide to the Galaxy'_ + + If you have problems with 'gawk' or think that you have found a bug, +report it to the developers; we cannot promise to do anything, but we +might well want to fix it. + +* Menu: + +* Bug address:: Where to send reports to. +* Usenet:: Where not to send reports to. +* Maintainers:: Maintainers of non-*nix ports. + + +File: gawk.info, Node: Bug address, Next: Usenet, Up: Bugs + +B.4.1 Submitting Bug Reports +---------------------------- + +Before reporting a bug, make sure you have really found a genuine bug. +Carefully reread the documentation and see if it says you can do what +you're trying to do. If it's not clear whether you should be able to do +something or not, report that too; it's a bug in the documentation! + + Before reporting a bug or trying to fix it yourself, try to isolate +it to the smallest possible 'awk' program and input data file that +reproduce the problem. Then send us the program and data file, some +idea of what kind of Unix system you're using, the compiler you used to +compile 'gawk', and the exact results 'gawk' gave you. Also say what +you expected to occur; this helps us decide whether the problem is +really in the documentation. + + Make sure to include the version number of 'gawk' you are using. You +can get this information with the command 'gawk --version'. + + Once you have a precise problem description, send email to +<bug-gawk@gnu.org>. + + The 'gawk' maintainers subscribe to this address, and thus they will +receive your bug report. Although you can send mail to the maintainers +directly, the bug reporting address is preferred because the email list +is archived at the GNU Project. _All email must be in English. This is +the only language understood in common by all the maintainers._ In +addition, please be sure to send all mail in _plain text_, not (or not +exclusively) in HTML. + + NOTE: Many distributions of GNU/Linux and the various BSD-based + operating systems have their own bug reporting systems. If you + report a bug using your distribution's bug reporting system, you + should also send a copy to <bug-gawk@gnu.org>. + + This is for two reasons. First, although some distributions + forward bug reports "upstream" to the GNU mailing list, many don't, + so there is a good chance that the 'gawk' maintainers won't even + see the bug report! Second, mail to the GNU list is archived, and + having everything at the GNU Project keeps things self-contained + and not dependent on other organizations. + + Non-bug suggestions are always welcome as well. If you have +questions about things that are unclear in the documentation or are just +obscure features, ask on the bug list; we will try to help you out if we +can. + + +File: gawk.info, Node: Usenet, Next: Maintainers, Prev: Bug address, Up: Bugs + +B.4.2 Please Don't Post Bug Reports to USENET +--------------------------------------------- + + I gave up on Usenet a couple of years ago and haven't really looked + back. It's like sports talk radio--you feel smarter for not having + read it. + -- _Chet Ramey_ + + Please do _not_ try to report bugs in 'gawk' by posting to the +Usenet/Internet newsgroup 'comp.lang.awk'. Although some of the 'gawk' +developers occasionally read this newgroup, the primary 'gawk' +maintainer no longer does. Thus it's virtually guaranteed that he will +_not_ see your posting. The steps described here are the only +officially recognized way for reporting bugs. Really. + + +File: gawk.info, Node: Maintainers, Prev: Usenet, Up: Bugs + +B.4.3 Reporting Problems with Non-Unix Ports +-------------------------------------------- + +If you find bugs in one of the non-Unix ports of 'gawk', send an email +to the bug list, with a copy to the person who maintains that port. The +maintainers are named in the following list, as well as in the 'README' +file in the 'gawk' distribution. Information in the 'README' file +should be considered authoritative if it conflicts with this Info file. + + The people maintaining the various 'gawk' ports are: + +Unix and POSIX Arnold Robbins, <arnold@skeeve.com> +systems +MS-Windows with MinGW Eli Zaretskii, <eliz@gnu.org> + +OS/2 Andreas Buening, <andreas.buening@nexgo.de> + +VMS John Malmberg, <wb8tyw@qsl.net> + +z/OS (OS/390) Daniel Richard G. <skunk@iSKUNK.ORG> + Dave Pitts (Maintainer Emeritus), <dpitts@cozx.com> + + If your bug is also reproducible under Unix, send a copy of your +report to the <bug-gawk@gnu.org> email list as well. + + The DJGPP port is no longer supported; it will remain in the code +base for a while in case a volunteer wishes to take it over. If this +does not happen, then eventually code for this port will be removed. + + +File: gawk.info, Node: Other Versions, Next: Installation summary, Prev: Bugs, Up: Installation + +B.5 Other Freely Available 'awk' Implementations +================================================ + + It's kind of fun to put comments like this in your awk code: + '// Do C++ comments work? answer: yes! of course' + -- _Michael Brennan_ + + There are a number of other freely available 'awk' implementations. +This minor node briefly describes where to get them: + +Unix 'awk' + Brian Kernighan, one of the original designers of Unix 'awk', has + made his implementation of 'awk' freely available. You can + retrieve this version via his home page + (http://www.cs.princeton.edu/~bwk). It is available in several + archive formats: + + Shell archive + <http://www.cs.princeton.edu/~bwk/btl.mirror/awk.shar> + + Compressed 'tar' file + <http://www.cs.princeton.edu/~bwk/btl.mirror/awk.tar.gz> + + Zip file + <http://www.cs.princeton.edu/~bwk/btl.mirror/awk.zip> + + You can also retrieve it from GitHub: + + git clone git://github.com/onetrueawk/awk bwkawk + + This command creates a copy of the Git (http://git-scm.com) + repository in a directory named 'bwkawk'. If you leave that + argument off the 'git' command line, the repository copy is created + in a directory named 'awk'. + + This version requires an ISO C (1990 standard) compiler; the C + compiler from GCC (the GNU Compiler Collection) works quite nicely. + + *Note Common Extensions:: for a list of extensions in this 'awk' + that are not in POSIX 'awk'. + + As a side note, Dan Bornstein has created a Git repository tracking + all the versions of BWK 'awk' that he could find. It's available + at <git://github.com/danfuzz/one-true-awk>. + +'mawk' + Michael Brennan wrote an independent implementation of 'awk', + called 'mawk'. It is available under the GPL (*note Copying::), + just as 'gawk' is. + + The original distribution site for the 'mawk' source code no longer + has it. A copy is available at + <http://www.skeeve.com/gawk/mawk1.3.3.tar.gz>. + + In 2009, Thomas Dickey took on 'mawk' maintenance. Basic + information is available on the project's web page + (http://www.invisible-island.net/mawk). The download URL is + <http://invisible-island.net/datafiles/release/mawk.tar.gz>. + + Once you have it, 'gunzip' may be used to decompress this file. + Installation is similar to 'gawk''s (*note Unix Installation::). + + *Note Common Extensions:: for a list of extensions in 'mawk' that + are not in POSIX 'awk'. + +'awka' + Written by Andrew Sumner, 'awka' translates 'awk' programs into C, + compiles them, and links them with a library of functions that + provide the core 'awk' functionality. It also has a number of + extensions. + + The 'awk' translator is released under the GPL, and the library is + under the LGPL. + + To get 'awka', go to <http://sourceforge.net/projects/awka>. + + The project seems to be frozen; no new code changes have been made + since approximately 2001. + +'pawk' + Nelson H.F. Beebe at the University of Utah has modified BWK 'awk' + to provide timing and profiling information. It is different from + 'gawk' with the '--profile' option (*note Profiling::) in that it + uses CPU-based profiling, not line-count profiling. You may find + it at either + <ftp://ftp.math.utah.edu/pub/pawk/pawk-20030606.tar.gz> or + <http://www.math.utah.edu/pub/pawk/pawk-20030606.tar.gz>. + +BusyBox 'awk' + BusyBox is a GPL-licensed program providing small versions of many + applications within a single executable. It is aimed at embedded + systems. It includes a full implementation of POSIX 'awk'. When + building it, be careful not to do 'make install' as it will + overwrite copies of other applications in your '/usr/local/bin'. + For more information, see the project's home page + (http://busybox.net). + +The OpenSolaris POSIX 'awk' + The versions of 'awk' in '/usr/xpg4/bin' and '/usr/xpg6/bin' on + Solaris are more or less POSIX-compliant. They are based on the + 'awk' from Mortice Kern Systems for PCs. We were able to make this + code compile and work under GNU/Linux with 1-2 hours of work. + Making it more generally portable (using GNU Autoconf and/or + Automake) would take more work, and this has not been done, at + least to our knowledge. + + The source code used to be available from the OpenSolaris website. + However, that project was ended and the website shut down. + Fortunately, the Illumos project + (http://wiki.illumos.org/display/illumos/illumos+Home) makes this + implementation available. You can view the files one at a time + from + <https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/awk_xpg4>. + +'jawk' + This is an interpreter for 'awk' written in Java. It claims to be + a full interpreter, although because it uses Java facilities for + I/O and for regexp matching, the language it supports is different + from POSIX 'awk'. More information is available on the project's + home page (http://jawk.sourceforge.net). + +Libmawk + This is an embeddable 'awk' interpreter derived from 'mawk'. For + more information, see <http://repo.hu/projects/libmawk/>. + +'pawk' + This is a Python module that claims to bring 'awk'-like features to + Python. See <https://github.com/alecthomas/pawk> for more + information. (This is not related to Nelson Beebe's modified + version of BWK 'awk', described earlier.) + +QSE 'awk' + This is an embeddable 'awk' interpreter. For more information, see + <http://code.google.com/p/qse/> and <http://awk.info/?tools/qse>. + +'QTawk' + This is an independent implementation of 'awk' distributed under + the GPL. It has a large number of extensions over standard 'awk' + and may not be 100% syntactically compatible with it. See + <http://www.quiktrim.org/QTawk.html> for more information, + including the manual. The download link there is out of date; see + <http://www.quiktrim.org/#AdditionalResources> for the latest + download link. + + The project may also be frozen; no new code changes have been made + since approximately 2014. + +Other versions + See also the "Versions and implementations" section of the + Wikipedia article + (http://en.wikipedia.org/wiki/Awk_language#Versions_and_implementations) + on 'awk' for information on additional versions. + + +File: gawk.info, Node: Installation summary, Prev: Other Versions, Up: Installation + +B.6 Summary +=========== + + * The 'gawk' distribution is available from the GNU Project's main + distribution site, 'ftp.gnu.org'. The canonical build recipe is: + + wget http://ftp.gnu.org/gnu/gawk/gawk-4.1.4.tar.gz + tar -xvpzf gawk-4.1.4.tar.gz + cd gawk-4.1.4 + ./configure && make && make check + + * 'gawk' may be built on non-POSIX systems as well. The currently + supported systems are MS-Windows using MSYS, MinGW, and Cygwin, and + both Vax/VMS and OpenVMS. Instructions for each system are included + in this major node. + + * Bug reports should be sent via email to <bug-gawk@gnu.org>. Bug + reports should be in English and should include the version of + 'gawk', how it was compiled, and a short program and data file that + demonstrate the problem. + + * There are a number of other freely available 'awk' implementations. + Many are POSIX-compliant; others are less so. + + +File: gawk.info, Node: Notes, Next: Basic Concepts, Prev: Installation, Up: Top + +Appendix C Implementation Notes +******************************* + +This appendix contains information mainly of interest to implementers +and maintainers of 'gawk'. Everything in it applies specifically to +'gawk' and not to other implementations. + +* Menu: + +* Compatibility Mode:: How to disable certain 'gawk' + extensions. +* Additions:: Making Additions To 'gawk'. +* Future Extensions:: New features that may be implemented one day. +* Implementation Limitations:: Some limitations of the implementation. +* Extension Design:: Design notes about the extension API. +* Old Extension Mechanism:: Some compatibility for old extensions. +* Notes summary:: Summary of implementation notes. + + +File: gawk.info, Node: Compatibility Mode, Next: Additions, Up: Notes + +C.1 Downward Compatibility and Debugging +======================================== + +*Note POSIX/GNU::, for a summary of the GNU extensions to the 'awk' +language and program. All of these features can be turned off by +invoking 'gawk' with the '--traditional' option or with the '--posix' +option. + + If 'gawk' is compiled for debugging with '-DDEBUG', then there is one +more option available on the command line: + +'-Y' +'--parsedebug' + Print out the parse stack information as the program is being + parsed. + + This option is intended only for serious 'gawk' developers and not +for the casual user. It probably has not even been compiled into your +version of 'gawk', since it slows down execution. + + +File: gawk.info, Node: Additions, Next: Future Extensions, Prev: Compatibility Mode, Up: Notes + +C.2 Making Additions to 'gawk' +============================== + +If you find that you want to enhance 'gawk' in a significant fashion, +you are perfectly free to do so. That is the point of having free +software; the source code is available and you are free to change it as +you want (*note Copying::). + + This minor node discusses the ways you might want to change 'gawk' as +well as any considerations you should bear in mind. + +* Menu: + +* Accessing The Source:: Accessing the Git repository. +* Adding Code:: Adding code to the main body of + 'gawk'. +* New Ports:: Porting 'gawk' to a new operating + system. +* Derived Files:: Why derived files are kept in the Git + repository. + + +File: gawk.info, Node: Accessing The Source, Next: Adding Code, Up: Additions + +C.2.1 Accessing The 'gawk' Git Repository +----------------------------------------- + +As 'gawk' is Free Software, the source code is always available. *note +Gawk Distribution:: describes how to get and build the formal, released +versions of 'gawk'. + + However, if you want to modify 'gawk' and contribute back your +changes, you will probably wish to work with the development version. +To do so, you will need to access the 'gawk' source code repository. +The code is maintained using the Git distributed version control system +(http://git-scm.com). You will need to install it if your system +doesn't have it. Once you have done so, use the command: + + git clone git://git.savannah.gnu.org/gawk.git + +This clones the 'gawk' repository. If you are behind a firewall that +does not allow you to use the Git native protocol, you can still access +the repository using: + + git clone http://git.savannah.gnu.org/r/gawk.git + + Once you have made changes, you can use 'git diff' to produce a +patch, and send that to the 'gawk' maintainer; see *note Bugs::, for how +to do that. + + Once upon a time there was Git-CVS gateway for use by people who +could not install Git. However, this gateway no longer works, so you +may have better luck using a more modern version control system like +Bazaar, that has a Git plug-in for working with Git repositories. + + +File: gawk.info, Node: Adding Code, Next: New Ports, Prev: Accessing The Source, Up: Additions + +C.2.2 Adding New Features +------------------------- + +You are free to add any new features you like to 'gawk'. However, if +you want your changes to be incorporated into the 'gawk' distribution, +there are several steps that you need to take in order to make it +possible to include them: + + 1. Before building the new feature into 'gawk' itself, consider + writing it as an extension (*note Dynamic Extensions::). If that's + not possible, continue with the rest of the steps in this list. + + 2. Be prepared to sign the appropriate paperwork. In order for the + FSF to distribute your changes, you must either place those changes + in the public domain and submit a signed statement to that effect, + or assign the copyright in your changes to the FSF. Both of these + actions are easy to do and _many_ people have done so already. If + you have questions, please contact me (*note Bugs::), or + <assign@gnu.org>. + + 3. Get the latest version. It is much easier for me to integrate + changes if they are relative to the most recent distributed version + of 'gawk', or better yet, relative to the latest code in the Git + repository. If your version of 'gawk' is very old, I may not be + able to integrate your changes at all. (*Note Getting::, for + information on getting the latest version of 'gawk'.) + + 4. See *note (Version, standards, GNU Coding Standards)Top::. This + document describes how GNU software should be written. If you + haven't read it, please do so, preferably _before_ starting to + modify 'gawk'. (The 'GNU Coding Standards' are available from the + GNU Project's website (http://www.gnu.org/prep/standards/). + Texinfo, Info, and DVI versions are also available.) + + 5. Use the 'gawk' coding style. The C code for 'gawk' follows the + instructions in the 'GNU Coding Standards', with minor exceptions. + The code is formatted using the traditional "K&R" style, + particularly as regards to the placement of braces and the use of + TABs. In brief, the coding rules for 'gawk' are as follows: + + * Use ANSI/ISO style (prototype) function headers when defining + functions. + + * Put the name of the function at the beginning of its own line. + + * Use '#elif' instead of nesting '#if' inside '#else'. + + * Put the return type of the function, even if it is 'int', on + the line above the line with the name and arguments of the + function. + + * Put spaces around parentheses used in control structures + ('if', 'while', 'for', 'do', 'switch', and 'return'). + + * Do not put spaces in front of parentheses used in function + calls. + + * Put spaces around all C operators and after commas in function + calls. + + * Do not use the comma operator to produce multiple side + effects, except in 'for' loop initialization and increment + parts, and in macro bodies. + + * Use real TABs for indenting, not spaces. + + * Use the "K&R" brace layout style. + + * Use comparisons against 'NULL' and ''\0'' in the conditions of + 'if', 'while', and 'for' statements, as well as in the 'case's + of 'switch' statements, instead of just the plain pointer or + character value. + + * Use 'true' and 'false' for 'bool' values, the 'NULL' symbolic + constant for pointer values, and the character constant ''\0'' + where appropriate, instead of '1' and '0'. + + * Provide one-line descriptive comments for each function. + + * Do not use the 'alloca()' function for allocating memory off + the stack. Its use causes more portability trouble than is + worth the minor benefit of not having to free the storage. + Instead, use 'malloc()' and 'free()'. + + * Do not use comparisons of the form '! strcmp(a, b)' or + similar. As Henry Spencer once said, "'strcmp()' is not a + boolean!" Instead, use 'strcmp(a, b) == 0'. + + * If adding new bit flag values, use explicit hexadecimal + constants ('0x001', '0x002', '0x004', and son on) instead of + shifting one left by successive amounts ('(1<<0)', '(1<<1)', + and so on). + + NOTE: If I have to reformat your code to follow the coding + style used in 'gawk', I may not bother to integrate your + changes at all. + + 6. Update the documentation. Along with your new code, please supply + new sections and/or chapters for this Info file. If at all + possible, please use real Texinfo, instead of just supplying + unformatted ASCII text (although even that is better than no + documentation at all). Conventions to be followed in 'GAWK: + Effective AWK Programming' are provided after the '@bye' at the end + of the Texinfo source file. If possible, please update the 'man' + page as well. + + You will also have to sign paperwork for your documentation + changes. + + 7. Submit changes as unified diffs. Use 'diff -u -r -N' to compare + the original 'gawk' source tree with your version. I recommend + using the GNU version of 'diff', or best of all, 'git diff' or 'git + format-patch'. Send the output produced by 'diff' to me when you + submit your changes. (*Note Bugs::, for the electronic mail + information.) + + Using this format makes it easy for me to apply your changes to the + master version of the 'gawk' source code (using 'patch'). If I + have to apply the changes manually, using a text editor, I may not + do so, particularly if there are lots of changes. + + 8. Include an entry for the 'ChangeLog' file with your submission. + This helps further minimize the amount of work I have to do, making + it easier for me to accept patches. It is simplest if you just + make this part of your diff. + + Although this sounds like a lot of work, please remember that while +you may write the new code, I have to maintain it and support it. If it +isn't possible for me to do that with a minimum of extra work, then I +probably will not. + + +File: gawk.info, Node: New Ports, Next: Derived Files, Prev: Adding Code, Up: Additions + +C.2.3 Porting 'gawk' to a New Operating System +---------------------------------------------- + +If you want to port 'gawk' to a new operating system, there are several +steps: + + 1. Follow the guidelines in *note Adding Code::, concerning coding + style, submission of diffs, and so on. + + 2. Be prepared to sign the appropriate paperwork. In order for the + FSF to distribute your code, you must either place your code in the + public domain and submit a signed statement to that effect, or + assign the copyright in your code to the FSF. Both of these actions + are easy to do and _many_ people have done so already. If you have + questions, please contact me, or <gnu@gnu.org>. + + 3. When doing a port, bear in mind that your code must coexist + peacefully with the rest of 'gawk' and the other ports. Avoid + gratuitous changes to the system-independent parts of the code. If + at all possible, avoid sprinkling '#ifdef's just for your port + throughout the code. + + If the changes needed for a particular system affect too much of + the code, I probably will not accept them. In such a case, you + can, of course, distribute your changes on your own, as long as you + comply with the GPL (*note Copying::). + + 4. A number of the files that come with 'gawk' are maintained by other + people. Thus, you should not change them unless it is for a very + good reason; i.e., changes are not out of the question, but changes + to these files are scrutinized extra carefully. The files are + 'dfa.c', 'dfa.h', 'getopt.c', 'getopt.h', 'getopt1.c', + 'getopt_int.h', 'gettext.h', 'regcomp.c', 'regex.c', 'regex.h', + 'regex_internal.c', 'regex_internal.h', and 'regexec.c'. + + 5. A number of other files are provided by the GNU Autotools + (Autoconf, Automake, and GNU 'gettext'). You should not change + them either, unless it is for a very good reason. The files are + 'ABOUT-NLS', 'config.guess', 'config.rpath', 'config.sub', + 'depcomp', 'INSTALL', 'install-sh', 'missing', 'mkinstalldirs', + 'xalloc.h', and 'ylwrap'. + + 6. Be willing to continue to maintain the port. Non-Unix operating + systems are supported by volunteers who maintain the code needed to + compile and run 'gawk' on their systems. If no-one volunteers to + maintain a port, it becomes unsupported and it may be necessary to + remove it from the distribution. + + 7. Supply an appropriate 'gawkmisc.???' file. Each port has its own + 'gawkmisc.???' that implements certain operating system specific + functions. This is cleaner than a plethora of '#ifdef's scattered + throughout the code. The 'gawkmisc.c' in the main source directory + includes the appropriate 'gawkmisc.???' file from each + subdirectory. Be sure to update it as well. + + Each port's 'gawkmisc.???' file has a suffix reminiscent of the + machine or operating system for the port--for example, + 'pc/gawkmisc.pc' and 'vms/gawkmisc.vms'. The use of separate + suffixes, instead of plain 'gawkmisc.c', makes it possible to move + files from a port's subdirectory into the main subdirectory, + without accidentally destroying the real 'gawkmisc.c' file. + (Currently, this is only an issue for the PC operating system + ports.) + + 8. Supply a 'Makefile' as well as any other C source and header files + that are necessary for your operating system. All your code should + be in a separate subdirectory, with a name that is the same as, or + reminiscent of, either your operating system or the computer + system. If possible, try to structure things so that it is not + necessary to move files out of the subdirectory into the main + source directory. If that is not possible, then be sure to avoid + using names for your files that duplicate the names of files in the + main source directory. + + 9. Update the documentation. Please write a section (or sections) for + this Info file describing the installation and compilation steps + needed to compile and/or install 'gawk' for your system. + + Following these steps makes it much easier to integrate your changes +into 'gawk' and have them coexist happily with other operating systems' +code that is already there. + + In the code that you supply and maintain, feel free to use a coding +style and brace layout that suits your taste. + + +File: gawk.info, Node: Derived Files, Prev: New Ports, Up: Additions + +C.2.4 Why Generated Files Are Kept In Git +----------------------------------------- + +If you look at the 'gawk' source in the Git repository, you will notice +that it includes files that are automatically generated by GNU +infrastructure tools, such as 'Makefile.in' from Automake and even +'configure' from Autoconf. + + This is different from many Free Software projects that do not store +the derived files, because that keeps the repository less cluttered, and +it is easier to see the substantive changes when comparing versions and +trying to understand what changed between commits. + + However, there are several reasons why the 'gawk' maintainer likes to +have everything in the repository. + + First, because it is then easy to reproduce any given version +completely, without relying upon the availability of (older, likely +obsolete, and maybe even impossible to find) other tools. + + As an extreme example, if you ever even think about trying to +compile, oh, say, the V7 'awk', you will discover that not only do you +have to bootstrap the V7 'yacc' to do so, but you also need the V7 +'lex'. And the latter is pretty much impossible to bring up on a modern +GNU/Linux system.(1) + + (Or, let's say 'gawk' 1.2 required 'bison' whatever-it-was in 1989 +and that there was no 'awkgram.c' file in the repository. Is there a +guarantee that we could find that 'bison' version? Or that _it_ would +build?) + + If the repository has all the generated files, then it's easy to just +check them out and build. (Or _easier_, depending upon how far back we +go.) + + And that brings us to the second (and stronger) reason why all the +files really need to be in Git. It boils down to who do you cater +to--the 'gawk' developer(s), or the user who just wants to check out a +version and try it out? + + The 'gawk' maintainer wants it to be possible for any interested +'awk' user in the world to just clone the repository, check out the +branch of interest and build it. Without their having to have the +correct version(s) of the autotools.(2) That is the point of the +'bootstrap.sh' file. It touches the various other files in the right +order such that + + # The canonical incantation for building GNU software: + ./bootstrap.sh && ./configure && make + +will _just work_. + + This is extremely important for the 'master' and 'gawk-X.Y-stable' +branches. + + Further, the 'gawk' maintainer would argue that it's also important +for the 'gawk' developers. When he tried to check out the 'xgawk' +branch(3) to build it, he couldn't. (No 'ltmain.sh' file, and he had no +idea how to create it, and that was not the only problem.) + + He felt _extremely_ frustrated. With respect to that branch, the +maintainer is no different than Jane User who wants to try to build +'gawk-4.1-stable' or 'master' from the repository. + + Thus, the maintainer thinks that it's not just important, but +critical, that for any given branch, the above incantation _just works_. + + A third reason to have all the files is that without them, using 'git +bisect' to try to find the commit that introduced a bug is exceedingly +difficult. The maintainer tried to do that on another project that +requires running bootstrapping scripts just to create 'configure' and so +on; it was really painful. When the repository is self-contained, using +'git bisect' in it is very easy. + + What are some of the consequences and/or actions to take? + + 1. We don't mind that there are differing files in the different + branches as a result of different versions of the autotools. + + A. It's the maintainer's job to merge them and he will deal with + it. + + B. He is really good at 'git diff x y > /tmp/diff1 ; gvim + /tmp/diff1' to remove the diffs that aren't of interest in + order to review code. + + 2. It would certainly help if everyone used the same versions of the + GNU tools as he does, which in general are the latest released + versions of Automake, Autoconf, 'bison', and GNU 'gettext'. + + Installing from source is quite easy. It's how the maintainer + worked for years (and still works). He had '/usr/local/bin' at the + front of his 'PATH' and just did: + + wget http://ftp.gnu.org/gnu/PACKAGE/PACKAGE-X.Y.Z.tar.gz + tar -xpzvf PACKAGE-X.Y.Z.tar.gz + cd PACKAGE-X.Y.Z + ./configure && make && make check + make install # as root + + Most of the above was originally written by the maintainer to other +'gawk' developers. It raised the objection from one of the developers +"... that anybody pulling down the source from Git is not an end user." + + However, this is not true. There are "power 'awk' users" who can +build 'gawk' (using the magic incantation shown previously) but who +can't program in C. Thus, the major branches should be kept buildable +all the time. + + It was then suggested that there be a 'cron' job to create nightly +tarballs of "the source." Here, the problem is that there are source +trees, corresponding to the various branches! So, nightly tarballs +aren't the answer, especially as the repository can go for weeks without +significant change being introduced. + + Fortunately, the Git server can meet this need. For any given branch +named BRANCHNAME, use: + + wget http://git.savannah.gnu.org/cgit/gawk.git/snapshot/gawk-BRANCHNAME.tar.gz + +to retrieve a snapshot of the given branch. + + ---------- Footnotes ---------- + + (1) We tried. It was painful. + + (2) There is one GNU program that is (in our opinion) severely +difficult to bootstrap from the Git repository. For example, on the +author's old (but still working) PowerPC Macintosh with Mac OS X 10.5, +it was necessary to bootstrap a ton of software, starting with Git +itself, in order to try to work with the latest code. It's not +pleasant, and especially on older systems, it's a big waste of time. + + Starting with the latest tarball was no picnic either. The +maintainers had dropped '.gz' and '.bz2' files and only distribute +'.tar.xz' files. It was necessary to bootstrap 'xz' first! + + (3) A branch (since removed) created by one of the other developers +that did not include the generated files. + + +File: gawk.info, Node: Future Extensions, Next: Implementation Limitations, Prev: Additions, Up: Notes + +C.3 Probable Future Extensions +============================== + + AWK is a language similar to PERL, only considerably more elegant. + -- _Arnold Robbins_ + + Hey! + -- _Larry Wall_ + + The 'TODO' file in the 'master' branch of the 'gawk' Git repository +lists possible future enhancements. Some of these relate to the source +code, and others to possible new features. Please see that file for the +list. *Note Additions::, if you are interested in tackling any of the +projects listed there. + + +File: gawk.info, Node: Implementation Limitations, Next: Extension Design, Prev: Future Extensions, Up: Notes + +C.4 Some Limitations of the Implementation +========================================== + +This following table describes limits of 'gawk' on a Unix-like system +(although it is variable even then). Other systems may have different +limits. + +Item Limit +-------------------------------------------------------------------------- +Characters in a character 2^(number of bits per byte) +class +Length of input record 'MAX_INT' +Length of output record Unlimited +Length of source line Unlimited +Number of fields in a 'MAX_LONG' +record +Number of file redirections Unlimited +Number of input records in 'MAX_LONG' +one file +Number of input records 'MAX_LONG' +total +Number of pipe redirections min(number of processes per user, number + of open files) +Numeric values Double-precision floating point (if not + using MPFR) +Size of a field 'MAX_INT' +Size of a literal string 'MAX_INT' +Size of a printf string 'MAX_INT' + + +File: gawk.info, Node: Extension Design, Next: Old Extension Mechanism, Prev: Implementation Limitations, Up: Notes + +C.5 Extension API Design +======================== + +This minor node documents the design of the extension API, including a +discussion of some of the history and problems that needed to be solved. + + The first version of extensions for 'gawk' was developed in the +mid-1990s and released with 'gawk' 3.1 in the late 1990s. The basic +mechanisms and design remained unchanged for close to 15 years, until +2012. + + The old extension mechanism used data types and functions from 'gawk' +itself, with a "clever hack" to install extension functions. + + 'gawk' included some sample extensions, of which a few were really +useful. However, it was clear from the outset that the extension +mechanism was bolted onto the side and was not really well thought out. + +* Menu: + +* Old Extension Problems:: Problems with the old mechanism. +* Extension New Mechanism Goals:: Goals for the new mechanism. +* Extension Other Design Decisions:: Some other design decisions. +* Extension Future Growth:: Some room for future growth. + + +File: gawk.info, Node: Old Extension Problems, Next: Extension New Mechanism Goals, Up: Extension Design + +C.5.1 Problems With The Old Mechanism +------------------------------------- + +The old extension mechanism had several problems: + + * It depended heavily upon 'gawk' internals. Any time the 'NODE' + structure(1) changed, an extension would have to be recompiled. + Furthermore, to really write extensions required understanding + something about 'gawk''s internal functions. There was some + documentation in this Info file, but it was quite minimal. + + * Being able to call into 'gawk' from an extension required linker + facilities that are common on Unix-derived systems but that did not + work on MS-Windows systems; users wanting extensions on MS-Windows + had to statically link them into 'gawk', even though MS-Windows + supports dynamic loading of shared objects. + + * The API would change occasionally as 'gawk' changed; no + compatibility between versions was ever offered or planned for. + + Despite the drawbacks, the 'xgawk' project developers forked 'gawk' +and developed several significant extensions. They also enhanced +'gawk''s facilities relating to file inclusion and shared object access. + + A new API was desired for a long time, but only in 2012 did the +'gawk' maintainer and the 'xgawk' developers finally start working on it +together. More information about the 'xgawk' project is provided in +*note gawkextlib::. + + ---------- Footnotes ---------- + + (1) A critical central data structure inside 'gawk'. + + +File: gawk.info, Node: Extension New Mechanism Goals, Next: Extension Other Design Decisions, Prev: Old Extension Problems, Up: Extension Design + +C.5.2 Goals For A New Mechanism +------------------------------- + +Some goals for the new API were: + + * The API should be independent of 'gawk' internals. Changes in + 'gawk' internals should not be visible to the writer of an + extension function. + + * The API should provide _binary_ compatibility across 'gawk' + releases as long as the API itself does not change. + + * The API should enable extensions written in C or C++ to have + roughly the same "appearance" to 'awk'-level code as 'awk' + functions do. This means that extensions should have: + + - The ability to access function parameters. + + - The ability to turn an undefined parameter into an array (call + by reference). + + - The ability to create, access and update global variables. + + - Easy access to all the elements of an array at once ("array + flattening") in order to loop over all the element in an easy + fashion for C code. + + - The ability to create arrays (including 'gawk''s true arrays + of arrays). + + Some additional important goals were: + + * The API should use only features in ISO C 90, so that extensions + can be written using the widest range of C and C++ compilers. The + header should include the appropriate '#ifdef __cplusplus' and + 'extern "C"' magic so that a C++ compiler could be used. (If using + C++, the runtime system has to be smart enough to call any + constructors and destructors, as 'gawk' is a C program. As of this + writing, this has not been tested.) + + * The API mechanism should not require access to 'gawk''s symbols(1) + by the compile-time or dynamic linker, in order to enable creation + of extensions that also work on MS-Windows. + + During development, it became clear that there were other features +that should be available to extensions, which were also subsequently +provided: + + * Extensions should have the ability to hook into 'gawk''s I/O + redirection mechanism. In particular, the 'xgawk' developers + provided a so-called "open hook" to take over reading records. + During development, this was generalized to allow extensions to + hook into input processing, output processing, and two-way I/O. + + * An extension should be able to provide a "call back" function to + perform cleanup actions when 'gawk' exits. + + * An extension should be able to provide a version string so that + 'gawk''s '--version' option can provide information about + extensions as well. + + The requirement to avoid access to 'gawk''s symbols is, at first +glance, a difficult one to meet. + + One design, apparently used by Perl and Ruby and maybe others, would +be to make the mainline 'gawk' code into a library, with the 'gawk' +utility a small C 'main()' function linked against the library. + + This seemed like the tail wagging the dog, complicating build and +installation and making a simple copy of the 'gawk' executable from one +system to another (or one place to another on the same system!) into a +chancy operation. + + Pat Rankin suggested the solution that was adopted. *Note Extension +Mechanism Outline::, for the details. + + ---------- Footnotes ---------- + + (1) The "symbols" are the variables and functions defined inside +'gawk'. Access to these symbols by code external to 'gawk' loaded +dynamically at runtime is problematic on MS-Windows. + + +File: gawk.info, Node: Extension Other Design Decisions, Next: Extension Future Growth, Prev: Extension New Mechanism Goals, Up: Extension Design + +C.5.3 Other Design Decisions +---------------------------- + +As an arbitrary design decision, extensions can read the values of +predefined variables and arrays (such as 'ARGV' and 'FS'), but cannot +change them, with the exception of 'PROCINFO'. + + The reason for this is to prevent an extension function from +affecting the flow of an 'awk' program outside its control. While a +real 'awk' function can do what it likes, that is at the discretion of +the programmer. An extension function should provide a service or make +a C API available for use within 'awk', and not mess with 'FS' or 'ARGC' +and 'ARGV'. + + In addition, it becomes easy to start down a slippery slope. How +much access to 'gawk' facilities do extensions need? Do they need +'getline'? What about calling 'gsub()' or compiling regular +expressions? What about calling into 'awk' functions? (_That_ would be +messy.) + + In order to avoid these issues, the 'gawk' developers chose to start +with the simplest, most basic features that are still truly useful. + + Another decision is that although 'gawk' provides nice things like +MPFR, and arrays indexed internally by integers, these features are not +being brought out to the API in order to keep things simple and close to +traditional 'awk' semantics. (In fact, arrays indexed internally by +integers are so transparent that they aren't even documented!) + + Additionally, all functions in the API check that their pointer input +parameters are not 'NULL'. If they are, they return an error. (It is a +good idea for extension code to verify that pointers received from +'gawk' are not 'NULL'. Such a thing should not happen, but the 'gawk' +developers are only human, and they have been known to occasionally make +mistakes.) + + With time, the API will undoubtedly evolve; the 'gawk' developers +expect this to be driven by user needs. For now, the current API seems +to provide a minimal yet powerful set of features for creating +extensions. + + +File: gawk.info, Node: Extension Future Growth, Prev: Extension Other Design Decisions, Up: Extension Design + +C.5.4 Room For Future Growth +---------------------------- + +The API can later be expanded, in two ways: + + * 'gawk' passes an "extension id" into the extension when it first + loads the extension. The extension then passes this id back to + 'gawk' with each function call. This mechanism allows 'gawk' to + identify the extension calling into it, should it need to know. + + * Similarly, the extension passes a "name space" into 'gawk' when it + registers each extension function. This accommodates a possible + future mechanism for grouping extension functions and possibly + avoiding name conflicts. + + Of course, as of this writing, no decisions have been made with +respect to any of the above. + + +File: gawk.info, Node: Old Extension Mechanism, Next: Notes summary, Prev: Extension Design, Up: Notes + +C.6 Compatibility For Old Extensions +==================================== + +*note Dynamic Extensions::, describes the supported API and mechanisms +for writing extensions for 'gawk'. This API was introduced in version +4.1. However, for many years 'gawk' provided an extension mechanism +that required knowledge of 'gawk' internals and that was not as well +designed. + + In order to provide a transition period, 'gawk' version 4.1 continues +to support the original extension mechanism. This will be true for the +life of exactly one major release. This support will be withdrawn, and +removed from the source code, at the next major release. + + Briefly, original-style extensions should be compiled by including +the 'awk.h' header file in the extension source code. Additionally, you +must define the identifier 'GAWK' when building (use '-DGAWK' with +Unix-style compilers). Otherwise, the definitions in 'gawkapi.h' will +cause conflicts with those in 'awk.h' and your extension will not +compile. + + Just as in previous versions, you load an old-style extension with +the 'extension()' built-in function (which is not otherwise documented). +This function in turn finds and loads the shared object file containing +the extension and calls its 'dl_load()' C routine. + + Because original-style and new-style extensions use different +initialization routines ('dl_load()' versus 'dlload()'), they may safely +be installed in the same directory (to be found by 'AWKLIBPATH') without +conflict. + + The 'gawk' development team strongly recommends that you convert any +old extensions that you may have to use the new API described in *note +Dynamic Extensions::. + + +File: gawk.info, Node: Notes summary, Prev: Old Extension Mechanism, Up: Notes + +C.7 Summary +=========== + + * 'gawk''s extensions can be disabled with either the '--traditional' + option or with the '--posix' option. The '--parsedebug' option is + available if 'gawk' is compiled with '-DDEBUG'. + + * The source code for 'gawk' is maintained in a publicly accessible + Git repository. Anyone may check it out and view the source. + + * Contributions to 'gawk' are welcome. Following the steps outlined + in this major node will make it easier to integrate your + contributions into the code base. This applies both to new feature + contributions and to ports to additional operating systems. + + * 'gawk' has some limits--generally those that are imposed by the + machine architecture. + + * The extension API design was intended to solve a number of problems + with the previous extension mechanism, enable features needed by + the 'xgawk' project, and provide binary compatibility going + forward. + + * The previous extension mechanism is still supported in version 4.1 + of 'gawk', but it _will_ be removed in the next major release. + + +File: gawk.info, Node: Basic Concepts, Next: Glossary, Prev: Notes, Up: Top + +Appendix D Basic Programming Concepts +************************************* + +This major node attempts to define some of the basic concepts and terms +that are used throughout the rest of this Info file. As this Info file +is specifically about 'awk', and not about computer programming in +general, the coverage here is by necessity fairly cursory and +simplistic. (If you need more background, there are many other +introductory texts that you should refer to instead.) + +* Menu: + +* Basic High Level:: The high level view. +* Basic Data Typing:: A very quick intro to data types. + + +File: gawk.info, Node: Basic High Level, Next: Basic Data Typing, Up: Basic Concepts + +D.1 What a Program Does +======================= + +At the most basic level, the job of a program is to process some input +data and produce results. See *note Figure D.1: figure-general-flow. + + ++------+ / \\ +---------+ +| Data | -----> < Program > -----> | Results | ++------+ \\_______/ +---------+" + +Figure D.1: General Program Flow + + The "program" in the figure can be either a compiled program(1) (such +as 'ls'), or it may be "interpreted". In the latter case, a +machine-executable program such as 'awk' reads your program, and then +uses the instructions in your program to process the data. + + When you write a program, it usually consists of the following, very +basic set of steps, as shown in *note Figure D.2: figure-process-flow.: + + ++----------------+ / More \\ No +----------+ +| Initialization | -------> < Data > -------> | Clean Up | ++----------------+ ^ \\ ? / +----------+ + | +--+-+ + | | Yes + | | + | V + | +---------+ + +-----+ Process | + +---------+" + +Figure D.2: Basic Program Steps + +Initialization + These are the things you do before actually starting to process + data, such as checking arguments, initializing any data you need to + work with, and so on. This step corresponds to 'awk''s 'BEGIN' + rule (*note BEGIN/END::). + + If you were baking a cake, this might consist of laying out all the + mixing bowls and the baking pan, and making sure you have all the + ingredients that you need. + +Processing + This is where the actual work is done. Your program reads data, + one logical chunk at a time, and processes it as appropriate. + + In most programming languages, you have to manually manage the + reading of data, checking to see if there is more each time you + read a chunk. 'awk''s pattern-action paradigm (*note Getting + Started::) handles the mechanics of this for you. + + In baking a cake, the processing corresponds to the actual labor: + breaking eggs, mixing the flour, water, and other ingredients, and + then putting the cake into the oven. + +Clean Up + Once you've processed all the data, you may have things you need to + do before exiting. This step corresponds to 'awk''s 'END' rule + (*note BEGIN/END::). + + After the cake comes out of the oven, you still have to wrap it in + plastic wrap to keep anyone from tasting it, as well as wash the + mixing bowls and utensils. + + An "algorithm" is a detailed set of instructions necessary to +accomplish a task, or process data. It is much the same as a recipe for +baking a cake. Programs implement algorithms. Often, it is up to you +to design the algorithm and implement it, simultaneously. + + The "logical chunks" we talked about previously are called "records", +similar to the records a company keeps on employees, a school keeps for +students, or a doctor keeps for patients. Each record has many +component parts, such as first and last names, date of birth, address, +and so on. The component parts are referred to as the "fields" of the +record. + + The act of reading data is termed "input", and that of generating +results, not too surprisingly, is termed "output". They are often +referred to together as "input/output," and even more often, as "I/O" +for short. (You will also see "input" and "output" used as verbs.) + + 'awk' manages the reading of data for you, as well as the breaking it +up into records and fields. Your program's job is to tell 'awk' what to +do with the data. You do this by describing "patterns" in the data to +look for, and "actions" to execute when those patterns are seen. This +"data-driven" nature of 'awk' programs usually makes them both easier to +write and easier to read. + + ---------- Footnotes ---------- + + (1) Compiled programs are typically written in lower-level languages +such as C, C++, or Ada, and then translated, or "compiled", into a form +that the computer can execute directly. + + +File: gawk.info, Node: Basic Data Typing, Prev: Basic High Level, Up: Basic Concepts + +D.2 Data Values in a Computer +============================= + +In a program, you keep track of information and values in things called +"variables". A variable is just a name for a given value, such as +'first_name', 'last_name', 'address', and so on. 'awk' has several +predefined variables, and it has special names to refer to the current +input record and the fields of the record. You may also group multiple +associated values under one name, as an array. + + Data, particularly in 'awk', consists of either numeric values, such +as 42 or 3.1415927, or string values. String values are essentially +anything that's not a number, such as a name. Strings are sometimes +referred to as "character data", since they store the individual +characters that comprise them. Individual variables, as well as numeric +and string variables, are referred to as "scalar" values. Groups of +values, such as arrays, are not scalars. + + *note Computer Arithmetic::, provided a basic introduction to numeric +types (integer and floating-point) and how they are used in a computer. +Please review that information, including a number of caveats that were +presented. + + While you are probably used to the idea of a number without a value +(i.e., zero), it takes a bit more getting used to the idea of +zero-length character data. Nevertheless, such a thing exists. It is +called the "null string". The null string is character data that has no +value. In other words, it is empty. It is written in 'awk' programs +like this: '""'. + + Humans are used to working in decimal; i.e., base 10. In base 10, +numbers go from 0 to 9, and then "roll over" into the next column. +(Remember grade school? 42 = 4 x 10 + 2.) + + There are other number bases though. Computers commonly use base 2 +or "binary", base 8 or "octal", and base 16 or "hexadecimal". In +binary, each column represents two times the value in the column to its +right. Each column may contain either a 0 or a 1. Thus, binary 1010 +represents (1 x 8) + (0 x 4) + (1 x 2) + (0 x 1), or decimal 10. Octal +and hexadecimal are discussed more in *note Nondecimal-numbers::. + + At the very lowest level, computers store values as groups of binary +digits, or "bits". Modern computers group bits into groups of eight, +called "bytes". Advanced applications sometimes have to manipulate bits +directly, and 'gawk' provides functions for doing so. + + Programs are written in programming languages. Hundreds, if not +thousands, of programming languages exist. One of the most popular is +the C programming language. The C language had a very strong influence +on the design of the 'awk' language. + + There have been several versions of C. The first is often referred to +as "K&R" C, after the initials of Brian Kernighan and Dennis Ritchie, +the authors of the first book on C. (Dennis Ritchie created the +language, and Brian Kernighan was one of the creators of 'awk'.) + + In the mid-1980s, an effort began to produce an international +standard for C. This work culminated in 1989, with the production of the +ANSI standard for C. This standard became an ISO standard in 1990. In +1999, a revised ISO C standard was approved and released. Where it +makes sense, POSIX 'awk' is compatible with 1999 ISO C. + + +File: gawk.info, Node: Glossary, Next: Copying, Prev: Basic Concepts, Up: Top + +Glossary +******** + +Action + A series of 'awk' statements attached to a rule. If the rule's + pattern matches an input record, 'awk' executes the rule's action. + Actions are always enclosed in braces. (*Note Action Overview::.) + +Ada + A programming language originally defined by the U.S. Department of + Defense for embedded programming. It was designed to enforce good + Software Engineering practices. + +Amazing 'awk' Assembler + Henry Spencer at the University of Toronto wrote a retargetable + assembler completely as 'sed' and 'awk' scripts. It is thousands + of lines long, including machine descriptions for several eight-bit + microcomputers. It is a good example of a program that would have + been better written in another language. You can get it from + <http://awk.info/?awk100/aaa>. + +Amazingly Workable Formatter ('awf') + Henry Spencer at the University of Toronto wrote a formatter that + accepts a large subset of the 'nroff -ms' and 'nroff -man' + formatting commands, using 'awk' and 'sh'. It is available from + <http://awk.info/?tools/awf>. + +Anchor + The regexp metacharacters '^' and '$', which force the match to the + beginning or end of the string, respectively. + +ANSI + The American National Standards Institute. This organization + produces many standards, among them the standards for the C and C++ + programming languages. These standards often become international + standards as well. See also "ISO." + +Argument + An argument can be two different things. It can be an option or a + file name passed to a command while invoking it from the command + line, or it can be something passed to a "function" inside a + program, e.g. inside 'awk'. + + In the latter case, an argument can be passed to a function in two + ways. Either it is given to the called function by value, i.e., a + copy of the value of the variable is made available to the called + function, but the original variable cannot be modified by the + function itself; or it is given by reference, i.e., a pointer to + the interested variable is passed to the function, which can then + directly modify it. In 'awk' scalars are passed by value, and + arrays are passed by reference. See "Pass By Value/Reference." + +Array + A grouping of multiple values under the same name. Most languages + just provide sequential arrays. 'awk' provides associative arrays. + +Assertion + A statement in a program that a condition is true at this point in + the program. Useful for reasoning about how a program is supposed + to behave. + +Assignment + An 'awk' expression that changes the value of some 'awk' variable + or data object. An object that you can assign to is called an + "lvalue". The assigned values are called "rvalues". *Note + Assignment Ops::. + +Associative Array + Arrays in which the indices may be numbers or strings, not just + sequential integers in a fixed range. + +'awk' Language + The language in which 'awk' programs are written. + +'awk' Program + An 'awk' program consists of a series of "patterns" and "actions", + collectively known as "rules". For each input record given to the + program, the program's rules are all processed in turn. 'awk' + programs may also contain function definitions. + +'awk' Script + Another name for an 'awk' program. + +Bash + The GNU version of the standard shell (the Bourne-Again SHell). + See also "Bourne Shell." + +Binary + Base-two notation, where the digits are '0'-'1'. Since electronic + circuitry works "naturally" in base 2 (just think of Off/On), + everything inside a computer is calculated using base 2. Each + digit represents the presence (or absence) of a power of 2 and is + called a "bit". So, for example, the base-two number '10101' is + the same as decimal 21, ((1 x 16) + (1 x 4) + (1 x 1)). + + Since base-two numbers quickly become very long to read and write, + they are usually grouped by 3 (i.e., they are read as octal + numbers), or by 4 (i.e., they are read as hexadecimal numbers). + There is no direct way to insert base 2 numbers in a C program. If + need arises, such numbers are usually inserted as octal or + hexadecimal numbers. The number of base-two digits that fit into + registers used for representing integer numbers in computers is a + rough indication of the computing power of the computer itself. + Most computers nowadays use 64 bits for representing integer + numbers in their registers, but 32-bit, 16-bit and 8-bit registers + have been widely used in the past. *Note Nondecimal-numbers::. +Bit + Short for "Binary Digit." All values in computer memory ultimately + reduce to binary digits: values that are either zero or one. + Groups of bits may be interpreted differently--as integers, + floating-point numbers, character data, addresses of other memory + objects, or other data. 'awk' lets you work with floating-point + numbers and strings. 'gawk' lets you manipulate bit values with + the built-in functions described in *note Bitwise Functions::. + + Computers are often defined by how many bits they use to represent + integer values. Typical systems are 32-bit systems, but 64-bit + systems are becoming increasingly popular, and 16-bit systems have + essentially disappeared. + +Boolean Expression + Named after the English mathematician Boole. See also "Logical + Expression." + +Bourne Shell + The standard shell ('/bin/sh') on Unix and Unix-like systems, + originally written by Steven R. Bourne at Bell Laboratories. Many + shells (Bash, 'ksh', 'pdksh', 'zsh') are generally upwardly + compatible with the Bourne shell. + +Braces + The characters '{' and '}'. Braces are used in 'awk' for + delimiting actions, compound statements, and function bodies. + +Bracket Expression + Inside a "regular expression", an expression included in square + brackets, meant to designate a single character as belonging to a + specified character class. A bracket expression can contain a list + of one or more characters, like '[abc]', a range of characters, + like '[A-Z]', or a name, delimited by ':', that designates a known + set of characters, like '[:digit:]'. The form of bracket + expression enclosed between ':' is independent of the underlying + representation of the character themselves, which could utilize the + ASCII, ECBDIC, or Unicode codesets, depending on the architecture + of the computer system, and on localization. See also "Regular + Expression." + +Built-in Function + The 'awk' language provides built-in functions that perform various + numerical, I/O-related, and string computations. Examples are + 'sqrt()' (for the square root of a number) and 'substr()' (for a + substring of a string). 'gawk' provides functions for timestamp + management, bit manipulation, array sorting, type checking, and + runtime string translation. (*Note Built-in::.) + +Built-in Variable + 'ARGC', 'ARGV', 'CONVFMT', 'ENVIRON', 'FILENAME', 'FNR', 'FS', + 'NF', 'NR', 'OFMT', 'OFS', 'ORS', 'RLENGTH', 'RSTART', 'RS', and + 'SUBSEP' are the variables that have special meaning to 'awk'. In + addition, 'ARGIND', 'BINMODE', 'ERRNO', 'FIELDWIDTHS', 'FPAT', + 'IGNORECASE', 'LINT', 'PROCINFO', 'RT', and 'TEXTDOMAIN' are the + variables that have special meaning to 'gawk'. Changing some of + them affects 'awk''s running environment. (*Note Built-in + Variables::.) + +C + The system programming language that most GNU software is written + in. The 'awk' programming language has C-like syntax, and this + Info file points out similarities between 'awk' and C when + appropriate. + + In general, 'gawk' attempts to be as similar to the 1990 version of + ISO C as makes sense. + +C Shell + The C Shell ('csh' or its improved version, 'tcsh') is a Unix shell + that was created by Bill Joy in the late 1970s. The C shell was + differentiated from other shells by its interactive features and + overall style, which looks more like C. The C Shell is not backward + compatible with the Bourne Shell, so special attention is required + when converting scripts written for other Unix shells to the C + shell, especially with regard to the management of shell variables. + See also "Bourne Shell." + +C++ + A popular object-oriented programming language derived from C. + +Character Class + See "Bracket Expression." + +Character List + See "Bracket Expression." + +Character Set + The set of numeric codes used by a computer system to represent the + characters (letters, numbers, punctuation, etc.) of a particular + country or place. The most common character set in use today is + ASCII (American Standard Code for Information Interchange). Many + European countries use an extension of ASCII known as ISO-8859-1 + (ISO Latin-1). The Unicode character set (http://www.unicode.org) + is increasingly popular and standard, and is particularly widely + used on GNU/Linux systems. + +CHEM + A preprocessor for 'pic' that reads descriptions of molecules and + produces 'pic' input for drawing them. It was written in 'awk' by + Brian Kernighan and Jon Bentley, and is available from + <http://netlib.org/typesetting/chem>. + +Comparison Expression + A relation that is either true or false, such as 'a < b'. + Comparison expressions are used in 'if', 'while', 'do', and 'for' + statements, and in patterns to select which input records to + process. (*Note Typing and Comparison::.) + +Compiler + A program that translates human-readable source code into + machine-executable object code. The object code is then executed + directly by the computer. See also "Interpreter." + +Complemented Bracket Expression + The negation of a "bracket expression". All that is _not_ + described by a given bracket expression. The symbol '^' precedes + the negated bracket expression. E.g.: '[[^:digit:]' designates + whatever character is not a digit. '[^bad]' designates whatever + character is not one of the letters 'b', 'a', or 'd'. See "Bracket + Expression." + +Compound Statement + A series of 'awk' statements, enclosed in curly braces. Compound + statements may be nested. (*Note Statements::.) + +Computed Regexps + See "Dynamic Regular Expressions." + +Concatenation + Concatenating two strings means sticking them together, one after + another, producing a new string. For example, the string 'foo' + concatenated with the string 'bar' gives the string 'foobar'. + (*Note Concatenation::.) + +Conditional Expression + An expression using the '?:' ternary operator, such as 'EXPR1 ? + EXPR2 : EXPR3'. The expression EXPR1 is evaluated; if the result + is true, the value of the whole expression is the value of EXPR2; + otherwise the value is EXPR3. In either case, only one of EXPR2 + and EXPR3 is evaluated. (*Note Conditional Exp::.) + +Control Statement + A control statement is an instruction to perform a given operation + or a set of operations inside an 'awk' program, if a given + condition is true. Control statements are: 'if', 'for', 'while', + and 'do' (*note Statements::). + +Cookie + A peculiar goodie, token, saying or remembrance produced by or + presented to a program. (With thanks to Professor Doug McIlroy.) + +Coprocess + A subordinate program with which two-way communications is + possible. + +Curly Braces + See "Braces." + +Dark Corner + An area in the language where specifications often were (or still + are) not clear, leading to unexpected or undesirable behavior. + Such areas are marked in this Info file with "(d.c.)" in the text + and are indexed under the heading "dark corner." + +Data Driven + A description of 'awk' programs, where you specify the data you are + interested in processing, and what to do when that data is seen. + +Data Objects + These are numbers and strings of characters. Numbers are converted + into strings and vice versa, as needed. (*Note Conversion::.) + +Deadlock + The situation in which two communicating processes are each waiting + for the other to perform an action. + +Debugger + A program used to help developers remove "bugs" from (de-bug) their + programs. + +Double Precision + An internal representation of numbers that can have fractional + parts. Double precision numbers keep track of more digits than do + single precision numbers, but operations on them are sometimes more + expensive. This is the way 'awk' stores numeric values. It is the + C type 'double'. + +Dynamic Regular Expression + A dynamic regular expression is a regular expression written as an + ordinary expression. It could be a string constant, such as + '"foo"', but it may also be an expression whose value can vary. + (*Note Computed Regexps::.) + +Empty String + See "Null String." + +Environment + A collection of strings, of the form 'NAME=VAL', that each program + has available to it. Users generally place values into the + environment in order to provide information to various programs. + Typical examples are the environment variables 'HOME' and 'PATH'. + +Epoch + The date used as the "beginning of time" for timestamps. Time + values in most systems are represented as seconds since the epoch, + with library functions available for converting these values into + standard date and time formats. + + The epoch on Unix and POSIX systems is 1970-01-01 00:00:00 UTC. See + also "GMT" and "UTC." + +Escape Sequences + A special sequence of characters used for describing nonprinting + characters, such as '\n' for newline or '\033' for the ASCII ESC + (Escape) character. (*Note Escape Sequences::.) + +Extension + An additional feature or change to a programming language or + utility not defined by that language's or utility's standard. + 'gawk' has (too) many extensions over POSIX 'awk'. + +FDL + See "Free Documentation License." + +Field + When 'awk' reads an input record, it splits the record into pieces + separated by whitespace (or by a separator regexp that you can + change by setting the predefined variable 'FS'). Such pieces are + called fields. If the pieces are of fixed length, you can use the + built-in variable 'FIELDWIDTHS' to describe their lengths. If you + wish to specify the contents of fields instead of the field + separator, you can use the predefined variable 'FPAT' to do so. + (*Note Field Separators::, *note Constant Size::, and *note + Splitting By Content::.) + +Flag + A variable whose truth value indicates the existence or + nonexistence of some condition. + +Floating-Point Number + Often referred to in mathematical terms as a "rational" or real + number, this is just a number that can have a fractional part. See + also "Double Precision" and "Single Precision." + +Format + Format strings control the appearance of output in the 'strftime()' + and 'sprintf()' functions, and in the 'printf' statement as well. + Also, data conversions from numbers to strings are controlled by + the format strings contained in the predefined variables 'CONVFMT' + and 'OFMT'. (*Note Control Letters::.) + +Fortran + Shorthand for FORmula TRANslator, one of the first programming + languages available for scientific calculations. It was created by + John Backus, and has been available since 1957. It is still in use + today. + +Free Documentation License + This document describes the terms under which this Info file is + published and may be copied. (*Note GNU Free Documentation + License::.) + +Free Software Foundation + A nonprofit organization dedicated to the production and + distribution of freely distributable software. It was founded by + Richard M. Stallman, the author of the original Emacs editor. GNU + Emacs is the most widely used version of Emacs today. + +FSF + See "Free Software Foundation." + +Function + A part of an 'awk' program that can be invoked from every point of + the program, to perform a task. 'awk' has several built-in + functions. Users can define their own functions in every part of + the program. Function can be recursive, i.e., they may invoke + themselves. *Note Functions::. In 'gawk' it is also possible to + have functions shared among different programs, and included where + required using the '@include' directive (*note Include Files::). + In 'gawk' the name of the function that should be invoked can be + generated at run time, i.e., dynamically. The 'gawk' extension API + provides constructor functions (*note Constructor Functions::). + +'gawk' + The GNU implementation of 'awk'. + +General Public License + This document describes the terms under which 'gawk' and its source + code may be distributed. (*Note Copying::.) + +GMT + "Greenwich Mean Time." This is the old term for UTC. It is the + time of day used internally for Unix and POSIX systems. See also + "Epoch" and "UTC." + +GNU + "GNU's not Unix". An on-going project of the Free Software + Foundation to create a complete, freely distributable, + POSIX-compliant computing environment. + +GNU/Linux + A variant of the GNU system using the Linux kernel, instead of the + Free Software Foundation's Hurd kernel. The Linux kernel is a + stable, efficient, full-featured clone of Unix that has been ported + to a variety of architectures. It is most popular on PC-class + systems, but runs well on a variety of other systems too. The + Linux kernel source code is available under the terms of the GNU + General Public License, which is perhaps its most important aspect. + +GPL + See "General Public License." + +Hexadecimal + Base 16 notation, where the digits are '0'-'9' and 'A'-'F', with + 'A' representing 10, 'B' representing 11, and so on, up to 'F' for + 15. Hexadecimal numbers are written in C using a leading '0x', to + indicate their base. Thus, '0x12' is 18 ((1 x 16) + 2). *Note + Nondecimal-numbers::. + +I/O + Abbreviation for "Input/Output," the act of moving data into and/or + out of a running program. + +Input Record + A single chunk of data that is read in by 'awk'. Usually, an 'awk' + input record consists of one line of text. (*Note Records::.) + +Integer + A whole number, i.e., a number that does not have a fractional + part. + +Internationalization + The process of writing or modifying a program so that it can use + multiple languages without requiring further source code changes. + +Interpreter + A program that reads human-readable source code directly, and uses + the instructions in it to process data and produce results. 'awk' + is typically (but not always) implemented as an interpreter. See + also "Compiler." + +Interval Expression + A component of a regular expression that lets you specify repeated + matches of some part of the regexp. Interval expressions were not + originally available in 'awk' programs. + +ISO + The International Organization for Standardization. This + organization produces international standards for many things, + including programming languages, such as C and C++. In the + computer arena, important standards like those for C, C++, and + POSIX become both American national and ISO international standards + simultaneously. This Info file refers to Standard C as "ISO C" + throughout. See the ISO website + (http://www.iso.org/iso/home/about.htm) for more information about + the name of the organization and its language-independent + three-letter acronym. + +Java + A modern programming language originally developed by Sun + Microsystems (now Oracle) supporting Object-Oriented programming. + Although usually implemented by compiling to the instructions for a + standard virtual machine (the JVM), the language can be compiled to + native code. + +Keyword + In the 'awk' language, a keyword is a word that has special + meaning. Keywords are reserved and may not be used as variable + names. + + 'gawk''s keywords are: 'BEGIN', 'BEGINFILE', 'END', 'ENDFILE', + 'break', 'case', 'continue', 'default' 'delete', 'do...while', + 'else', 'exit', 'for...in', 'for', 'function', 'func', 'if', + 'next', 'nextfile', 'switch', and 'while'. + +Korn Shell + The Korn Shell ('ksh') is a Unix shell which was developed by David + Korn at Bell Laboratories in the early 1980s. The Korn Shell is + backward-compatible with the Bourne shell and includes many + features of the C shell. See also "Bourne Shell." + +Lesser General Public License + This document describes the terms under which binary library + archives or shared objects, and their source code may be + distributed. + +LGPL + See "Lesser General Public License." + +Linux + See "GNU/Linux." + +Localization + The process of providing the data necessary for an + internationalized program to work in a particular language. + +Logical Expression + An expression using the operators for logic, AND, OR, and NOT, + written '&&', '||', and '!' in 'awk'. Often called Boolean + expressions, after the mathematician who pioneered this kind of + mathematical logic. + +Lvalue + An expression that can appear on the left side of an assignment + operator. In most languages, lvalues can be variables or array + elements. In 'awk', a field designator can also be used as an + lvalue. + +Matching + The act of testing a string against a regular expression. If the + regexp describes the contents of the string, it is said to "match" + it. + +Metacharacters + Characters used within a regexp that do not stand for themselves. + Instead, they denote regular expression operations, such as + repetition, grouping, or alternation. + +Nesting + Nesting is where information is organized in layers, or where + objects contain other similar objects. In 'gawk' the '@include' + directive can be nested. The "natural" nesting of arithmetic and + logical operations can be changed using parentheses (*note + Precedence::). + +No-op + An operation that does nothing. + +Null String + A string with no characters in it. It is represented explicitly in + 'awk' programs by placing two double quote characters next to each + other ('""'). It can appear in input data by having two successive + occurrences of the field separator appear next to each other. + +Number + A numeric-valued data object. Modern 'awk' implementations use + double precision floating-point to represent numbers. Ancient + 'awk' implementations used single precision floating-point. + +Octal + Base-eight notation, where the digits are '0'-'7'. Octal numbers + are written in C using a leading '0', to indicate their base. + Thus, '013' is 11 ((1 x 8) + 3). *Note Nondecimal-numbers::. + +Output Record + A single chunk of data that is written out by 'awk'. Usually, an + 'awk' output record consists of one or more lines of text. *Note + Records::. + +Pattern + Patterns tell 'awk' which input records are interesting to which + rules. + + A pattern is an arbitrary conditional expression against which + input is tested. If the condition is satisfied, the pattern is + said to "match" the input record. A typical pattern might compare + the input record against a regular expression. (*Note Pattern + Overview::.) + +PEBKAC + An acronym describing what is possibly the most frequent source of + computer usage problems. (Problem Exists Between Keyboard And + Chair.) + +Plug-in + See "Extensions." + +POSIX + The name for a series of standards that specify a Portable + Operating System interface. The "IX" denotes the Unix heritage of + these standards. The main standard of interest for 'awk' users is + 'IEEE Standard for Information Technology, Standard 1003.1-2008'. + The 2008 POSIX standard can be found online at + <http://www.opengroup.org/onlinepubs/9699919799/>. + +Precedence + The order in which operations are performed when operators are used + without explicit parentheses. + +Private + Variables and/or functions that are meant for use exclusively by + library functions and not for the main 'awk' program. Special care + must be taken when naming such variables and functions. (*Note + Library Names::.) + +Range (of input lines) + A sequence of consecutive lines from the input file(s). A pattern + can specify ranges of input lines for 'awk' to process or it can + specify single lines. (*Note Pattern Overview::.) + +Record + See "Input record" and "Output record." + +Recursion + When a function calls itself, either directly or indirectly. If + this is clear, stop, and proceed to the next entry. Otherwise, + refer to the entry for "recursion." + +Redirection + Redirection means performing input from something other than the + standard input stream, or performing output to something other than + the standard output stream. + + You can redirect input to the 'getline' statement using the '<', + '|', and '|&' operators. You can redirect the output of the + 'print' and 'printf' statements to a file or a system command, + using the '>', '>>', '|', and '|&' operators. (*Note Getline::, + and *note Redirection::.) + +Reference Counts + An internal mechanism in 'gawk' to minimize the amount of memory + needed to store the value of string variables. If the value + assumed by a variable is used in more than one place, only one copy + of the value itself is kept, and the associated reference count is + increased when the same value is used by an additional variable, + and decreased when the related variable is no longer in use. When + the reference count goes to zero, the memory space used to store + the value of the variable is freed. + +Regexp + See "Regular Expression." + +Regular Expression + A regular expression ("regexp" for short) is a pattern that denotes + a set of strings, possibly an infinite set. For example, the + regular expression 'R.*xp' matches any string starting with the + letter 'R' and ending with the letters 'xp'. In 'awk', regular + expressions are used in patterns and in conditional expressions. + Regular expressions may contain escape sequences. (*Note + Regexp::.) + +Regular Expression Constant + A regular expression constant is a regular expression written + within slashes, such as '/foo/'. This regular expression is chosen + when you write the 'awk' program and cannot be changed during its + execution. (*Note Regexp Usage::.) + +Regular Expression Operators + See "Metacharacters." + +Rounding + Rounding the result of an arithmetic operation can be tricky. More + than one way of rounding exists, and in 'gawk' it is possible to + choose which method should be used in a program. *Note Setting the + rounding mode::. + +Rule + A segment of an 'awk' program that specifies how to process single + input records. A rule consists of a "pattern" and an "action". + 'awk' reads an input record; then, for each rule, if the input + record satisfies the rule's pattern, 'awk' executes the rule's + action. Otherwise, the rule does nothing for that input record. + +Rvalue + A value that can appear on the right side of an assignment + operator. In 'awk', essentially every expression has a value. + These values are rvalues. + +Scalar + A single value, be it a number or a string. Regular variables are + scalars; arrays and functions are not. + +Search Path + In 'gawk', a list of directories to search for 'awk' program source + files. In the shell, a list of directories to search for + executable programs. + +'sed' + See "Stream Editor." + +Seed + The initial value, or starting point, for a sequence of random + numbers. + +Shell + The command interpreter for Unix and POSIX-compliant systems. The + shell works both interactively, and as a programming language for + batch files, or shell scripts. + +Short-Circuit + The nature of the 'awk' logical operators '&&' and '||'. If the + value of the entire expression is determinable from evaluating just + the lefthand side of these operators, the righthand side is not + evaluated. (*Note Boolean Ops::.) + +Side Effect + A side effect occurs when an expression has an effect aside from + merely producing a value. Assignment expressions, increment and + decrement expressions, and function calls have side effects. + (*Note Assignment Ops::.) + +Single Precision + An internal representation of numbers that can have fractional + parts. Single precision numbers keep track of fewer digits than do + double precision numbers, but operations on them are sometimes less + expensive in terms of CPU time. This is the type used by some + ancient versions of 'awk' to store numeric values. It is the C + type 'float'. + +Space + The character generated by hitting the space bar on the keyboard. + +Special File + A file name interpreted internally by 'gawk', instead of being + handed directly to the underlying operating system--for example, + '/dev/stderr'. (*Note Special Files::.) + +Statement + An expression inside an 'awk' program in the action part of a + pattern-action rule, or inside an 'awk' function. A statement can + be a variable assignment, an array operation, a loop, etc. + +Stream Editor + A program that reads records from an input stream and processes + them one or more at a time. This is in contrast with batch + programs, which may expect to read their input files in entirety + before starting to do anything, as well as with interactive + programs which require input from the user. + +String + A datum consisting of a sequence of characters, such as 'I am a + string'. Constant strings are written with double quotes in the + 'awk' language and may contain escape sequences. (*Note Escape + Sequences::.) + +Tab + The character generated by hitting the 'TAB' key on the keyboard. + It usually expands to up to eight spaces upon output. + +Text Domain + A unique name that identifies an application. Used for grouping + messages that are translated at runtime into the local language. + +Timestamp + A value in the "seconds since the epoch" format used by Unix and + POSIX systems. Used for the 'gawk' functions 'mktime()', + 'strftime()', and 'systime()'. See also "Epoch," "GMT," and "UTC." + +Unix + A computer operating system originally developed in the early + 1970's at AT&T Bell Laboratories. It initially became popular in + universities around the world and later moved into commercial + environments as a software development system and network server + system. There are many commercial versions of Unix, as well as + several work-alike systems whose source code is freely available + (such as GNU/Linux, NetBSD (http://www.netbsd.org), FreeBSD + (http://www.freebsd.org), and OpenBSD (http://www.openbsd.org)). + +UTC + The accepted abbreviation for "Universal Coordinated Time." This + is standard time in Greenwich, England, which is used as a + reference time for day and date calculations. See also "Epoch" and + "GMT." + +Variable + A name for a value. In 'awk', variables may be either scalars or + arrays. + +Whitespace + A sequence of space, TAB, or newline characters occurring inside an + input record or a string. + + +File: gawk.info, Node: Copying, Next: GNU Free Documentation License, Prev: Glossary, Up: Top + +GNU General Public License +************************** + + Version 3, 29 June 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/> + + Everyone is permitted to copy and distribute verbatim copies of this + license document, but changing it is not allowed. + +Preamble +======== + +The GNU General Public License is a free, copyleft license for software +and other kinds of works. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +the GNU General Public License is intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. We, the Free Software Foundation, use the +GNU General Public License for most of our software; it applies also to +any other work released this way by its authors. You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + To protect your rights, we need to prevent others from denying you +these rights or asking you to surrender the rights. Therefore, you have +certain responsibilities if you distribute copies of the software, or if +you modify it: responsibilities to respect the freedom of others. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must pass on to the recipients the same +freedoms that you received. You must make sure that they, too, receive +or can get the source code. And you must show them these terms so they +know their rights. + + Developers that use the GNU GPL protect your rights with two steps: +(1) assert copyright on the software, and (2) offer you this License +giving you legal permission to copy, distribute and/or modify it. + + For the developers' and authors' protection, the GPL clearly explains +that there is no warranty for this free software. For both users' and +authors' sake, the GPL requires that modified versions be marked as +changed, so that their problems will not be attributed erroneously to +authors of previous versions. + + Some devices are designed to deny users access to install or run +modified versions of the software inside them, although the manufacturer +can do so. This is fundamentally incompatible with the aim of +protecting users' freedom to change the software. The systematic +pattern of such abuse occurs in the area of products for individuals to +use, which is precisely where it is most unacceptable. Therefore, we +have designed this version of the GPL to prohibit the practice for those +products. If such problems arise substantially in other domains, we +stand ready to extend this provision to those domains in future versions +of the GPL, as needed to protect the freedom of users. + + Finally, every program is threatened constantly by software patents. +States should not allow patents to restrict development and use of +software on general-purpose computers, but in those that do, we wish to +avoid the special danger that patents applied to a free program could +make it effectively proprietary. To prevent this, the GPL assures that +patents cannot be used to render the program non-free. + + The precise terms and conditions for copying, distribution and +modification follow. + +TERMS AND CONDITIONS +==================== + + 0. Definitions. + + "This License" refers to version 3 of the GNU General Public + License. + + "Copyright" also means copyright-like laws that apply to other + kinds of works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this + License. Each licensee is addressed as "you". "Licensees" and + "recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the + work in a fashion requiring copyright permission, other than the + making of an exact copy. The resulting work is called a "modified + version" of the earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work + based on the Program. + + To "propagate" a work means to do anything with it that, without + permission, would make you directly or secondarily liable for + infringement under applicable copyright law, except executing it on + a computer or modifying a private copy. Propagation includes + copying, distribution (with or without modification), making + available to the public, and in some countries other activities as + well. + + To "convey" a work means any kind of propagation that enables other + parties to make or receive copies. Mere interaction with a user + through a computer network, with no transfer of a copy, is not + conveying. + + An interactive user interface displays "Appropriate Legal Notices" + to the extent that it includes a convenient and prominently visible + feature that (1) displays an appropriate copyright notice, and (2) + tells the user that there is no warranty for the work (except to + the extent that warranties are provided), that licensees may convey + the work under this License, and how to view a copy of this + License. If the interface presents a list of user commands or + options, such as a menu, a prominent item in the list meets this + criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work + for making modifications to it. "Object code" means any non-source + form of a work. + + A "Standard Interface" means an interface that either is an + official standard defined by a recognized standards body, or, in + the case of interfaces specified for a particular programming + language, one that is widely used among developers working in that + language. + + The "System Libraries" of an executable work include anything, + other than the work as a whole, that (a) is included in the normal + form of packaging a Major Component, but which is not part of that + Major Component, and (b) serves only to enable use of the work with + that Major Component, or to implement a Standard Interface for + which an implementation is available to the public in source code + form. A "Major Component", in this context, means a major + essential component (kernel, window system, and so on) of the + specific operating system (if any) on which the executable work + runs, or a compiler used to produce the work, or an object code + interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all + the source code needed to generate, install, and (for an executable + work) run the object code and to modify the work, including scripts + to control those activities. However, it does not include the + work's System Libraries, or general-purpose tools or generally + available free programs which are used unmodified in performing + those activities but which are not part of the work. For example, + Corresponding Source includes interface definition files associated + with source files for the work, and the source code for shared + libraries and dynamically linked subprograms that the work is + specifically designed to require, such as by intimate data + communication or control flow between those subprograms and other + parts of the work. + + The Corresponding Source need not include anything that users can + regenerate automatically from other parts of the Corresponding + Source. + + The Corresponding Source for a work in source code form is that + same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of + copyright on the Program, and are irrevocable provided the stated + conditions are met. This License explicitly affirms your unlimited + permission to run the unmodified Program. The output from running + a covered work is covered by this License only if the output, given + its content, constitutes a covered work. This License acknowledges + your rights of fair use or other equivalent, as provided by + copyright law. + + You may make, run and propagate covered works that you do not + convey, without conditions so long as your license otherwise + remains in force. You may convey covered works to others for the + sole purpose of having them make modifications exclusively for you, + or provide you with facilities for running those works, provided + that you comply with the terms of this License in conveying all + material for which you do not control copyright. Those thus making + or running the covered works for you must do so exclusively on your + behalf, under your direction and control, on terms that prohibit + them from making any copies of your copyrighted material outside + their relationship with you. + + Conveying under any other circumstances is permitted solely under + the conditions stated below. Sublicensing is not allowed; section + 10 makes it unnecessary. + + 3. Protecting Users' Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological + measure under any applicable law fulfilling obligations under + article 11 of the WIPO copyright treaty adopted on 20 December + 1996, or similar laws prohibiting or restricting circumvention of + such measures. + + When you convey a covered work, you waive any legal power to forbid + circumvention of technological measures to the extent such + circumvention is effected by exercising rights under this License + with respect to the covered work, and you disclaim any intention to + limit operation or modification of the work as a means of + enforcing, against the work's users, your or third parties' legal + rights to forbid circumvention of technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program's source code as you + receive it, in any medium, provided that you conspicuously and + appropriately publish on each copy an appropriate copyright notice; + keep intact all notices stating that this License and any + non-permissive terms added in accord with section 7 apply to the + code; keep intact all notices of the absence of any warranty; and + give all recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, + and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to + produce it from the Program, in the form of source code under the + terms of section 4, provided that you also meet all of these + conditions: + + a. The work must carry prominent notices stating that you + modified it, and giving a relevant date. + + b. The work must carry prominent notices stating that it is + released under this License and any conditions added under + section 7. This requirement modifies the requirement in + section 4 to "keep intact all notices". + + c. You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable + section 7 additional terms, to the whole of the work, and all + its parts, regardless of how they are packaged. This License + gives no permission to license the work in any other way, but + it does not invalidate such permission if you have separately + received it. + + d. If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has + interactive interfaces that do not display Appropriate Legal + Notices, your work need not make them do so. + + A compilation of a covered work with other separate and independent + works, which are not by their nature extensions of the covered + work, and which are not combined with it such as to form a larger + program, in or on a volume of a storage or distribution medium, is + called an "aggregate" if the compilation and its resulting + copyright are not used to limit the access or legal rights of the + compilation's users beyond what the individual works permit. + Inclusion of a covered work in an aggregate does not cause this + License to apply to the other parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms + of sections 4 and 5, provided that you also convey the + machine-readable Corresponding Source under the terms of this + License, in one of these ways: + + a. Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b. Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that + product model, to give anyone who possesses the object code + either (1) a copy of the Corresponding Source for all the + software in the product that is covered by this License, on a + durable physical medium customarily used for software + interchange, for a price no more than your reasonable cost of + physically performing this conveying of source, or (2) access + to copy the Corresponding Source from a network server at no + charge. + + c. Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, + and only if you received the object code with such an offer, + in accord with subsection 6b. + + d. Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to + the Corresponding Source in the same way through the same + place at no further charge. You need not require recipients + to copy the Corresponding Source along with the object code. + If the place to copy the object code is a network server, the + Corresponding Source may be on a different server (operated by + you or a third party) that supports equivalent copying + facilities, provided you maintain clear directions next to the + object code saying where to find the Corresponding Source. + Regardless of what server hosts the Corresponding Source, you + remain obligated to ensure that it is available for as long as + needed to satisfy these requirements. + + e. Convey the object code using peer-to-peer transmission, + provided you inform other peers where the object code and + Corresponding Source of the work are being offered to the + general public at no charge under subsection 6d. + + A separable portion of the object code, whose source code is + excluded from the Corresponding Source as a System Library, need + not be included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means + any tangible personal property which is normally used for personal, + family, or household purposes, or (2) anything designed or sold for + incorporation into a dwelling. In determining whether a product is + a consumer product, doubtful cases shall be resolved in favor of + coverage. For a particular product received by a particular user, + "normally used" refers to a typical or common use of that class of + product, regardless of the status of the particular user or of the + way in which the particular user actually uses, or expects or is + expected to use, the product. A product is a consumer product + regardless of whether the product has substantial commercial, + industrial or non-consumer uses, unless such uses represent the + only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, + procedures, authorization keys, or other information required to + install and execute modified versions of a covered work in that + User Product from a modified version of its Corresponding Source. + The information must suffice to ensure that the continued + functioning of the modified object code is in no case prevented or + interfered with solely because modification has been made. + + If you convey an object code work under this section in, or with, + or specifically for use in, a User Product, and the conveying + occurs as part of a transaction in which the right of possession + and use of the User Product is transferred to the recipient in + perpetuity or for a fixed term (regardless of how the transaction + is characterized), the Corresponding Source conveyed under this + section must be accompanied by the Installation Information. But + this requirement does not apply if neither you nor any third party + retains the ability to install modified object code on the User + Product (for example, the work has been installed in ROM). + + The requirement to provide Installation Information does not + include a requirement to continue to provide support service, + warranty, or updates for a work that has been modified or installed + by the recipient, or for the User Product in which it has been + modified or installed. Access to a network may be denied when the + modification itself materially and adversely affects the operation + of the network or violates the rules and protocols for + communication across the network. + + Corresponding Source conveyed, and Installation Information + provided, in accord with this section must be in a format that is + publicly documented (and with an implementation available to the + public in source code form), and must require no special password + or key for unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of + this License by making exceptions from one or more of its + conditions. Additional permissions that are applicable to the + entire Program shall be treated as though they were included in + this License, to the extent that they are valid under applicable + law. If additional permissions apply only to part of the Program, + that part may be used separately under those permissions, but the + entire Program remains governed by this License without regard to + the additional permissions. + + When you convey a copy of a covered work, you may at your option + remove any additional permissions from that copy, or from any part + of it. (Additional permissions may be written to require their own + removal in certain cases when you modify the work.) You may place + additional permissions on material, added by you to a covered work, + for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material + you add to a covered work, you may (if authorized by the copyright + holders of that material) supplement the terms of this License with + terms: + + a. Disclaiming warranty or limiting liability differently from + the terms of sections 15 and 16 of this License; or + + b. Requiring preservation of specified reasonable legal notices + or author attributions in that material or in the Appropriate + Legal Notices displayed by works containing it; or + + c. Prohibiting misrepresentation of the origin of that material, + or requiring that modified versions of such material be marked + in reasonable ways as different from the original version; or + + d. Limiting the use for publicity purposes of names of licensors + or authors of the material; or + + e. Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f. Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified + versions of it) with contractual assumptions of liability to + the recipient, for any liability that these contractual + assumptions directly impose on those licensors and authors. + + All other non-permissive additional terms are considered "further + restrictions" within the meaning of section 10. If the Program as + you received it, or any part of it, contains a notice stating that + it is governed by this License along with a term that is a further + restriction, you may remove that term. If a license document + contains a further restriction but permits relicensing or conveying + under this License, you may add to a covered work material governed + by the terms of that license document, provided that the further + restriction does not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you + must place, in the relevant source files, a statement of the + additional terms that apply to those files, or a notice indicating + where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in + the form of a separately written license, or stated as exceptions; + the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly + provided under this License. Any attempt otherwise to propagate or + modify it is void, and will automatically terminate your rights + under this License (including any patent licenses granted under the + third paragraph of section 11). + + However, if you cease all violation of this License, then your + license from a particular copyright holder is reinstated (a) + provisionally, unless and until the copyright holder explicitly and + finally terminates your license, and (b) permanently, if the + copyright holder fails to notify you of the violation by some + reasonable means prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is + reinstated permanently if the copyright holder notifies you of the + violation by some reasonable means, this is the first time you have + received notice of violation of this License (for any work) from + that copyright holder, and you cure the violation prior to 30 days + after your receipt of the notice. + + Termination of your rights under this section does not terminate + the licenses of parties who have received copies or rights from you + under this License. If your rights have been terminated and not + permanently reinstated, you do not qualify to receive new licenses + for the same material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or + run a copy of the Program. Ancillary propagation of a covered work + occurring solely as a consequence of using peer-to-peer + transmission to receive a copy likewise does not require + acceptance. However, nothing other than this License grants you + permission to propagate or modify any covered work. These actions + infringe copyright if you do not accept this License. Therefore, + by modifying or propagating a covered work, you indicate your + acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically + receives a license from the original licensors, to run, modify and + propagate that work, subject to this License. You are not + responsible for enforcing compliance by third parties with this + License. + + An "entity transaction" is a transaction transferring control of an + organization, or substantially all assets of one, or subdividing an + organization, or merging organizations. If propagation of a + covered work results from an entity transaction, each party to that + transaction who receives a copy of the work also receives whatever + licenses to the work the party's predecessor in interest had or + could give under the previous paragraph, plus a right to possession + of the Corresponding Source of the work from the predecessor in + interest, if the predecessor has it or can get it with reasonable + efforts. + + You may not impose any further restrictions on the exercise of the + rights granted or affirmed under this License. For example, you + may not impose a license fee, royalty, or other charge for exercise + of rights granted under this License, and you may not initiate + litigation (including a cross-claim or counterclaim in a lawsuit) + alleging that any patent claim is infringed by making, using, + selling, offering for sale, or importing the Program or any portion + of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this + License of the Program or a work on which the Program is based. + The work thus licensed is called the contributor's "contributor + version". + + A contributor's "essential patent claims" are all patent claims + owned or controlled by the contributor, whether already acquired or + hereafter acquired, that would be infringed by some manner, + permitted by this License, of making, using, or selling its + contributor version, but do not include claims that would be + infringed only as a consequence of further modification of the + contributor version. For purposes of this definition, "control" + includes the right to grant patent sublicenses in a manner + consistent with the requirements of this License. + + Each contributor grants you a non-exclusive, worldwide, + royalty-free patent license under the contributor's essential + patent claims, to make, use, sell, offer for sale, import and + otherwise run, modify and propagate the contents of its contributor + version. + + In the following three paragraphs, a "patent license" is any + express agreement or commitment, however denominated, not to + enforce a patent (such as an express permission to practice a + patent or covenant not to sue for patent infringement). To "grant" + such a patent license to a party means to make such an agreement or + commitment not to enforce a patent against the party. + + If you convey a covered work, knowingly relying on a patent + license, and the Corresponding Source of the work is not available + for anyone to copy, free of charge and under the terms of this + License, through a publicly available network server or other + readily accessible means, then you must either (1) cause the + Corresponding Source to be so available, or (2) arrange to deprive + yourself of the benefit of the patent license for this particular + work, or (3) arrange, in a manner consistent with the requirements + of this License, to extend the patent license to downstream + recipients. "Knowingly relying" means you have actual knowledge + that, but for the patent license, your conveying the covered work + in a country, or your recipient's use of the covered work in a + country, would infringe one or more identifiable patents in that + country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or + arrangement, you convey, or propagate by procuring conveyance of, a + covered work, and grant a patent license to some of the parties + receiving the covered work authorizing them to use, propagate, + modify or convey a specific copy of the covered work, then the + patent license you grant is automatically extended to all + recipients of the covered work and works based on it. + + A patent license is "discriminatory" if it does not include within + the scope of its coverage, prohibits the exercise of, or is + conditioned on the non-exercise of one or more of the rights that + are specifically granted under this License. You may not convey a + covered work if you are a party to an arrangement with a third + party that is in the business of distributing software, under which + you make payment to the third party based on the extent of your + activity of conveying the work, and under which the third party + grants, to any of the parties who would receive the covered work + from you, a discriminatory patent license (a) in connection with + copies of the covered work conveyed by you (or copies made from + those copies), or (b) primarily for and in connection with specific + products or compilations that contain the covered work, unless you + entered into that arrangement, or that patent license was granted, + prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting + any implied license or other defenses to infringement that may + otherwise be available to you under applicable patent law. + + 12. No Surrender of Others' Freedom. + + If conditions are imposed on you (whether by court order, agreement + or otherwise) that contradict the conditions of this License, they + do not excuse you from the conditions of this License. If you + cannot convey a covered work so as to satisfy simultaneously your + obligations under this License and any other pertinent obligations, + then as a consequence you may not convey it at all. For example, + if you agree to terms that obligate you to collect a royalty for + further conveying from those to whom you convey the Program, the + only way you could satisfy both those terms and this License would + be to refrain entirely from conveying the Program. + + 13. Use with the GNU Affero General Public License. + + Notwithstanding any other provision of this License, you have + permission to link or combine any covered work with a work licensed + under version 3 of the GNU Affero General Public License into a + single combined work, and to convey the resulting work. The terms + of this License will continue to apply to the part which is the + covered work, but the special requirements of the GNU Affero + General Public License, section 13, concerning interaction through + a network will apply to the combination as such. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new + versions of the GNU General Public License from time to time. Such + new versions will be similar in spirit to the present version, but + may differ in detail to address new problems or concerns. + + Each version is given a distinguishing version number. If the + Program specifies that a certain numbered version of the GNU + General Public License "or any later version" applies to it, you + have the option of following the terms and conditions either of + that numbered version or of any later version published by the Free + Software Foundation. If the Program does not specify a version + number of the GNU General Public License, you may choose any + version ever published by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future + versions of the GNU General Public License can be used, that + proxy's public statement of acceptance of a version permanently + authorizes you to choose that version for the Program. + + Later license versions may give you additional or different + permissions. However, no additional obligations are imposed on any + author or copyright holder as a result of your choosing to follow a + later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY + APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE + COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" + WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, + INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF + MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE + RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. + SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL + NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN + WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES + AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR + DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR + CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE + THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA + BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD + PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER + PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF + THE POSSIBILITY OF SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided + above cannot be given local legal effect according to their terms, + reviewing courts shall apply local law that most closely + approximates an absolute waiver of all civil liability in + connection with the Program, unless a warranty or assumption of + liability accompanies a copy of the Program in return for a fee. + +END OF TERMS AND CONDITIONS +=========================== + +How to Apply These Terms to Your New Programs +============================================= + +If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these +terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least the +"copyright" line and a pointer to where the full notice is found. + + ONE LINE TO GIVE THE PROGRAM'S NAME AND A BRIEF IDEA OF WHAT IT DOES. + Copyright (C) YEAR NAME OF AUTHOR + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or (at + your option) any later version. + + This program is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. + + Also add information on how to contact you by electronic and paper +mail. + + If the program does terminal interaction, make it output a short +notice like this when it starts in an interactive mode: + + PROGRAM Copyright (C) YEAR NAME OF AUTHOR + This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type 'show c' for details. + + The hypothetical commands 'show w' and 'show c' should show the +appropriate parts of the General Public License. Of course, your +program's commands might be different; for a GUI interface, you would +use an "about box". + + You should also get your employer (if you work as a programmer) or +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. For more information on this, and how to apply and follow +the GNU GPL, see <http://www.gnu.org/licenses/>. + + The GNU General Public License does not permit incorporating your +program into proprietary programs. If your program is a subroutine +library, you may consider it more useful to permit linking proprietary +applications with the library. If this is what you want to do, use the +GNU Lesser General Public License instead of this License. But first, +please read <http://www.gnu.org/philosophy/why-not-lgpl.html>. + + +File: gawk.info, Node: GNU Free Documentation License, Next: Index, Prev: Copying, Up: Top + +GNU Free Documentation License +****************************** + + Version 1.3, 3 November 2008 + + Copyright (C) 2000, 2001, 2002, 2007, 2008 Free Software Foundation, Inc. + <http://fsf.org/> + + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + 0. PREAMBLE + + The purpose of this License is to make a manual, textbook, or other + functional and useful document "free" in the sense of freedom: to + assure everyone the effective freedom to copy and redistribute it, + with or without modifying it, either commercially or + noncommercially. Secondarily, this License preserves for the + author and publisher a way to get credit for their work, while not + being considered responsible for modifications made by others. + + This License is a kind of "copyleft", which means that derivative + works of the document must themselves be free in the same sense. + It complements the GNU General Public License, which is a copyleft + license designed for free software. + + We have designed this License in order to use it for manuals for + free software, because free software needs free documentation: a + free program should come with manuals providing the same freedoms + that the software does. But this License is not limited to + software manuals; it can be used for any textual work, regardless + of subject matter or whether it is published as a printed book. We + recommend this License principally for works whose purpose is + instruction or reference. + + 1. APPLICABILITY AND DEFINITIONS + + This License applies to any manual or other work, in any medium, + that contains a notice placed by the copyright holder saying it can + be distributed under the terms of this License. Such a notice + grants a world-wide, royalty-free license, unlimited in duration, + to use that work under the conditions stated herein. The + "Document", below, refers to any such manual or work. Any member + of the public is a licensee, and is addressed as "you". You accept + the license if you copy, modify or distribute the work in a way + requiring permission under copyright law. + + A "Modified Version" of the Document means any work containing the + Document or a portion of it, either copied verbatim, or with + modifications and/or translated into another language. + + A "Secondary Section" is a named appendix or a front-matter section + of the Document that deals exclusively with the relationship of the + publishers or authors of the Document to the Document's overall + subject (or to related matters) and contains nothing that could + fall directly within that overall subject. (Thus, if the Document + is in part a textbook of mathematics, a Secondary Section may not + explain any mathematics.) The relationship could be a matter of + historical connection with the subject or with related matters, or + of legal, commercial, philosophical, ethical or political position + regarding them. + + The "Invariant Sections" are certain Secondary Sections whose + titles are designated, as being those of Invariant Sections, in the + notice that says that the Document is released under this License. + If a section does not fit the above definition of Secondary then it + is not allowed to be designated as Invariant. The Document may + contain zero Invariant Sections. If the Document does not identify + any Invariant Sections then there are none. + + The "Cover Texts" are certain short passages of text that are + listed, as Front-Cover Texts or Back-Cover Texts, in the notice + that says that the Document is released under this License. A + Front-Cover Text may be at most 5 words, and a Back-Cover Text may + be at most 25 words. + + A "Transparent" copy of the Document means a machine-readable copy, + represented in a format whose specification is available to the + general public, that is suitable for revising the document + straightforwardly with generic text editors or (for images composed + of pixels) generic paint programs or (for drawings) some widely + available drawing editor, and that is suitable for input to text + formatters or for automatic translation to a variety of formats + suitable for input to text formatters. A copy made in an otherwise + Transparent file format whose markup, or absence of markup, has + been arranged to thwart or discourage subsequent modification by + readers is not Transparent. An image format is not Transparent if + used for any substantial amount of text. A copy that is not + "Transparent" is called "Opaque". + + Examples of suitable formats for Transparent copies include plain + ASCII without markup, Texinfo input format, LaTeX input format, + SGML or XML using a publicly available DTD, and standard-conforming + simple HTML, PostScript or PDF designed for human modification. + Examples of transparent image formats include PNG, XCF and JPG. + Opaque formats include proprietary formats that can be read and + edited only by proprietary word processors, SGML or XML for which + the DTD and/or processing tools are not generally available, and + the machine-generated HTML, PostScript or PDF produced by some word + processors for output purposes only. + + The "Title Page" means, for a printed book, the title page itself, + plus such following pages as are needed to hold, legibly, the + material this License requires to appear in the title page. For + works in formats which do not have any title page as such, "Title + Page" means the text near the most prominent appearance of the + work's title, preceding the beginning of the body of the text. + + The "publisher" means any person or entity that distributes copies + of the Document to the public. + + A section "Entitled XYZ" means a named subunit of the Document + whose title either is precisely XYZ or contains XYZ in parentheses + following text that translates XYZ in another language. (Here XYZ + stands for a specific section name mentioned below, such as + "Acknowledgements", "Dedications", "Endorsements", or "History".) + To "Preserve the Title" of such a section when you modify the + Document means that it remains a section "Entitled XYZ" according + to this definition. + + The Document may include Warranty Disclaimers next to the notice + which states that this License applies to the Document. These + Warranty Disclaimers are considered to be included by reference in + this License, but only as regards disclaiming warranties: any other + implication that these Warranty Disclaimers may have is void and + has no effect on the meaning of this License. + + 2. VERBATIM COPYING + + You may copy and distribute the Document in any medium, either + commercially or noncommercially, provided that this License, the + copyright notices, and the license notice saying this License + applies to the Document are reproduced in all copies, and that you + add no other conditions whatsoever to those of this License. You + may not use technical measures to obstruct or control the reading + or further copying of the copies you make or distribute. However, + you may accept compensation in exchange for copies. If you + distribute a large enough number of copies you must also follow the + conditions in section 3. + + You may also lend copies, under the same conditions stated above, + and you may publicly display copies. + + 3. COPYING IN QUANTITY + + If you publish printed copies (or copies in media that commonly + have printed covers) of the Document, numbering more than 100, and + the Document's license notice requires Cover Texts, you must + enclose the copies in covers that carry, clearly and legibly, all + these Cover Texts: Front-Cover Texts on the front cover, and + Back-Cover Texts on the back cover. Both covers must also clearly + and legibly identify you as the publisher of these copies. The + front cover must present the full title with all words of the title + equally prominent and visible. You may add other material on the + covers in addition. Copying with changes limited to the covers, as + long as they preserve the title of the Document and satisfy these + conditions, can be treated as verbatim copying in other respects. + + If the required texts for either cover are too voluminous to fit + legibly, you should put the first ones listed (as many as fit + reasonably) on the actual cover, and continue the rest onto + adjacent pages. + + If you publish or distribute Opaque copies of the Document + numbering more than 100, you must either include a machine-readable + Transparent copy along with each Opaque copy, or state in or with + each Opaque copy a computer-network location from which the general + network-using public has access to download using public-standard + network protocols a complete Transparent copy of the Document, free + of added material. If you use the latter option, you must take + reasonably prudent steps, when you begin distribution of Opaque + copies in quantity, to ensure that this Transparent copy will + remain thus accessible at the stated location until at least one + year after the last time you distribute an Opaque copy (directly or + through your agents or retailers) of that edition to the public. + + It is requested, but not required, that you contact the authors of + the Document well before redistributing any large number of copies, + to give them a chance to provide you with an updated version of the + Document. + + 4. MODIFICATIONS + + You may copy and distribute a Modified Version of the Document + under the conditions of sections 2 and 3 above, provided that you + release the Modified Version under precisely this License, with the + Modified Version filling the role of the Document, thus licensing + distribution and modification of the Modified Version to whoever + possesses a copy of it. In addition, you must do these things in + the Modified Version: + + A. Use in the Title Page (and on the covers, if any) a title + distinct from that of the Document, and from those of previous + versions (which should, if there were any, be listed in the + History section of the Document). You may use the same title + as a previous version if the original publisher of that + version gives permission. + + B. List on the Title Page, as authors, one or more persons or + entities responsible for authorship of the modifications in + the Modified Version, together with at least five of the + principal authors of the Document (all of its principal + authors, if it has fewer than five), unless they release you + from this requirement. + + C. State on the Title page the name of the publisher of the + Modified Version, as the publisher. + + D. Preserve all the copyright notices of the Document. + + E. Add an appropriate copyright notice for your modifications + adjacent to the other copyright notices. + + F. Include, immediately after the copyright notices, a license + notice giving the public permission to use the Modified + Version under the terms of this License, in the form shown in + the Addendum below. + + G. Preserve in that license notice the full lists of Invariant + Sections and required Cover Texts given in the Document's + license notice. + + H. Include an unaltered copy of this License. + + I. Preserve the section Entitled "History", Preserve its Title, + and add to it an item stating at least the title, year, new + authors, and publisher of the Modified Version as given on the + Title Page. If there is no section Entitled "History" in the + Document, create one stating the title, year, authors, and + publisher of the Document as given on its Title Page, then add + an item describing the Modified Version as stated in the + previous sentence. + + J. Preserve the network location, if any, given in the Document + for public access to a Transparent copy of the Document, and + likewise the network locations given in the Document for + previous versions it was based on. These may be placed in the + "History" section. You may omit a network location for a work + that was published at least four years before the Document + itself, or if the original publisher of the version it refers + to gives permission. + + K. For any section Entitled "Acknowledgements" or "Dedications", + Preserve the Title of the section, and preserve in the section + all the substance and tone of each of the contributor + acknowledgements and/or dedications given therein. + + L. Preserve all the Invariant Sections of the Document, unaltered + in their text and in their titles. Section numbers or the + equivalent are not considered part of the section titles. + + M. Delete any section Entitled "Endorsements". Such a section + may not be included in the Modified Version. + + N. Do not retitle any existing section to be Entitled + "Endorsements" or to conflict in title with any Invariant + Section. + + O. Preserve any Warranty Disclaimers. + + If the Modified Version includes new front-matter sections or + appendices that qualify as Secondary Sections and contain no + material copied from the Document, you may at your option designate + some or all of these sections as invariant. To do this, add their + titles to the list of Invariant Sections in the Modified Version's + license notice. These titles must be distinct from any other + section titles. + + You may add a section Entitled "Endorsements", provided it contains + nothing but endorsements of your Modified Version by various + parties--for example, statements of peer review or that the text + has been approved by an organization as the authoritative + definition of a standard. + + You may add a passage of up to five words as a Front-Cover Text, + and a passage of up to 25 words as a Back-Cover Text, to the end of + the list of Cover Texts in the Modified Version. Only one passage + of Front-Cover Text and one of Back-Cover Text may be added by (or + through arrangements made by) any one entity. If the Document + already includes a cover text for the same cover, previously added + by you or by arrangement made by the same entity you are acting on + behalf of, you may not add another; but you may replace the old + one, on explicit permission from the previous publisher that added + the old one. + + The author(s) and publisher(s) of the Document do not by this + License give permission to use their names for publicity for or to + assert or imply endorsement of any Modified Version. + + 5. COMBINING DOCUMENTS + + You may combine the Document with other documents released under + this License, under the terms defined in section 4 above for + modified versions, provided that you include in the combination all + of the Invariant Sections of all of the original documents, + unmodified, and list them all as Invariant Sections of your + combined work in its license notice, and that you preserve all + their Warranty Disclaimers. + + The combined work need only contain one copy of this License, and + multiple identical Invariant Sections may be replaced with a single + copy. If there are multiple Invariant Sections with the same name + but different contents, make the title of each such section unique + by adding at the end of it, in parentheses, the name of the + original author or publisher of that section if known, or else a + unique number. Make the same adjustment to the section titles in + the list of Invariant Sections in the license notice of the + combined work. + + In the combination, you must combine any sections Entitled + "History" in the various original documents, forming one section + Entitled "History"; likewise combine any sections Entitled + "Acknowledgements", and any sections Entitled "Dedications". You + must delete all sections Entitled "Endorsements." + + 6. COLLECTIONS OF DOCUMENTS + + You may make a collection consisting of the Document and other + documents released under this License, and replace the individual + copies of this License in the various documents with a single copy + that is included in the collection, provided that you follow the + rules of this License for verbatim copying of each of the documents + in all other respects. + + You may extract a single document from such a collection, and + distribute it individually under this License, provided you insert + a copy of this License into the extracted document, and follow this + License in all other respects regarding verbatim copying of that + document. + + 7. AGGREGATION WITH INDEPENDENT WORKS + + A compilation of the Document or its derivatives with other + separate and independent documents or works, in or on a volume of a + storage or distribution medium, is called an "aggregate" if the + copyright resulting from the compilation is not used to limit the + legal rights of the compilation's users beyond what the individual + works permit. When the Document is included in an aggregate, this + License does not apply to the other works in the aggregate which + are not themselves derivative works of the Document. + + If the Cover Text requirement of section 3 is applicable to these + copies of the Document, then if the Document is less than one half + of the entire aggregate, the Document's Cover Texts may be placed + on covers that bracket the Document within the aggregate, or the + electronic equivalent of covers if the Document is in electronic + form. Otherwise they must appear on printed covers that bracket + the whole aggregate. + + 8. TRANSLATION + + Translation is considered a kind of modification, so you may + distribute translations of the Document under the terms of section + 4. Replacing Invariant Sections with translations requires special + permission from their copyright holders, but you may include + translations of some or all Invariant Sections in addition to the + original versions of these Invariant Sections. You may include a + translation of this License, and all the license notices in the + Document, and any Warranty Disclaimers, provided that you also + include the original English version of this License and the + original versions of those notices and disclaimers. In case of a + disagreement between the translation and the original version of + this License or a notice or disclaimer, the original version will + prevail. + + If a section in the Document is Entitled "Acknowledgements", + "Dedications", or "History", the requirement (section 4) to + Preserve its Title (section 1) will typically require changing the + actual title. + + 9. TERMINATION + + You may not copy, modify, sublicense, or distribute the Document + except as expressly provided under this License. Any attempt + otherwise to copy, modify, sublicense, or distribute it is void, + and will automatically terminate your rights under this License. + + However, if you cease all violation of this License, then your + license from a particular copyright holder is reinstated (a) + provisionally, unless and until the copyright holder explicitly and + finally terminates your license, and (b) permanently, if the + copyright holder fails to notify you of the violation by some + reasonable means prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is + reinstated permanently if the copyright holder notifies you of the + violation by some reasonable means, this is the first time you have + received notice of violation of this License (for any work) from + that copyright holder, and you cure the violation prior to 30 days + after your receipt of the notice. + + Termination of your rights under this section does not terminate + the licenses of parties who have received copies or rights from you + under this License. If your rights have been terminated and not + permanently reinstated, receipt of a copy of some or all of the + same material does not give you any rights to use it. + + 10. FUTURE REVISIONS OF THIS LICENSE + + The Free Software Foundation may publish new, revised versions of + the GNU Free Documentation License from time to time. Such new + versions will be similar in spirit to the present version, but may + differ in detail to address new problems or concerns. See + <http://www.gnu.org/copyleft/>. + + Each version of the License is given a distinguishing version + number. If the Document specifies that a particular numbered + version of this License "or any later version" applies to it, you + have the option of following the terms and conditions either of + that specified version or of any later version that has been + published (not as a draft) by the Free Software Foundation. If the + Document does not specify a version number of this License, you may + choose any version ever published (not as a draft) by the Free + Software Foundation. If the Document specifies that a proxy can + decide which future versions of this License can be used, that + proxy's public statement of acceptance of a version permanently + authorizes you to choose that version for the Document. + + 11. RELICENSING + + "Massive Multiauthor Collaboration Site" (or "MMC Site") means any + World Wide Web server that publishes copyrightable works and also + provides prominent facilities for anybody to edit those works. A + public wiki that anybody can edit is an example of such a server. + A "Massive Multiauthor Collaboration" (or "MMC") contained in the + site means any set of copyrightable works thus published on the MMC + site. + + "CC-BY-SA" means the Creative Commons Attribution-Share Alike 3.0 + license published by Creative Commons Corporation, a not-for-profit + corporation with a principal place of business in San Francisco, + California, as well as future copyleft versions of that license + published by that same organization. + + "Incorporate" means to publish or republish a Document, in whole or + in part, as part of another Document. + + An MMC is "eligible for relicensing" if it is licensed under this + License, and if all works that were first published under this + License somewhere other than this MMC, and subsequently + incorporated in whole or in part into the MMC, (1) had no cover + texts or invariant sections, and (2) were thus incorporated prior + to November 1, 2008. + + The operator of an MMC Site may republish an MMC contained in the + site under CC-BY-SA on the same site at any time before August 1, + 2009, provided the MMC is eligible for relicensing. + +ADDENDUM: How to use this License for your documents +==================================================== + +To use this License in a document you have written, include a copy of +the License in the document and put the following copyright and license +notices just after the title page: + + Copyright (C) YEAR YOUR NAME. + Permission is granted to copy, distribute and/or modify this document + under the terms of the GNU Free Documentation License, Version 1.3 + or any later version published by the Free Software Foundation; + with no Invariant Sections, no Front-Cover Texts, and no Back-Cover + Texts. A copy of the license is included in the section entitled ``GNU + Free Documentation License''. + + If you have Invariant Sections, Front-Cover Texts and Back-Cover +Texts, replace the "with...Texts." line with this: + + with the Invariant Sections being LIST THEIR TITLES, with + the Front-Cover Texts being LIST, and with the Back-Cover Texts + being LIST. + + If you have Invariant Sections without Cover Texts, or some other +combination of the three, merge those two alternatives to suit the +situation. + + If your document contains nontrivial examples of program code, we +recommend releasing these examples in parallel under your choice of free +software license, such as the GNU General Public License, to permit +their use in free software. + + +File: gawk.info, Node: Index, Prev: GNU Free Documentation License, Up: Top + +Index +***** + + +* Menu: + +* ! (exclamation point), ! operator: Boolean Ops. (line 69) +* ! (exclamation point), ! operator <1>: Precedence. (line 51) +* ! (exclamation point), ! operator <2>: Ranges. (line 47) +* ! (exclamation point), ! operator <3>: Egrep Program. (line 174) +* ! (exclamation point), != operator: Comparison Operators. + (line 11) +* ! (exclamation point), != operator <1>: Precedence. (line 64) +* ! (exclamation point), !~ operator: Regexp Usage. (line 19) +* ! (exclamation point), !~ operator <1>: Computed Regexps. (line 6) +* ! (exclamation point), !~ operator <2>: Case-sensitivity. (line 26) +* ! (exclamation point), !~ operator <3>: Regexp Constants. (line 6) +* ! (exclamation point), !~ operator <4>: Comparison Operators. + (line 11) +* ! (exclamation point), !~ operator <5>: Comparison Operators. + (line 98) +* ! (exclamation point), !~ operator <6>: Precedence. (line 79) +* ! (exclamation point), !~ operator <7>: Expression Patterns. + (line 24) +* " (double quote), in regexp constants: Computed Regexps. (line 30) +* " (double quote), in shell commands: Quoting. (line 54) +* # (number sign), #! (executable scripts): Executable Scripts. + (line 6) +* # (number sign), commenting: Comments. (line 6) +* $ (dollar sign), $ field operator: Fields. (line 19) +* $ (dollar sign), $ field operator <1>: Precedence. (line 42) +* $ (dollar sign), incrementing fields and arrays: Increment Ops. + (line 30) +* $ (dollar sign), regexp operator: Regexp Operators. (line 35) +* % (percent sign), % operator: Precedence. (line 54) +* % (percent sign), %= operator: Assignment Ops. (line 129) +* % (percent sign), %= operator <1>: Precedence. (line 94) +* & (ampersand), && operator: Boolean Ops. (line 59) +* & (ampersand), && operator <1>: Precedence. (line 85) +* & (ampersand), gsub()/gensub()/sub() functions and: Gory Details. + (line 6) +* ' (single quote): One-shot. (line 15) +* ' (single quote) in gawk command lines: Long. (line 35) +* ' (single quote), in shell commands: Quoting. (line 48) +* ' (single quote), vs. apostrophe: Comments. (line 27) +* ' (single quote), with double quotes: Quoting. (line 73) +* () (parentheses), in a profile: Profiling. (line 146) +* () (parentheses), regexp operator: Regexp Operators. (line 81) +* * (asterisk), * operator, as multiplication operator: Precedence. + (line 54) +* * (asterisk), * operator, as regexp operator: Regexp Operators. + (line 89) +* * (asterisk), * operator, null strings, matching: String Functions. + (line 537) +* * (asterisk), ** operator: Arithmetic Ops. (line 81) +* * (asterisk), ** operator <1>: Precedence. (line 48) +* * (asterisk), **= operator: Assignment Ops. (line 129) +* * (asterisk), **= operator <1>: Precedence. (line 94) +* * (asterisk), *= operator: Assignment Ops. (line 129) +* * (asterisk), *= operator <1>: Precedence. (line 94) +* + (plus sign), + operator: Precedence. (line 51) +* + (plus sign), + operator <1>: Precedence. (line 57) +* + (plus sign), ++ operator: Increment Ops. (line 11) +* + (plus sign), ++ operator <1>: Increment Ops. (line 40) +* + (plus sign), ++ operator <2>: Precedence. (line 45) +* + (plus sign), += operator: Assignment Ops. (line 81) +* + (plus sign), += operator <1>: Precedence. (line 94) +* + (plus sign), regexp operator: Regexp Operators. (line 105) +* , (comma), in range patterns: Ranges. (line 6) +* - (hyphen), - operator: Precedence. (line 51) +* - (hyphen), - operator <1>: Precedence. (line 57) +* - (hyphen), -- operator: Increment Ops. (line 48) +* - (hyphen), -- operator <1>: Precedence. (line 45) +* - (hyphen), -= operator: Assignment Ops. (line 129) +* - (hyphen), -= operator <1>: Precedence. (line 94) +* - (hyphen), filenames beginning with: Options. (line 60) +* - (hyphen), in bracket expressions: Bracket Expressions. (line 25) +* --assign option: Options. (line 32) +* --bignum option: Options. (line 203) +* --characters-as-bytes option: Options. (line 69) +* --copyright option: Options. (line 89) +* --debug option: Options. (line 108) +* --disable-extensions configuration option: Additional Configuration Options. + (line 9) +* --disable-lint configuration option: Additional Configuration Options. + (line 15) +* --disable-nls configuration option: Additional Configuration Options. + (line 32) +* --dump-variables option: Options. (line 94) +* --dump-variables option, using for library functions: Library Names. + (line 45) +* --exec option: Options. (line 125) +* --field-separator option: Options. (line 21) +* --file option: Options. (line 25) +* --gen-pot option: Options. (line 147) +* --gen-pot option <1>: String Extraction. (line 6) +* --gen-pot option <2>: String Extraction. (line 6) +* --help option: Options. (line 154) +* --include option: Options. (line 159) +* --lint option: Command Line. (line 20) +* --lint option <1>: Options. (line 184) +* --lint-old option: Options. (line 299) +* --load option: Options. (line 172) +* --no-optimize option: Options. (line 285) +* --non-decimal-data option: Options. (line 209) +* --non-decimal-data option <1>: Nondecimal Data. (line 6) +* --non-decimal-data option, strtonum() function and: Nondecimal Data. + (line 35) +* --optimize option: Options. (line 234) +* --posix option: Options. (line 257) +* --posix option, --traditional option and: Options. (line 272) +* --pretty-print option: Options. (line 223) +* --profile option: Options. (line 245) +* --profile option <1>: Profiling. (line 12) +* --re-interval option: Options. (line 278) +* --sandbox option: Options. (line 290) +* --sandbox option, disabling system() function: I/O Functions. + (line 129) +* --sandbox option, input redirection with getline: Getline. (line 19) +* --sandbox option, output redirection with print, printf: Redirection. + (line 6) +* --source option: Options. (line 117) +* --traditional option: Options. (line 82) +* --traditional option, --posix option and: Options. (line 272) +* --use-lc-numeric option: Options. (line 218) +* --version option: Options. (line 304) +* --with-whiny-user-strftime configuration option: Additional Configuration Options. + (line 37) +* -b option: Options. (line 69) +* -c option: Options. (line 82) +* -C option: Options. (line 89) +* -d option: Options. (line 94) +* -D option: Options. (line 108) +* -e option: Options. (line 117) +* -E option: Options. (line 125) +* -e option <1>: Options. (line 340) +* -f option: Long. (line 12) +* -F option: Options. (line 21) +* -f option <1>: Options. (line 25) +* -F option, -Ft sets FS to TAB: Options. (line 312) +* -F option, command-line: Command Line Field Separator. + (line 6) +* -f option, multiple uses: Options. (line 317) +* -g option: Options. (line 147) +* -h option: Options. (line 154) +* -i option: Options. (line 159) +* -l option: Options. (line 172) +* -l option <1>: Options. (line 184) +* -L option: Options. (line 299) +* -M option: Options. (line 203) +* -n option: Options. (line 209) +* -N option: Options. (line 218) +* -o option: Options. (line 223) +* -O option: Options. (line 234) +* -p option: Options. (line 245) +* -P option: Options. (line 257) +* -r option: Options. (line 278) +* -s option: Options. (line 285) +* -S option: Options. (line 290) +* -v option: Options. (line 32) +* -V option: Options. (line 304) +* -v option <1>: Assignment Options. (line 12) +* -W option: Options. (line 47) +* . (period), regexp operator: Regexp Operators. (line 44) +* .gmo files: Explaining gettext. (line 42) +* .gmo files, specifying directory of: Explaining gettext. (line 54) +* .gmo files, specifying directory of <1>: Programmer i18n. (line 48) +* .mo files, converting from .po: I18N Example. (line 66) +* .po files: Explaining gettext. (line 37) +* .po files <1>: Translator i18n. (line 6) +* .po files, converting to .mo: I18N Example. (line 66) +* .pot files: Explaining gettext. (line 31) +* / (forward slash) to enclose regular expressions: Regexp. (line 10) +* / (forward slash), / operator: Precedence. (line 54) +* / (forward slash), /= operator: Assignment Ops. (line 129) +* / (forward slash), /= operator <1>: Precedence. (line 94) +* / (forward slash), /= operator, vs. /=.../ regexp constant: Assignment Ops. + (line 149) +* / (forward slash), patterns and: Expression Patterns. (line 24) +* /= operator vs. /=.../ regexp constant: Assignment Ops. (line 149) +* /dev/... special files: Special FD. (line 48) +* /dev/fd/N special files (gawk): Special FD. (line 48) +* /inet/... special files (gawk): TCP/IP Networking. (line 6) +* /inet4/... special files (gawk): TCP/IP Networking. (line 6) +* /inet6/... special files (gawk): TCP/IP Networking. (line 6) +* ; (semicolon), AWKPATH variable and: PC Using. (line 9) +* ; (semicolon), separating statements in actions: Statements/Lines. + (line 90) +* ; (semicolon), separating statements in actions <1>: Action Overview. + (line 19) +* ; (semicolon), separating statements in actions <2>: Statements. + (line 10) +* < (left angle bracket), < operator: Comparison Operators. + (line 11) +* < (left angle bracket), < operator <1>: Precedence. (line 64) +* < (left angle bracket), < operator (I/O): Getline/File. (line 6) +* < (left angle bracket), <= operator: Comparison Operators. + (line 11) +* < (left angle bracket), <= operator <1>: Precedence. (line 64) +* = (equals sign), = operator: Assignment Ops. (line 6) +* = (equals sign), == operator: Comparison Operators. + (line 11) +* = (equals sign), == operator <1>: Precedence. (line 64) +* > (right angle bracket), > operator: Comparison Operators. + (line 11) +* > (right angle bracket), > operator <1>: Precedence. (line 64) +* > (right angle bracket), > operator (I/O): Redirection. (line 22) +* > (right angle bracket), >= operator: Comparison Operators. + (line 11) +* > (right angle bracket), >= operator <1>: Precedence. (line 64) +* > (right angle bracket), >> operator (I/O): Redirection. (line 50) +* > (right angle bracket), >> operator (I/O) <1>: Precedence. (line 64) +* ? (question mark), ?: operator: Precedence. (line 91) +* ? (question mark), regexp operator: Regexp Operators. (line 111) +* ? (question mark), regexp operator <1>: GNU Regexp Operators. + (line 62) +* @-notation for indirect function calls: Indirect Calls. (line 47) +* @include directive: Include Files. (line 8) +* @load directive: Loading Shared Libraries. + (line 8) +* [] (square brackets), regexp operator: Regexp Operators. (line 56) +* \ (backslash): Comments. (line 50) +* \ (backslash), as field separator: Command Line Field Separator. + (line 24) +* \ (backslash), continuing lines and: Statements/Lines. (line 19) +* \ (backslash), continuing lines and, comments and: Statements/Lines. + (line 75) +* \ (backslash), continuing lines and, in csh: Statements/Lines. + (line 43) +* \ (backslash), gsub()/gensub()/sub() functions and: Gory Details. + (line 6) +* \ (backslash), in bracket expressions: Bracket Expressions. (line 25) +* \ (backslash), in escape sequences: Escape Sequences. (line 6) +* \ (backslash), in escape sequences <1>: Escape Sequences. (line 103) +* \ (backslash), in escape sequences, POSIX and: Escape Sequences. + (line 108) +* \ (backslash), in regexp constants: Computed Regexps. (line 30) +* \ (backslash), in shell commands: Quoting. (line 48) +* \ (backslash), regexp operator: Regexp Operators. (line 18) +* \ (backslash), \" escape sequence: Escape Sequences. (line 85) +* \ (backslash), \' operator (gawk): GNU Regexp Operators. + (line 59) +* \ (backslash), \/ escape sequence: Escape Sequences. (line 76) +* \ (backslash), \< operator (gawk): GNU Regexp Operators. + (line 33) +* \ (backslash), \> operator (gawk): GNU Regexp Operators. + (line 37) +* \ (backslash), \a escape sequence: Escape Sequences. (line 34) +* \ (backslash), \b escape sequence: Escape Sequences. (line 38) +* \ (backslash), \B operator (gawk): GNU Regexp Operators. + (line 46) +* \ (backslash), \f escape sequence: Escape Sequences. (line 41) +* \ (backslash), \n escape sequence: Escape Sequences. (line 44) +* \ (backslash), \NNN escape sequence: Escape Sequences. (line 56) +* \ (backslash), \r escape sequence: Escape Sequences. (line 47) +* \ (backslash), \s operator (gawk): GNU Regexp Operators. + (line 13) +* \ (backslash), \S operator (gawk): GNU Regexp Operators. + (line 17) +* \ (backslash), \t escape sequence: Escape Sequences. (line 50) +* \ (backslash), \v escape sequence: Escape Sequences. (line 53) +* \ (backslash), \w operator (gawk): GNU Regexp Operators. + (line 22) +* \ (backslash), \W operator (gawk): GNU Regexp Operators. + (line 28) +* \ (backslash), \x escape sequence: Escape Sequences. (line 61) +* \ (backslash), \y operator (gawk): GNU Regexp Operators. + (line 41) +* \ (backslash), \` operator (gawk): GNU Regexp Operators. + (line 57) +* ^ (caret), in bracket expressions: Bracket Expressions. (line 25) +* ^ (caret), in FS: Regexp Field Splitting. + (line 59) +* ^ (caret), regexp operator: Regexp Operators. (line 22) +* ^ (caret), regexp operator <1>: GNU Regexp Operators. + (line 62) +* ^ (caret), ^ operator: Precedence. (line 48) +* ^ (caret), ^= operator: Assignment Ops. (line 129) +* ^ (caret), ^= operator <1>: Precedence. (line 94) +* _ (underscore), C macro: Explaining gettext. (line 71) +* _ (underscore), in names of private variables: Library Names. + (line 29) +* _ (underscore), translatable string: Programmer i18n. (line 69) +* _gr_init() user-defined function: Group Functions. (line 83) +* _ord_init() user-defined function: Ordinal Functions. (line 16) +* _pw_init() user-defined function: Passwd Functions. (line 105) +* {} (braces): Profiling. (line 142) +* {} (braces), actions and: Action Overview. (line 19) +* {} (braces), statements, grouping: Statements. (line 10) +* | (vertical bar): Regexp Operators. (line 70) +* | (vertical bar), | operator (I/O): Getline/Pipe. (line 10) +* | (vertical bar), | operator (I/O) <1>: Redirection. (line 57) +* | (vertical bar), | operator (I/O) <2>: Precedence. (line 64) +* | (vertical bar), |& operator (I/O): Getline/Coprocess. (line 6) +* | (vertical bar), |& operator (I/O) <1>: Redirection. (line 96) +* | (vertical bar), |& operator (I/O) <2>: Precedence. (line 64) +* | (vertical bar), |& operator (I/O) <3>: Two-way I/O. (line 27) +* | (vertical bar), |& operator (I/O), pipes, closing: Close Files And Pipes. + (line 120) +* | (vertical bar), || operator: Boolean Ops. (line 59) +* | (vertical bar), || operator <1>: Precedence. (line 88) +* ~ (tilde), ~ operator: Regexp Usage. (line 19) +* ~ (tilde), ~ operator <1>: Computed Regexps. (line 6) +* ~ (tilde), ~ operator <2>: Case-sensitivity. (line 26) +* ~ (tilde), ~ operator <3>: Regexp Constants. (line 6) +* ~ (tilde), ~ operator <4>: Comparison Operators. + (line 11) +* ~ (tilde), ~ operator <5>: Comparison Operators. + (line 98) +* ~ (tilde), ~ operator <6>: Precedence. (line 79) +* ~ (tilde), ~ operator <7>: Expression Patterns. (line 24) +* accessing fields: Fields. (line 6) +* accessing global variables from extensions: Symbol Table Access. + (line 6) +* account information: Passwd Functions. (line 16) +* account information <1>: Group Functions. (line 6) +* actions: Action Overview. (line 6) +* actions, control statements in: Statements. (line 6) +* actions, default: Very Simple. (line 35) +* actions, empty: Very Simple. (line 40) +* Ada programming language: Glossary. (line 11) +* adding, features to gawk: Adding Code. (line 6) +* adding, fields: Changing Fields. (line 53) +* advanced features, fixed-width data: Constant Size. (line 6) +* advanced features, gawk: Advanced Features. (line 6) +* advanced features, network programming: TCP/IP Networking. (line 6) +* advanced features, nondecimal input data: Nondecimal Data. (line 6) +* advanced features, processes, communicating with: Two-way I/O. + (line 6) +* advanced features, specifying field content: Splitting By Content. + (line 9) +* Aho, Alfred: History. (line 17) +* Aho, Alfred <1>: Contributors. (line 12) +* alarm clock example program: Alarm Program. (line 11) +* alarm.awk program: Alarm Program. (line 31) +* algorithms: Basic High Level. (line 57) +* allocating memory for extensions: Memory Allocation Functions. + (line 6) +* amazing awk assembler (aaa): Glossary. (line 16) +* amazingly workable formatter (awf): Glossary. (line 24) +* ambiguity, syntactic: /= operator vs. /=.../ regexp constant: Assignment Ops. + (line 149) +* ampersand (&), && operator: Boolean Ops. (line 59) +* ampersand (&), && operator <1>: Precedence. (line 85) +* ampersand (&), gsub()/gensub()/sub() functions and: Gory Details. + (line 6) +* anagram.awk program: Anagram Program. (line 21) +* anagrams, finding: Anagram Program. (line 6) +* and: Bitwise Functions. (line 40) +* AND bitwise operation: Bitwise Functions. (line 6) +* and Boolean-logic operator: Boolean Ops. (line 6) +* ANSI: Glossary. (line 34) +* API informational variables: Extension API Informational Variables. + (line 6) +* API version: Extension Versioning. + (line 6) +* arbitrary precision: Arbitrary Precision Arithmetic. + (line 6) +* arbitrary precision integers: Arbitrary Precision Integers. + (line 6) +* archaeologists: Bugs. (line 6) +* arctangent: Numeric Functions. (line 12) +* ARGC/ARGV variables: Auto-set. (line 15) +* ARGC/ARGV variables, command-line arguments: Other Arguments. + (line 15) +* ARGC/ARGV variables, how to use: ARGC and ARGV. (line 6) +* ARGC/ARGV variables, portability and: Executable Scripts. (line 59) +* ARGIND variable: Auto-set. (line 44) +* ARGIND variable, command-line arguments: Other Arguments. (line 15) +* arguments, command-line: Other Arguments. (line 6) +* arguments, command-line <1>: Auto-set. (line 15) +* arguments, command-line <2>: ARGC and ARGV. (line 6) +* arguments, command-line, invoking awk: Command Line. (line 6) +* arguments, in function calls: Function Calls. (line 18) +* arguments, processing: Getopt Function. (line 6) +* ARGV array, indexing into: Other Arguments. (line 15) +* arithmetic operators: Arithmetic Ops. (line 6) +* array manipulation in extensions: Array Manipulation. (line 6) +* array members: Reference to Elements. + (line 6) +* array scanning order, controlling: Controlling Scanning. + (line 14) +* array, number of elements: String Functions. (line 200) +* arrays: Arrays. (line 6) +* arrays of arrays: Arrays of Arrays. (line 6) +* arrays, an example of using: Array Example. (line 6) +* arrays, and IGNORECASE variable: Array Intro. (line 100) +* arrays, as parameters to functions: Pass By Value/Reference. + (line 44) +* arrays, associative: Array Intro. (line 48) +* arrays, associative, library functions and: Library Names. (line 58) +* arrays, deleting entire contents: Delete. (line 39) +* arrays, elements that don't exist: Reference to Elements. + (line 23) +* arrays, elements, assigning values: Assigning Elements. (line 6) +* arrays, elements, deleting: Delete. (line 6) +* arrays, elements, order of access by in operator: Scanning an Array. + (line 48) +* arrays, elements, retrieving number of: String Functions. (line 42) +* arrays, for statement and: Scanning an Array. (line 20) +* arrays, indexing: Array Intro. (line 48) +* arrays, merging into strings: Join Function. (line 6) +* arrays, multidimensional: Multidimensional. (line 10) +* arrays, multidimensional, scanning: Multiscanning. (line 11) +* arrays, numeric subscripts: Numeric Array Subscripts. + (line 6) +* arrays, referencing elements: Reference to Elements. + (line 6) +* arrays, scanning: Scanning an Array. (line 6) +* arrays, sorting: Array Sorting Functions. + (line 6) +* arrays, sorting, and IGNORECASE variable: Array Sorting Functions. + (line 83) +* arrays, sparse: Array Intro. (line 76) +* arrays, subscripts, uninitialized variables as: Uninitialized Subscripts. + (line 6) +* arrays, unassigned elements: Reference to Elements. + (line 18) +* artificial intelligence, gawk and: Distribution contents. + (line 52) +* ASCII: Ordinal Functions. (line 45) +* ASCII <1>: Glossary. (line 196) +* asort: String Functions. (line 42) +* asort <1>: Array Sorting Functions. + (line 6) +* asort() function (gawk), arrays, sorting: Array Sorting Functions. + (line 6) +* asorti: String Functions. (line 42) +* asorti <1>: Array Sorting Functions. + (line 6) +* asorti() function (gawk), arrays, sorting: Array Sorting Functions. + (line 6) +* assert() function (C library): Assert Function. (line 6) +* assert() user-defined function: Assert Function. (line 28) +* assertions: Assert Function. (line 6) +* assign values to variables, in debugger: Viewing And Changing Data. + (line 58) +* assignment operators: Assignment Ops. (line 6) +* assignment operators, evaluation order: Assignment Ops. (line 110) +* assignment operators, lvalues/rvalues: Assignment Ops. (line 31) +* assignments as filenames: Ignoring Assigns. (line 6) +* associative arrays: Array Intro. (line 48) +* asterisk (*), * operator, as multiplication operator: Precedence. + (line 54) +* asterisk (*), * operator, as regexp operator: Regexp Operators. + (line 89) +* asterisk (*), * operator, null strings, matching: String Functions. + (line 537) +* asterisk (*), ** operator: Arithmetic Ops. (line 81) +* asterisk (*), ** operator <1>: Precedence. (line 48) +* asterisk (*), **= operator: Assignment Ops. (line 129) +* asterisk (*), **= operator <1>: Precedence. (line 94) +* asterisk (*), *= operator: Assignment Ops. (line 129) +* asterisk (*), *= operator <1>: Precedence. (line 94) +* atan2: Numeric Functions. (line 12) +* automatic displays, in debugger: Debugger Info. (line 24) +* awf (amazingly workable formatter) program: Glossary. (line 24) +* awk debugging, enabling: Options. (line 108) +* awk language, POSIX version: Assignment Ops. (line 138) +* awk profiling, enabling: Options. (line 245) +* awk programs: Getting Started. (line 12) +* awk programs <1>: Executable Scripts. (line 6) +* awk programs <2>: Two Rules. (line 6) +* awk programs, complex: When. (line 27) +* awk programs, documenting: Comments. (line 6) +* awk programs, documenting <1>: Library Names. (line 6) +* awk programs, examples of: Sample Programs. (line 6) +* awk programs, execution of: Next Statement. (line 16) +* awk programs, internationalizing: I18N Functions. (line 6) +* awk programs, internationalizing <1>: Programmer i18n. (line 6) +* awk programs, lengthy: Long. (line 6) +* awk programs, lengthy, assertions: Assert Function. (line 6) +* awk programs, location of: Options. (line 25) +* awk programs, location of <1>: Options. (line 125) +* awk programs, location of <2>: Options. (line 159) +* awk programs, one-line examples: Very Simple. (line 46) +* awk programs, profiling: Profiling. (line 6) +* awk programs, running: Running gawk. (line 6) +* awk programs, running <1>: Long. (line 6) +* awk programs, running, from shell scripts: One-shot. (line 22) +* awk programs, running, without input files: Read Terminal. (line 16) +* awk programs, shell variables in: Using Shell Variables. + (line 6) +* awk, function of: Getting Started. (line 6) +* awk, gawk and: Preface. (line 21) +* awk, gawk and <1>: This Manual. (line 14) +* awk, history of: History. (line 17) +* awk, implementation issues, pipes: Redirection. (line 129) +* awk, implementations: Other Versions. (line 6) +* awk, implementations, limits: Getline Notes. (line 14) +* awk, invoking: Command Line. (line 6) +* awk, new vs. old: Names. (line 6) +* awk, new vs. old, OFMT variable: Strings And Numbers. (line 56) +* awk, POSIX and: Preface. (line 21) +* awk, POSIX and, See Also POSIX awk: Preface. (line 21) +* awk, regexp constants and: Comparison Operators. + (line 103) +* awk, See Also gawk: Preface. (line 34) +* awk, terms describing: This Manual. (line 6) +* awk, uses for: Preface. (line 21) +* awk, uses for <1>: Getting Started. (line 12) +* awk, uses for <2>: When. (line 6) +* awk, versions of: V7/SVR3.1. (line 6) +* awk, versions of, changes between SVR3.1 and SVR4: SVR4. (line 6) +* awk, versions of, changes between SVR4 and POSIX awk: POSIX. + (line 6) +* awk, versions of, changes between V7 and SVR3.1: V7/SVR3.1. (line 6) +* awk, versions of, See Also Brian Kernighan's awk: BTL. (line 6) +* awk, versions of, See Also Brian Kernighan's awk <1>: Other Versions. + (line 13) +* awka compiler for awk: Other Versions. (line 68) +* AWKLIBPATH environment variable: AWKLIBPATH Variable. (line 6) +* AWKPATH environment variable: AWKPATH Variable. (line 6) +* AWKPATH environment variable <1>: PC Using. (line 9) +* awkprof.out file: Profiling. (line 6) +* awksed.awk program: Simple Sed. (line 25) +* awkvars.out file: Options. (line 94) +* b debugger command (alias for break): Breakpoint Control. (line 11) +* backslash (\): Comments. (line 50) +* backslash (\), as field separator: Command Line Field Separator. + (line 24) +* backslash (\), continuing lines and: Statements/Lines. (line 19) +* backslash (\), continuing lines and, comments and: Statements/Lines. + (line 75) +* backslash (\), continuing lines and, in csh: Statements/Lines. + (line 43) +* backslash (\), gsub()/gensub()/sub() functions and: Gory Details. + (line 6) +* backslash (\), in bracket expressions: Bracket Expressions. (line 25) +* backslash (\), in escape sequences: Escape Sequences. (line 6) +* backslash (\), in escape sequences <1>: Escape Sequences. (line 103) +* backslash (\), in escape sequences, POSIX and: Escape Sequences. + (line 108) +* backslash (\), in regexp constants: Computed Regexps. (line 30) +* backslash (\), in shell commands: Quoting. (line 48) +* backslash (\), regexp operator: Regexp Operators. (line 18) +* backslash (\), \" escape sequence: Escape Sequences. (line 85) +* backslash (\), \' operator (gawk): GNU Regexp Operators. + (line 59) +* backslash (\), \/ escape sequence: Escape Sequences. (line 76) +* backslash (\), \< operator (gawk): GNU Regexp Operators. + (line 33) +* backslash (\), \> operator (gawk): GNU Regexp Operators. + (line 37) +* backslash (\), \a escape sequence: Escape Sequences. (line 34) +* backslash (\), \b escape sequence: Escape Sequences. (line 38) +* backslash (\), \B operator (gawk): GNU Regexp Operators. + (line 46) +* backslash (\), \f escape sequence: Escape Sequences. (line 41) +* backslash (\), \n escape sequence: Escape Sequences. (line 44) +* backslash (\), \NNN escape sequence: Escape Sequences. (line 56) +* backslash (\), \r escape sequence: Escape Sequences. (line 47) +* backslash (\), \s operator (gawk): GNU Regexp Operators. + (line 13) +* backslash (\), \S operator (gawk): GNU Regexp Operators. + (line 17) +* backslash (\), \t escape sequence: Escape Sequences. (line 50) +* backslash (\), \v escape sequence: Escape Sequences. (line 53) +* backslash (\), \w operator (gawk): GNU Regexp Operators. + (line 22) +* backslash (\), \W operator (gawk): GNU Regexp Operators. + (line 28) +* backslash (\), \x escape sequence: Escape Sequences. (line 61) +* backslash (\), \y operator (gawk): GNU Regexp Operators. + (line 41) +* backslash (\), \` operator (gawk): GNU Regexp Operators. + (line 57) +* backtrace debugger command: Execution Stack. (line 13) +* Beebe, Nelson H.F.: Acknowledgments. (line 60) +* Beebe, Nelson H.F. <1>: Other Versions. (line 82) +* BEGIN pattern: Field Separators. (line 44) +* BEGIN pattern <1>: BEGIN/END. (line 6) +* BEGIN pattern <2>: Using BEGIN/END. (line 6) +* BEGIN pattern, and profiling: Profiling. (line 62) +* BEGIN pattern, assert() user-defined function and: Assert Function. + (line 83) +* BEGIN pattern, Boolean patterns and: Expression Patterns. (line 70) +* BEGIN pattern, exit statement and: Exit Statement. (line 12) +* BEGIN pattern, getline and: Getline Notes. (line 19) +* BEGIN pattern, headings, adding: Print Examples. (line 42) +* BEGIN pattern, next/nextfile statements and: I/O And BEGIN/END. + (line 36) +* BEGIN pattern, next/nextfile statements and <1>: Next Statement. + (line 44) +* BEGIN pattern, OFS/ORS variables, assigning values to: Output Separators. + (line 20) +* BEGIN pattern, operators and: Using BEGIN/END. (line 17) +* BEGIN pattern, print statement and: I/O And BEGIN/END. (line 15) +* BEGIN pattern, pwcat program: Passwd Functions. (line 143) +* BEGIN pattern, running awk programs and: Cut Program. (line 63) +* BEGIN pattern, TEXTDOMAIN variable and: Programmer i18n. (line 60) +* BEGINFILE pattern: BEGINFILE/ENDFILE. (line 6) +* BEGINFILE pattern, Boolean patterns and: Expression Patterns. + (line 70) +* beginfile() user-defined function: Filetrans Function. (line 62) +* Bentley, Jon: Glossary. (line 206) +* Benzinger, Michael: Contributors. (line 98) +* Berry, Karl: Acknowledgments. (line 33) +* Berry, Karl <1>: Acknowledgments. (line 75) +* Berry, Karl <2>: Ranges and Locales. (line 74) +* binary input/output: User-modified. (line 15) +* bindtextdomain: I18N Functions. (line 11) +* bindtextdomain <1>: Programmer i18n. (line 48) +* bindtextdomain() function (C library): Explaining gettext. (line 50) +* bindtextdomain() function (gawk), portability and: I18N Portability. + (line 33) +* BINMODE variable: User-modified. (line 15) +* BINMODE variable <1>: PC Using. (line 16) +* bit-manipulation functions: Bitwise Functions. (line 6) +* bits2str() user-defined function: Bitwise Functions. (line 69) +* bitwise AND: Bitwise Functions. (line 40) +* bitwise complement: Bitwise Functions. (line 44) +* bitwise OR: Bitwise Functions. (line 50) +* bitwise XOR: Bitwise Functions. (line 57) +* bitwise, complement: Bitwise Functions. (line 25) +* bitwise, operations: Bitwise Functions. (line 6) +* bitwise, shift: Bitwise Functions. (line 32) +* body, in actions: Statements. (line 10) +* body, in loops: While Statement. (line 14) +* Boolean expressions: Boolean Ops. (line 6) +* Boolean expressions, as patterns: Expression Patterns. (line 39) +* Boolean operators, See Boolean expressions: Boolean Ops. (line 6) +* Bourne shell, quoting rules for: Quoting. (line 18) +* braces ({}): Profiling. (line 142) +* braces ({}), actions and: Action Overview. (line 19) +* braces ({}), statements, grouping: Statements. (line 10) +* bracket expressions: Regexp Operators. (line 56) +* bracket expressions <1>: Bracket Expressions. (line 6) +* bracket expressions, character classes: Bracket Expressions. + (line 40) +* bracket expressions, collating elements: Bracket Expressions. + (line 86) +* bracket expressions, collating symbols: Bracket Expressions. + (line 93) +* bracket expressions, complemented: Regexp Operators. (line 64) +* bracket expressions, equivalence classes: Bracket Expressions. + (line 99) +* bracket expressions, non-ASCII: Bracket Expressions. (line 86) +* bracket expressions, range expressions: Bracket Expressions. + (line 6) +* break debugger command: Breakpoint Control. (line 11) +* break statement: Break Statement. (line 6) +* breakpoint: Debugging Terms. (line 33) +* breakpoint at location, how to delete: Breakpoint Control. (line 36) +* breakpoint commands: Debugger Execution Control. + (line 10) +* breakpoint condition: Breakpoint Control. (line 54) +* breakpoint, delete by number: Breakpoint Control. (line 64) +* breakpoint, how to disable or enable: Breakpoint Control. (line 69) +* breakpoint, setting: Breakpoint Control. (line 11) +* Brennan, Michael: Foreword3. (line 84) +* Brennan, Michael <1>: Foreword4. (line 33) +* Brennan, Michael <2>: Acknowledgments. (line 79) +* Brennan, Michael <3>: Delete. (line 56) +* Brennan, Michael <4>: Simple Sed. (line 25) +* Brennan, Michael <5>: Other Versions. (line 6) +* Brennan, Michael <6>: Other Versions. (line 48) +* Brian Kernighan's awk: When. (line 21) +* Brian Kernighan's awk <1>: Escape Sequences. (line 112) +* Brian Kernighan's awk <2>: GNU Regexp Operators. + (line 85) +* Brian Kernighan's awk <3>: Regexp Field Splitting. + (line 67) +* Brian Kernighan's awk <4>: Getline/Pipe. (line 62) +* Brian Kernighan's awk <5>: Concatenation. (line 36) +* Brian Kernighan's awk <6>: I/O And BEGIN/END. (line 15) +* Brian Kernighan's awk <7>: Break Statement. (line 51) +* Brian Kernighan's awk <8>: Continue Statement. (line 44) +* Brian Kernighan's awk <9>: Nextfile Statement. (line 47) +* Brian Kernighan's awk <10>: Delete. (line 51) +* Brian Kernighan's awk <11>: String Functions. (line 493) +* Brian Kernighan's awk <12>: Gory Details. (line 19) +* Brian Kernighan's awk <13>: I/O Functions. (line 43) +* Brian Kernighan's awk, extensions: BTL. (line 6) +* Brian Kernighan's awk, source code: Other Versions. (line 13) +* Brini, Davide: Signature Program. (line 6) +* Brink, Jeroen: DOS Quoting. (line 10) +* Broder, Alan J.: Contributors. (line 89) +* Brown, Martin: Contributors. (line 83) +* BSD-based operating systems: Glossary. (line 748) +* bt debugger command (alias for backtrace): Execution Stack. (line 13) +* Buening, Andreas: Acknowledgments. (line 60) +* Buening, Andreas <1>: Contributors. (line 93) +* Buening, Andreas <2>: Maintainers. (line 14) +* buffering, input/output: I/O Functions. (line 166) +* buffering, input/output <1>: Two-way I/O. (line 53) +* buffering, interactive vs. noninteractive: I/O Functions. (line 76) +* buffers, flushing: I/O Functions. (line 32) +* buffers, flushing <1>: I/O Functions. (line 166) +* buffers, operators for: GNU Regexp Operators. + (line 51) +* bug reports, email address, bug-gawk@gnu.org: Bug address. (line 22) +* bug-gawk@gnu.org bug reporting address: Bug address. (line 22) +* built-in functions: Functions. (line 6) +* built-in functions, evaluation order: Calling Built-in. (line 30) +* BusyBox Awk: Other Versions. (line 92) +* c.e., See common extensions: Conventions. (line 51) +* call by reference: Pass By Value/Reference. + (line 44) +* call by value: Pass By Value/Reference. + (line 15) +* call stack, display in debugger: Execution Stack. (line 13) +* caret (^), in bracket expressions: Bracket Expressions. (line 25) +* caret (^), regexp operator: Regexp Operators. (line 22) +* caret (^), regexp operator <1>: GNU Regexp Operators. + (line 62) +* caret (^), ^ operator: Precedence. (line 48) +* caret (^), ^= operator: Assignment Ops. (line 129) +* caret (^), ^= operator <1>: Precedence. (line 94) +* case keyword: Switch Statement. (line 6) +* case sensitivity, and regexps: User-modified. (line 76) +* case sensitivity, and string comparisons: User-modified. (line 76) +* case sensitivity, array indices and: Array Intro. (line 100) +* case sensitivity, converting case: String Functions. (line 523) +* case sensitivity, example programs: Library Functions. (line 53) +* case sensitivity, gawk: Case-sensitivity. (line 26) +* case sensitivity, regexps and: Case-sensitivity. (line 6) +* CGI, awk scripts for: Options. (line 125) +* character classes, See bracket expressions: Regexp Operators. + (line 56) +* character lists in regular expression: Bracket Expressions. (line 6) +* character lists, See bracket expressions: Regexp Operators. (line 56) +* character sets (machine character encodings): Ordinal Functions. + (line 45) +* character sets (machine character encodings) <1>: Glossary. (line 196) +* character sets, See Also bracket expressions: Regexp Operators. + (line 56) +* characters, counting: Wc Program. (line 6) +* characters, transliterating: Translate Program. (line 6) +* characters, values of as numbers: Ordinal Functions. (line 6) +* Chassell, Robert J.: Acknowledgments. (line 33) +* chdir() extension function: Extension Sample File Functions. + (line 12) +* chem utility: Glossary. (line 206) +* chr() extension function: Extension Sample Ord. + (line 15) +* chr() user-defined function: Ordinal Functions. (line 16) +* clear debugger command: Breakpoint Control. (line 36) +* Cliff random numbers: Cliff Random Function. + (line 6) +* cliff_rand() user-defined function: Cliff Random Function. + (line 12) +* close: Close Files And Pipes. + (line 18) +* close <1>: I/O Functions. (line 10) +* close file or coprocess: I/O Functions. (line 10) +* close() function, portability: Close Files And Pipes. + (line 81) +* close() function, return value: Close Files And Pipes. + (line 132) +* close() function, two-way pipes and: Two-way I/O. (line 60) +* Close, Diane: Manual History. (line 34) +* Close, Diane <1>: Contributors. (line 21) +* Collado, Manuel: Acknowledgments. (line 60) +* collating elements: Bracket Expressions. (line 86) +* collating symbols: Bracket Expressions. (line 93) +* Colombo, Antonio: Acknowledgments. (line 60) +* Colombo, Antonio <1>: Contributors. (line 141) +* columns, aligning: Print Examples. (line 69) +* columns, cutting: Cut Program. (line 6) +* comma (,), in range patterns: Ranges. (line 6) +* command completion, in debugger: Readline Support. (line 6) +* command line, arguments: Other Arguments. (line 6) +* command line, arguments <1>: Auto-set. (line 15) +* command line, arguments <2>: ARGC and ARGV. (line 6) +* command line, directories on: Command-line directories. + (line 6) +* command line, formats: Running gawk. (line 12) +* command line, FS on, setting: Command Line Field Separator. + (line 6) +* command line, invoking awk from: Command Line. (line 6) +* command line, option -f: Long. (line 12) +* command line, options: Options. (line 6) +* command line, options, end of: Options. (line 55) +* command line, variables, assigning on: Assignment Options. (line 6) +* command-line options, processing: Getopt Function. (line 6) +* command-line options, string extraction: String Extraction. (line 6) +* commands debugger command: Debugger Execution Control. + (line 10) +* commands to execute at breakpoint: Debugger Execution Control. + (line 10) +* commenting: Comments. (line 6) +* commenting, backslash continuation and: Statements/Lines. (line 75) +* common extensions, ** operator: Arithmetic Ops. (line 30) +* common extensions, **= operator: Assignment Ops. (line 138) +* common extensions, /dev/stderr special file: Special FD. (line 48) +* common extensions, /dev/stdin special file: Special FD. (line 48) +* common extensions, /dev/stdout special file: Special FD. (line 48) +* common extensions, BINMODE variable: PC Using. (line 16) +* common extensions, delete to delete entire arrays: Delete. (line 39) +* common extensions, func keyword: Definition Syntax. (line 99) +* common extensions, length() applied to an array: String Functions. + (line 200) +* common extensions, RS as a regexp: gawk split records. (line 6) +* common extensions, single character fields: Single Character Fields. + (line 6) +* common extensions, \x escape sequence: Escape Sequences. (line 61) +* comp.lang.awk newsgroup: Usenet. (line 11) +* comparison expressions: Typing and Comparison. + (line 9) +* comparison expressions, as patterns: Expression Patterns. (line 14) +* comparison expressions, string vs. regexp: Comparison Operators. + (line 79) +* compatibility mode (gawk), extensions: POSIX/GNU. (line 6) +* compatibility mode (gawk), file names: Special Caveats. (line 9) +* compatibility mode (gawk), hexadecimal numbers: Nondecimal-numbers. + (line 59) +* compatibility mode (gawk), octal numbers: Nondecimal-numbers. + (line 59) +* compatibility mode (gawk), specifying: Options. (line 82) +* compiled programs: Basic High Level. (line 13) +* compiled programs <1>: Glossary. (line 218) +* compiling gawk for Cygwin: Cygwin. (line 6) +* compiling gawk for MS-Windows: PC Compiling. (line 11) +* compiling gawk for VMS: VMS Compilation. (line 6) +* compl: Bitwise Functions. (line 44) +* complement, bitwise: Bitwise Functions. (line 25) +* compound statements, control statements and: Statements. (line 10) +* concatenating: Concatenation. (line 9) +* condition debugger command: Breakpoint Control. (line 54) +* conditional expressions: Conditional Exp. (line 6) +* configuration option, --disable-extensions: Additional Configuration Options. + (line 9) +* configuration option, --disable-lint: Additional Configuration Options. + (line 15) +* configuration option, --disable-nls: Additional Configuration Options. + (line 32) +* configuration option, --with-whiny-user-strftime: Additional Configuration Options. + (line 37) +* configuration options, gawk: Additional Configuration Options. + (line 6) +* constant regexps: Regexp Usage. (line 57) +* constants, nondecimal: Nondecimal Data. (line 6) +* constants, numeric: Scalar Constants. (line 6) +* constants, types of: Constants. (line 6) +* continue program, in debugger: Debugger Execution Control. + (line 33) +* continue statement: Continue Statement. (line 6) +* control statements: Statements. (line 6) +* controlling array scanning order: Controlling Scanning. + (line 14) +* convert string to lower case: String Functions. (line 524) +* convert string to number: String Functions. (line 391) +* convert string to upper case: String Functions. (line 530) +* converting integer array subscripts: Numeric Array Subscripts. + (line 31) +* converting, dates to timestamps: Time Functions. (line 76) +* converting, numbers to strings: Strings And Numbers. (line 6) +* converting, numbers to strings <1>: Bitwise Functions. (line 108) +* converting, strings to numbers: Strings And Numbers. (line 6) +* converting, strings to numbers <1>: Bitwise Functions. (line 108) +* CONVFMT variable: Strings And Numbers. (line 29) +* CONVFMT variable <1>: User-modified. (line 30) +* CONVFMT variable, and array subscripts: Numeric Array Subscripts. + (line 6) +* cookie: Glossary. (line 257) +* coprocesses: Redirection. (line 96) +* coprocesses <1>: Two-way I/O. (line 27) +* coprocesses, closing: Close Files And Pipes. + (line 6) +* coprocesses, getline from: Getline/Coprocess. (line 6) +* cos: Numeric Functions. (line 16) +* cosine: Numeric Functions. (line 16) +* counting: Wc Program. (line 6) +* csh utility: Statements/Lines. (line 43) +* csh utility, POSIXLY_CORRECT environment variable: Options. (line 358) +* csh utility, |& operator, comparison with: Two-way I/O. (line 27) +* ctime() user-defined function: Function Example. (line 74) +* currency symbols, localization: Explaining gettext. (line 104) +* current system time: Time Functions. (line 66) +* custom.h file: Configuration Philosophy. + (line 30) +* customized input parser: Input Parsers. (line 6) +* customized output wrapper: Output Wrappers. (line 6) +* customized two-way processor: Two-way processors. (line 6) +* cut utility: Cut Program. (line 6) +* cut utility <1>: Cut Program. (line 6) +* cut.awk program: Cut Program. (line 45) +* d debugger command (alias for delete): Breakpoint Control. (line 64) +* d.c., See dark corner: Conventions. (line 42) +* dark corner: Conventions. (line 42) +* dark corner <1>: Glossary. (line 268) +* dark corner, "0" is actually true: Truth Values. (line 24) +* dark corner, /= operator vs. /=.../ regexp constant: Assignment Ops. + (line 149) +* dark corner, array subscripts: Uninitialized Subscripts. + (line 43) +* dark corner, break statement: Break Statement. (line 51) +* dark corner, close() function: Close Files And Pipes. + (line 132) +* dark corner, command-line arguments: Assignment Options. (line 43) +* dark corner, continue statement: Continue Statement. (line 44) +* dark corner, CONVFMT variable: Strings And Numbers. (line 39) +* dark corner, escape sequences: Other Arguments. (line 38) +* dark corner, escape sequences, for metacharacters: Escape Sequences. + (line 144) +* dark corner, exit statement: Exit Statement. (line 30) +* dark corner, field separators: Full Line Fields. (line 22) +* dark corner, FILENAME variable: Getline Notes. (line 19) +* dark corner, FILENAME variable <1>: Auto-set. (line 108) +* dark corner, FNR/NR variables: Auto-set. (line 357) +* dark corner, format-control characters: Control Letters. (line 18) +* dark corner, format-control characters <1>: Control Letters. + (line 93) +* dark corner, FS as null string: Single Character Fields. + (line 20) +* dark corner, input files: awk split records. (line 110) +* dark corner, invoking awk: Command Line. (line 16) +* dark corner, length() function: String Functions. (line 186) +* dark corner, locale's decimal point character: Locale influences conversions. + (line 17) +* dark corner, multiline records: Multiple Line. (line 35) +* dark corner, NF variable, decrementing: Changing Fields. (line 107) +* dark corner, OFMT variable: OFMT. (line 27) +* dark corner, regexp as second argument to index(): String Functions. + (line 164) +* dark corner, regexp constants: Using Constant Regexps. + (line 6) +* dark corner, regexp constants, /= operator and: Assignment Ops. + (line 149) +* dark corner, regexp constants, as arguments to user-defined functions: Using Constant Regexps. + (line 43) +* dark corner, split() function: String Functions. (line 361) +* dark corner, strings, storing: gawk split records. (line 82) +* dark corner, value of ARGV[0]: Auto-set. (line 39) +* dark corner, ^, in FS: Regexp Field Splitting. + (line 59) +* data, fixed-width: Constant Size. (line 6) +* data-driven languages: Basic High Level. (line 74) +* database, group, reading: Group Functions. (line 6) +* database, users, reading: Passwd Functions. (line 6) +* date utility, GNU: Time Functions. (line 17) +* date utility, POSIX: Time Functions. (line 253) +* dates, converting to timestamps: Time Functions. (line 76) +* dates, information related to, localization: Explaining gettext. + (line 112) +* Davies, Stephen: Acknowledgments. (line 60) +* Davies, Stephen <1>: Contributors. (line 75) +* Day, Robert P.J.: Acknowledgments. (line 79) +* dcgettext: I18N Functions. (line 21) +* dcgettext <1>: Programmer i18n. (line 20) +* dcgettext() function (gawk), portability and: I18N Portability. + (line 33) +* dcngettext: I18N Functions. (line 27) +* dcngettext <1>: Programmer i18n. (line 37) +* dcngettext() function (gawk), portability and: I18N Portability. + (line 33) +* deadlocks: Two-way I/O. (line 53) +* debugger commands, b (break): Breakpoint Control. (line 11) +* debugger commands, backtrace: Execution Stack. (line 13) +* debugger commands, break: Breakpoint Control. (line 11) +* debugger commands, bt (backtrace): Execution Stack. (line 13) +* debugger commands, c (continue): Debugger Execution Control. + (line 33) +* debugger commands, clear: Breakpoint Control. (line 36) +* debugger commands, commands: Debugger Execution Control. + (line 10) +* debugger commands, condition: Breakpoint Control. (line 54) +* debugger commands, continue: Debugger Execution Control. + (line 33) +* debugger commands, d (delete): Breakpoint Control. (line 64) +* debugger commands, delete: Breakpoint Control. (line 64) +* debugger commands, disable: Breakpoint Control. (line 69) +* debugger commands, display: Viewing And Changing Data. + (line 8) +* debugger commands, down: Execution Stack. (line 23) +* debugger commands, dump: Miscellaneous Debugger Commands. + (line 9) +* debugger commands, e (enable): Breakpoint Control. (line 73) +* debugger commands, enable: Breakpoint Control. (line 73) +* debugger commands, end: Debugger Execution Control. + (line 10) +* debugger commands, eval: Viewing And Changing Data. + (line 23) +* debugger commands, f (frame): Execution Stack. (line 27) +* debugger commands, finish: Debugger Execution Control. + (line 39) +* debugger commands, frame: Execution Stack. (line 27) +* debugger commands, h (help): Miscellaneous Debugger Commands. + (line 69) +* debugger commands, help: Miscellaneous Debugger Commands. + (line 69) +* debugger commands, i (info): Debugger Info. (line 13) +* debugger commands, ignore: Breakpoint Control. (line 87) +* debugger commands, info: Debugger Info. (line 13) +* debugger commands, l (list): Miscellaneous Debugger Commands. + (line 75) +* debugger commands, list: Miscellaneous Debugger Commands. + (line 75) +* debugger commands, n (next): Debugger Execution Control. + (line 43) +* debugger commands, next: Debugger Execution Control. + (line 43) +* debugger commands, nexti: Debugger Execution Control. + (line 49) +* debugger commands, ni (nexti): Debugger Execution Control. + (line 49) +* debugger commands, o (option): Debugger Info. (line 57) +* debugger commands, option: Debugger Info. (line 57) +* debugger commands, p (print): Viewing And Changing Data. + (line 35) +* debugger commands, print: Viewing And Changing Data. + (line 35) +* debugger commands, printf: Viewing And Changing Data. + (line 53) +* debugger commands, q (quit): Miscellaneous Debugger Commands. + (line 102) +* debugger commands, quit: Miscellaneous Debugger Commands. + (line 102) +* debugger commands, r (run): Debugger Execution Control. + (line 62) +* debugger commands, return: Debugger Execution Control. + (line 54) +* debugger commands, run: Debugger Execution Control. + (line 62) +* debugger commands, s (step): Debugger Execution Control. + (line 68) +* debugger commands, set: Viewing And Changing Data. + (line 58) +* debugger commands, si (stepi): Debugger Execution Control. + (line 75) +* debugger commands, silent: Debugger Execution Control. + (line 10) +* debugger commands, step: Debugger Execution Control. + (line 68) +* debugger commands, stepi: Debugger Execution Control. + (line 75) +* debugger commands, t (tbreak): Breakpoint Control. (line 90) +* debugger commands, tbreak: Breakpoint Control. (line 90) +* debugger commands, trace: Miscellaneous Debugger Commands. + (line 110) +* debugger commands, u (until): Debugger Execution Control. + (line 82) +* debugger commands, undisplay: Viewing And Changing Data. + (line 79) +* debugger commands, until: Debugger Execution Control. + (line 82) +* debugger commands, unwatch: Viewing And Changing Data. + (line 83) +* debugger commands, up: Execution Stack. (line 36) +* debugger commands, w (watch): Viewing And Changing Data. + (line 66) +* debugger commands, watch: Viewing And Changing Data. + (line 66) +* debugger commands, where (backtrace): Execution Stack. (line 13) +* debugger default list amount: Debugger Info. (line 69) +* debugger history file: Debugger Info. (line 81) +* debugger history size: Debugger Info. (line 65) +* debugger options: Debugger Info. (line 57) +* debugger prompt: Debugger Info. (line 78) +* debugger, how to start: Debugger Invocation. (line 6) +* debugger, read commands from a file: Debugger Info. (line 97) +* debugging awk programs: Debugger. (line 6) +* debugging gawk, bug reports: Bugs. (line 9) +* decimal point character, locale specific: Options. (line 269) +* decrement operators: Increment Ops. (line 35) +* default keyword: Switch Statement. (line 6) +* Deifik, Scott: Acknowledgments. (line 60) +* Deifik, Scott <1>: Contributors. (line 54) +* Deifik, Scott <2>: Maintainers. (line 14) +* delete ARRAY: Delete. (line 39) +* delete breakpoint at location: Breakpoint Control. (line 36) +* delete breakpoint by number: Breakpoint Control. (line 64) +* delete debugger command: Breakpoint Control. (line 64) +* delete statement: Delete. (line 6) +* delete watchpoint: Viewing And Changing Data. + (line 83) +* deleting elements in arrays: Delete. (line 6) +* deleting entire arrays: Delete. (line 39) +* Demaille, Akim: Acknowledgments. (line 60) +* describe call stack frame, in debugger: Debugger Info. (line 27) +* differences between gawk and awk: String Functions. (line 200) +* differences in awk and gawk, ARGC/ARGV variables: ARGC and ARGV. + (line 89) +* differences in awk and gawk, ARGIND variable: Auto-set. (line 44) +* differences in awk and gawk, array elements, deleting: Delete. + (line 39) +* differences in awk and gawk, AWKLIBPATH environment variable: AWKLIBPATH Variable. + (line 6) +* differences in awk and gawk, AWKPATH environment variable: AWKPATH Variable. + (line 6) +* differences in awk and gawk, BEGIN/END patterns: I/O And BEGIN/END. + (line 15) +* differences in awk and gawk, BEGINFILE/ENDFILE patterns: BEGINFILE/ENDFILE. + (line 6) +* differences in awk and gawk, BINMODE variable: User-modified. + (line 15) +* differences in awk and gawk, BINMODE variable <1>: PC Using. + (line 16) +* differences in awk and gawk, close() function: Close Files And Pipes. + (line 81) +* differences in awk and gawk, close() function <1>: Close Files And Pipes. + (line 132) +* differences in awk and gawk, command-line directories: Command-line directories. + (line 6) +* differences in awk and gawk, ERRNO variable: Auto-set. (line 87) +* differences in awk and gawk, error messages: Special FD. (line 19) +* differences in awk and gawk, FIELDWIDTHS variable: User-modified. + (line 37) +* differences in awk and gawk, FPAT variable: User-modified. (line 43) +* differences in awk and gawk, FUNCTAB variable: Auto-set. (line 134) +* differences in awk and gawk, function arguments (gawk): Calling Built-in. + (line 16) +* differences in awk and gawk, getline command: Getline. (line 19) +* differences in awk and gawk, IGNORECASE variable: User-modified. + (line 76) +* differences in awk and gawk, implementation limitations: Getline Notes. + (line 14) +* differences in awk and gawk, implementation limitations <1>: Redirection. + (line 129) +* differences in awk and gawk, indirect function calls: Indirect Calls. + (line 6) +* differences in awk and gawk, input/output operators: Getline/Coprocess. + (line 6) +* differences in awk and gawk, input/output operators <1>: Redirection. + (line 96) +* differences in awk and gawk, line continuations: Conditional Exp. + (line 34) +* differences in awk and gawk, LINT variable: User-modified. (line 87) +* differences in awk and gawk, match() function: String Functions. + (line 262) +* differences in awk and gawk, print/printf statements: Format Modifiers. + (line 13) +* differences in awk and gawk, PROCINFO array: Auto-set. (line 148) +* differences in awk and gawk, read timeouts: Read Timeout. (line 6) +* differences in awk and gawk, record separators: awk split records. + (line 124) +* differences in awk and gawk, regexp constants: Using Constant Regexps. + (line 43) +* differences in awk and gawk, regular expressions: Case-sensitivity. + (line 26) +* differences in awk and gawk, retrying input: Retrying Input. + (line 6) +* differences in awk and gawk, RS/RT variables: gawk split records. + (line 58) +* differences in awk and gawk, RT variable: Auto-set. (line 295) +* differences in awk and gawk, single-character fields: Single Character Fields. + (line 6) +* differences in awk and gawk, split() function: String Functions. + (line 348) +* differences in awk and gawk, strings: Scalar Constants. (line 20) +* differences in awk and gawk, strings, storing: gawk split records. + (line 76) +* differences in awk and gawk, SYMTAB variable: Auto-set. (line 299) +* differences in awk and gawk, TEXTDOMAIN variable: User-modified. + (line 152) +* differences in awk and gawk, trunc-mod operation: Arithmetic Ops. + (line 66) +* directories, command-line: Command-line directories. + (line 6) +* directories, searching: Programs Exercises. (line 70) +* directories, searching for loadable extensions: AWKLIBPATH Variable. + (line 6) +* directories, searching for source files: AWKPATH Variable. (line 6) +* disable breakpoint: Breakpoint Control. (line 69) +* disable debugger command: Breakpoint Control. (line 69) +* display debugger command: Viewing And Changing Data. + (line 8) +* display debugger options: Debugger Info. (line 57) +* division: Arithmetic Ops. (line 44) +* do-while statement: Do Statement. (line 6) +* do-while statement, use of regexps in: Regexp Usage. (line 19) +* documentation, of awk programs: Library Names. (line 6) +* documentation, online: Manual History. (line 11) +* documents, searching: Dupword Program. (line 6) +* dollar sign ($), $ field operator: Fields. (line 19) +* dollar sign ($), $ field operator <1>: Precedence. (line 42) +* dollar sign ($), incrementing fields and arrays: Increment Ops. + (line 30) +* dollar sign ($), regexp operator: Regexp Operators. (line 35) +* double quote ("), in regexp constants: Computed Regexps. (line 30) +* double quote ("), in shell commands: Quoting. (line 54) +* down debugger command: Execution Stack. (line 23) +* Drepper, Ulrich: Acknowledgments. (line 52) +* Duman, Patrice: Acknowledgments. (line 75) +* dump all variables of a program: Options. (line 94) +* dump debugger command: Miscellaneous Debugger Commands. + (line 9) +* dupword.awk program: Dupword Program. (line 31) +* dynamic profiling: Profiling. (line 177) +* dynamically loaded extensions: Dynamic Extensions. (line 6) +* e debugger command (alias for enable): Breakpoint Control. (line 73) +* EBCDIC: Ordinal Functions. (line 45) +* effective group ID of gawk user: Auto-set. (line 153) +* effective user ID of gawk user: Auto-set. (line 161) +* egrep utility: Bracket Expressions. (line 34) +* egrep utility <1>: Egrep Program. (line 6) +* egrep.awk program: Egrep Program. (line 53) +* elements in arrays, assigning values: Assigning Elements. (line 6) +* elements in arrays, deleting: Delete. (line 6) +* elements in arrays, order of access by in operator: Scanning an Array. + (line 48) +* elements in arrays, scanning: Scanning an Array. (line 6) +* elements of arrays: Reference to Elements. + (line 6) +* email address for bug reports, bug-gawk@gnu.org: Bug address. + (line 22) +* empty array elements: Reference to Elements. + (line 18) +* empty pattern: Empty. (line 6) +* empty strings: awk split records. (line 114) +* empty strings, See null strings: Regexp Field Splitting. + (line 43) +* EMRED: TCP/IP Networking. (line 6) +* enable breakpoint: Breakpoint Control. (line 73) +* enable debugger command: Breakpoint Control. (line 73) +* end debugger command: Debugger Execution Control. + (line 10) +* END pattern: BEGIN/END. (line 6) +* END pattern <1>: Using BEGIN/END. (line 6) +* END pattern, and profiling: Profiling. (line 62) +* END pattern, assert() user-defined function and: Assert Function. + (line 75) +* END pattern, Boolean patterns and: Expression Patterns. (line 70) +* END pattern, exit statement and: Exit Statement. (line 12) +* END pattern, next/nextfile statements and: I/O And BEGIN/END. + (line 36) +* END pattern, next/nextfile statements and <1>: Next Statement. + (line 44) +* END pattern, operators and: Using BEGIN/END. (line 17) +* END pattern, print statement and: I/O And BEGIN/END. (line 15) +* ENDFILE pattern: BEGINFILE/ENDFILE. (line 6) +* ENDFILE pattern, Boolean patterns and: Expression Patterns. (line 70) +* endfile() user-defined function: Filetrans Function. (line 62) +* endgrent() function (C library): Group Functions. (line 213) +* endgrent() user-defined function: Group Functions. (line 216) +* endpwent() function (C library): Passwd Functions. (line 208) +* endpwent() user-defined function: Passwd Functions. (line 211) +* English, Steve: Advanced Features. (line 6) +* ENVIRON array: Auto-set. (line 59) +* environment variables used by gawk: Environment Variables. + (line 6) +* environment variables, in ENVIRON array: Auto-set. (line 59) +* epoch, definition of: Glossary. (line 312) +* equals sign (=), = operator: Assignment Ops. (line 6) +* equals sign (=), == operator: Comparison Operators. + (line 11) +* equals sign (=), == operator <1>: Precedence. (line 64) +* EREs (Extended Regular Expressions): Bracket Expressions. (line 34) +* ERRNO variable: Auto-set. (line 87) +* ERRNO variable <1>: TCP/IP Networking. (line 54) +* ERRNO variable, with BEGINFILE pattern: BEGINFILE/ENDFILE. (line 26) +* ERRNO variable, with close() function: Close Files And Pipes. + (line 140) +* ERRNO variable, with getline command: Getline. (line 19) +* error handling: Special FD. (line 19) +* error handling, ERRNO variable and: Auto-set. (line 87) +* error output: Special FD. (line 6) +* escape processing, gsub()/gensub()/sub() functions: Gory Details. + (line 6) +* escape sequences, in strings: Escape Sequences. (line 6) +* eval debugger command: Viewing And Changing Data. + (line 23) +* evaluate expressions, in debugger: Viewing And Changing Data. + (line 23) +* evaluation order: Increment Ops. (line 60) +* evaluation order, concatenation: Concatenation. (line 41) +* evaluation order, functions: Calling Built-in. (line 30) +* examining fields: Fields. (line 6) +* exclamation point (!), ! operator: Boolean Ops. (line 69) +* exclamation point (!), ! operator <1>: Precedence. (line 51) +* exclamation point (!), ! operator <2>: Egrep Program. (line 174) +* exclamation point (!), != operator: Comparison Operators. + (line 11) +* exclamation point (!), != operator <1>: Precedence. (line 64) +* exclamation point (!), !~ operator: Regexp Usage. (line 19) +* exclamation point (!), !~ operator <1>: Computed Regexps. (line 6) +* exclamation point (!), !~ operator <2>: Case-sensitivity. (line 26) +* exclamation point (!), !~ operator <3>: Regexp Constants. (line 6) +* exclamation point (!), !~ operator <4>: Comparison Operators. + (line 11) +* exclamation point (!), !~ operator <5>: Comparison Operators. + (line 98) +* exclamation point (!), !~ operator <6>: Precedence. (line 79) +* exclamation point (!), !~ operator <7>: Expression Patterns. + (line 24) +* exit debugger command: Miscellaneous Debugger Commands. + (line 66) +* exit statement: Exit Statement. (line 6) +* exit status, of gawk: Exit Status. (line 6) +* exit status, of VMS: VMS Running. (line 28) +* exit the debugger: Miscellaneous Debugger Commands. + (line 66) +* exit the debugger <1>: Miscellaneous Debugger Commands. + (line 102) +* exp: Numeric Functions. (line 19) +* expand utility: Very Simple. (line 73) +* Expat XML parser library: gawkextlib. (line 37) +* exponent: Numeric Functions. (line 19) +* expressions: Expressions. (line 6) +* expressions, as patterns: Expression Patterns. (line 6) +* expressions, assignment: Assignment Ops. (line 6) +* expressions, Boolean: Boolean Ops. (line 6) +* expressions, comparison: Typing and Comparison. + (line 9) +* expressions, conditional: Conditional Exp. (line 6) +* expressions, matching, See comparison expressions: Typing and Comparison. + (line 9) +* expressions, selecting: Conditional Exp. (line 6) +* Extended Regular Expressions (EREs): Bracket Expressions. (line 34) +* extension API: Extension API Description. + (line 6) +* extension API informational variables: Extension API Informational Variables. + (line 6) +* extension API version: Extension Versioning. + (line 6) +* extension API, version number: Auto-set. (line 246) +* extension example: Extension Example. (line 6) +* extension registration: Registration Functions. + (line 6) +* extension search path: Finding Extensions. (line 6) +* extensions distributed with gawk: Extension Samples. (line 6) +* extensions, allocating memory: Memory Allocation Functions. + (line 6) +* extensions, Brian Kernighan's awk: BTL. (line 6) +* extensions, Brian Kernighan's awk <1>: Common Extensions. (line 6) +* extensions, common, ** operator: Arithmetic Ops. (line 30) +* extensions, common, **= operator: Assignment Ops. (line 138) +* extensions, common, /dev/stderr special file: Special FD. (line 48) +* extensions, common, /dev/stdin special file: Special FD. (line 48) +* extensions, common, /dev/stdout special file: Special FD. (line 48) +* extensions, common, BINMODE variable: PC Using. (line 16) +* extensions, common, delete to delete entire arrays: Delete. (line 39) +* extensions, common, fflush() function: I/O Functions. (line 43) +* extensions, common, func keyword: Definition Syntax. (line 99) +* extensions, common, length() applied to an array: String Functions. + (line 200) +* extensions, common, RS as a regexp: gawk split records. (line 6) +* extensions, common, single character fields: Single Character Fields. + (line 6) +* extensions, common, \x escape sequence: Escape Sequences. (line 61) +* extensions, in gawk, not in POSIX awk: POSIX/GNU. (line 6) +* extensions, loading, @load directive: Loading Shared Libraries. + (line 8) +* extensions, mawk: Common Extensions. (line 6) +* extensions, where to find: gawkextlib. (line 6) +* extract.awk program: Extract Program. (line 79) +* extraction, of marked strings (internationalization): String Extraction. + (line 6) +* f debugger command (alias for frame): Execution Stack. (line 27) +* false, logical: Truth Values. (line 6) +* FDL (Free Documentation License): GNU Free Documentation License. + (line 8) +* features, adding to gawk: Adding Code. (line 6) +* features, deprecated: Obsolete. (line 6) +* features, undocumented: Undocumented. (line 6) +* Fenlason, Jay: History. (line 30) +* Fenlason, Jay <1>: Contributors. (line 19) +* fflush: I/O Functions. (line 28) +* field numbers: Nonconstant Fields. (line 6) +* field operator $: Fields. (line 19) +* field operators, dollar sign as: Fields. (line 19) +* field separator, in multiline records: Multiple Line. (line 41) +* field separator, on command line: Command Line Field Separator. + (line 6) +* field separator, POSIX and: Full Line Fields. (line 16) +* field separators: Field Separators. (line 15) +* field separators <1>: User-modified. (line 50) +* field separators <2>: User-modified. (line 113) +* field separators, choice of: Field Separators. (line 50) +* field separators, FIELDWIDTHS variable and: User-modified. (line 37) +* field separators, FPAT variable and: User-modified. (line 43) +* field separators, regular expressions as: Field Separators. (line 50) +* field separators, regular expressions as <1>: Regexp Field Splitting. + (line 6) +* field separators, See Also OFS: Changing Fields. (line 64) +* field separators, spaces as: Cut Program. (line 103) +* fields: Reading Files. (line 14) +* fields <1>: Fields. (line 6) +* fields <2>: Basic High Level. (line 62) +* fields, adding: Changing Fields. (line 53) +* fields, changing contents of: Changing Fields. (line 6) +* fields, cutting: Cut Program. (line 6) +* fields, examining: Fields. (line 6) +* fields, number of: Fields. (line 33) +* fields, numbers: Nonconstant Fields. (line 6) +* fields, printing: Print Examples. (line 20) +* fields, separating: Field Separators. (line 15) +* fields, separating <1>: Field Separators. (line 15) +* fields, single-character: Single Character Fields. + (line 6) +* FIELDWIDTHS variable: Constant Size. (line 22) +* FIELDWIDTHS variable <1>: User-modified. (line 37) +* file descriptors: Special FD. (line 6) +* file inclusion, @include directive: Include Files. (line 8) +* file names, distinguishing: Auto-set. (line 55) +* file names, in compatibility mode: Special Caveats. (line 9) +* file names, standard streams in gawk: Special FD. (line 48) +* FILENAME variable: Reading Files. (line 6) +* FILENAME variable <1>: Auto-set. (line 108) +* FILENAME variable, getline, setting with: Getline Notes. (line 19) +* filenames, assignments as: Ignoring Assigns. (line 6) +* files, .gmo: Explaining gettext. (line 42) +* files, .gmo, specifying directory of: Explaining gettext. (line 54) +* files, .gmo, specifying directory of <1>: Programmer i18n. (line 48) +* files, .mo, converting from .po: I18N Example. (line 66) +* files, .po: Explaining gettext. (line 37) +* files, .po <1>: Translator i18n. (line 6) +* files, .po, converting to .mo: I18N Example. (line 66) +* files, .pot: Explaining gettext. (line 31) +* files, /dev/... special files: Special FD. (line 48) +* files, /inet/... (gawk): TCP/IP Networking. (line 6) +* files, /inet4/... (gawk): TCP/IP Networking. (line 6) +* files, /inet6/... (gawk): TCP/IP Networking. (line 6) +* files, awk programs in: Long. (line 6) +* files, awkprof.out: Profiling. (line 6) +* files, awkvars.out: Options. (line 94) +* files, closing: I/O Functions. (line 10) +* files, descriptors, See file descriptors: Special FD. (line 6) +* files, group: Group Functions. (line 6) +* files, initialization and cleanup: Filetrans Function. (line 6) +* files, input, See input files: Read Terminal. (line 16) +* files, log, timestamps in: Time Functions. (line 6) +* files, managing: Data File Management. + (line 6) +* files, managing, data file boundaries: Filetrans Function. (line 6) +* files, message object: Explaining gettext. (line 42) +* files, message object, converting from portable object files: I18N Example. + (line 66) +* files, message object, specifying directory of: Explaining gettext. + (line 54) +* files, message object, specifying directory of <1>: Programmer i18n. + (line 48) +* files, multiple passes over: Other Arguments. (line 56) +* files, multiple, duplicating output into: Tee Program. (line 6) +* files, output, See output files: Close Files And Pipes. + (line 6) +* files, password: Passwd Functions. (line 16) +* files, portable object: Explaining gettext. (line 37) +* files, portable object <1>: Translator i18n. (line 6) +* files, portable object template: Explaining gettext. (line 31) +* files, portable object, converting to message object files: I18N Example. + (line 66) +* files, portable object, generating: Options. (line 147) +* files, processing, ARGIND variable and: Auto-set. (line 50) +* files, reading: Rewind Function. (line 6) +* files, reading, multiline records: Multiple Line. (line 6) +* files, searching for regular expressions: Egrep Program. (line 6) +* files, skipping: File Checking. (line 6) +* files, source, search path for: Programs Exercises. (line 70) +* files, splitting: Split Program. (line 6) +* files, Texinfo, extracting programs from: Extract Program. (line 6) +* find substring in string: String Functions. (line 155) +* finding extensions: Finding Extensions. (line 6) +* finish debugger command: Debugger Execution Control. + (line 39) +* Fish, Fred: Contributors. (line 51) +* fixed-width data: Constant Size. (line 6) +* flag variables: Boolean Ops. (line 69) +* flag variables <1>: Tee Program. (line 20) +* floating-point, numbers, arbitrary precision: Arbitrary Precision Arithmetic. + (line 6) +* floating-point, VAX/VMS: VMS Running. (line 50) +* flush buffered output: I/O Functions. (line 28) +* fnmatch() extension function: Extension Sample Fnmatch. + (line 12) +* FNR variable: Records. (line 6) +* FNR variable <1>: Auto-set. (line 118) +* FNR variable, changing: Auto-set. (line 357) +* for statement: For Statement. (line 6) +* for statement, looping over arrays: Scanning an Array. (line 20) +* fork() extension function: Extension Sample Fork. + (line 11) +* format specifiers: Basic Printf. (line 15) +* format specifiers, mixing regular with positional specifiers: Printf Ordering. + (line 57) +* format specifiers, printf statement: Control Letters. (line 6) +* format specifiers, strftime() function (gawk): Time Functions. + (line 89) +* format time string: Time Functions. (line 48) +* formats, numeric output: OFMT. (line 6) +* formatting output: Printf. (line 6) +* formatting strings: String Functions. (line 384) +* forward slash (/) to enclose regular expressions: Regexp. (line 10) +* forward slash (/), / operator: Precedence. (line 54) +* forward slash (/), /= operator: Assignment Ops. (line 129) +* forward slash (/), /= operator <1>: Precedence. (line 94) +* forward slash (/), /= operator, vs. /=.../ regexp constant: Assignment Ops. + (line 149) +* forward slash (/), patterns and: Expression Patterns. (line 24) +* FPAT variable: Splitting By Content. + (line 25) +* FPAT variable <1>: User-modified. (line 43) +* frame debugger command: Execution Stack. (line 27) +* Free Documentation License (FDL): GNU Free Documentation License. + (line 8) +* Free Software Foundation (FSF): Manual History. (line 6) +* Free Software Foundation (FSF) <1>: Getting. (line 10) +* Free Software Foundation (FSF) <2>: Glossary. (line 372) +* Free Software Foundation (FSF) <3>: Glossary. (line 405) +* FreeBSD: Glossary. (line 748) +* FS variable: Field Separators. (line 15) +* FS variable <1>: User-modified. (line 50) +* FS variable, --field-separator option and: Options. (line 21) +* FS variable, as null string: Single Character Fields. + (line 20) +* FS variable, as TAB character: Options. (line 266) +* FS variable, changing value of: Field Separators. (line 34) +* FS variable, running awk programs and: Cut Program. (line 63) +* FS variable, setting from command line: Command Line Field Separator. + (line 6) +* FS, containing ^: Regexp Field Splitting. + (line 59) +* FS, in multiline records: Multiple Line. (line 41) +* FSF (Free Software Foundation): Manual History. (line 6) +* FSF (Free Software Foundation) <1>: Getting. (line 10) +* FSF (Free Software Foundation) <2>: Glossary. (line 372) +* FSF (Free Software Foundation) <3>: Glossary. (line 405) +* fts() extension function: Extension Sample File Functions. + (line 60) +* FUNCTAB array: Auto-set. (line 134) +* function calls: Function Calls. (line 6) +* function calls, indirect: Indirect Calls. (line 6) +* function calls, indirect, @-notation for: Indirect Calls. (line 47) +* function definition example: Function Example. (line 6) +* function pointers: Indirect Calls. (line 6) +* functions, arrays as parameters to: Pass By Value/Reference. + (line 44) +* functions, built-in: Function Calls. (line 10) +* functions, built-in <1>: Functions. (line 6) +* functions, built-in, evaluation order: Calling Built-in. (line 30) +* functions, defining: Definition Syntax. (line 10) +* functions, library: Library Functions. (line 6) +* functions, library, assertions: Assert Function. (line 6) +* functions, library, associative arrays and: Library Names. (line 58) +* functions, library, C library: Getopt Function. (line 6) +* functions, library, character values as numbers: Ordinal Functions. + (line 6) +* functions, library, Cliff random numbers: Cliff Random Function. + (line 6) +* functions, library, command-line options: Getopt Function. (line 6) +* functions, library, example program for using: Igawk Program. + (line 6) +* functions, library, group database, reading: Group Functions. + (line 6) +* functions, library, managing data files: Data File Management. + (line 6) +* functions, library, managing time: Getlocaltime Function. + (line 6) +* functions, library, merging arrays into strings: Join Function. + (line 6) +* functions, library, rounding numbers: Round Function. (line 6) +* functions, library, user database, reading: Passwd Functions. + (line 6) +* functions, names of: Definition Syntax. (line 24) +* functions, recursive: Definition Syntax. (line 89) +* functions, string-translation: I18N Functions. (line 6) +* functions, undefined: Pass By Value/Reference. + (line 68) +* functions, user-defined: User-defined. (line 6) +* functions, user-defined, calling: Function Caveats. (line 6) +* functions, user-defined, counts, in a profile: Profiling. (line 137) +* functions, user-defined, library of: Library Functions. (line 6) +* functions, user-defined, next/nextfile statements and: Next Statement. + (line 44) +* functions, user-defined, next/nextfile statements and <1>: Nextfile Statement. + (line 47) +* G-d: Acknowledgments. (line 94) +* G., Daniel Richard: Acknowledgments. (line 60) +* G., Daniel Richard <1>: Maintainers. (line 14) +* Garfinkle, Scott: Contributors. (line 35) +* gawk program, dynamic profiling: Profiling. (line 177) +* gawk version: Auto-set. (line 221) +* gawk, ARGIND variable in: Other Arguments. (line 15) +* gawk, awk and: Preface. (line 21) +* gawk, awk and <1>: This Manual. (line 14) +* gawk, bitwise operations in: Bitwise Functions. (line 40) +* gawk, break statement in: Break Statement. (line 51) +* gawk, character classes and: Bracket Expressions. (line 108) +* gawk, coding style in: Adding Code. (line 37) +* gawk, command-line options, and regular expressions: GNU Regexp Operators. + (line 73) +* gawk, configuring: Configuration Philosophy. + (line 6) +* gawk, configuring, options: Additional Configuration Options. + (line 6) +* gawk, continue statement in: Continue Statement. (line 44) +* gawk, distribution: Distribution contents. + (line 6) +* gawk, ERRNO variable in: Getline. (line 19) +* gawk, ERRNO variable in <1>: Close Files And Pipes. + (line 140) +* gawk, ERRNO variable in <2>: BEGINFILE/ENDFILE. (line 26) +* gawk, ERRNO variable in <3>: Auto-set. (line 87) +* gawk, ERRNO variable in <4>: TCP/IP Networking. (line 54) +* gawk, escape sequences: Escape Sequences. (line 121) +* gawk, extensions, disabling: Options. (line 257) +* gawk, features, adding: Adding Code. (line 6) +* gawk, features, advanced: Advanced Features. (line 6) +* gawk, field separators and: User-modified. (line 71) +* gawk, FIELDWIDTHS variable in: Constant Size. (line 22) +* gawk, FIELDWIDTHS variable in <1>: User-modified. (line 37) +* gawk, file names in: Special Files. (line 6) +* gawk, format-control characters: Control Letters. (line 18) +* gawk, format-control characters <1>: Control Letters. (line 93) +* gawk, FPAT variable in: Splitting By Content. + (line 25) +* gawk, FPAT variable in <1>: User-modified. (line 43) +* gawk, FUNCTAB array in: Auto-set. (line 134) +* gawk, function arguments and: Calling Built-in. (line 16) +* gawk, hexadecimal numbers and: Nondecimal-numbers. (line 41) +* gawk, IGNORECASE variable in: Case-sensitivity. (line 26) +* gawk, IGNORECASE variable in <1>: User-modified. (line 76) +* gawk, IGNORECASE variable in <2>: Array Intro. (line 100) +* gawk, IGNORECASE variable in <3>: String Functions. (line 58) +* gawk, IGNORECASE variable in <4>: Array Sorting Functions. + (line 83) +* gawk, implementation issues: Notes. (line 6) +* gawk, implementation issues, debugging: Compatibility Mode. (line 6) +* gawk, implementation issues, downward compatibility: Compatibility Mode. + (line 6) +* gawk, implementation issues, limits: Getline Notes. (line 14) +* gawk, implementation issues, pipes: Redirection. (line 129) +* gawk, installing: Installation. (line 6) +* gawk, internationalization and, See internationalization: Internationalization. + (line 13) +* gawk, interpreter, adding code to: Using Internal File Ops. + (line 6) +* gawk, interval expressions and: Regexp Operators. (line 139) +* gawk, line continuation in: Conditional Exp. (line 34) +* gawk, LINT variable in: User-modified. (line 87) +* gawk, list of contributors to: Contributors. (line 6) +* gawk, MS-Windows version of: PC Using. (line 9) +* gawk, newlines in: Statements/Lines. (line 12) +* gawk, octal numbers and: Nondecimal-numbers. (line 41) +* gawk, predefined variables and: Built-in Variables. (line 14) +* gawk, PROCINFO array in: Auto-set. (line 148) +* gawk, PROCINFO array in <1>: Time Functions. (line 47) +* gawk, PROCINFO array in <2>: Two-way I/O. (line 114) +* gawk, regexp constants and: Using Constant Regexps. + (line 28) +* gawk, regular expressions, case sensitivity: Case-sensitivity. + (line 26) +* gawk, regular expressions, operators: GNU Regexp Operators. + (line 6) +* gawk, regular expressions, precedence: Regexp Operators. (line 161) +* gawk, RT variable in: awk split records. (line 124) +* gawk, RT variable in <1>: Multiple Line. (line 130) +* gawk, RT variable in <2>: Auto-set. (line 295) +* gawk, See Also awk: Preface. (line 34) +* gawk, source code, obtaining: Getting. (line 6) +* gawk, splitting fields and: Constant Size. (line 86) +* gawk, string-translation functions: I18N Functions. (line 6) +* gawk, SYMTAB array in: Auto-set. (line 299) +* gawk, TEXTDOMAIN variable in: User-modified. (line 152) +* gawk, timestamps: Time Functions. (line 6) +* gawk, uses for: Preface. (line 34) +* gawk, versions of, information about, printing: Options. (line 304) +* gawk, VMS version of: VMS Installation. (line 6) +* gawk, word-boundary operator: GNU Regexp Operators. + (line 66) +* gawkextlib: gawkextlib. (line 6) +* gawkextlib project: gawkextlib. (line 6) +* gawklibpath_append shell function: Shell Startup Files. (line 29) +* gawklibpath_default shell function: Shell Startup Files. (line 22) +* gawklibpath_prepend shell function: Shell Startup Files. (line 25) +* gawkpath_append shell function: Shell Startup Files. (line 19) +* gawkpath_default shell function: Shell Startup Files. (line 12) +* gawkpath_prepend shell function: Shell Startup Files. (line 15) +* General Public License (GPL): Glossary. (line 396) +* General Public License, See GPL: Manual History. (line 11) +* generate time values: Time Functions. (line 25) +* gensub: Using Constant Regexps. + (line 43) +* gensub <1>: String Functions. (line 89) +* gensub() function (gawk), escape processing: Gory Details. (line 6) +* getaddrinfo() function (C library): TCP/IP Networking. (line 39) +* getgrent() function (C library): Group Functions. (line 6) +* getgrent() function (C library) <1>: Group Functions. (line 202) +* getgrent() user-defined function: Group Functions. (line 6) +* getgrent() user-defined function <1>: Group Functions. (line 205) +* getgrgid() function (C library): Group Functions. (line 184) +* getgrgid() user-defined function: Group Functions. (line 187) +* getgrnam() function (C library): Group Functions. (line 173) +* getgrnam() user-defined function: Group Functions. (line 178) +* getgruser() function (C library): Group Functions. (line 193) +* getgruser() function, user-defined: Group Functions. (line 196) +* getline command: Reading Files. (line 20) +* getline command, coprocesses, using from: Getline/Coprocess. + (line 6) +* getline command, coprocesses, using from <1>: Close Files And Pipes. + (line 6) +* getline command, deadlock and: Two-way I/O. (line 53) +* getline command, explicit input with: Getline. (line 6) +* getline command, FILENAME variable and: Getline Notes. (line 19) +* getline command, return values: Getline. (line 19) +* getline command, variants: Getline Summary. (line 6) +* getline command, _gr_init() user-defined function: Group Functions. + (line 83) +* getline command, _pw_init() function: Passwd Functions. (line 154) +* getline from a file: Getline/File. (line 6) +* getline into a variable: Getline/Variable. (line 6) +* getline statement, BEGINFILE/ENDFILE patterns and: BEGINFILE/ENDFILE. + (line 53) +* getlocaltime() user-defined function: Getlocaltime Function. + (line 16) +* getopt() function (C library): Getopt Function. (line 15) +* getopt() user-defined function: Getopt Function. (line 108) +* getopt() user-defined function <1>: Getopt Function. (line 134) +* getpwent() function (C library): Passwd Functions. (line 16) +* getpwent() function (C library) <1>: Passwd Functions. (line 196) +* getpwent() user-defined function: Passwd Functions. (line 16) +* getpwent() user-defined function <1>: Passwd Functions. (line 200) +* getpwnam() function (C library): Passwd Functions. (line 175) +* getpwnam() user-defined function: Passwd Functions. (line 180) +* getpwuid() function (C library): Passwd Functions. (line 186) +* getpwuid() user-defined function: Passwd Functions. (line 190) +* gettext library: Explaining gettext. (line 6) +* gettext library, locale categories: Explaining gettext. (line 81) +* gettext() function (C library): Explaining gettext. (line 63) +* gettimeofday() extension function: Extension Sample Time. + (line 12) +* git utility: gawkextlib. (line 31) +* git utility <1>: Other Versions. (line 29) +* git utility <2>: Accessing The Source. + (line 10) +* git utility <3>: Adding Code. (line 112) +* Git, use of for gawk source code: Derived Files. (line 6) +* GNITS mailing list: Acknowledgments. (line 52) +* GNU awk, See gawk: Preface. (line 51) +* GNU Free Documentation License: GNU Free Documentation License. + (line 8) +* GNU General Public License: Glossary. (line 396) +* GNU Lesser General Public License: Glossary. (line 491) +* GNU long options: Command Line. (line 13) +* GNU long options <1>: Options. (line 6) +* GNU long options, printing list of: Options. (line 154) +* GNU Project: Manual History. (line 11) +* GNU Project <1>: Glossary. (line 405) +* GNU/Linux: Manual History. (line 28) +* GNU/Linux <1>: I18N Example. (line 57) +* GNU/Linux <2>: Glossary. (line 748) +* Gordon, Assaf: Contributors. (line 106) +* GPL (General Public License): Manual History. (line 11) +* GPL (General Public License) <1>: Glossary. (line 396) +* GPL (General Public License), printing: Options. (line 89) +* grcat program: Group Functions. (line 16) +* Grigera, Juan: Contributors. (line 58) +* group database, reading: Group Functions. (line 6) +* group file: Group Functions. (line 6) +* group ID of gawk user: Auto-set. (line 170) +* groups, information about: Group Functions. (line 6) +* gsub: Using Constant Regexps. + (line 43) +* gsub <1>: String Functions. (line 139) +* gsub() function, arguments of: String Functions. (line 463) +* gsub() function, escape processing: Gory Details. (line 6) +* h debugger command (alias for help): Miscellaneous Debugger Commands. + (line 69) +* Hankerson, Darrel: Acknowledgments. (line 60) +* Hankerson, Darrel <1>: Contributors. (line 61) +* Haque, John: Contributors. (line 109) +* Hartholz, Elaine: Acknowledgments. (line 38) +* Hartholz, Marshall: Acknowledgments. (line 38) +* Hasegawa, Isamu: Contributors. (line 95) +* help debugger command: Miscellaneous Debugger Commands. + (line 69) +* hexadecimal numbers: Nondecimal-numbers. (line 6) +* hexadecimal values, enabling interpretation of: Options. (line 209) +* history expansion, in debugger: Readline Support. (line 6) +* histsort.awk program: History Sorting. (line 25) +* Hughes, Phil: Acknowledgments. (line 43) +* HUP signal, for dynamic profiling: Profiling. (line 209) +* hyphen (-), - operator: Precedence. (line 51) +* hyphen (-), - operator <1>: Precedence. (line 57) +* hyphen (-), -- operator: Increment Ops. (line 48) +* hyphen (-), -- operator <1>: Precedence. (line 45) +* hyphen (-), -= operator: Assignment Ops. (line 129) +* hyphen (-), -= operator <1>: Precedence. (line 94) +* hyphen (-), filenames beginning with: Options. (line 60) +* hyphen (-), in bracket expressions: Bracket Expressions. (line 25) +* i debugger command (alias for info): Debugger Info. (line 13) +* id utility: Id Program. (line 6) +* id.awk program: Id Program. (line 31) +* if statement: If Statement. (line 6) +* if statement, actions, changing: Ranges. (line 25) +* if statement, use of regexps in: Regexp Usage. (line 19) +* igawk.sh program: Igawk Program. (line 124) +* ignore breakpoint: Breakpoint Control. (line 87) +* ignore debugger command: Breakpoint Control. (line 87) +* IGNORECASE variable: User-modified. (line 76) +* IGNORECASE variable, and array indices: Array Intro. (line 100) +* IGNORECASE variable, and array sorting functions: Array Sorting Functions. + (line 83) +* IGNORECASE variable, in example programs: Library Functions. + (line 53) +* IGNORECASE variable, with ~ and !~ operators: Case-sensitivity. + (line 26) +* Illumos: Other Versions. (line 109) +* Illumos, POSIX-compliant awk: Other Versions. (line 109) +* implementation issues, gawk: Notes. (line 6) +* implementation issues, gawk, debugging: Compatibility Mode. (line 6) +* implementation issues, gawk, limits: Getline Notes. (line 14) +* implementation issues, gawk, limits <1>: Redirection. (line 129) +* in operator: Comparison Operators. + (line 11) +* in operator <1>: Precedence. (line 82) +* in operator <2>: For Statement. (line 75) +* in operator, index existence in multidimensional arrays: Multidimensional. + (line 41) +* in operator, order of array access: Scanning an Array. (line 48) +* in operator, testing if array element exists: Reference to Elements. + (line 38) +* in operator, use in loops: Scanning an Array. (line 17) +* including files, @include directive: Include Files. (line 8) +* increment operators: Increment Ops. (line 6) +* index: String Functions. (line 155) +* indexing arrays: Array Intro. (line 48) +* indirect function calls: Indirect Calls. (line 6) +* indirect function calls, @-notation: Indirect Calls. (line 47) +* infinite precision: Arbitrary Precision Arithmetic. + (line 6) +* info debugger command: Debugger Info. (line 13) +* initialization, automatic: More Complex. (line 39) +* inplace extension: Extension Sample Inplace. + (line 6) +* input files: Reading Files. (line 6) +* input files, closing: Close Files And Pipes. + (line 6) +* input files, counting elements in: Wc Program. (line 6) +* input files, examples: Sample Data Files. (line 6) +* input files, reading: Reading Files. (line 6) +* input files, running awk without: Read Terminal. (line 6) +* input files, running awk without <1>: Read Terminal. (line 16) +* input files, variable assignments and: Other Arguments. (line 26) +* input pipeline: Getline/Pipe. (line 10) +* input record, length of: String Functions. (line 177) +* input redirection: Getline/File. (line 6) +* input, data, nondecimal: Nondecimal Data. (line 6) +* input, explicit: Getline. (line 6) +* input, files, See input files: Multiple Line. (line 6) +* input, multiline records: Multiple Line. (line 6) +* input, splitting into records: Records. (line 6) +* input, standard: Read Terminal. (line 6) +* input, standard <1>: Special FD. (line 6) +* input/output functions: I/O Functions. (line 6) +* input/output, binary: User-modified. (line 15) +* input/output, from BEGIN and END: I/O And BEGIN/END. (line 6) +* input/output, two-way: Two-way I/O. (line 27) +* insomnia, cure for: Alarm Program. (line 6) +* installation, VMS: VMS Installation. (line 6) +* installing gawk: Installation. (line 6) +* instruction tracing, in debugger: Debugger Info. (line 90) +* int: Numeric Functions. (line 24) +* INT signal (MS-Windows): Profiling. (line 212) +* intdiv: Numeric Functions. (line 29) +* intdiv <1>: Numeric Functions. (line 29) +* integer array indices: Numeric Array Subscripts. + (line 31) +* integers, arbitrary precision: Arbitrary Precision Integers. + (line 6) +* integers, unsigned: Computer Arithmetic. (line 41) +* interacting with other programs: I/O Functions. (line 107) +* internationalization: I18N Functions. (line 6) +* internationalization <1>: I18N and L10N. (line 6) +* internationalization, localization: User-modified. (line 152) +* internationalization, localization <1>: Internationalization. + (line 13) +* internationalization, localization, character classes: Bracket Expressions. + (line 108) +* internationalization, localization, gawk and: Internationalization. + (line 13) +* internationalization, localization, locale categories: Explaining gettext. + (line 81) +* internationalization, localization, marked strings: Programmer i18n. + (line 13) +* internationalization, localization, portability and: I18N Portability. + (line 6) +* internationalizing a program: Explaining gettext. (line 6) +* interpreted programs: Basic High Level. (line 13) +* interpreted programs <1>: Glossary. (line 445) +* interval expressions, regexp operator: Regexp Operators. (line 116) +* inventory-shipped file: Sample Data Files. (line 32) +* invoke shell command: I/O Functions. (line 107) +* isarray: Type Functions. (line 11) +* ISO: Glossary. (line 456) +* ISO 8859-1: Glossary. (line 196) +* ISO Latin-1: Glossary. (line 196) +* Jacobs, Andrew: Passwd Functions. (line 90) +* Jaegermann, Michal: Acknowledgments. (line 60) +* Jaegermann, Michal <1>: Contributors. (line 46) +* Java implementation of awk: Other Versions. (line 117) +* Java programming language: Glossary. (line 468) +* jawk: Other Versions. (line 117) +* Jedi knights: Undocumented. (line 6) +* Johansen, Chris: Signature Program. (line 25) +* join() user-defined function: Join Function. (line 18) +* Kahrs, Ju"rgen: Acknowledgments. (line 60) +* Kahrs, Ju"rgen <1>: Contributors. (line 71) +* Kasal, Stepan: Acknowledgments. (line 60) +* Kenobi, Obi-Wan: Undocumented. (line 6) +* Kernighan, Brian: History. (line 17) +* Kernighan, Brian <1>: Conventions. (line 38) +* Kernighan, Brian <2>: Acknowledgments. (line 79) +* Kernighan, Brian <3>: Getline/Pipe. (line 6) +* Kernighan, Brian <4>: Concatenation. (line 6) +* Kernighan, Brian <5>: Library Functions. (line 12) +* Kernighan, Brian <6>: BTL. (line 6) +* Kernighan, Brian <7>: Contributors. (line 12) +* Kernighan, Brian <8>: Other Versions. (line 13) +* Kernighan, Brian <9>: Basic Data Typing. (line 54) +* Kernighan, Brian <10>: Glossary. (line 206) +* kill command, dynamic profiling: Profiling. (line 186) +* Knights, jedi: Undocumented. (line 6) +* Kwok, Conrad: Contributors. (line 35) +* l debugger command (alias for list): Miscellaneous Debugger Commands. + (line 75) +* labels.awk program: Labels Program. (line 51) +* Langston, Peter: Advanced Features. (line 6) +* LANGUAGE environment variable: Explaining gettext. (line 120) +* languages, data-driven: Basic High Level. (line 74) +* LC_ALL locale category: Explaining gettext. (line 117) +* LC_COLLATE locale category: Explaining gettext. (line 94) +* LC_CTYPE locale category: Explaining gettext. (line 98) +* LC_MESSAGES locale category: Explaining gettext. (line 88) +* LC_MESSAGES locale category, bindtextdomain() function (gawk): Programmer i18n. + (line 101) +* LC_MONETARY locale category: Explaining gettext. (line 104) +* LC_NUMERIC locale category: Explaining gettext. (line 108) +* LC_TIME locale category: Explaining gettext. (line 112) +* left angle bracket (<), < operator: Comparison Operators. + (line 11) +* left angle bracket (<), < operator <1>: Precedence. (line 64) +* left angle bracket (<), < operator (I/O): Getline/File. (line 6) +* left angle bracket (<), <= operator: Comparison Operators. + (line 11) +* left angle bracket (<), <= operator <1>: Precedence. (line 64) +* left shift: Bitwise Functions. (line 47) +* left shift, bitwise: Bitwise Functions. (line 32) +* leftmost longest match: Multiple Line. (line 26) +* length: String Functions. (line 170) +* length of input record: String Functions. (line 177) +* length of string: String Functions. (line 170) +* Lesser General Public License (LGPL): Glossary. (line 491) +* LGPL (Lesser General Public License): Glossary. (line 491) +* libmawk: Other Versions. (line 125) +* libraries of awk functions: Library Functions. (line 6) +* libraries of awk functions, assertions: Assert Function. (line 6) +* libraries of awk functions, associative arrays and: Library Names. + (line 58) +* libraries of awk functions, character values as numbers: Ordinal Functions. + (line 6) +* libraries of awk functions, command-line options: Getopt Function. + (line 6) +* libraries of awk functions, example program for using: Igawk Program. + (line 6) +* libraries of awk functions, group database, reading: Group Functions. + (line 6) +* libraries of awk functions, managing, data files: Data File Management. + (line 6) +* libraries of awk functions, managing, time: Getlocaltime Function. + (line 6) +* libraries of awk functions, merging arrays into strings: Join Function. + (line 6) +* libraries of awk functions, rounding numbers: Round Function. + (line 6) +* libraries of awk functions, user database, reading: Passwd Functions. + (line 6) +* line breaks: Statements/Lines. (line 6) +* line continuations: Boolean Ops. (line 64) +* line continuations, gawk: Conditional Exp. (line 34) +* line continuations, in print statement: Print Examples. (line 75) +* line continuations, with C shell: More Complex. (line 31) +* lines, blank, printing: Print. (line 22) +* lines, counting: Wc Program. (line 6) +* lines, duplicate, removing: History Sorting. (line 6) +* lines, matching ranges of: Ranges. (line 6) +* lines, skipping between markers: Ranges. (line 43) +* lint checking: User-modified. (line 87) +* lint checking, array elements: Delete. (line 34) +* lint checking, array subscripts: Uninitialized Subscripts. + (line 43) +* lint checking, empty programs: Command Line. (line 16) +* lint checking, issuing warnings: Options. (line 184) +* lint checking, POSIXLY_CORRECT environment variable: Options. + (line 343) +* lint checking, undefined functions: Pass By Value/Reference. + (line 85) +* LINT variable: User-modified. (line 87) +* Linux: Manual History. (line 28) +* Linux <1>: I18N Example. (line 57) +* Linux <2>: Glossary. (line 748) +* list all global variables, in debugger: Debugger Info. (line 48) +* list debugger command: Miscellaneous Debugger Commands. + (line 75) +* list function definitions, in debugger: Debugger Info. (line 30) +* loading extensions, @load directive: Loading Shared Libraries. + (line 8) +* loading, extensions: Options. (line 172) +* local variables, in a function: Variable Scope. (line 6) +* locale categories: Explaining gettext. (line 81) +* locale decimal point character: Options. (line 269) +* locale, definition of: Locales. (line 6) +* localization: I18N and L10N. (line 6) +* localization, See internationalization, localization: I18N and L10N. + (line 6) +* log: Numeric Functions. (line 44) +* log files, timestamps in: Time Functions. (line 6) +* logarithm: Numeric Functions. (line 44) +* logical false/true: Truth Values. (line 6) +* logical operators, See Boolean expressions: Boolean Ops. (line 6) +* login information: Passwd Functions. (line 16) +* long options: Command Line. (line 13) +* loops: While Statement. (line 6) +* loops, break statement and: Break Statement. (line 6) +* loops, continue statements and: For Statement. (line 64) +* loops, count for header, in a profile: Profiling. (line 131) +* loops, do-while: Do Statement. (line 6) +* loops, exiting: Break Statement. (line 6) +* loops, for, array scanning: Scanning an Array. (line 6) +* loops, for, iterative: For Statement. (line 6) +* loops, See Also while statement: While Statement. (line 6) +* loops, while: While Statement. (line 6) +* ls utility: More Complex. (line 15) +* lshift: Bitwise Functions. (line 47) +* lvalues/rvalues: Assignment Ops. (line 31) +* mail-list file: Sample Data Files. (line 6) +* mailing labels, printing: Labels Program. (line 6) +* mailing list, GNITS: Acknowledgments. (line 52) +* Malmberg, John: Acknowledgments. (line 60) +* Malmberg, John <1>: Maintainers. (line 14) +* Malmberg, John E.: Contributors. (line 138) +* mark parity: Ordinal Functions. (line 45) +* marked string extraction (internationalization): String Extraction. + (line 6) +* marked strings, extracting: String Extraction. (line 6) +* Marx, Groucho: Increment Ops. (line 60) +* match: String Functions. (line 210) +* match regexp in string: String Functions. (line 210) +* match() function, RSTART/RLENGTH variables: String Functions. + (line 227) +* matching, expressions, See comparison expressions: Typing and Comparison. + (line 9) +* matching, leftmost longest: Multiple Line. (line 26) +* matching, null strings: String Functions. (line 537) +* mawk utility: Escape Sequences. (line 121) +* mawk utility <1>: Getline/Pipe. (line 62) +* mawk utility <2>: Concatenation. (line 36) +* mawk utility <3>: Nextfile Statement. (line 47) +* mawk utility <4>: Other Versions. (line 48) +* maximum precision supported by MPFR library: Auto-set. (line 235) +* McIlroy, Doug: Glossary. (line 257) +* McPhee, Patrick: Contributors. (line 101) +* message object files: Explaining gettext. (line 42) +* message object files, converting from portable object files: I18N Example. + (line 66) +* message object files, specifying directory of: Explaining gettext. + (line 54) +* message object files, specifying directory of <1>: Programmer i18n. + (line 48) +* messages from extensions: Printing Messages. (line 6) +* metacharacters in regular expressions: Regexp Operators. (line 6) +* metacharacters, escape sequences for: Escape Sequences. (line 140) +* minimum precision required by MPFR library: Auto-set. (line 238) +* mktime: Time Functions. (line 25) +* modifiers, in format specifiers: Format Modifiers. (line 6) +* monetary information, localization: Explaining gettext. (line 104) +* Moore, Duncan: Getline Notes. (line 40) +* msgfmt utility: I18N Example. (line 66) +* multiple precision: Arbitrary Precision Arithmetic. + (line 6) +* multiple-line records: Multiple Line. (line 6) +* n debugger command (alias for next): Debugger Execution Control. + (line 43) +* names, arrays/variables: Library Names. (line 6) +* names, functions: Definition Syntax. (line 24) +* names, functions <1>: Library Names. (line 6) +* namespace issues: Library Names. (line 6) +* namespace issues, functions: Definition Syntax. (line 24) +* NetBSD: Glossary. (line 748) +* networks, programming: TCP/IP Networking. (line 6) +* networks, support for: Special Network. (line 6) +* newlines: Statements/Lines. (line 6) +* newlines <1>: Options. (line 263) +* newlines <2>: Boolean Ops. (line 69) +* newlines, as record separators: awk split records. (line 12) +* newlines, in dynamic regexps: Computed Regexps. (line 60) +* newlines, in regexp constants: Computed Regexps. (line 70) +* newlines, printing: Print Examples. (line 11) +* newlines, separating statements in actions: Action Overview. + (line 19) +* newlines, separating statements in actions <1>: Statements. (line 10) +* next debugger command: Debugger Execution Control. + (line 43) +* next file statement: Feature History. (line 168) +* next statement: Boolean Ops. (line 95) +* next statement <1>: Next Statement. (line 6) +* next statement, BEGIN/END patterns and: I/O And BEGIN/END. (line 36) +* next statement, BEGINFILE/ENDFILE patterns and: BEGINFILE/ENDFILE. + (line 49) +* next statement, user-defined functions and: Next Statement. (line 44) +* nextfile statement: Nextfile Statement. (line 6) +* nextfile statement, BEGIN/END patterns and: I/O And BEGIN/END. + (line 36) +* nextfile statement, BEGINFILE/ENDFILE patterns and: BEGINFILE/ENDFILE. + (line 26) +* nextfile statement, user-defined functions and: Nextfile Statement. + (line 47) +* nexti debugger command: Debugger Execution Control. + (line 49) +* NF variable: Fields. (line 33) +* NF variable <1>: Auto-set. (line 123) +* NF variable, decrementing: Changing Fields. (line 107) +* ni debugger command (alias for nexti): Debugger Execution Control. + (line 49) +* noassign.awk program: Ignoring Assigns. (line 15) +* non-existent array elements: Reference to Elements. + (line 23) +* not Boolean-logic operator: Boolean Ops. (line 6) +* NR variable: Records. (line 6) +* NR variable <1>: Auto-set. (line 143) +* NR variable, changing: Auto-set. (line 357) +* null strings: awk split records. (line 114) +* null strings <1>: Regexp Field Splitting. + (line 43) +* null strings <2>: Truth Values. (line 6) +* null strings <3>: Basic Data Typing. (line 26) +* null strings in gawk arguments, quoting and: Quoting. (line 82) +* null strings, and deleting array elements: Delete. (line 27) +* null strings, as array subscripts: Uninitialized Subscripts. + (line 43) +* null strings, converting numbers to strings: Strings And Numbers. + (line 21) +* null strings, matching: String Functions. (line 537) +* number as string of bits: Bitwise Functions. (line 108) +* number of array elements: String Functions. (line 200) +* number sign (#), #! (executable scripts): Executable Scripts. + (line 6) +* number sign (#), commenting: Comments. (line 6) +* numbers, as array subscripts: Numeric Array Subscripts. + (line 6) +* numbers, as values of characters: Ordinal Functions. (line 6) +* numbers, Cliff random: Cliff Random Function. + (line 6) +* numbers, converting: Strings And Numbers. (line 6) +* numbers, converting <1>: Bitwise Functions. (line 108) +* numbers, converting, to strings: User-modified. (line 30) +* numbers, converting, to strings <1>: User-modified. (line 104) +* numbers, hexadecimal: Nondecimal-numbers. (line 6) +* numbers, octal: Nondecimal-numbers. (line 6) +* numbers, rounding: Round Function. (line 6) +* numeric constants: Scalar Constants. (line 6) +* numeric functions: Numeric Functions. (line 6) +* numeric, output format: OFMT. (line 6) +* numeric, strings: Variable Typing. (line 6) +* o debugger command (alias for option): Debugger Info. (line 57) +* obsolete features: Obsolete. (line 6) +* octal numbers: Nondecimal-numbers. (line 6) +* octal values, enabling interpretation of: Options. (line 209) +* OFMT variable: OFMT. (line 15) +* OFMT variable <1>: Strings And Numbers. (line 56) +* OFMT variable <2>: User-modified. (line 104) +* OFMT variable, POSIX awk and: OFMT. (line 27) +* OFS variable: Changing Fields. (line 64) +* OFS variable <1>: Output Separators. (line 6) +* OFS variable <2>: User-modified. (line 113) +* OpenBSD: Glossary. (line 748) +* OpenSolaris: Other Versions. (line 100) +* operating systems, BSD-based: Manual History. (line 28) +* operating systems, PC, gawk on: PC Using. (line 6) +* operating systems, PC, gawk on, installing: PC Installation. + (line 6) +* operating systems, porting gawk to: New Ports. (line 6) +* operating systems, See Also GNU/Linux, PC operating systems, Unix: Installation. + (line 6) +* operations, bitwise: Bitwise Functions. (line 6) +* operators, arithmetic: Arithmetic Ops. (line 6) +* operators, assignment: Assignment Ops. (line 6) +* operators, assignment <1>: Assignment Ops. (line 31) +* operators, assignment, evaluation order: Assignment Ops. (line 110) +* operators, Boolean, See Boolean expressions: Boolean Ops. (line 6) +* operators, decrement/increment: Increment Ops. (line 6) +* operators, GNU-specific: GNU Regexp Operators. + (line 6) +* operators, input/output: Getline/File. (line 6) +* operators, input/output <1>: Getline/Pipe. (line 10) +* operators, input/output <2>: Getline/Coprocess. (line 6) +* operators, input/output <3>: Redirection. (line 22) +* operators, input/output <4>: Redirection. (line 96) +* operators, input/output <5>: Precedence. (line 64) +* operators, input/output <6>: Precedence. (line 64) +* operators, input/output <7>: Precedence. (line 64) +* operators, logical, See Boolean expressions: Boolean Ops. (line 6) +* operators, precedence: Increment Ops. (line 60) +* operators, precedence <1>: Precedence. (line 6) +* operators, relational, See operators, comparison: Typing and Comparison. + (line 9) +* operators, short-circuit: Boolean Ops. (line 59) +* operators, string: Concatenation. (line 9) +* operators, string-matching: Regexp Usage. (line 19) +* operators, string-matching, for buffers: GNU Regexp Operators. + (line 51) +* operators, word-boundary (gawk): GNU Regexp Operators. + (line 66) +* option debugger command: Debugger Info. (line 57) +* options, command-line: Options. (line 6) +* options, command-line, end of: Options. (line 55) +* options, command-line, invoking awk: Command Line. (line 6) +* options, command-line, processing: Getopt Function. (line 6) +* options, deprecated: Obsolete. (line 6) +* options, long: Command Line. (line 13) +* options, long <1>: Options. (line 6) +* options, printing list of: Options. (line 154) +* or: Bitwise Functions. (line 50) +* OR bitwise operation: Bitwise Functions. (line 6) +* or Boolean-logic operator: Boolean Ops. (line 6) +* ord() extension function: Extension Sample Ord. + (line 12) +* ord() user-defined function: Ordinal Functions. (line 16) +* order of evaluation, concatenation: Concatenation. (line 41) +* ORS variable: Output Separators. (line 20) +* ORS variable <1>: User-modified. (line 119) +* output field separator, See OFS variable: Changing Fields. (line 64) +* output record separator, See ORS variable: Output Separators. + (line 20) +* output redirection: Redirection. (line 6) +* output wrapper: Output Wrappers. (line 6) +* output, buffering: I/O Functions. (line 32) +* output, buffering <1>: I/O Functions. (line 166) +* output, duplicating into files: Tee Program. (line 6) +* output, files, closing: Close Files And Pipes. + (line 6) +* output, format specifier, OFMT: OFMT. (line 15) +* output, formatted: Printf. (line 6) +* output, pipes: Redirection. (line 57) +* output, printing, See printing: Printing. (line 6) +* output, records: Output Separators. (line 20) +* output, standard: Special FD. (line 6) +* p debugger command (alias for print): Viewing And Changing Data. + (line 35) +* Papadopoulos, Panos: Contributors. (line 129) +* parent process ID of gawk process: Auto-set. (line 210) +* parentheses (), in a profile: Profiling. (line 146) +* parentheses (), regexp operator: Regexp Operators. (line 81) +* password file: Passwd Functions. (line 16) +* patsplit: String Functions. (line 296) +* patterns: Patterns and Actions. + (line 6) +* patterns, comparison expressions as: Expression Patterns. (line 14) +* patterns, counts, in a profile: Profiling. (line 118) +* patterns, default: Very Simple. (line 35) +* patterns, empty: Empty. (line 6) +* patterns, expressions as: Regexp Patterns. (line 6) +* patterns, ranges in: Ranges. (line 6) +* patterns, regexp constants as: Expression Patterns. (line 34) +* patterns, types of: Pattern Overview. (line 15) +* pawk (profiling version of Brian Kernighan's awk): Other Versions. + (line 82) +* pawk, awk-like facilities for Python: Other Versions. (line 129) +* PC operating systems, gawk on: PC Using. (line 6) +* PC operating systems, gawk on, installing: PC Installation. (line 6) +* percent sign (%), % operator: Precedence. (line 54) +* percent sign (%), %= operator: Assignment Ops. (line 129) +* percent sign (%), %= operator <1>: Precedence. (line 94) +* period (.), regexp operator: Regexp Operators. (line 44) +* Perl: Future Extensions. (line 6) +* Peters, Arno: Contributors. (line 86) +* Peterson, Hal: Contributors. (line 40) +* pipe, closing: Close Files And Pipes. + (line 6) +* pipe, input: Getline/Pipe. (line 10) +* pipe, output: Redirection. (line 57) +* Pitts, Dave: Acknowledgments. (line 60) +* Pitts, Dave <1>: Maintainers. (line 14) +* Plauger, P.J.: Library Functions. (line 12) +* plug-in: Extension Intro. (line 6) +* plus sign (+), + operator: Precedence. (line 51) +* plus sign (+), + operator <1>: Precedence. (line 57) +* plus sign (+), ++ operator: Increment Ops. (line 11) +* plus sign (+), ++ operator <1>: Increment Ops. (line 40) +* plus sign (+), ++ operator <2>: Precedence. (line 45) +* plus sign (+), += operator: Assignment Ops. (line 81) +* plus sign (+), += operator <1>: Precedence. (line 94) +* plus sign (+), regexp operator: Regexp Operators. (line 105) +* pointers to functions: Indirect Calls. (line 6) +* portability: Escape Sequences. (line 103) +* portability, #! (executable scripts): Executable Scripts. (line 33) +* portability, ** operator and: Arithmetic Ops. (line 81) +* portability, **= operator and: Assignment Ops. (line 144) +* portability, ARGV variable: Executable Scripts. (line 59) +* portability, backslash continuation and: Statements/Lines. (line 30) +* portability, backslash in escape sequences: Escape Sequences. + (line 108) +* portability, close() function and: Close Files And Pipes. + (line 81) +* portability, data files as single record: gawk split records. + (line 65) +* portability, deleting array elements: Delete. (line 56) +* portability, example programs: Library Functions. (line 42) +* portability, functions, defining: Definition Syntax. (line 114) +* portability, gawk: New Ports. (line 6) +* portability, gettext library and: Explaining gettext. (line 11) +* portability, internationalization and: I18N Portability. (line 6) +* portability, length() function: String Functions. (line 179) +* portability, new awk vs. old awk: Strings And Numbers. (line 56) +* portability, next statement in user-defined functions: Pass By Value/Reference. + (line 88) +* portability, NF variable, decrementing: Changing Fields. (line 115) +* portability, operators: Increment Ops. (line 60) +* portability, operators, not in POSIX awk: Precedence. (line 97) +* portability, POSIXLY_CORRECT environment variable: Options. (line 363) +* portability, substr() function: String Functions. (line 513) +* portable object files: Explaining gettext. (line 37) +* portable object files <1>: Translator i18n. (line 6) +* portable object files, converting to message object files: I18N Example. + (line 66) +* portable object files, generating: Options. (line 147) +* portable object template files: Explaining gettext. (line 31) +* porting gawk: New Ports. (line 6) +* positional specifiers, printf statement: Format Modifiers. (line 13) +* positional specifiers, printf statement <1>: Printf Ordering. + (line 6) +* positional specifiers, printf statement, mixing with regular formats: Printf Ordering. + (line 57) +* POSIX awk: This Manual. (line 14) +* POSIX awk <1>: Assignment Ops. (line 138) +* POSIX awk, ** operator and: Precedence. (line 97) +* POSIX awk, **= operator and: Assignment Ops. (line 144) +* POSIX awk, < operator and: Getline/File. (line 26) +* POSIX awk, arithmetic operators and: Arithmetic Ops. (line 30) +* POSIX awk, backslashes in string constants: Escape Sequences. + (line 108) +* POSIX awk, BEGIN/END patterns: I/O And BEGIN/END. (line 15) +* POSIX awk, bracket expressions and: Bracket Expressions. (line 34) +* POSIX awk, bracket expressions and, character classes: Bracket Expressions. + (line 40) +* POSIX awk, bracket expressions and, character classes <1>: Bracket Expressions. + (line 108) +* POSIX awk, break statement and: Break Statement. (line 51) +* POSIX awk, changes in awk versions: POSIX. (line 6) +* POSIX awk, continue statement and: Continue Statement. (line 44) +* POSIX awk, CONVFMT variable and: User-modified. (line 30) +* POSIX awk, date utility and: Time Functions. (line 253) +* POSIX awk, field separators and: Full Line Fields. (line 16) +* POSIX awk, function keyword in: Definition Syntax. (line 99) +* POSIX awk, functions and, gsub()/sub(): Gory Details. (line 90) +* POSIX awk, functions and, length(): String Functions. (line 179) +* POSIX awk, GNU long options and: Options. (line 15) +* POSIX awk, interval expressions in: Regexp Operators. (line 135) +* POSIX awk, next/nextfile statements and: Next Statement. (line 44) +* POSIX awk, numeric strings and: Variable Typing. (line 6) +* POSIX awk, OFMT variable and: OFMT. (line 27) +* POSIX awk, OFMT variable and <1>: Strings And Numbers. (line 56) +* POSIX awk, period (.), using: Regexp Operators. (line 51) +* POSIX awk, printf format strings and: Format Modifiers. (line 157) +* POSIX awk, regular expressions and: Regexp Operators. (line 161) +* POSIX awk, timestamps and: Time Functions. (line 6) +* POSIX awk, | I/O operator and: Getline/Pipe. (line 56) +* POSIX mode: Options. (line 257) +* POSIX mode <1>: Options. (line 343) +* POSIX, awk and: Preface. (line 21) +* POSIX, gawk extensions not included in: POSIX/GNU. (line 6) +* POSIX, programs, implementing in awk: Clones. (line 6) +* POSIXLY_CORRECT environment variable: Options. (line 343) +* PREC variable: User-modified. (line 124) +* precedence: Increment Ops. (line 60) +* precedence <1>: Precedence. (line 6) +* precedence, regexp operators: Regexp Operators. (line 156) +* predefined variables: Built-in Variables. (line 6) +* predefined variables, -v option, setting with: Options. (line 41) +* predefined variables, conveying information: Auto-set. (line 6) +* predefined variables, user-modifiable: User-modified. (line 6) +* print debugger command: Viewing And Changing Data. + (line 35) +* print statement: Printing. (line 16) +* print statement, BEGIN/END patterns and: I/O And BEGIN/END. (line 15) +* print statement, commas, omitting: Print Examples. (line 30) +* print statement, I/O operators in: Precedence. (line 70) +* print statement, line continuations and: Print Examples. (line 75) +* print statement, OFMT variable and: User-modified. (line 113) +* print statement, See Also redirection, of output: Redirection. + (line 17) +* print statement, sprintf() function and: Round Function. (line 6) +* print variables, in debugger: Viewing And Changing Data. + (line 35) +* printf debugger command: Viewing And Changing Data. + (line 53) +* printf statement: Printing. (line 16) +* printf statement <1>: Printf. (line 6) +* printf statement, columns, aligning: Print Examples. (line 69) +* printf statement, format-control characters: Control Letters. + (line 6) +* printf statement, I/O operators in: Precedence. (line 70) +* printf statement, modifiers: Format Modifiers. (line 6) +* printf statement, positional specifiers: Format Modifiers. (line 13) +* printf statement, positional specifiers <1>: Printf Ordering. + (line 6) +* printf statement, positional specifiers, mixing with regular formats: Printf Ordering. + (line 57) +* printf statement, See Also redirection, of output: Redirection. + (line 17) +* printf statement, sprintf() function and: Round Function. (line 6) +* printf statement, syntax of: Basic Printf. (line 6) +* printing: Printing. (line 6) +* printing messages from extensions: Printing Messages. (line 6) +* printing, list of options: Options. (line 154) +* printing, mailing labels: Labels Program. (line 6) +* printing, unduplicated lines of text: Uniq Program. (line 6) +* printing, user information: Id Program. (line 6) +* private variables: Library Names. (line 11) +* process group ID of gawk process: Auto-set. (line 204) +* process ID of gawk process: Auto-set. (line 207) +* processes, two-way communications with: Two-way I/O. (line 6) +* processing data: Basic High Level. (line 6) +* PROCINFO array: Auto-set. (line 148) +* PROCINFO array <1>: Time Functions. (line 47) +* PROCINFO array <2>: Passwd Functions. (line 6) +* PROCINFO array, and communications via ptys: Two-way I/O. (line 114) +* PROCINFO array, and group membership: Group Functions. (line 6) +* PROCINFO array, and user and group ID numbers: Id Program. (line 15) +* PROCINFO array, testing the field splitting: Passwd Functions. + (line 154) +* PROCINFO, values of sorted_in: Controlling Scanning. + (line 26) +* profiling awk programs: Profiling. (line 6) +* profiling awk programs, dynamically: Profiling. (line 177) +* program identifiers: Auto-set. (line 173) +* program, definition of: Getting Started. (line 21) +* programming conventions, --non-decimal-data option: Nondecimal Data. + (line 35) +* programming conventions, ARGC/ARGV variables: Auto-set. (line 35) +* programming conventions, exit statement: Exit Statement. (line 38) +* programming conventions, function parameters: Return Statement. + (line 44) +* programming conventions, functions, calling: Calling Built-in. + (line 10) +* programming conventions, functions, writing: Definition Syntax. + (line 71) +* programming conventions, gawk extensions: Internal File Ops. + (line 45) +* programming conventions, private variable names: Library Names. + (line 23) +* programming language, recipe for: History. (line 6) +* programming languages, Ada: Glossary. (line 11) +* programming languages, data-driven vs. procedural: Getting Started. + (line 12) +* programming languages, Java: Glossary. (line 468) +* programming, basic steps: Basic High Level. (line 18) +* programming, concepts: Basic Concepts. (line 6) +* programming, concepts <1>: Basic Concepts. (line 6) +* pwcat program: Passwd Functions. (line 23) +* q debugger command (alias for quit): Miscellaneous Debugger Commands. + (line 102) +* QSE awk: Other Versions. (line 135) +* Quanstrom, Erik: Alarm Program. (line 8) +* question mark (?), ?: operator: Precedence. (line 91) +* question mark (?), regexp operator: Regexp Operators. (line 111) +* question mark (?), regexp operator <1>: GNU Regexp Operators. + (line 62) +* QuikTrim Awk: Other Versions. (line 139) +* quit debugger command: Miscellaneous Debugger Commands. + (line 102) +* QUIT signal (MS-Windows): Profiling. (line 212) +* quoting in gawk command lines: Long. (line 26) +* quoting in gawk command lines, tricks for: Quoting. (line 91) +* quoting, for small awk programs: Comments. (line 27) +* r debugger command (alias for run): Debugger Execution Control. + (line 62) +* Rakitzis, Byron: History Sorting. (line 25) +* Ramey, Chet: Acknowledgments. (line 60) +* Ramey, Chet <1>: General Data Types. (line 6) +* rand: Numeric Functions. (line 49) +* random numbers, Cliff: Cliff Random Function. + (line 6) +* random numbers, rand()/srand() functions: Numeric Functions. + (line 49) +* random numbers, seed of: Numeric Functions. (line 79) +* range expressions (regexps): Bracket Expressions. (line 6) +* range patterns: Ranges. (line 6) +* range patterns, line continuation and: Ranges. (line 64) +* Rankin, Pat: Acknowledgments. (line 60) +* Rankin, Pat <1>: Assignment Ops. (line 99) +* Rankin, Pat <2>: Contributors. (line 38) +* reada() extension function: Extension Sample Read write array. + (line 18) +* readable data files, checking: File Checking. (line 6) +* readable.awk program: File Checking. (line 11) +* readdir extension: Extension Sample Readdir. + (line 9) +* readfile() extension function: Extension Sample Readfile. + (line 12) +* readfile() user-defined function: Readfile Function. (line 30) +* reading input files: Reading Files. (line 6) +* recipe for a programming language: History. (line 6) +* record separators: awk split records. (line 6) +* record separators <1>: User-modified. (line 133) +* record separators, changing: awk split records. (line 85) +* record separators, regular expressions as: awk split records. + (line 124) +* record separators, with multiline records: Multiple Line. (line 10) +* records: Reading Files. (line 14) +* records <1>: Basic High Level. (line 62) +* records, multiline: Multiple Line. (line 6) +* records, printing: Print. (line 22) +* records, splitting input into: Records. (line 6) +* records, terminating: awk split records. (line 124) +* records, treating files as: gawk split records. (line 92) +* recursive functions: Definition Syntax. (line 89) +* redirect gawk output, in debugger: Debugger Info. (line 73) +* redirection of input: Getline/File. (line 6) +* redirection of output: Redirection. (line 6) +* redirection on VMS: VMS Running. (line 64) +* reference counting, sorting arrays: Array Sorting Functions. + (line 77) +* regexp: Regexp. (line 6) +* regexp constants: Regexp Usage. (line 57) +* regexp constants <1>: Regexp Constants. (line 6) +* regexp constants <2>: Comparison Operators. + (line 103) +* regexp constants, /=.../, /= operator and: Assignment Ops. (line 149) +* regexp constants, as patterns: Expression Patterns. (line 34) +* regexp constants, in gawk: Using Constant Regexps. + (line 28) +* regexp constants, slashes vs. quotes: Computed Regexps. (line 30) +* regexp constants, vs. string constants: Computed Regexps. (line 40) +* register extension: Registration Functions. + (line 6) +* regular expressions: Regexp. (line 6) +* regular expressions as field separators: Field Separators. (line 50) +* regular expressions, anchors in: Regexp Operators. (line 22) +* regular expressions, as field separators: Regexp Field Splitting. + (line 6) +* regular expressions, as patterns: Regexp Usage. (line 6) +* regular expressions, as patterns <1>: Regexp Patterns. (line 6) +* regular expressions, as record separators: awk split records. + (line 124) +* regular expressions, case sensitivity: Case-sensitivity. (line 6) +* regular expressions, case sensitivity <1>: User-modified. (line 76) +* regular expressions, computed: Computed Regexps. (line 6) +* regular expressions, constants, See regexp constants: Regexp Usage. + (line 57) +* regular expressions, dynamic: Computed Regexps. (line 6) +* regular expressions, dynamic, with embedded newlines: Computed Regexps. + (line 60) +* regular expressions, gawk, command-line options: GNU Regexp Operators. + (line 73) +* regular expressions, interval expressions and: Options. (line 278) +* regular expressions, leftmost longest match: Leftmost Longest. + (line 6) +* regular expressions, operators: Regexp Usage. (line 19) +* regular expressions, operators <1>: Regexp Operators. (line 6) +* regular expressions, operators, for buffers: GNU Regexp Operators. + (line 51) +* regular expressions, operators, for words: GNU Regexp Operators. + (line 6) +* regular expressions, operators, gawk: GNU Regexp Operators. + (line 6) +* regular expressions, operators, precedence of: Regexp Operators. + (line 156) +* regular expressions, searching for: Egrep Program. (line 6) +* relational operators, See comparison operators: Typing and Comparison. + (line 9) +* replace in string: String Functions. (line 409) +* retrying input: Retrying Input. (line 6) +* return debugger command: Debugger Execution Control. + (line 54) +* return statement, user-defined functions: Return Statement. (line 6) +* return value, close() function: Close Files And Pipes. + (line 132) +* rev() user-defined function: Function Example. (line 54) +* revoutput extension: Extension Sample Revout. + (line 11) +* revtwoway extension: Extension Sample Rev2way. + (line 12) +* rewind() user-defined function: Rewind Function. (line 15) +* right angle bracket (>), > operator: Comparison Operators. + (line 11) +* right angle bracket (>), > operator <1>: Precedence. (line 64) +* right angle bracket (>), > operator (I/O): Redirection. (line 22) +* right angle bracket (>), >= operator: Comparison Operators. + (line 11) +* right angle bracket (>), >= operator <1>: Precedence. (line 64) +* right angle bracket (>), >> operator (I/O): Redirection. (line 50) +* right angle bracket (>), >> operator (I/O) <1>: Precedence. (line 64) +* right shift: Bitwise Functions. (line 54) +* right shift, bitwise: Bitwise Functions. (line 32) +* Ritchie, Dennis: Basic Data Typing. (line 54) +* RLENGTH variable: Auto-set. (line 282) +* RLENGTH variable, match() function and: String Functions. (line 227) +* Robbins, Arnold: Command Line Field Separator. + (line 71) +* Robbins, Arnold <1>: Getline/Pipe. (line 40) +* Robbins, Arnold <2>: Passwd Functions. (line 90) +* Robbins, Arnold <3>: Alarm Program. (line 6) +* Robbins, Arnold <4>: General Data Types. (line 6) +* Robbins, Arnold <5>: Contributors. (line 145) +* Robbins, Arnold <6>: Maintainers. (line 14) +* Robbins, Arnold <7>: Future Extensions. (line 6) +* Robbins, Bill: Getline/Pipe. (line 40) +* Robbins, Harry: Acknowledgments. (line 94) +* Robbins, Jean: Acknowledgments. (line 94) +* Robbins, Miriam: Acknowledgments. (line 94) +* Robbins, Miriam <1>: Getline/Pipe. (line 40) +* Robbins, Miriam <2>: Passwd Functions. (line 90) +* Rommel, Kai Uwe: Contributors. (line 43) +* round to nearest integer: Numeric Functions. (line 24) +* round() user-defined function: Round Function. (line 16) +* rounding numbers: Round Function. (line 6) +* ROUNDMODE variable: User-modified. (line 128) +* RS variable: awk split records. (line 12) +* RS variable <1>: User-modified. (line 133) +* RS variable, multiline records and: Multiple Line. (line 17) +* rshift: Bitwise Functions. (line 54) +* RSTART variable: Auto-set. (line 288) +* RSTART variable, match() function and: String Functions. (line 227) +* RT variable: awk split records. (line 124) +* RT variable <1>: Multiple Line. (line 130) +* RT variable <2>: Auto-set. (line 295) +* Rubin, Paul: History. (line 30) +* Rubin, Paul <1>: Contributors. (line 16) +* rule, definition of: Getting Started. (line 21) +* run debugger command: Debugger Execution Control. + (line 62) +* rvalues/lvalues: Assignment Ops. (line 31) +* s debugger command (alias for step): Debugger Execution Control. + (line 68) +* sample debugging session: Sample Debugging Session. + (line 6) +* sandbox mode: Options. (line 290) +* save debugger options: Debugger Info. (line 85) +* scalar or array: Type Functions. (line 11) +* scalar values: Basic Data Typing. (line 13) +* scanning arrays: Scanning an Array. (line 6) +* scanning multidimensional arrays: Multiscanning. (line 11) +* Schorr, Andrew: Acknowledgments. (line 60) +* Schorr, Andrew <1>: Auto-set. (line 327) +* Schorr, Andrew <2>: Contributors. (line 134) +* Schreiber, Bert: Acknowledgments. (line 38) +* Schreiber, Rita: Acknowledgments. (line 38) +* search and replace in strings: String Functions. (line 89) +* search in string: String Functions. (line 155) +* search paths: Programs Exercises. (line 70) +* search paths <1>: PC Using. (line 9) +* search paths <2>: VMS Running. (line 57) +* search paths, for loadable extensions: AWKLIBPATH Variable. (line 6) +* search paths, for source files: AWKPATH Variable. (line 6) +* search paths, for source files <1>: Programs Exercises. (line 70) +* search paths, for source files <2>: PC Using. (line 9) +* search paths, for source files <3>: VMS Running. (line 57) +* searching, files for regular expressions: Egrep Program. (line 6) +* searching, for words: Dupword Program. (line 6) +* sed utility: Full Line Fields. (line 22) +* sed utility <1>: Simple Sed. (line 6) +* sed utility <2>: Glossary. (line 16) +* seeding random number generator: Numeric Functions. (line 79) +* semicolon (;), AWKPATH variable and: PC Using. (line 9) +* semicolon (;), separating statements in actions: Statements/Lines. + (line 90) +* semicolon (;), separating statements in actions <1>: Action Overview. + (line 19) +* semicolon (;), separating statements in actions <2>: Statements. + (line 10) +* separators, field: User-modified. (line 50) +* separators, field <1>: User-modified. (line 113) +* separators, field, FIELDWIDTHS variable and: User-modified. (line 37) +* separators, field, FPAT variable and: User-modified. (line 43) +* separators, for records: awk split records. (line 6) +* separators, for records <1>: awk split records. (line 85) +* separators, for records <2>: User-modified. (line 133) +* separators, for records, regular expressions as: awk split records. + (line 124) +* separators, for statements in actions: Action Overview. (line 19) +* separators, subscript: User-modified. (line 146) +* set breakpoint: Breakpoint Control. (line 11) +* set debugger command: Viewing And Changing Data. + (line 58) +* set directory of message catalogs: I18N Functions. (line 11) +* set watchpoint: Viewing And Changing Data. + (line 66) +* shadowing of variable values: Definition Syntax. (line 77) +* shell quoting, rules for: Quoting. (line 6) +* shells, piping commands into: Redirection. (line 136) +* shells, quoting: Using Shell Variables. + (line 12) +* shells, quoting, rules for: Quoting. (line 18) +* shells, scripts: One-shot. (line 22) +* shells, sea: Undocumented. (line 9) +* shells, variables: Using Shell Variables. + (line 6) +* shift, bitwise: Bitwise Functions. (line 32) +* short-circuit operators: Boolean Ops. (line 59) +* show all source files, in debugger: Debugger Info. (line 45) +* show breakpoints: Debugger Info. (line 21) +* show function arguments, in debugger: Debugger Info. (line 18) +* show local variables, in debugger: Debugger Info. (line 34) +* show name of current source file, in debugger: Debugger Info. + (line 37) +* show watchpoints: Debugger Info. (line 51) +* si debugger command (alias for stepi): Debugger Execution Control. + (line 75) +* side effects: Concatenation. (line 41) +* side effects <1>: Increment Ops. (line 11) +* side effects <2>: Increment Ops. (line 75) +* side effects, array indexing: Reference to Elements. + (line 43) +* side effects, asort() function: Array Sorting Functions. + (line 24) +* side effects, assignment expressions: Assignment Ops. (line 22) +* side effects, Boolean operators: Boolean Ops. (line 30) +* side effects, conditional expressions: Conditional Exp. (line 22) +* side effects, decrement/increment operators: Increment Ops. (line 11) +* side effects, FILENAME variable: Getline Notes. (line 19) +* side effects, function calls: Function Calls. (line 57) +* side effects, statements: Action Overview. (line 32) +* sidebar, A Constant's Base Does Not Affect Its Value: Nondecimal-numbers. + (line 63) +* sidebar, Backslash Before Regular Characters: Escape Sequences. + (line 106) +* sidebar, Beware The Smoke and Mirrors!: Bitwise Functions. (line 126) +* sidebar, Changing FS Does Not Affect the Fields: Full Line Fields. + (line 14) +* sidebar, Changing NR and FNR: Auto-set. (line 355) +* sidebar, Controlling Output Buffering with system(): I/O Functions. + (line 164) +* sidebar, Escape Sequences for Metacharacters: Escape Sequences. + (line 138) +* sidebar, FS and IGNORECASE: Field Splitting Summary. + (line 37) +* sidebar, Interactive Versus Noninteractive Buffering: I/O Functions. + (line 74) +* sidebar, Matching the Null String: String Functions. (line 535) +* sidebar, Operator Evaluation Order: Increment Ops. (line 58) +* sidebar, Piping into sh: Redirection. (line 134) +* sidebar, Pre-POSIX awk Used OFMT for String Conversion: Strings And Numbers. + (line 54) +* sidebar, Recipe for a Programming Language: History. (line 6) +* sidebar, RS = "\0" Is Not Portable: gawk split records. (line 63) +* sidebar, So Why Does gawk Have BEGINFILE and ENDFILE?: Filetrans Function. + (line 83) +* sidebar, Syntactic Ambiguities Between /= and Regular Expressions: Assignment Ops. + (line 147) +* sidebar, Understanding #!: Executable Scripts. (line 31) +* sidebar, Understanding $0: Changing Fields. (line 134) +* sidebar, Using close()'s Return Value: Close Files And Pipes. + (line 130) +* sidebar, Using \n in Bracket Expressions of Dynamic Regexps: Computed Regexps. + (line 58) +* SIGHUP signal, for dynamic profiling: Profiling. (line 209) +* SIGINT signal (MS-Windows): Profiling. (line 212) +* signals, HUP/SIGHUP, for profiling: Profiling. (line 209) +* signals, INT/SIGINT (MS-Windows): Profiling. (line 212) +* signals, QUIT/SIGQUIT (MS-Windows): Profiling. (line 212) +* signals, USR1/SIGUSR1, for profiling: Profiling. (line 186) +* signature program: Signature Program. (line 6) +* SIGQUIT signal (MS-Windows): Profiling. (line 212) +* SIGUSR1 signal, for dynamic profiling: Profiling. (line 186) +* silent debugger command: Debugger Execution Control. + (line 10) +* sin: Numeric Functions. (line 90) +* sine: Numeric Functions. (line 90) +* single quote ('): One-shot. (line 15) +* single quote (') in gawk command lines: Long. (line 35) +* single quote ('), in shell commands: Quoting. (line 48) +* single quote ('), vs. apostrophe: Comments. (line 27) +* single quote ('), with double quotes: Quoting. (line 73) +* single-character fields: Single Character Fields. + (line 6) +* single-step execution, in the debugger: Debugger Execution Control. + (line 43) +* Skywalker, Luke: Undocumented. (line 6) +* sleep utility: Alarm Program. (line 109) +* sleep() extension function: Extension Sample Time. + (line 22) +* Solaris, POSIX-compliant awk: Other Versions. (line 100) +* sort array: String Functions. (line 42) +* sort array indices: String Functions. (line 42) +* sort function, arrays, sorting: Array Sorting Functions. + (line 6) +* sort utility: Word Sorting. (line 50) +* sort utility, coprocesses and: Two-way I/O. (line 66) +* sorting characters in different languages: Explaining gettext. + (line 94) +* source code, awka: Other Versions. (line 68) +* source code, Brian Kernighan's awk: Other Versions. (line 13) +* source code, BusyBox Awk: Other Versions. (line 92) +* source code, gawk: Gawk Distribution. (line 6) +* source code, Illumos awk: Other Versions. (line 109) +* source code, jawk: Other Versions. (line 117) +* source code, libmawk: Other Versions. (line 125) +* source code, mawk: Other Versions. (line 48) +* source code, mixing: Options. (line 117) +* source code, pawk: Other Versions. (line 82) +* source code, pawk (Python version): Other Versions. (line 129) +* source code, QSE awk: Other Versions. (line 135) +* source code, QuikTrim Awk: Other Versions. (line 139) +* source code, Solaris awk: Other Versions. (line 100) +* source files, search path for: Programs Exercises. (line 70) +* sparse arrays: Array Intro. (line 76) +* Spencer, Henry: Glossary. (line 16) +* split: String Functions. (line 315) +* split string into array: String Functions. (line 296) +* split utility: Split Program. (line 6) +* split() function, array elements, deleting: Delete. (line 61) +* split.awk program: Split Program. (line 30) +* sprintf: OFMT. (line 15) +* sprintf <1>: String Functions. (line 384) +* sprintf() function, OFMT variable and: User-modified. (line 113) +* sprintf() function, print/printf statements and: Round Function. + (line 6) +* sqrt: Numeric Functions. (line 93) +* square brackets ([]), regexp operator: Regexp Operators. (line 56) +* square root: Numeric Functions. (line 93) +* srand: Numeric Functions. (line 97) +* stack frame: Debugging Terms. (line 10) +* Stallman, Richard: Manual History. (line 6) +* Stallman, Richard <1>: Acknowledgments. (line 18) +* Stallman, Richard <2>: Contributors. (line 24) +* Stallman, Richard <3>: Glossary. (line 372) +* standard error: Special FD. (line 6) +* standard input: Read Terminal. (line 6) +* standard input <1>: Special FD. (line 6) +* standard output: Special FD. (line 6) +* starting the debugger: Debugger Invocation. (line 6) +* stat() extension function: Extension Sample File Functions. + (line 18) +* statements, compound, control statements and: Statements. (line 10) +* statements, control, in actions: Statements. (line 6) +* statements, multiple: Statements/Lines. (line 90) +* step debugger command: Debugger Execution Control. + (line 68) +* stepi debugger command: Debugger Execution Control. + (line 75) +* stop automatic display, in debugger: Viewing And Changing Data. + (line 79) +* stream editors: Full Line Fields. (line 22) +* stream editors <1>: Simple Sed. (line 6) +* strftime: Time Functions. (line 48) +* string constants: Scalar Constants. (line 15) +* string constants, vs. regexp constants: Computed Regexps. (line 40) +* string extraction (internationalization): String Extraction. + (line 6) +* string length: String Functions. (line 170) +* string operators: Concatenation. (line 9) +* string, regular expression match: String Functions. (line 210) +* string-manipulation functions: String Functions. (line 6) +* string-matching operators: Regexp Usage. (line 19) +* string-translation functions: I18N Functions. (line 6) +* strings splitting, example: String Functions. (line 334) +* strings, converting: Strings And Numbers. (line 6) +* strings, converting <1>: Bitwise Functions. (line 108) +* strings, converting letter case: String Functions. (line 523) +* strings, converting, numbers to: User-modified. (line 30) +* strings, converting, numbers to <1>: User-modified. (line 104) +* strings, empty, See null strings: awk split records. (line 114) +* strings, extracting: String Extraction. (line 6) +* strings, for localization: Programmer i18n. (line 13) +* strings, length limitations: Scalar Constants. (line 20) +* strings, merging arrays into: Join Function. (line 6) +* strings, null: Regexp Field Splitting. + (line 43) +* strings, numeric: Variable Typing. (line 6) +* strtonum: String Functions. (line 391) +* strtonum() function (gawk), --non-decimal-data option and: Nondecimal Data. + (line 35) +* sub: Using Constant Regexps. + (line 43) +* sub <1>: String Functions. (line 409) +* sub() function, arguments of: String Functions. (line 463) +* sub() function, escape processing: Gory Details. (line 6) +* subscript separators: User-modified. (line 146) +* subscripts in arrays, multidimensional: Multidimensional. (line 10) +* subscripts in arrays, multidimensional, scanning: Multiscanning. + (line 11) +* subscripts in arrays, numbers as: Numeric Array Subscripts. + (line 6) +* subscripts in arrays, uninitialized variables as: Uninitialized Subscripts. + (line 6) +* SUBSEP variable: User-modified. (line 146) +* SUBSEP variable, and multidimensional arrays: Multidimensional. + (line 16) +* substitute in string: String Functions. (line 89) +* substr: String Functions. (line 482) +* substring: String Functions. (line 482) +* Sumner, Andrew: Other Versions. (line 68) +* supplementary groups of gawk process: Auto-set. (line 251) +* switch statement: Switch Statement. (line 6) +* SYMTAB array: Auto-set. (line 299) +* syntactic ambiguity: /= operator vs. /=.../ regexp constant: Assignment Ops. + (line 149) +* system: I/O Functions. (line 107) +* systime: Time Functions. (line 66) +* t debugger command (alias for tbreak): Breakpoint Control. (line 90) +* tbreak debugger command: Breakpoint Control. (line 90) +* Tcl: Library Names. (line 58) +* TCP/IP: TCP/IP Networking. (line 6) +* TCP/IP, support for: Special Network. (line 6) +* tee utility: Tee Program. (line 6) +* tee.awk program: Tee Program. (line 26) +* temporary breakpoint: Breakpoint Control. (line 90) +* terminating records: awk split records. (line 124) +* testbits.awk program: Bitwise Functions. (line 69) +* testext extension: Extension Sample API Tests. + (line 6) +* Texinfo: Conventions. (line 6) +* Texinfo <1>: Library Functions. (line 33) +* Texinfo <2>: Dupword Program. (line 17) +* Texinfo <3>: Extract Program. (line 12) +* Texinfo <4>: Distribution contents. + (line 77) +* Texinfo <5>: Adding Code. (line 100) +* Texinfo, chapter beginnings in files: Regexp Operators. (line 22) +* Texinfo, extracting programs from source files: Extract Program. + (line 6) +* text, printing: Print. (line 22) +* text, printing, unduplicated lines of: Uniq Program. (line 6) +* TEXTDOMAIN variable: User-modified. (line 152) +* TEXTDOMAIN variable <1>: Programmer i18n. (line 8) +* TEXTDOMAIN variable, BEGIN pattern and: Programmer i18n. (line 60) +* TEXTDOMAIN variable, portability and: I18N Portability. (line 20) +* textdomain() function (C library): Explaining gettext. (line 28) +* tilde (~), ~ operator: Regexp Usage. (line 19) +* tilde (~), ~ operator <1>: Computed Regexps. (line 6) +* tilde (~), ~ operator <2>: Case-sensitivity. (line 26) +* tilde (~), ~ operator <3>: Regexp Constants. (line 6) +* tilde (~), ~ operator <4>: Comparison Operators. + (line 11) +* tilde (~), ~ operator <5>: Comparison Operators. + (line 98) +* tilde (~), ~ operator <6>: Precedence. (line 79) +* tilde (~), ~ operator <7>: Expression Patterns. (line 24) +* time functions: Time Functions. (line 6) +* time, alarm clock example program: Alarm Program. (line 11) +* time, localization and: Explaining gettext. (line 112) +* time, managing: Getlocaltime Function. + (line 6) +* time, retrieving: Time Functions. (line 17) +* timeout, reading input: Read Timeout. (line 6) +* timestamps: Time Functions. (line 6) +* timestamps <1>: Time Functions. (line 66) +* timestamps, converting dates to: Time Functions. (line 76) +* timestamps, formatted: Getlocaltime Function. + (line 6) +* tolower: String Functions. (line 524) +* toupper: String Functions. (line 530) +* tr utility: Translate Program. (line 6) +* trace debugger command: Miscellaneous Debugger Commands. + (line 110) +* traceback, display in debugger: Execution Stack. (line 13) +* translate string: I18N Functions. (line 21) +* translate.awk program: Translate Program. (line 55) +* treating files, as single records: gawk split records. (line 92) +* troubleshooting, --non-decimal-data option: Options. (line 209) +* troubleshooting, == operator: Comparison Operators. + (line 37) +* troubleshooting, awk uses FS not IFS: Field Separators. (line 29) +* troubleshooting, backslash before nonspecial character: Escape Sequences. + (line 108) +* troubleshooting, division: Arithmetic Ops. (line 44) +* troubleshooting, fatal errors, field widths, specifying: Constant Size. + (line 22) +* troubleshooting, fatal errors, printf format strings: Format Modifiers. + (line 157) +* troubleshooting, fflush() function: I/O Functions. (line 63) +* troubleshooting, function call syntax: Function Calls. (line 30) +* troubleshooting, gawk: Compatibility Mode. (line 6) +* troubleshooting, gawk, bug reports: Bugs. (line 9) +* troubleshooting, gawk, fatal errors, function arguments: Calling Built-in. + (line 16) +* troubleshooting, getline function: File Checking. (line 25) +* troubleshooting, gsub()/sub() functions: String Functions. (line 473) +* troubleshooting, match() function: String Functions. (line 291) +* troubleshooting, print statement, omitting commas: Print Examples. + (line 30) +* troubleshooting, printing: Redirection. (line 112) +* troubleshooting, quotes with file names: Special FD. (line 62) +* troubleshooting, readable data files: File Checking. (line 6) +* troubleshooting, regexp constants vs. string constants: Computed Regexps. + (line 40) +* troubleshooting, string concatenation: Concatenation. (line 27) +* troubleshooting, substr() function: String Functions. (line 500) +* troubleshooting, system() function: I/O Functions. (line 129) +* troubleshooting, typographical errors, global variables: Options. + (line 99) +* true, logical: Truth Values. (line 6) +* Trueman, David: History. (line 30) +* Trueman, David <1>: Acknowledgments. (line 47) +* Trueman, David <2>: Contributors. (line 31) +* trunc-mod operation: Arithmetic Ops. (line 66) +* truth values: Truth Values. (line 6) +* type conversion: Strings And Numbers. (line 21) +* type, of variable: Type Functions. (line 14) +* typeof: Type Functions. (line 14) +* u debugger command (alias for until): Debugger Execution Control. + (line 82) +* unassigned array elements: Reference to Elements. + (line 18) +* undefined functions: Pass By Value/Reference. + (line 68) +* underscore (_), C macro: Explaining gettext. (line 71) +* underscore (_), in names of private variables: Library Names. + (line 29) +* underscore (_), translatable string: Programmer i18n. (line 69) +* undisplay debugger command: Viewing And Changing Data. + (line 79) +* undocumented features: Undocumented. (line 6) +* Unicode: Ordinal Functions. (line 45) +* Unicode <1>: Ranges and Locales. (line 61) +* Unicode <2>: Glossary. (line 196) +* uninitialized variables, as array subscripts: Uninitialized Subscripts. + (line 6) +* uniq utility: Uniq Program. (line 6) +* uniq.awk program: Uniq Program. (line 65) +* Unix: Glossary. (line 748) +* Unix awk, backslashes in escape sequences: Escape Sequences. + (line 121) +* Unix awk, close() function and: Close Files And Pipes. + (line 132) +* Unix awk, password files, field separators and: Command Line Field Separator. + (line 62) +* Unix, awk scripts and: Executable Scripts. (line 6) +* unsigned integers: Computer Arithmetic. (line 41) +* until debugger command: Debugger Execution Control. + (line 82) +* unwatch debugger command: Viewing And Changing Data. + (line 83) +* up debugger command: Execution Stack. (line 36) +* user database, reading: Passwd Functions. (line 6) +* user-defined functions: User-defined. (line 6) +* user-defined, functions, counts, in a profile: Profiling. (line 137) +* user-defined, variables: Variables. (line 6) +* user-modifiable variables: User-modified. (line 6) +* users, information about, printing: Id Program. (line 6) +* users, information about, retrieving: Passwd Functions. (line 16) +* USR1 signal, for dynamic profiling: Profiling. (line 186) +* values, numeric: Basic Data Typing. (line 13) +* values, string: Basic Data Typing. (line 13) +* variable assignments and input files: Other Arguments. (line 26) +* variable type: Type Functions. (line 14) +* variable typing: Typing and Comparison. + (line 9) +* variables: Other Features. (line 6) +* variables <1>: Basic Data Typing. (line 6) +* variables, assigning on command line: Assignment Options. (line 6) +* variables, built-in: Using Variables. (line 23) +* variables, flag: Boolean Ops. (line 69) +* variables, getline command into, using: Getline/Variable. (line 6) +* variables, getline command into, using <1>: Getline/Variable/File. + (line 6) +* variables, getline command into, using <2>: Getline/Variable/Pipe. + (line 6) +* variables, getline command into, using <3>: Getline/Variable/Coprocess. + (line 6) +* variables, global, for library functions: Library Names. (line 11) +* variables, global, printing list of: Options. (line 94) +* variables, initializing: Using Variables. (line 23) +* variables, local to a function: Variable Scope. (line 6) +* variables, predefined: Built-in Variables. (line 6) +* variables, predefined -v option, setting with: Options. (line 41) +* variables, predefined conveying information: Auto-set. (line 6) +* variables, private: Library Names. (line 11) +* variables, setting: Options. (line 32) +* variables, shadowing: Definition Syntax. (line 77) +* variables, types of: Assignment Ops. (line 39) +* variables, types of, comparison expressions and: Typing and Comparison. + (line 9) +* variables, uninitialized, as array subscripts: Uninitialized Subscripts. + (line 6) +* variables, user-defined: Variables. (line 6) +* version of gawk: Auto-set. (line 221) +* version of gawk extension API: Auto-set. (line 246) +* version of GNU MP library: Auto-set. (line 229) +* version of GNU MPFR library: Auto-set. (line 231) +* vertical bar (|): Regexp Operators. (line 70) +* vertical bar (|), | operator (I/O): Getline/Pipe. (line 10) +* vertical bar (|), | operator (I/O) <1>: Precedence. (line 64) +* vertical bar (|), |& operator (I/O): Getline/Coprocess. (line 6) +* vertical bar (|), |& operator (I/O) <1>: Precedence. (line 64) +* vertical bar (|), |& operator (I/O) <2>: Two-way I/O. (line 27) +* vertical bar (|), || operator: Boolean Ops. (line 59) +* vertical bar (|), || operator <1>: Precedence. (line 88) +* Vinschen, Corinna: Acknowledgments. (line 60) +* w debugger command (alias for watch): Viewing And Changing Data. + (line 66) +* w utility: Constant Size. (line 22) +* wait() extension function: Extension Sample Fork. + (line 22) +* waitpid() extension function: Extension Sample Fork. + (line 18) +* walk_array() user-defined function: Walking Arrays. (line 14) +* Wall, Larry: Array Intro. (line 6) +* Wall, Larry <1>: Future Extensions. (line 6) +* Wallin, Anders: Contributors. (line 104) +* warnings, issuing: Options. (line 184) +* watch debugger command: Viewing And Changing Data. + (line 66) +* watchpoint: Debugging Terms. (line 42) +* wc utility: Wc Program. (line 6) +* wc.awk program: Wc Program. (line 46) +* Weinberger, Peter: History. (line 17) +* Weinberger, Peter <1>: Contributors. (line 12) +* where debugger command: Execution Stack. (line 13) +* where debugger command (alias for backtrace): Execution Stack. + (line 13) +* while statement: While Statement. (line 6) +* while statement, use of regexps in: Regexp Usage. (line 19) +* whitespace, as field separators: Default Field Splitting. + (line 6) +* whitespace, functions, calling: Calling Built-in. (line 10) +* whitespace, newlines as: Options. (line 263) +* Williams, Kent: Contributors. (line 35) +* Woehlke, Matthew: Contributors. (line 80) +* Woods, John: Contributors. (line 28) +* word boundaries, matching: GNU Regexp Operators. + (line 41) +* word, regexp definition of: GNU Regexp Operators. + (line 6) +* word-boundary operator (gawk): GNU Regexp Operators. + (line 66) +* wordfreq.awk program: Word Sorting. (line 56) +* words, counting: Wc Program. (line 6) +* words, duplicate, searching for: Dupword Program. (line 6) +* words, usage counts, generating: Word Sorting. (line 6) +* writea() extension function: Extension Sample Read write array. + (line 12) +* xgettext utility: String Extraction. (line 13) +* xor: Bitwise Functions. (line 57) +* XOR bitwise operation: Bitwise Functions. (line 6) +* Yawitz, Efraim: Contributors. (line 132) +* Zaretskii, Eli: Acknowledgments. (line 60) +* Zaretskii, Eli <1>: Contributors. (line 56) +* Zaretskii, Eli <2>: Maintainers. (line 14) +* zerofile.awk program: Empty Files. (line 20) +* Zoulas, Christos: Contributors. (line 67) + + + +Tag Table: +Node: Top1200 +Node: Foreword342530 +Node: Foreword446972 +Node: Preface48504 +Ref: Preface-Footnote-151363 +Ref: Preface-Footnote-251470 +Ref: Preface-Footnote-351704 +Node: History51846 +Node: Names54198 +Ref: Names-Footnote-155292 +Node: This Manual55439 +Ref: This Manual-Footnote-161924 +Node: Conventions62024 +Node: Manual History64378 +Ref: Manual History-Footnote-167373 +Ref: Manual History-Footnote-267414 +Node: How To Contribute67488 +Node: Acknowledgments68617 +Node: Getting Started73503 +Node: Running gawk75942 +Node: One-shot77132 +Node: Read Terminal78395 +Node: Long80388 +Node: Executable Scripts81901 +Ref: Executable Scripts-Footnote-184696 +Node: Comments84799 +Node: Quoting87283 +Node: DOS Quoting92800 +Node: Sample Data Files93475 +Node: Very Simple96070 +Node: Two Rules100972 +Node: More Complex102857 +Node: Statements/Lines105723 +Ref: Statements/Lines-Footnote-1110182 +Node: Other Features110447 +Node: When111383 +Ref: When-Footnote-1113137 +Node: Intro Summary113202 +Node: Invoking Gawk114086 +Node: Command Line115600 +Node: Options116398 +Ref: Options-Footnote-1132497 +Ref: Options-Footnote-2132727 +Node: Other Arguments132752 +Node: Naming Standard Input135699 +Node: Environment Variables136792 +Node: AWKPATH Variable137350 +Ref: AWKPATH Variable-Footnote-1140761 +Ref: AWKPATH Variable-Footnote-2140795 +Node: AWKLIBPATH Variable141056 +Node: Other Environment Variables142313 +Node: Exit Status146134 +Node: Include Files146811 +Node: Loading Shared Libraries150406 +Node: Obsolete151834 +Node: Undocumented152526 +Node: Invoking Summary152823 +Node: Regexp154483 +Node: Regexp Usage156002 +Node: Escape Sequences158039 +Node: Regexp Operators164271 +Ref: Regexp Operators-Footnote-1171687 +Ref: Regexp Operators-Footnote-2171834 +Node: Bracket Expressions171932 +Ref: table-char-classes174408 +Node: Leftmost Longest177545 +Node: Computed Regexps178848 +Node: GNU Regexp Operators182275 +Node: Case-sensitivity185954 +Ref: Case-sensitivity-Footnote-1188850 +Ref: Case-sensitivity-Footnote-2189085 +Node: Strong Regexp Constants189193 +Node: Regexp Summary189982 +Node: Reading Files191457 +Node: Records193620 +Node: awk split records194353 +Node: gawk split records199284 +Ref: gawk split records-Footnote-1203824 +Node: Fields203861 +Node: Nonconstant Fields206602 +Ref: Nonconstant Fields-Footnote-1208838 +Node: Changing Fields209042 +Node: Field Separators214970 +Node: Default Field Splitting217668 +Node: Regexp Field Splitting218786 +Node: Single Character Fields222139 +Node: Command Line Field Separator223199 +Node: Full Line Fields226417 +Ref: Full Line Fields-Footnote-1227939 +Ref: Full Line Fields-Footnote-2227985 +Node: Field Splitting Summary228086 +Node: Constant Size230160 +Node: Splitting By Content234738 +Ref: Splitting By Content-Footnote-1238709 +Node: Multiple Line238872 +Ref: Multiple Line-Footnote-1244754 +Node: Getline244933 +Node: Plain Getline247400 +Node: Getline/Variable250039 +Node: Getline/File251188 +Node: Getline/Variable/File252574 +Ref: Getline/Variable/File-Footnote-1254177 +Node: Getline/Pipe254265 +Node: Getline/Variable/Pipe256970 +Node: Getline/Coprocess258103 +Node: Getline/Variable/Coprocess259368 +Node: Getline Notes260108 +Node: Getline Summary262903 +Ref: table-getline-variants263325 +Node: Read Timeout264073 +Ref: Read Timeout-Footnote-1267979 +Node: Retrying Input268037 +Node: Command-line directories269236 +Node: Input Summary270142 +Node: Input Exercises273314 +Node: Printing274042 +Node: Print275876 +Node: Print Examples277333 +Node: Output Separators280113 +Node: OFMT282130 +Node: Printf283486 +Node: Basic Printf284271 +Node: Control Letters285845 +Node: Format Modifiers289833 +Node: Printf Examples295848 +Node: Redirection298334 +Node: Special FD305175 +Ref: Special FD-Footnote-1308343 +Node: Special Files308417 +Node: Other Inherited Files309034 +Node: Special Network310035 +Node: Special Caveats310895 +Node: Close Files And Pipes311844 +Ref: table-close-pipe-return-values318751 +Ref: Close Files And Pipes-Footnote-1319534 +Ref: Close Files And Pipes-Footnote-2319682 +Node: Nonfatal319834 +Node: Output Summary322159 +Node: Output Exercises323381 +Node: Expressions324060 +Node: Values325248 +Node: Constants325926 +Node: Scalar Constants326617 +Ref: Scalar Constants-Footnote-1327481 +Node: Nondecimal-numbers327731 +Node: Regexp Constants330744 +Node: Using Constant Regexps331270 +Node: Variables334433 +Node: Using Variables335090 +Node: Assignment Options337000 +Node: Conversion338873 +Node: Strings And Numbers339397 +Ref: Strings And Numbers-Footnote-1342460 +Node: Locale influences conversions342569 +Ref: table-locale-affects345327 +Node: All Operators345945 +Node: Arithmetic Ops346574 +Node: Concatenation349080 +Ref: Concatenation-Footnote-1351927 +Node: Assignment Ops352034 +Ref: table-assign-ops357025 +Node: Increment Ops358338 +Node: Truth Values and Conditions361798 +Node: Truth Values362872 +Node: Typing and Comparison363920 +Node: Variable Typing364740 +Node: Comparison Operators368364 +Ref: table-relational-ops368783 +Node: POSIX String Comparison372278 +Ref: POSIX String Comparison-Footnote-1373973 +Ref: POSIX String Comparison-Footnote-2374112 +Node: Boolean Ops374196 +Ref: Boolean Ops-Footnote-1378678 +Node: Conditional Exp378770 +Node: Function Calls380506 +Node: Precedence384383 +Node: Locales388042 +Node: Expressions Summary389674 +Node: Patterns and Actions392247 +Node: Pattern Overview393367 +Node: Regexp Patterns395044 +Node: Expression Patterns395586 +Node: Ranges399367 +Node: BEGIN/END402475 +Node: Using BEGIN/END403236 +Ref: Using BEGIN/END-Footnote-1405972 +Node: I/O And BEGIN/END406078 +Node: BEGINFILE/ENDFILE408392 +Node: Empty411299 +Node: Using Shell Variables411616 +Node: Action Overview413890 +Node: Statements416215 +Node: If Statement418063 +Node: While Statement419558 +Node: Do Statement421586 +Node: For Statement422734 +Node: Switch Statement425892 +Node: Break Statement428278 +Node: Continue Statement430370 +Node: Next Statement432197 +Node: Nextfile Statement434580 +Node: Exit Statement437232 +Node: Built-in Variables439635 +Node: User-modified440768 +Node: Auto-set448354 +Ref: Auto-set-Footnote-1463007 +Ref: Auto-set-Footnote-2463213 +Node: ARGC and ARGV463269 +Node: Pattern Action Summary467482 +Node: Arrays469912 +Node: Array Basics471241 +Node: Array Intro472085 +Ref: figure-array-elements474060 +Ref: Array Intro-Footnote-1476764 +Node: Reference to Elements476892 +Node: Assigning Elements479356 +Node: Array Example479847 +Node: Scanning an Array481606 +Node: Controlling Scanning484628 +Ref: Controlling Scanning-Footnote-1490027 +Node: Numeric Array Subscripts490343 +Node: Uninitialized Subscripts492527 +Node: Delete494146 +Ref: Delete-Footnote-1496898 +Node: Multidimensional496955 +Node: Multiscanning500050 +Node: Arrays of Arrays501641 +Node: Arrays Summary506408 +Node: Functions508501 +Node: Built-in509539 +Node: Calling Built-in510620 +Node: Numeric Functions512616 +Ref: Numeric Functions-Footnote-1517449 +Ref: Numeric Functions-Footnote-2517806 +Ref: Numeric Functions-Footnote-3517854 +Node: String Functions518126 +Ref: String Functions-Footnote-1541630 +Ref: String Functions-Footnote-2541758 +Ref: String Functions-Footnote-3542006 +Node: Gory Details542093 +Ref: table-sub-escapes543884 +Ref: table-sub-proposed545403 +Ref: table-posix-sub546766 +Ref: table-gensub-escapes548307 +Ref: Gory Details-Footnote-1549130 +Node: I/O Functions549284 +Ref: table-system-return-values555866 +Ref: I/O Functions-Footnote-1557846 +Ref: I/O Functions-Footnote-2557994 +Node: Time Functions558114 +Ref: Time Functions-Footnote-1568636 +Ref: Time Functions-Footnote-2568704 +Ref: Time Functions-Footnote-3568862 +Ref: Time Functions-Footnote-4568973 +Ref: Time Functions-Footnote-5569085 +Ref: Time Functions-Footnote-6569312 +Node: Bitwise Functions569578 +Ref: table-bitwise-ops570172 +Ref: Bitwise Functions-Footnote-1576217 +Ref: Bitwise Functions-Footnote-2576390 +Node: Type Functions576581 +Node: I18N Functions579113 +Node: User-defined580764 +Node: Definition Syntax581569 +Ref: Definition Syntax-Footnote-1587256 +Node: Function Example587327 +Ref: Function Example-Footnote-1590249 +Node: Function Caveats590271 +Node: Calling A Function590789 +Node: Variable Scope591747 +Node: Pass By Value/Reference594741 +Node: Return Statement598240 +Node: Dynamic Typing601219 +Node: Indirect Calls602149 +Ref: Indirect Calls-Footnote-1612400 +Node: Functions Summary612528 +Node: Library Functions615233 +Ref: Library Functions-Footnote-1618840 +Ref: Library Functions-Footnote-2618983 +Node: Library Names619154 +Ref: Library Names-Footnote-1622614 +Ref: Library Names-Footnote-2622837 +Node: General Functions622923 +Node: Strtonum Function624026 +Node: Assert Function627048 +Node: Round Function630374 +Node: Cliff Random Function631915 +Node: Ordinal Functions632931 +Ref: Ordinal Functions-Footnote-1635994 +Ref: Ordinal Functions-Footnote-2636246 +Node: Join Function636456 +Ref: Join Function-Footnote-1638226 +Node: Getlocaltime Function638426 +Node: Readfile Function642168 +Node: Shell Quoting644140 +Node: Data File Management645541 +Node: Filetrans Function646173 +Node: Rewind Function650269 +Node: File Checking652175 +Ref: File Checking-Footnote-1653509 +Node: Empty Files653710 +Node: Ignoring Assigns655689 +Node: Getopt Function657239 +Ref: Getopt Function-Footnote-1668708 +Node: Passwd Functions668908 +Ref: Passwd Functions-Footnote-1677747 +Node: Group Functions677835 +Ref: Group Functions-Footnote-1685733 +Node: Walking Arrays685940 +Node: Library Functions Summary688948 +Node: Library Exercises690354 +Node: Sample Programs690819 +Node: Running Examples691589 +Node: Clones692317 +Node: Cut Program693541 +Node: Egrep Program703470 +Ref: Egrep Program-Footnote-1710982 +Node: Id Program711092 +Node: Split Program714772 +Ref: Split Program-Footnote-1718231 +Node: Tee Program718360 +Node: Uniq Program721150 +Node: Wc Program728576 +Ref: Wc Program-Footnote-1732831 +Node: Miscellaneous Programs732925 +Node: Dupword Program734138 +Node: Alarm Program736168 +Node: Translate Program741023 +Ref: Translate Program-Footnote-1745588 +Node: Labels Program745858 +Ref: Labels Program-Footnote-1749209 +Node: Word Sorting749293 +Node: History Sorting753365 +Node: Extract Program755200 +Node: Simple Sed762729 +Node: Igawk Program765803 +Ref: Igawk Program-Footnote-1780134 +Ref: Igawk Program-Footnote-2780336 +Ref: Igawk Program-Footnote-3780458 +Node: Anagram Program780573 +Node: Signature Program783635 +Node: Programs Summary784882 +Node: Programs Exercises786096 +Ref: Programs Exercises-Footnote-1790225 +Node: Advanced Features790316 +Node: Nondecimal Data792306 +Node: Array Sorting793897 +Node: Controlling Array Traversal794597 +Ref: Controlling Array Traversal-Footnote-1802964 +Node: Array Sorting Functions803082 +Ref: Array Sorting Functions-Footnote-1808173 +Node: Two-way I/O808369 +Ref: Two-way I/O-Footnote-1814919 +Ref: Two-way I/O-Footnote-2815106 +Node: TCP/IP Networking815188 +Node: Profiling818306 +Ref: Profiling-Footnote-1826799 +Node: Advanced Features Summary827122 +Node: Internationalization828966 +Node: I18N and L10N830446 +Node: Explaining gettext831133 +Ref: Explaining gettext-Footnote-1837025 +Ref: Explaining gettext-Footnote-2837210 +Node: Programmer i18n837375 +Ref: Programmer i18n-Footnote-1842230 +Node: Translator i18n842279 +Node: String Extraction843073 +Ref: String Extraction-Footnote-1844205 +Node: Printf Ordering844291 +Ref: Printf Ordering-Footnote-1847077 +Node: I18N Portability847141 +Ref: I18N Portability-Footnote-1849597 +Node: I18N Example849660 +Ref: I18N Example-Footnote-1852466 +Node: Gawk I18N852539 +Node: I18N Summary853184 +Node: Debugger854525 +Node: Debugging855547 +Node: Debugging Concepts855988 +Node: Debugging Terms857797 +Node: Awk Debugging860372 +Node: Sample Debugging Session861278 +Node: Debugger Invocation861812 +Node: Finding The Bug863198 +Node: List of Debugger Commands869676 +Node: Breakpoint Control871009 +Node: Debugger Execution Control874703 +Node: Viewing And Changing Data878065 +Node: Execution Stack881439 +Node: Debugger Info883076 +Node: Miscellaneous Debugger Commands887147 +Node: Readline Support892235 +Node: Limitations893131 +Ref: Limitations-Footnote-1897362 +Node: Debugging Summary897413 +Node: Arbitrary Precision Arithmetic898692 +Node: Computer Arithmetic900108 +Ref: table-numeric-ranges903699 +Ref: Computer Arithmetic-Footnote-1904421 +Node: Math Definitions904478 +Ref: table-ieee-formats907792 +Ref: Math Definitions-Footnote-1908395 +Node: MPFR features908500 +Node: FP Math Caution910217 +Ref: FP Math Caution-Footnote-1911289 +Node: Inexactness of computations911658 +Node: Inexact representation912618 +Node: Comparing FP Values913978 +Node: Errors accumulate915060 +Node: Getting Accuracy916493 +Node: Try To Round919203 +Node: Setting precision920102 +Ref: table-predefined-precision-strings920799 +Node: Setting the rounding mode922629 +Ref: table-gawk-rounding-modes923003 +Ref: Setting the rounding mode-Footnote-1926411 +Node: Arbitrary Precision Integers926590 +Ref: Arbitrary Precision Integers-Footnote-1931507 +Node: POSIX Floating Point Problems931656 +Ref: POSIX Floating Point Problems-Footnote-1935538 +Node: Floating point summary935576 +Node: Dynamic Extensions937766 +Node: Extension Intro939319 +Node: Plugin License940585 +Node: Extension Mechanism Outline941382 +Ref: figure-load-extension941821 +Ref: figure-register-new-function943386 +Ref: figure-call-new-function944478 +Node: Extension API Description946540 +Node: Extension API Functions Introduction948072 +Node: General Data Types952931 +Ref: General Data Types-Footnote-1958886 +Node: Memory Allocation Functions959185 +Ref: Memory Allocation Functions-Footnote-1962030 +Node: Constructor Functions962129 +Node: Registration Functions963874 +Node: Extension Functions964559 +Node: Exit Callback Functions967182 +Node: Extension Version String968432 +Node: Input Parsers969095 +Node: Output Wrappers978977 +Node: Two-way processors983489 +Node: Printing Messages985754 +Ref: Printing Messages-Footnote-1986925 +Node: Updating ERRNO987078 +Node: Requesting Values987817 +Ref: table-value-types-returned988554 +Node: Accessing Parameters989437 +Node: Symbol Table Access990672 +Node: Symbol table by name991184 +Node: Symbol table by cookie993205 +Ref: Symbol table by cookie-Footnote-1997357 +Node: Cached values997421 +Ref: Cached values-Footnote-11000928 +Node: Array Manipulation1001019 +Ref: Array Manipulation-Footnote-11002110 +Node: Array Data Types1002147 +Ref: Array Data Types-Footnote-11004805 +Node: Array Functions1004897 +Node: Flattening Arrays1008755 +Node: Creating Arrays1015663 +Node: Redirection API1020432 +Node: Extension API Variables1023263 +Node: Extension Versioning1023896 +Ref: gawk-api-version1024333 +Node: Extension API Informational Variables1026089 +Node: Extension API Boilerplate1027153 +Node: Finding Extensions1030967 +Node: Extension Example1031526 +Node: Internal File Description1032324 +Node: Internal File Ops1036404 +Ref: Internal File Ops-Footnote-11048166 +Node: Using Internal File Ops1048306 +Ref: Using Internal File Ops-Footnote-11050689 +Node: Extension Samples1050963 +Node: Extension Sample File Functions1052492 +Node: Extension Sample Fnmatch1060141 +Node: Extension Sample Fork1061628 +Node: Extension Sample Inplace1062846 +Node: Extension Sample Ord1066056 +Node: Extension Sample Readdir1066892 +Ref: table-readdir-file-types1067781 +Node: Extension Sample Revout1068586 +Node: Extension Sample Rev2way1069175 +Node: Extension Sample Read write array1069915 +Node: Extension Sample Readfile1071857 +Node: Extension Sample Time1072952 +Node: Extension Sample API Tests1074300 +Node: gawkextlib1074792 +Node: Extension summary1077239 +Node: Extension Exercises1080941 +Node: Language History1082439 +Node: V7/SVR3.11084095 +Node: SVR41086247 +Node: POSIX1087681 +Node: BTL1089060 +Node: POSIX/GNU1089789 +Node: Feature History1095651 +Node: Common Extensions1110021 +Node: Ranges and Locales1111304 +Ref: Ranges and Locales-Footnote-11115920 +Ref: Ranges and Locales-Footnote-21115947 +Ref: Ranges and Locales-Footnote-31116182 +Node: Contributors1116403 +Node: History summary1121963 +Node: Installation1123343 +Node: Gawk Distribution1124287 +Node: Getting1124771 +Node: Extracting1125732 +Node: Distribution contents1127370 +Node: Unix Installation1133455 +Node: Quick Installation1134137 +Node: Shell Startup Files1136551 +Node: Additional Configuration Options1137629 +Node: Configuration Philosophy1139434 +Node: Non-Unix Installation1141803 +Node: PC Installation1142263 +Node: PC Binary Installation1143101 +Node: PC Compiling1143536 +Node: PC Using1144653 +Node: Cygwin1147698 +Node: MSYS1148468 +Node: VMS Installation1148969 +Node: VMS Compilation1149760 +Ref: VMS Compilation-Footnote-11150989 +Node: VMS Dynamic Extensions1151047 +Node: VMS Installation Details1152732 +Node: VMS Running1154985 +Node: VMS GNV1159264 +Node: VMS Old Gawk1159999 +Node: Bugs1160470 +Node: Bug address1161133 +Node: Usenet1163530 +Node: Maintainers1164305 +Node: Other Versions1165681 +Node: Installation summary1172265 +Node: Notes1173300 +Node: Compatibility Mode1174165 +Node: Additions1174947 +Node: Accessing The Source1175872 +Node: Adding Code1177307 +Node: New Ports1183526 +Node: Derived Files1188014 +Ref: Derived Files-Footnote-11193499 +Ref: Derived Files-Footnote-21193534 +Ref: Derived Files-Footnote-31194132 +Node: Future Extensions1194246 +Node: Implementation Limitations1194904 +Node: Extension Design1196087 +Node: Old Extension Problems1197241 +Ref: Old Extension Problems-Footnote-11198759 +Node: Extension New Mechanism Goals1198816 +Ref: Extension New Mechanism Goals-Footnote-11202180 +Node: Extension Other Design Decisions1202369 +Node: Extension Future Growth1204482 +Node: Old Extension Mechanism1205318 +Node: Notes summary1207081 +Node: Basic Concepts1208263 +Node: Basic High Level1208944 +Ref: figure-general-flow1209226 +Ref: figure-process-flow1209911 +Ref: Basic High Level-Footnote-11213212 +Node: Basic Data Typing1213397 +Node: Glossary1216725 +Node: Copying1248672 +Node: GNU Free Documentation License1286211 +Node: Index1311329 + +End Tag Table diff --git a/doc/gawkinet.info b/doc/gawkinet.info new file mode 100644 index 00000000..d5a7abf8 --- /dev/null +++ b/doc/gawkinet.info @@ -0,0 +1,4406 @@ +This is gawkinet.info, produced by makeinfo version 6.1 from +gawkinet.texi. + +This is Edition 1.4 of 'TCP/IP Internetworking with 'gawk'', for the +4.1.4 (or later) version of the GNU implementation of AWK. + + + Copyright (C) 2000, 2001, 2002, 2004, 2009, 2010, 2016 Free Software +Foundation, Inc. + + + Permission is granted to copy, distribute and/or modify this document +under the terms of the GNU Free Documentation License, Version 1.3 or +any later version published by the Free Software Foundation; with the +Invariant Sections being "GNU General Public License", the Front-Cover +texts being (a) (see below), and with the Back-Cover Texts being (b) +(see below). A copy of the license is included in the section entitled +"GNU Free Documentation License". + + a. "A GNU Manual" + + b. "You have the freedom to copy and modify this GNU manual. Buying + copies from the FSF supports it in developing GNU and promoting + software freedom." +INFO-DIR-SECTION Network applications +START-INFO-DIR-ENTRY +* Gawkinet: (gawkinet). TCP/IP Internetworking With 'gawk'. +END-INFO-DIR-ENTRY + + This file documents the networking features in GNU 'awk'. + + This is Edition 1.4 of 'TCP/IP Internetworking with 'gawk'', for the +4.1.4 (or later) version of the GNU implementation of AWK. + + + Copyright (C) 2000, 2001, 2002, 2004, 2009, 2010, 2016 Free Software +Foundation, Inc. + + + Permission is granted to copy, distribute and/or modify this document +under the terms of the GNU Free Documentation License, Version 1.3 or +any later version published by the Free Software Foundation; with the +Invariant Sections being "GNU General Public License", the Front-Cover +texts being (a) (see below), and with the Back-Cover Texts being (b) +(see below). A copy of the license is included in the section entitled +"GNU Free Documentation License". + + a. "A GNU Manual" + + b. "You have the freedom to copy and modify this GNU manual. Buying + copies from the FSF supports it in developing GNU and promoting + software freedom." + + +File: gawkinet.info, Node: Top, Next: Preface, Prev: (dir), Up: (dir) + +General Introduction +******************** + +This file documents the networking features in GNU Awk ('gawk') version +4.0 and later. + + This is Edition 1.4 of 'TCP/IP Internetworking with 'gawk'', for the +4.1.4 (or later) version of the GNU implementation of AWK. + + + Copyright (C) 2000, 2001, 2002, 2004, 2009, 2010, 2016 Free Software +Foundation, Inc. + + + Permission is granted to copy, distribute and/or modify this document +under the terms of the GNU Free Documentation License, Version 1.3 or +any later version published by the Free Software Foundation; with the +Invariant Sections being "GNU General Public License", the Front-Cover +texts being (a) (see below), and with the Back-Cover Texts being (b) +(see below). A copy of the license is included in the section entitled +"GNU Free Documentation License". + + a. "A GNU Manual" + + b. "You have the freedom to copy and modify this GNU manual. Buying + copies from the FSF supports it in developing GNU and promoting + software freedom." + +* Menu: + +* Preface:: About this document. +* Introduction:: About networking. +* Using Networking:: Some examples. +* Some Applications and Techniques:: More extended examples. +* Links:: Where to find the stuff mentioned in this + document. +* GNU Free Documentation License:: The license for this document. +* Index:: The index. + +* Stream Communications:: Sending data streams. +* Datagram Communications:: Sending self-contained messages. +* The TCP/IP Protocols:: How these models work in the Internet. +* Basic Protocols:: The basic protocols. +* Ports:: The idea behind ports. +* Making Connections:: Making TCP/IP connections. +* Gawk Special Files:: How to do 'gawk' networking. +* Special File Fields:: The fields in the special file name. +* Comparing Protocols:: Differences between the protocols. +* File /inet/tcp:: The TCP special file. +* File /inet/udp:: The UDP special file. +* TCP Connecting:: Making a TCP connection. +* Troubleshooting:: Troubleshooting TCP/IP connections. +* Interacting:: Interacting with a service. +* Setting Up:: Setting up a service. +* Email:: Reading email. +* Web page:: Reading a Web page. +* Primitive Service:: A primitive Web service. +* Interacting Service:: A Web service with interaction. +* CGI Lib:: A simple CGI library. +* Simple Server:: A simple Web server. +* Caveats:: Network programming caveats. +* Challenges:: Where to go from here. +* PANIC:: An Emergency Web Server. +* GETURL:: Retrieving Web Pages. +* REMCONF:: Remote Configuration Of Embedded Systems. +* URLCHK:: Look For Changed Web Pages. +* WEBGRAB:: Extract Links From A Page. +* STATIST:: Graphing A Statistical Distribution. +* MAZE:: Walking Through A Maze In Virtual Reality. +* MOBAGWHO:: A Simple Mobile Agent. +* STOXPRED:: Stock Market Prediction As A Service. +* PROTBASE:: Searching Through A Protein Database. + + +File: gawkinet.info, Node: Preface, Next: Introduction, Prev: Top, Up: Top + +Preface +******* + +In May of 1997, Ju"rgen Kahrs felt the need for network access from +'awk', and, with a little help from me, set about adding features to do +this for 'gawk'. At that time, he wrote the bulk of this Info file. + + The code and documentation were added to the 'gawk' 3.1 development +tree, and languished somewhat until I could finally get down to some +serious work on that version of 'gawk'. This finally happened in the +middle of 2000. + + Meantime, Ju"rgen wrote an article about the Internet special files +and '|&' operator for 'Linux Journal', and made a networking patch for +the production versions of 'gawk' available from his home page. In +August of 2000 (for 'gawk' 3.0.6), this patch also made it to the main +GNU 'ftp' distribution site. + + For release with 'gawk', I edited Ju"rgen's prose for English grammar +and style, as he is not a native English speaker. I also rearranged the +material somewhat for what I felt was a better order of presentation, +and (re)wrote some of the introductory material. + + The majority of this document and the code are his work, and the high +quality and interesting ideas speak for themselves. It is my hope that +these features will be of significant value to the 'awk' community. + + +Arnold Robbins +Nof Ayalon, ISRAEL +March, 2001 + + +File: gawkinet.info, Node: Introduction, Next: Using Networking, Prev: Preface, Up: Top + +1 Networking Concepts +********************* + +This major node provides a (necessarily) brief introduction to computer +networking concepts. For many applications of 'gawk' to TCP/IP +networking, we hope that this is enough. For more advanced tasks, you +will need deeper background, and it may be necessary to switch to +lower-level programming in C or C++. + + There are two real-life models for the way computers send messages to +each other over a network. While the analogies are not perfect, they +are close enough to convey the major concepts. These two models are the +phone system (reliable byte-stream communications), and the postal +system (best-effort datagrams). + +* Menu: + +* Stream Communications:: Sending data streams. +* Datagram Communications:: Sending self-contained messages. +* The TCP/IP Protocols:: How these models work in the Internet. +* Making Connections:: Making TCP/IP connections. + + +File: gawkinet.info, Node: Stream Communications, Next: Datagram Communications, Prev: Introduction, Up: Introduction + +1.1 Reliable Byte-streams (Phone Calls) +======================================= + +When you make a phone call, the following steps occur: + + 1. You dial a number. + + 2. The phone system connects to the called party, telling them there + is an incoming call. (Their phone rings.) + + 3. The other party answers the call, or, in the case of a computer + network, refuses to answer the call. + + 4. Assuming the other party answers, the connection between you is now + a "duplex" (two-way), "reliable" (no data lost), sequenced (data + comes out in the order sent) data stream. + + 5. You and your friend may now talk freely, with the phone system + moving the data (your voices) from one end to the other. From your + point of view, you have a direct end-to-end connection with the + person on the other end. + + The same steps occur in a duplex reliable computer networking +connection. There is considerably more overhead in setting up the +communications, but once it's done, data moves in both directions, +reliably, in sequence. + + +File: gawkinet.info, Node: Datagram Communications, Next: The TCP/IP Protocols, Prev: Stream Communications, Up: Introduction + +1.2 Best-effort Datagrams (Mailed Letters) +========================================== + +Suppose you mail three different documents to your office on the other +side of the country on two different days. Doing so entails the +following. + + 1. Each document travels in its own envelope. + + 2. Each envelope contains both the sender and the recipient address. + + 3. Each envelope may travel a different route to its destination. + + 4. The envelopes may arrive in a different order from the one in which + they were sent. + + 5. One or more may get lost in the mail. (Although, fortunately, this + does not occur very often.) + + 6. In a computer network, one or more "packets" may also arrive + multiple times. (This doesn't happen with the postal system!) + + The important characteristics of datagram communications, like those +of the postal system are thus: + + * Delivery is "best effort;" the data may never get there. + + * Each message is self-contained, including the source and + destination addresses. + + * Delivery is _not_ sequenced; packets may arrive out of order, + and/or multiple times. + + * Unlike the phone system, overhead is considerably lower. It is not + necessary to set up the call first. + + The price the user pays for the lower overhead of datagram +communications is exactly the lower reliability; it is often necessary +for user-level protocols that use datagram communications to add their +own reliability features on top of the basic communications. + + +File: gawkinet.info, Node: The TCP/IP Protocols, Next: Making Connections, Prev: Datagram Communications, Up: Introduction + +1.3 The Internet Protocols +========================== + +The Internet Protocol Suite (usually referred to as just TCP/IP)(1) +consists of a number of different protocols at different levels or +"layers." For our purposes, three protocols provide the fundamental +communications mechanisms. All other defined protocols are referred to +as user-level protocols (e.g., HTTP, used later in this Info file). + +* Menu: + +* Basic Protocols:: The basic protocols. +* Ports:: The idea behind ports. + + ---------- Footnotes ---------- + + (1) It should be noted that although the Internet seems to have +conquered the world, there are other networking protocol suites in +existence and in use. + + +File: gawkinet.info, Node: Basic Protocols, Next: Ports, Prev: The TCP/IP Protocols, Up: The TCP/IP Protocols + +1.3.1 The Basic Internet Protocols +---------------------------------- + +IP + The Internet Protocol. This protocol is almost never used directly + by applications. It provides the basic packet delivery and routing + infrastructure of the Internet. Much like the phone company's + switching centers or the Post Office's trucks, it is not of much + day-to-day interest to the regular user (or programmer). It + happens to be a best effort datagram protocol. In the early + twenty-first century, there are two versions of this protocol in + use: + + IPv4 + The original version of the Internet Protocol, with 32-bit + addresses, on which most of the current Internet is based. + + IPv6 + The "next generation" of the Internet Protocol, with 128-bit + addresses. This protocol is in wide use in certain parts of + the world, but has not yet replaced IPv4.(1) + + Versions of the other protocols that sit "atop" IP exist for both + IPv4 and IPv6. However, as the IPv6 versions are fundamentally the + same as the original IPv4 versions, we will not distinguish further + between them. + +UDP + The User Datagram Protocol. This is a best effort datagram + protocol. It provides a small amount of extra reliability over IP, + and adds the notion of "ports", described in *note TCP and UDP + Ports: Ports. + +TCP + The Transmission Control Protocol. This is a duplex, reliable, + sequenced byte-stream protocol, again layered on top of IP, and + also providing the notion of ports. This is the protocol that you + will most likely use when using 'gawk' for network programming. + + All other user-level protocols use either TCP or UDP to do their +basic communications. Examples are SMTP (Simple Mail Transfer +Protocol), FTP (File Transfer Protocol), and HTTP (HyperText Transfer +Protocol). + + ---------- Footnotes ---------- + + (1) There isn't an IPv5. + + +File: gawkinet.info, Node: Ports, Prev: Basic Protocols, Up: The TCP/IP Protocols + +1.3.2 TCP and UDP Ports +----------------------- + +In the postal system, the address on an envelope indicates a physical +location, such as a residence or office building. But there may be more +than one person at the location; thus you have to further quantify the +recipient by putting a person or company name on the envelope. + + In the phone system, one phone number may represent an entire +company, in which case you need a person's extension number in order to +reach that individual directly. Or, when you call a home, you have to +say, "May I please speak to ..." before talking to the person directly. + + IP networking provides the concept of addressing. An IP address +represents a particular computer, but no more. In order to reach the +mail service on a system, or the FTP or WWW service on a system, you +must have some way to further specify which service you want. In the +Internet Protocol suite, this is done with "port numbers", which +represent the services, much like an extension number used with a phone +number. + + Port numbers are 16-bit integers. Unix and Unix-like systems reserve +ports below 1024 for "well known" services, such as SMTP, FTP, and HTTP. +Numbers 1024 and above may be used by any application, although there is +no promise made that a particular port number is always available. + + +File: gawkinet.info, Node: Making Connections, Prev: The TCP/IP Protocols, Up: Introduction + +1.4 Making TCP/IP Connections (And Some Terminology) +==================================================== + +Two terms come up repeatedly when discussing networking: "client" and +"server". For now, we'll discuss these terms at the "connection level", +when first establishing connections between two processes on different +systems over a network. (Once the connection is established, the higher +level, or "application level" protocols, such as HTTP or FTP, determine +who is the client and who is the server. Often, it turns out that the +client and server are the same in both roles.) + + The "server" is the system providing the service, such as the web +server or email server. It is the "host" (system) which is _connected +to_ in a transaction. For this to work though, the server must be +expecting connections. Much as there has to be someone at the office +building to answer the phone(1), the server process (usually) has to be +started first and be waiting for a connection. + + The "client" is the system requesting the service. It is the system +_initiating the connection_ in a transaction. (Just as when you pick up +the phone to call an office or store.) + + In the TCP/IP framework, each end of a connection is represented by a +pair of (ADDRESS, PORT) pairs. For the duration of the connection, the +ports in use at each end are unique, and cannot be used simultaneously +by other processes on the same system. (Only after closing a connection +can a new one be built up on the same port. This is contrary to the +usual behavior of fully developed web servers which have to avoid +situations in which they are not reachable. We have to pay this price +in order to enjoy the benefits of a simple communication paradigm in +'gawk'.) + + Furthermore, once the connection is established, communications are +"synchronous".(2) I.e., each end waits on the other to finish +transmitting, before replying. This is much like two people in a phone +conversation. While both could talk simultaneously, doing so usually +doesn't work too well. + + In the case of TCP, the synchronicity is enforced by the protocol +when sending data. Data writes "block" until the data have been +received on the other end. For both TCP and UDP, data reads block until +there is incoming data waiting to be read. This is summarized in the +following table, where an "X" indicates that the given action blocks. + +TCP X X +UDP X + + ---------- Footnotes ---------- + + (1) In the days before voice mail systems! + + (2) For the technically savvy, data reads block--if there's no +incoming data, the program is made to wait until there is, instead of +receiving a "there's no data" error return. + + +File: gawkinet.info, Node: Using Networking, Next: Some Applications and Techniques, Prev: Introduction, Up: Top + +2 Networking With 'gawk' +************************ + +The 'awk' programming language was originally developed as a +pattern-matching language for writing short programs to perform data +manipulation tasks. 'awk''s strength is the manipulation of textual +data that is stored in files. It was never meant to be used for +networking purposes. To exploit its features in a networking context, +it's necessary to use an access mode for network connections that +resembles the access of files as closely as possible. + + 'awk' is also meant to be a prototyping language. It is used to +demonstrate feasibility and to play with features and user interfaces. +This can be done with file-like handling of network connections. 'gawk' +trades the lack of many of the advanced features of the TCP/IP family of +protocols for the convenience of simple connection handling. The +advanced features are available when programming in C or Perl. In fact, +the network programming in this major node is very similar to what is +described in books such as 'Internet Programming with Python', 'Advanced +Perl Programming', or 'Web Client Programming with Perl'. + + However, you can do the programming here without first having to +learn object-oriented ideology; underlying languages such as Tcl/Tk, +Perl, Python; or all of the libraries necessary to extend these +languages before they are ready for the Internet. + + This major node demonstrates how to use the TCP protocol. The UDP +protocol is much less important for most users. + +* Menu: + +* Gawk Special Files:: How to do 'gawk' networking. +* TCP Connecting:: Making a TCP connection. +* Troubleshooting:: Troubleshooting TCP/IP connections. +* Interacting:: Interacting with a service. +* Setting Up:: Setting up a service. +* Email:: Reading email. +* Web page:: Reading a Web page. +* Primitive Service:: A primitive Web service. +* Interacting Service:: A Web service with interaction. +* Simple Server:: A simple Web server. +* Caveats:: Network programming caveats. +* Challenges:: Where to go from here. + + +File: gawkinet.info, Node: Gawk Special Files, Next: TCP Connecting, Prev: Using Networking, Up: Using Networking + +2.1 'gawk''s Networking Mechanisms +================================== + +The '|&' operator for use in communicating with a "coprocess" is +described in *note Two-way Communications With Another Process: +(gawk)Two-way I/O. It shows how to do two-way I/O to a separate process, +sending it data with 'print' or 'printf' and reading data with +'getline'. If you haven't read it already, you should detour there to +do so. + + 'gawk' transparently extends the two-way I/O mechanism to simple +networking through the use of special file names. When a "coprocess" +that matches the special files we are about to describe is started, +'gawk' creates the appropriate network connection, and then two-way I/O +proceeds as usual. + + At the C, C++, and Perl level, networking is accomplished via +"sockets", an Application Programming Interface (API) originally +developed at the University of California at Berkeley that is now used +almost universally for TCP/IP networking. Socket level programming, +while fairly straightforward, requires paying attention to a number of +details, as well as using binary data. It is not well-suited for use +from a high-level language like 'awk'. The special files provided in +'gawk' hide the details from the programmer, making things much simpler +and easier to use. + + The special file name for network access is made up of several +fields, all of which are mandatory: + + /NET-TYPE/PROTOCOL/LOCALPORT/HOSTNAME/REMOTEPORT + + The NET-TYPE field lets you specify IPv4 versus IPv6, or lets you +allow the system to choose. + +* Menu: + +* Special File Fields:: The fields in the special file name. +* Comparing Protocols:: Differences between the protocols. + + +File: gawkinet.info, Node: Special File Fields, Next: Comparing Protocols, Prev: Gawk Special Files, Up: Gawk Special Files + +2.1.1 The Fields of the Special File Name +----------------------------------------- + +This node explains the meaning of all the other fields, as well as the +range of values and the defaults. All of the fields are mandatory. To +let the system pick a value, or if the field doesn't apply to the +protocol, specify it as '0': + +NET-TYPE + This is one of 'inet4' for IPv4, 'inet6' for IPv6, or 'inet' to use + the system default (which is likely to be IPv4). For the rest of + this document, we will use the generic '/inet' in our descriptions + of how 'gawk''s networking works. + +PROTOCOL + Determines which member of the TCP/IP family of protocols is + selected to transport the data across the network. There are two + possible values (always written in lowercase): 'tcp' and 'udp'. + The exact meaning of each is explained later in this node. + +LOCALPORT + Determines which port on the local machine is used to communicate + across the network. Application-level clients usually use '0' to + indicate they do not care which local port is used--instead they + specify a remote port to connect to. It is vital for + application-level servers to use a number different from '0' here + because their service has to be available at a specific publicly + known port number. It is possible to use a name from + '/etc/services' here. + +HOSTNAME + Determines which remote host is to be at the other end of the + connection. Application-level servers must fill this field with a + '0' to indicate their being open for all other hosts to connect to + them and enforce connection level server behavior this way. It is + not possible for an application-level server to restrict its + availability to one remote host by entering a host name here. + Application-level clients must enter a name different from '0'. + The name can be either symbolic (e.g., 'jpl-devvax.jpl.nasa.gov') + or numeric (e.g., '128.149.1.143'). + +REMOTEPORT + Determines which port on the remote machine is used to communicate + across the network. For '/inet/tcp' and '/inet/udp', + application-level clients _must_ use a number other than '0' to + indicate to which port on the remote machine they want to connect. + Application-level servers must not fill this field with a '0'. + Instead they specify a local port to which clients connect. It is + possible to use a name from '/etc/services' here. + + Experts in network programming will notice that the usual +client/server asymmetry found at the level of the socket API is not +visible here. This is for the sake of simplicity of the high-level +concept. If this asymmetry is necessary for your application, use +another language. For 'gawk', it is more important to enable users to +write a client program with a minimum of code. What happens when first +accessing a network connection is seen in the following pseudocode: + + if ((name of remote host given) && (other side accepts connection)) { + rendez-vous successful; transmit with getline or print + } else { + if ((other side did not accept) && (localport == 0)) + exit unsuccessful + if (TCP) { + set up a server accepting connections + this means waiting for the client on the other side to connect + } else + ready + } + + The exact behavior of this algorithm depends on the values of the +fields of the special file name. When in doubt, *note Table 2.1: +table-inet-components. gives you the combinations of values and their +meaning. If this table is too complicated, focus on the three lines +printed in *bold*. All the examples in *note Networking With 'gawk': +Using Networking, use only the patterns printed in bold letters. + +PROTOCOL LOCAL HOST NAME REMOTE RESULTING CONNECTION-LEVEL + PORT PORT BEHAVIOR +------------------------------------------------------------------------------ +*tcp* *0* *x* *x* *Dedicated client, fails if + immediately connecting to a + server on the other side + fails* +udp 0 x x Dedicated client +*tcp, *x* *x* *x* *Client, switches to +udp* dedicated server if + necessary* +*tcp, *x* *0* *0* *Dedicated server* +udp* +tcp, udp x x 0 Invalid +tcp, udp 0 0 x Invalid +tcp, udp x 0 x Invalid +tcp, udp 0 0 0 Invalid +tcp, udp 0 x 0 Invalid + +Table 2.1: /inet Special File Components + + In general, TCP is the preferred mechanism to use. It is the +simplest protocol to understand and to use. Use UDP only if +circumstances demand low-overhead. + + +File: gawkinet.info, Node: Comparing Protocols, Prev: Special File Fields, Up: Gawk Special Files + +2.1.2 Comparing Protocols +------------------------- + +This node develops a pair of programs (sender and receiver) that do +nothing but send a timestamp from one machine to another. The sender +and the receiver are implemented with each of the two protocols +available and demonstrate the differences between them. + +* Menu: + +* File /inet/tcp:: The TCP special file. +* File /inet/udp:: The UDP special file. + + +File: gawkinet.info, Node: File /inet/tcp, Next: File /inet/udp, Prev: Comparing Protocols, Up: Comparing Protocols + +2.1.2.1 '/inet/tcp' +................... + +Once again, always use TCP. (Use UDP when low overhead is a necessity, +and use RAW for network experimentation.) The first example is the +sender program: + + # Server + BEGIN { + print strftime() |& "/inet/tcp/8888/0/0" + close("/inet/tcp/8888/0/0") + } + + The receiver is very simple: + + # Client + BEGIN { + "/inet/tcp/0/localhost/8888" |& getline + print $0 + close("/inet/tcp/0/localhost/8888") + } + + TCP guarantees that the bytes arrive at the receiving end in exactly +the same order that they were sent. No byte is lost (except for broken +connections), doubled, or out of order. Some overhead is necessary to +accomplish this, but this is the price to pay for a reliable service. +It does matter which side starts first. The sender/server has to be +started first, and it waits for the receiver to read a line. + + +File: gawkinet.info, Node: File /inet/udp, Prev: File /inet/tcp, Up: Comparing Protocols + +2.1.2.2 '/inet/udp' +................... + +The server and client programs that use UDP are almost identical to +their TCP counterparts; only the PROTOCOL has changed. As before, it +does matter which side starts first. The receiving side blocks and +waits for the sender. In this case, the receiver/client has to be +started first: + + # Server + BEGIN { + print strftime() |& "/inet/udp/8888/0/0" + close("/inet/udp/8888/0/0") + } + + The receiver is almost identical to the TCP receiver: + + # Client + BEGIN { + print "hi!" |& "/inet/udp/0/localhost/8888" + "/inet/udp/0/localhost/8888" |& getline + print $0 + close("/inet/udp/0/localhost/8888") + } + + In the case of UDP, the initial 'print' command is the one that +actually sends data so that there is a connection. UDP and "connection" +sounds strange to anyone who has learned that UDP is a connectionless +protocol. Here, "connection" means that the 'connect()' system call has +completed its work and completed the "association" between a certain +socket and an IP address. Thus there are subtle differences between +'connect()' for TCP and UDP; see the man page for details.(1) + + UDP cannot guarantee that the datagrams at the receiving end will +arrive in exactly the same order they were sent. Some datagrams could +be lost, some doubled, and some out of order. But no overhead is +necessary to accomplish this. This unreliable behavior is good enough +for tasks such as data acquisition, logging, and even stateless services +like the original versions of NFS. + + ---------- Footnotes ---------- + + (1) This subtlety is just one of many details that are hidden in the +socket API, invisible and intractable for the 'gawk' user. The +developers are currently considering how to rework the network +facilities to make them easier to understand and use. + + +File: gawkinet.info, Node: TCP Connecting, Next: Troubleshooting, Prev: Gawk Special Files, Up: Using Networking + +2.2 Establishing a TCP Connection +================================= + +Let's observe a network connection at work. Type in the following +program and watch the output. Within a second, it connects via TCP +('/inet/tcp') to the machine it is running on ('localhost') and asks the +service 'daytime' on the machine what time it is: + + BEGIN { + "/inet/tcp/0/localhost/daytime" |& getline + print $0 + close("/inet/tcp/0/localhost/daytime") + } + + Even experienced 'awk' users will find the second line strange in two +respects: + + * A special file is used as a shell command that pipes its output + into 'getline'. One would rather expect to see the special file + being read like any other file ('getline < + "/inet/tcp/0/localhost/daytime")'. + + * The operator '|&' has not been part of any 'awk' implementation + (until now). It is actually the only extension of the 'awk' + language needed (apart from the special files) to introduce network + access. + + The '|&' operator was introduced in 'gawk' 3.1 in order to overcome +the crucial restriction that access to files and pipes in 'awk' is +always unidirectional. It was formerly impossible to use both access +modes on the same file or pipe. Instead of changing the whole concept +of file access, the '|&' operator behaves exactly like the usual pipe +operator except for two additions: + + * Normal shell commands connected to their 'gawk' program with a '|&' + pipe can be accessed bidirectionally. The '|&' turns out to be a + quite general, useful, and natural extension of 'awk'. + + * Pipes that consist of a special file name for network connections + are not executed as shell commands. Instead, they can be read and + written to, just like a full-duplex network connection. + + In the earlier example, the '|&' operator tells 'getline' to read a +line from the special file '/inet/tcp/0/localhost/daytime'. We could +also have printed a line into the special file. But instead we just +read a line with the time, printed it, and closed the connection. +(While we could just let 'gawk' close the connection by finishing the +program, in this Info file we are pedantic and always explicitly close +the connections.) + + +File: gawkinet.info, Node: Troubleshooting, Next: Interacting, Prev: TCP Connecting, Up: Using Networking + +2.3 Troubleshooting Connection Problems +======================================= + +It may well be that for some reason the program shown in the previous +example does not run on your machine. When looking at possible reasons +for this, you will learn much about typical problems that arise in +network programming. First of all, your implementation of 'gawk' may +not support network access because it is a pre-3.1 version or you do not +have a network interface in your machine. Perhaps your machine uses +some other protocol, such as DECnet or Novell's IPX. For the rest of +this major node, we will assume you work on a Unix machine that supports +TCP/IP. If the previous example program does not run on your machine, it +may help to replace the name 'localhost' with the name of your machine +or its IP address. If it does, you could replace 'localhost' with the +name of another machine in your vicinity--this way, the program connects +to another machine. Now you should see the date and time being printed +by the program, otherwise your machine may not support the 'daytime' +service. Try changing the service to 'chargen' or 'ftp'. This way, the +program connects to other services that should give you some response. +If you are curious, you should have a look at your '/etc/services' file. +It could look like this: + + # /etc/services: + # + # Network services, Internet style + # + # Name Number/Protocol Alternate name # Comments + + echo 7/tcp + echo 7/udp + discard 9/tcp sink null + discard 9/udp sink null + daytime 13/tcp + daytime 13/udp + chargen 19/tcp ttytst source + chargen 19/udp ttytst source + ftp 21/tcp + telnet 23/tcp + smtp 25/tcp mail + finger 79/tcp + www 80/tcp http # WorldWideWeb HTTP + www 80/udp # HyperText Transfer Protocol + pop-2 109/tcp postoffice # POP version 2 + pop-2 109/udp + pop-3 110/tcp # POP version 3 + pop-3 110/udp + nntp 119/tcp readnews untp # USENET News + irc 194/tcp # Internet Relay Chat + irc 194/udp + ... + + Here, you find a list of services that traditional Unix machines +usually support. If your GNU/Linux machine does not do so, it may be +that these services are switched off in some startup script. Systems +running some flavor of Microsoft Windows usually do _not_ support these +services. Nevertheless, it _is_ possible to do networking with 'gawk' +on Microsoft Windows.(1) The first column of the file gives the name of +the service, and the second column gives a unique number and the +protocol that one can use to connect to this service. The rest of the +line is treated as a comment. You see that some services ('echo') +support TCP as well as UDP. + + ---------- Footnotes ---------- + + (1) Microsoft preferred to ignore the TCP/IP family of protocols +until 1995. Then came the rise of the Netscape browser as a landmark +"killer application." Microsoft added TCP/IP support and their own +browser to Microsoft Windows 95 at the last minute. They even +back-ported their TCP/IP implementation to Microsoft Windows for +Workgroups 3.11, but it was a rather rudimentary and half-hearted +implementation. Nevertheless, the equivalent of '/etc/services' resides +under 'C:\WINNT\system32\drivers\etc\services' on Microsoft Windows 2000 +and Microsoft Windows XP. + + +File: gawkinet.info, Node: Interacting, Next: Setting Up, Prev: Troubleshooting, Up: Using Networking + +2.4 Interacting with a Network Service +====================================== + +The next program makes use of the possibility to really interact with a +network service by printing something into the special file. It asks +the so-called 'finger' service if a user of the machine is logged in. +When testing this program, try to change 'localhost' to some other +machine name in your local network: + + BEGIN { + NetService = "/inet/tcp/0/localhost/finger" + print "NAME" |& NetService + while ((NetService |& getline) > 0) + print $0 + close(NetService) + } + + After telling the service on the machine which user to look for, the +program repeatedly reads lines that come as a reply. When no more lines +are coming (because the service has closed the connection), the program +also closes the connection. Try replacing '"NAME"' with your login name +(or the name of someone else logged in). For a list of all users +currently logged in, replace NAME with an empty string ('""'). + + The final 'close()' command could be safely deleted from the above +script, because the operating system closes any open connection by +default when a script reaches the end of execution. In order to avoid +portability problems, it is best to always close connections explicitly. +With the Linux kernel, for example, proper closing results in flushing +of buffers. Letting the close happen by default may result in +discarding buffers. + + When looking at '/etc/services' you may have noticed that the +'daytime' service is also available with 'udp'. In the earlier example, +change 'tcp' to 'udp', and change 'finger' to 'daytime'. After starting +the modified program, you see the expected day and time message. The +program then hangs, because it waits for more lines coming from the +service. However, they never come. This behavior is a consequence of +the differences between TCP and UDP. When using UDP, neither party is +automatically informed about the other closing the connection. +Continuing to experiment this way reveals many other subtle differences +between TCP and UDP. To avoid such trouble, one should always remember +the advice Douglas E. Comer and David Stevens give in Volume III of +their series 'Internetworking With TCP' (page 14): + + When designing client-server applications, beginners are strongly + advised to use TCP because it provides reliable, + connection-oriented communication. Programs only use UDP if the + application protocol handles reliability, the application requires + hardware broadcast or multicast, or the application cannot tolerate + virtual circuit overhead. + + +File: gawkinet.info, Node: Setting Up, Next: Email, Prev: Interacting, Up: Using Networking + +2.5 Setting Up a Service +======================== + +The preceding programs behaved as clients that connect to a server +somewhere on the Internet and request a particular service. Now we set +up such a service to mimic the behavior of the 'daytime' service. Such +a server does not know in advance who is going to connect to it over the +network. Therefore, we cannot insert a name for the host to connect to +in our special file name. + + Start the following program in one window. Notice that the service +does not have the name 'daytime', but the number '8888'. From looking +at '/etc/services', you know that names like 'daytime' are just +mnemonics for predetermined 16-bit integers. Only the system +administrator ('root') could enter our new service into '/etc/services' +with an appropriate name. Also notice that the service name has to be +entered into a different field of the special file name because we are +setting up a server, not a client: + + BEGIN { + print strftime() |& "/inet/tcp/8888/0/0" + close("/inet/tcp/8888/0/0") + } + + Now open another window on the same machine. Copy the client program +given as the first example (*note Establishing a TCP Connection: TCP +Connecting.) to a new file and edit it, changing the name 'daytime' to +'8888'. Then start the modified client. You should get a reply like +this: + + Sat Sep 27 19:08:16 CEST 1997 + +Both programs explicitly close the connection. + + Now we will intentionally make a mistake to see what happens when the +name '8888' (the so-called port) is already used by another service. +Start the server program in both windows. The first one works, but the +second one complains that it could not open the connection. Each port +on a single machine can only be used by one server program at a time. +Now terminate the server program and change the name '8888' to 'echo'. +After restarting it, the server program does not run any more, and you +know why: there is already an 'echo' service running on your machine. +But even if this isn't true, you would not get your own 'echo' server +running on a Unix machine, because the ports with numbers smaller than +1024 ('echo' is at port 7) are reserved for 'root'. On machines running +some flavor of Microsoft Windows, there is no restriction that reserves +ports 1 to 1024 for a privileged user; hence, you can start an 'echo' +server there. + + Turning this short server program into something really useful is +simple. Imagine a server that first reads a file name from the client +through the network connection, then does something with the file and +sends a result back to the client. The server-side processing could be: + + BEGIN { + NetService = "/inet/tcp/8888/0/0" + NetService |& getline + CatPipe = ("cat " $1) # sets $0 and the fields + while ((CatPipe | getline) > 0) + print $0 |& NetService + close(NetService) + } + +and we would have a remote copying facility. Such a server reads the +name of a file from any client that connects to it and transmits the +contents of the named file across the net. The server-side processing +could also be the execution of a command that is transmitted across the +network. From this example, you can see how simple it is to open up a +security hole on your machine. If you allow clients to connect to your +machine and execute arbitrary commands, anyone would be free to do 'rm +-rf *'. + + +File: gawkinet.info, Node: Email, Next: Web page, Prev: Setting Up, Up: Using Networking + +2.6 Reading Email +================= + +The distribution of email is usually done by dedicated email servers +that communicate with your machine using special protocols. To receive +email, we will use the Post Office Protocol (POP). Sending can be done +with the much older Simple Mail Transfer Protocol (SMTP). + + When you type in the following program, replace the EMAILHOST by the +name of your local email server. Ask your administrator if the server +has a POP service, and then use its name or number in the program below. +Now the program is ready to connect to your email server, but it will +not succeed in retrieving your mail because it does not yet know your +login name or password. Replace them in the program and it shows you +the first email the server has in store: + + BEGIN { + POPService = "/inet/tcp/0/EMAILHOST/pop3" + RS = ORS = "\r\n" + print "user NAME" |& POPService + POPService |& getline + print "pass PASSWORD" |& POPService + POPService |& getline + print "retr 1" |& POPService + POPService |& getline + if ($1 != "+OK") exit + print "quit" |& POPService + RS = "\r\n\\.\r\n" + POPService |& getline + print $0 + close(POPService) + } + + The record separators 'RS' and 'ORS' are redefined because the +protocol (POP) requires CR-LF to separate lines. After identifying +yourself to the email service, the command 'retr 1' instructs the +service to send the first of all your email messages in line. If the +service replies with something other than '+OK', the program exits; +maybe there is no email. Otherwise, the program first announces that it +intends to finish reading email, and then redefines 'RS' in order to +read the entire email as multiline input in one record. From the POP +RFC, we know that the body of the email always ends with a single line +containing a single dot. The program looks for this using 'RS = +"\r\n\\.\r\n"'. When it finds this sequence in the mail message, it +quits. You can invoke this program as often as you like; it does not +delete the message it reads, but instead leaves it on the server. + + +File: gawkinet.info, Node: Web page, Next: Primitive Service, Prev: Email, Up: Using Networking + +2.7 Reading a Web Page +====================== + +Retrieving a web page from a web server is as simple as retrieving email +from an email server. We only have to use a similar, but not identical, +protocol and a different port. The name of the protocol is HyperText +Transfer Protocol (HTTP) and the port number is usually 80. As in the +preceding node, ask your administrator about the name of your local web +server or proxy web server and its port number for HTTP requests. + + The following program employs a rather crude approach toward +retrieving a web page. It uses the prehistoric syntax of HTTP 0.9, +which almost all web servers still support. The most noticeable thing +about it is that the program directs the request to the local proxy +server whose name you insert in the special file name (which in turn +calls 'www.yahoo.com'): + + BEGIN { + RS = ORS = "\r\n" + HttpService = "/inet/tcp/0/PROXY/80" + print "GET http://www.yahoo.com" |& HttpService + while ((HttpService |& getline) > 0) + print $0 + close(HttpService) + } + + Again, lines are separated by a redefined 'RS' and 'ORS'. The 'GET' +request that we send to the server is the only kind of HTTP request that +existed when the web was created in the early 1990s. HTTP calls this +'GET' request a "method," which tells the service to transmit a web page +(here the home page of the Yahoo! search engine). Version 1.0 added +the request methods 'HEAD' and 'POST'. The current version of HTTP is +1.1,(1) and knows the additional request methods 'OPTIONS', 'PUT', +'DELETE', and 'TRACE'. You can fill in any valid web address, and the +program prints the HTML code of that page to your screen. + + Notice the similarity between the responses of the POP and HTTP +services. First, you get a header that is terminated by an empty line, +and then you get the body of the page in HTML. The lines of the headers +also have the same form as in POP. There is the name of a parameter, +then a colon, and finally the value of that parameter. + + Images ('.png' or '.gif' files) can also be retrieved this way, but +then you get binary data that should be redirected into a file. Another +application is calling a CGI (Common Gateway Interface) script on some +server. CGI scripts are used when the contents of a web page are not +constant, but generated instantly at the moment you send a request for +the page. For example, to get a detailed report about the current +quotes of Motorola stock shares, call a CGI script at Yahoo! with the +following: + + get = "GET http://quote.yahoo.com/q?s=MOT&d=t" + print get |& HttpService + + You can also request weather reports this way. + + ---------- Footnotes ---------- + + (1) Version 1.0 of HTTP was defined in RFC 1945. HTTP 1.1 was +initially specified in RFC 2068. In June 1999, RFC 2068 was made +obsolete by RFC 2616, an update without any substantial changes. + + +File: gawkinet.info, Node: Primitive Service, Next: Interacting Service, Prev: Web page, Up: Using Networking + +2.8 A Primitive Web Service +=========================== + +Now we know enough about HTTP to set up a primitive web service that +just says '"Hello, world"' when someone connects to it with a browser. +Compared to the situation in the preceding node, our program changes the +role. It tries to behave just like the server we have observed. Since +we are setting up a server here, we have to insert the port number in +the 'localport' field of the special file name. The other two fields +(HOSTNAME and REMOTEPORT) have to contain a '0' because we do not know +in advance which host will connect to our service. + + In the early 1990s, all a server had to do was send an HTML document +and close the connection. Here, we adhere to the modern syntax of HTTP. +The steps are as follows: + + 1. Send a status line telling the web browser that everything is okay. + + 2. Send a line to tell the browser how many bytes follow in the body + of the message. This was not necessary earlier because both + parties knew that the document ended when the connection closed. + Nowadays it is possible to stay connected after the transmission of + one web page. This is to avoid the network traffic necessary for + repeatedly establishing TCP connections for requesting several + images. Thus, there is the need to tell the receiving party how + many bytes will be sent. The header is terminated as usual with an + empty line. + + 3. Send the '"Hello, world"' body in HTML. The useless 'while' loop + swallows the request of the browser. We could actually omit the + loop, and on most machines the program would still work. First, + start the following program: + + BEGIN { + RS = ORS = "\r\n" + HttpService = "/inet/tcp/8080/0/0" + Hello = "<HTML><HEAD>" \ + "<TITLE>A Famous Greeting</TITLE></HEAD>" \ + "<BODY><H1>Hello, world</H1></BODY></HTML>" + Len = length(Hello) + length(ORS) + print "HTTP/1.0 200 OK" |& HttpService + print "Content-Length: " Len ORS |& HttpService + print Hello |& HttpService + while ((HttpService |& getline) > 0) + continue; + close(HttpService) + } + + Now, on the same machine, start your favorite browser and let it +point to <http://localhost:8080> (the browser needs to know on which +port our server is listening for requests). If this does not work, the +browser probably tries to connect to a proxy server that does not know +your machine. If so, change the browser's configuration so that the +browser does not try to use a proxy to connect to your machine. + + +File: gawkinet.info, Node: Interacting Service, Next: Simple Server, Prev: Primitive Service, Up: Using Networking + +2.9 A Web Service with Interaction +================================== + +This node shows how to set up a simple web server. The subnode is a +library file that we will use with all the examples in *note Some +Applications and Techniques::. + +* Menu: + +* CGI Lib:: A simple CGI library. + + Setting up a web service that allows user interaction is more +difficult and shows us the limits of network access in 'gawk'. In this +node, we develop a main program (a 'BEGIN' pattern and its action) that +will become the core of event-driven execution controlled by a graphical +user interface (GUI). Each HTTP event that the user triggers by some +action within the browser is received in this central procedure. +Parameters and menu choices are extracted from this request, and an +appropriate measure is taken according to the user's choice. For +example: + + BEGIN { + if (MyHost == "") { + "uname -n" | getline MyHost + close("uname -n") + } + if (MyPort == 0) MyPort = 8080 + HttpService = "/inet/tcp/" MyPort "/0/0" + MyPrefix = "http://" MyHost ":" MyPort + SetUpServer() + while ("awk" != "complex") { + # header lines are terminated this way + RS = ORS = "\r\n" + Status = 200 # this means OK + Reason = "OK" + Header = TopHeader + Document = TopDoc + Footer = TopFooter + if (GETARG["Method"] == "GET") { + HandleGET() + } else if (GETARG["Method"] == "HEAD") { + # not yet implemented + } else if (GETARG["Method"] != "") { + print "bad method", GETARG["Method"] + } + Prompt = Header Document Footer + print "HTTP/1.0", Status, Reason |& HttpService + print "Connection: Close" |& HttpService + print "Pragma: no-cache" |& HttpService + len = length(Prompt) + length(ORS) + print "Content-length:", len |& HttpService + print ORS Prompt |& HttpService + # ignore all the header lines + while ((HttpService |& getline) > 0) + ; + # stop talking to this client + close(HttpService) + # wait for new client request + HttpService |& getline + # do some logging + print systime(), strftime(), $0 + # read request parameters + CGI_setup($1, $2, $3) + } + } + + This web server presents menu choices in the form of HTML links. +Therefore, it has to tell the browser the name of the host it is +residing on. When starting the server, the user may supply the name of +the host from the command line with 'gawk -v MyHost="Rumpelstilzchen"'. +If the user does not do this, the server looks up the name of the host +it is running on for later use as a web address in HTML documents. The +same applies to the port number. These values are inserted later into +the HTML content of the web pages to refer to the home system. + + Each server that is built around this core has to initialize some +application-dependent variables (such as the default home page) in a +procedure 'SetUpServer()', which is called immediately before entering +the infinite loop of the server. For now, we will write an instance +that initiates a trivial interaction. With this home page, the client +user can click on two possible choices, and receive the current date +either in human-readable format or in seconds since 1970: + + function SetUpServer() { + TopHeader = "<HTML><HEAD>" + TopHeader = TopHeader \ + "<title>My name is GAWK, GNU AWK</title></HEAD>" + TopDoc = "<BODY><h2>\ + Do you prefer your date <A HREF=" MyPrefix \ + "/human>human</A> or \ + <A HREF=" MyPrefix "/POSIX>POSIXed</A>?</h2>" ORS ORS + TopFooter = "</BODY></HTML>" + } + + On the first run through the main loop, the default line terminators +are set and the default home page is copied to the actual home page. +Since this is the first run, 'GETARG["Method"]' is not initialized yet, +hence the case selection over the method does nothing. Now that the +home page is initialized, the server can start communicating to a client +browser. + + It does so by printing the HTTP header into the network connection +('print ... |& HttpService'). This command blocks execution of the +server script until a client connects. If this server script is +compared with the primitive one we wrote before, you will notice two +additional lines in the header. The first instructs the browser to +close the connection after each request. The second tells the browser +that it should never try to _remember_ earlier requests that had +identical web addresses (no caching). Otherwise, it could happen that +the browser retrieves the time of day in the previous example just once, +and later it takes the web page from the cache, always displaying the +same time of day although time advances each second. + + Having supplied the initial home page to the browser with a valid +document stored in the parameter 'Prompt', it closes the connection and +waits for the next request. When the request comes, a log line is +printed that allows us to see which request the server receives. The +final step in the loop is to call the function 'CGI_setup()', which +reads all the lines of the request (coming from the browser), processes +them, and stores the transmitted parameters in the array 'PARAM'. The +complete text of these application-independent functions can be found in +*note A Simple CGI Library: CGI Lib. For now, we use a simplified +version of 'CGI_setup()': + + function CGI_setup( method, uri, version, i) { + delete GETARG; delete MENU; delete PARAM + GETARG["Method"] = $1 + GETARG["URI"] = $2 + GETARG["Version"] = $3 + i = index($2, "?") + # is there a "?" indicating a CGI request? + if (i > 0) { + split(substr($2, 1, i-1), MENU, "[/:]") + split(substr($2, i+1), PARAM, "&") + for (i in PARAM) { + j = index(PARAM[i], "=") + GETARG[substr(PARAM[i], 1, j-1)] = \ + substr(PARAM[i], j+1) + } + } else { # there is no "?", no need for splitting PARAMs + split($2, MENU, "[/:]") + } + } + + At first, the function clears all variables used for global storage +of request parameters. The rest of the function serves the purpose of +filling the global parameters with the extracted new values. To +accomplish this, the name of the requested resource is split into parts +and stored for later evaluation. If the request contains a '?', then +the request has CGI variables seamlessly appended to the web address. +Everything in front of the '?' is split up into menu items, and +everything behind the '?' is a list of 'VARIABLE=VALUE' pairs (separated +by '&') that also need splitting. This way, CGI variables are isolated +and stored. This procedure lacks recognition of special characters that +are transmitted in coded form(1). Here, any optional request header and +body parts are ignored. We do not need header parameters and the +request body. However, when refining our approach or working with the +'POST' and 'PUT' methods, reading the header and body becomes +inevitable. Header parameters should then be stored in a global array +as well as the body. + + On each subsequent run through the main loop, one request from a +browser is received, evaluated, and answered according to the user's +choice. This can be done by letting the value of the HTTP method guide +the main loop into execution of the procedure 'HandleGET()', which +evaluates the user's choice. In this case, we have only one +hierarchical level of menus, but in the general case, menus are nested. +The menu choices at each level are separated by '/', just as in file +names. Notice how simple it is to construct menus of arbitrary depth: + + function HandleGET() { + if ( MENU[2] == "human") { + Footer = strftime() TopFooter + } else if (MENU[2] == "POSIX") { + Footer = systime() TopFooter + } + } + + The disadvantage of this approach is that our server is slow and can +handle only one request at a time. Its main advantage, however, is that +the server consists of just one 'gawk' program. No need for installing +an 'httpd', and no need for static separate HTML files, CGI scripts, or +'root' privileges. This is rapid prototyping. This program can be +started on the same host that runs your browser. Then let your browser +point to <http://localhost:8080>. + + It is also possible to include images into the HTML pages. Most +browsers support the not very well-known '.xbm' format, which may +contain only monochrome pictures but is an ASCII format. Binary images +are possible but not so easy to handle. Another way of including images +is to generate them with a tool such as GNUPlot, by calling the tool +with the 'system()' function or through a pipe. + + ---------- Footnotes ---------- + + (1) As defined in RFC 2068. + + +File: gawkinet.info, Node: CGI Lib, Prev: Interacting Service, Up: Interacting Service + +2.9.1 A Simple CGI Library +-------------------------- + + HTTP is like being married: you have to be able to handle whatever + you're given, while being very careful what you send back. + Phil Smith III, + <http://www.netfunny.com/rhf/jokes/99/Mar/http.html> + + In *note A Web Service with Interaction: Interacting Service, we saw +the function 'CGI_setup()' as part of the web server "core logic" +framework. The code presented there handles almost everything necessary +for CGI requests. One thing it doesn't do is handle encoded characters +in the requests. For example, an '&' is encoded as a percent sign +followed by the hexadecimal value: '%26'. These encoded values should +be decoded. Following is a simple library to perform these tasks. This +code is used for all web server examples used throughout the rest of +this Info file. If you want to use it for your own web server, store +the source code into a file named 'inetlib.awk'. Then you can include +these functions into your code by placing the following statement into +your program (on the first line of your script): + + @include inetlib.awk + +But beware, this mechanism is only possible if you invoke your web +server script with 'igawk' instead of the usual 'awk' or 'gawk'. Here +is the code: + + # CGI Library and core of a web server + # Global arrays + # GETARG --- arguments to CGI GET command + # MENU --- menu items (path names) + # PARAM --- parameters of form x=y + + # Optional variable MyHost contains host address + # Optional variable MyPort contains port number + # Needs TopHeader, TopDoc, TopFooter + # Sets MyPrefix, HttpService, Status, Reason + + BEGIN { + if (MyHost == "") { + "uname -n" | getline MyHost + close("uname -n") + } + if (MyPort == 0) MyPort = 8080 + HttpService = "/inet/tcp/" MyPort "/0/0" + MyPrefix = "http://" MyHost ":" MyPort + SetUpServer() + while ("awk" != "complex") { + # header lines are terminated this way + RS = ORS = "\r\n" + Status = 200 # this means OK + Reason = "OK" + Header = TopHeader + Document = TopDoc + Footer = TopFooter + if (GETARG["Method"] == "GET") { + HandleGET() + } else if (GETARG["Method"] == "HEAD") { + # not yet implemented + } else if (GETARG["Method"] != "") { + print "bad method", GETARG["Method"] + } + Prompt = Header Document Footer + print "HTTP/1.0", Status, Reason |& HttpService + print "Connection: Close" |& HttpService + print "Pragma: no-cache" |& HttpService + len = length(Prompt) + length(ORS) + print "Content-length:", len |& HttpService + print ORS Prompt |& HttpService + # ignore all the header lines + while ((HttpService |& getline) > 0) + continue + # stop talking to this client + close(HttpService) + # wait for new client request + HttpService |& getline + # do some logging + print systime(), strftime(), $0 + CGI_setup($1, $2, $3) + } + } + + function CGI_setup( method, uri, version, i) + { + delete GETARG + delete MENU + delete PARAM + GETARG["Method"] = method + GETARG["URI"] = uri + GETARG["Version"] = version + + i = index(uri, "?") + if (i > 0) { # is there a "?" indicating a CGI request? + split(substr(uri, 1, i-1), MENU, "[/:]") + split(substr(uri, i+1), PARAM, "&") + for (i in PARAM) { + PARAM[i] = _CGI_decode(PARAM[i]) + j = index(PARAM[i], "=") + GETARG[substr(PARAM[i], 1, j-1)] = \ + substr(PARAM[i], j+1) + } + } else { # there is no "?", no need for splitting PARAMs + split(uri, MENU, "[/:]") + } + for (i in MENU) # decode characters in path + if (i > 4) # but not those in host name + MENU[i] = _CGI_decode(MENU[i]) + } + + This isolates details in a single function, 'CGI_setup()'. Decoding +of encoded characters is pushed off to a helper function, +'_CGI_decode()'. The use of the leading underscore ('_') in the +function name is intended to indicate that it is an "internal" function, +although there is nothing to enforce this: + + function _CGI_decode(str, hexdigs, i, pre, code1, code2, + val, result) + { + hexdigs = "123456789abcdef" + + i = index(str, "%") + if (i == 0) # no work to do + return str + + do { + pre = substr(str, 1, i-1) # part before %xx + code1 = substr(str, i+1, 1) # first hex digit + code2 = substr(str, i+2, 1) # second hex digit + str = substr(str, i+3) # rest of string + + code1 = tolower(code1) + code2 = tolower(code2) + val = index(hexdigs, code1) * 16 \ + + index(hexdigs, code2) + + result = result pre sprintf("%c", val) + i = index(str, "%") + } while (i != 0) + if (length(str) > 0) + result = result str + return result + } + + This works by splitting the string apart around an encoded character. +The two digits are converted to lowercase characters and looked up in a +string of hex digits. Note that '0' is not in the string on purpose; +'index()' returns zero when it's not found, automatically giving the +correct value! Once the hexadecimal value is converted from characters +in a string into a numerical value, 'sprintf()' converts the value back +into a real character. The following is a simple test harness for the +above functions: + + BEGIN { + CGI_setup("GET", + "http://www.gnu.org/cgi-bin/foo?p1=stuff&p2=stuff%26junk" \ + "&percent=a %25 sign", + "1.0") + for (i in MENU) + printf "MENU[\"%s\"] = %s\n", i, MENU[i] + for (i in PARAM) + printf "PARAM[\"%s\"] = %s\n", i, PARAM[i] + for (i in GETARG) + printf "GETARG[\"%s\"] = %s\n", i, GETARG[i] + } + + And this is the result when we run it: + + $ gawk -f testserv.awk + -| MENU["4"] = www.gnu.org + -| MENU["5"] = cgi-bin + -| MENU["6"] = foo + -| MENU["1"] = http + -| MENU["2"] = + -| MENU["3"] = + -| PARAM["1"] = p1=stuff + -| PARAM["2"] = p2=stuff&junk + -| PARAM["3"] = percent=a % sign + -| GETARG["p1"] = stuff + -| GETARG["percent"] = a % sign + -| GETARG["p2"] = stuff&junk + -| GETARG["Method"] = GET + -| GETARG["Version"] = 1.0 + -| GETARG["URI"] = http://www.gnu.org/cgi-bin/foo?p1=stuff& + p2=stuff%26junk&percent=a %25 sign + + +File: gawkinet.info, Node: Simple Server, Next: Caveats, Prev: Interacting Service, Up: Using Networking + +2.10 A Simple Web Server +======================== + +In the preceding node, we built the core logic for event-driven GUIs. +In this node, we finally extend the core to a real application. No one +would actually write a commercial web server in 'gawk', but it is +instructive to see that it is feasible in principle. + + The application is ELIZA, the famous program by Joseph Weizenbaum +that mimics the behavior of a professional psychotherapist when talking +to you. Weizenbaum would certainly object to this description, but this +is part of the legend around ELIZA. Take the site-independent core logic +and append the following code: + + function SetUpServer() { + SetUpEliza() + TopHeader = \ + "<HTML><title>An HTTP-based System with GAWK</title>\ + <HEAD><META HTTP-EQUIV=\"Content-Type\"\ + CONTENT=\"text/html; charset=iso-8859-1\"></HEAD>\ + <BODY BGCOLOR=\"#ffffff\" TEXT=\"#000000\"\ + LINK=\"#0000ff\" VLINK=\"#0000ff\"\ + ALINK=\"#0000ff\"> <A NAME=\"top\">" + TopDoc = "\ + <h2>Please choose one of the following actions:</h2>\ + <UL>\ + <LI>\ + <A HREF=" MyPrefix "/AboutServer>About this server</A>\ + </LI><LI>\ + <A HREF=" MyPrefix "/AboutELIZA>About Eliza</A></LI>\ + <LI>\ + <A HREF=" MyPrefix \ + "/StartELIZA>Start talking to Eliza</A></LI></UL>" + TopFooter = "</BODY></HTML>" + } + + 'SetUpServer()' is similar to the previous example, except for +calling another function, 'SetUpEliza()'. This approach can be used to +implement other kinds of servers. The only changes needed to do so are +hidden in the functions 'SetUpServer()' and 'HandleGET()'. Perhaps it +might be necessary to implement other HTTP methods. The 'igawk' program +that comes with 'gawk' may be useful for this process. + + When extending this example to a complete application, the first +thing to do is to implement the function 'SetUpServer()' to initialize +the HTML pages and some variables. These initializations determine the +way your HTML pages look (colors, titles, menu items, etc.). + + The function 'HandleGET()' is a nested case selection that decides +which page the user wants to see next. Each nesting level refers to a +menu level of the GUI. Each case implements a certain action of the +menu. On the deepest level of case selection, the handler essentially +knows what the user wants and stores the answer into the variable that +holds the HTML page contents: + + function HandleGET() { + # A real HTTP server would treat some parts of the URI as a file name. + # We take parts of the URI as menu choices and go on accordingly. + if(MENU[2] == "AboutServer") { + Document = "This is not a CGI script.\ + This is an httpd, an HTML file, and a CGI script all \ + in one GAWK script. It needs no separate www-server, \ + no installation, and no root privileges.\ + <p>To run it, do this:</p><ul>\ + <li> start this script with \"gawk -f httpserver.awk\",</li>\ + <li> and on the same host let your www browser open location\ + \"http://localhost:8080\"</li>\ + </ul>\<p>\ Details of HTTP come from:</p><ul>\ + <li>Hethmon: Illustrated Guide to HTTP</p>\ + <li>RFC 2068</li></ul><p>JK 14.9.1997</p>" + } else if (MENU[2] == "AboutELIZA") { + Document = "This is an implementation of the famous ELIZA\ + program by Joseph Weizenbaum. It is written in GAWK and\ + uses an HTML GUI." + } else if (MENU[2] == "StartELIZA") { + gsub(/\+/, " ", GETARG["YouSay"]) + # Here we also have to substitute coded special characters + Document = "<form method=GET>" \ + "<h3>" ElizaSays(GETARG["YouSay"]) "</h3>\ + <p><input type=text name=YouSay value=\"\" size=60>\ + <br><input type=submit value=\"Tell her about it\"></p></form>" + } + } + + Now we are down to the heart of ELIZA, so you can see how it works. +Initially the user does not say anything; then ELIZA resets its money +counter and asks the user to tell what comes to mind open heartedly. +The subsequent answers are converted to uppercase characters and stored +for later comparison. ELIZA presents the bill when being confronted +with a sentence that contains the phrase "shut up." Otherwise, it looks +for keywords in the sentence, conjugates the rest of the sentence, +remembers the keyword for later use, and finally selects an answer from +the set of possible answers: + + function ElizaSays(YouSay) { + if (YouSay == "") { + cost = 0 + answer = "HI, IM ELIZA, TELL ME YOUR PROBLEM" + } else { + q = toupper(YouSay) + gsub("'", "", q) + if(q == qold) { + answer = "PLEASE DONT REPEAT YOURSELF !" + } else { + if (index(q, "SHUT UP") > 0) { + answer = "WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $"\ + int(100*rand()+30+cost/100) + } else { + qold = q + w = "-" # no keyword recognized yet + for (i in k) { # search for keywords + if (index(q, i) > 0) { + w = i + break + } + } + if (w == "-") { # no keyword, take old subject + w = wold + subj = subjold + } else { # find subject + subj = substr(q, index(q, w) + length(w)+1) + wold = w + subjold = subj # remember keyword and subject + } + for (i in conj) + gsub(i, conj[i], q) # conjugation + # from all answers to this keyword, select one randomly + answer = r[indices[int(split(k[w], indices) * rand()) + 1]] + # insert subject into answer + gsub("_", subj, answer) + } + } + } + cost += length(answer) # for later payment : 1 cent per character + return answer + } + + In the long but simple function 'SetUpEliza()', you can see tables +for conjugation, keywords, and answers.(1) The associative array 'k' +contains indices into the array of answers 'r'. To choose an answer, +ELIZA just picks an index randomly: + + function SetUpEliza() { + srand() + wold = "-" + subjold = " " + + # table for conjugation + conj[" ARE " ] = " AM " + conj["WERE " ] = "WAS " + conj[" YOU " ] = " I " + conj["YOUR " ] = "MY " + conj[" IVE " ] =\ + conj[" I HAVE " ] = " YOU HAVE " + conj[" YOUVE " ] =\ + conj[" YOU HAVE "] = " I HAVE " + conj[" IM " ] =\ + conj[" I AM " ] = " YOU ARE " + conj[" YOURE " ] =\ + conj[" YOU ARE " ] = " I AM " + + # table of all answers + r[1] = "DONT YOU BELIEVE THAT I CAN _" + r[2] = "PERHAPS YOU WOULD LIKE TO BE ABLE TO _ ?" + ... + + # table for looking up answers that + # fit to a certain keyword + k["CAN YOU"] = "1 2 3" + k["CAN I"] = "4 5" + k["YOU ARE"] =\ + k["YOURE"] = "6 7 8 9" + ... + } + + Some interesting remarks and details (including the original source +code of ELIZA) are found on Mark Humphrys' home page. Yahoo! also has +a page with a collection of ELIZA-like programs. Many of them are +written in Java, some of them disclosing the Java source code, and a few +even explain how to modify the Java source code. + + ---------- Footnotes ---------- + + (1) The version shown here is abbreviated. The full version comes +with the 'gawk' distribution. + + +File: gawkinet.info, Node: Caveats, Next: Challenges, Prev: Simple Server, Up: Using Networking + +2.11 Network Programming Caveats +================================ + +By now it should be clear that debugging a networked application is more +complicated than debugging a single-process single-hosted application. +The behavior of a networked application sometimes looks noncausal +because it is not reproducible in a strong sense. Whether a network +application works or not sometimes depends on the following: + + * How crowded the underlying network is + + * If the party at the other end is running or not + + * The state of the party at the other end + + The most difficult problems for a beginner arise from the hidden +states of the underlying network. After closing a TCP connection, it's +often necessary to wait a short while before reopening the connection. +Even more difficult is the establishment of a connection that previously +ended with a "broken pipe." Those connections have to "time out" for a +minute or so before they can reopen. Check this with the command +'netstat -a', which provides a list of still "active" connections. + + +File: gawkinet.info, Node: Challenges, Prev: Caveats, Up: Using Networking + +2.12 Where To Go From Here +========================== + +Now, you have learned enough to build your own application. You could, +for example, take part in the Loebner Contest to win $100,000. The +Loebner Prize is the first formal instantiation of a Turing Test. Hugh +Loebner agreed with The Cambridge Center for Behavioral Studies to +underwrite a contest designed to implement the Turing Test. Dr. Loebner +pledged a Grand Prize of $100,000 for the first computer whose responses +were indistinguishable from a human's. Each year an annual prize of +$2000 and a bronze medal is awarded to the _most_ human computer. The +winner of the annual contest is the best entry relative to other entries +that year, irrespective of how good it is in an absolute sense. Here is +an example of a conversation with the winning program of 1997: + + (c)1997 Cambridge Center For Behavioral Studies all rights reserved [CONVERSE] + [Intelligent Research Ltd.] Start at: [1997/04/29 08:59:40] + PROGRAM[10:39:20]Hi, my name is Catherine, what's yours? + JUDGE02[10:39:36]My name is Nathan. + PROGRAM[10:39:37]Well, it's nice to meet you, Nathan. + PROGRAM[10:39:45]Did you see that story on CNN last night about the lesbian + PROGRAM[10:40:03]couple who came out at a White House party on Sunday? + JUDGE02[10:40:45]No. I'm sorry I missed that. How did the Prez take it? + PROGRAM[10:40:47]Ellen Degeneres was one of them - she was kissing her lover + PROGRAM[10:41:08]in front of Bill Clinton. Do you think he was very + PROGRAM[10:41:21]embarrassed? + JUDGE02[10:41:37]You tell me. I didn't see it. + PROGRAM[10:41:38]Are you still a Friend of Bill Clinton? + JUDGE02[10:42:22]Yes. I would like to talk about something else. Do you know anything about sculpture. + PROGRAM[10:42:24]Well, I suppose he's doing his best like most of us, but he + PROGRAM[10:42:35]does seem to attract an UNUSUAL amount of bad publicity, + PROGRAM[10:42:51]wouldnt you say--can it all be bad luck? + JUDGE02[10:43:20]I asked you if you knew anything about sculpture. + PROGRAM[10:43:21]OK, but what happens now--does it all drift on, with + PROGRAM[10:43:36]everyone waiting to see how much his friends have said about + PROGRAM[10:43:47]him under threat of lots and lots of jail-or will something + PROGRAM[10:44:01]dramatic actually happen? + JUDGE02[10:44:19]Do you know what sculpture is? + PROGRAM[10:44:20]Well, youre more a fan of his than I am I guess---though I + PROGRAM[10:44:34]cant SWEAR what Id have done in that locked hotel room, I + PROGRAM[10:44:51]must say. + JUDGE02[10:45:26]Have you ever been to the Salmagundi Club? + PROGRAM[10:45:29]Never. to go back to the original discussion Time to get + PROGRAM[10:45:43]off this one I think and onto something more down to earth! + JUDGE02[10:46:20]What is more down to earth than art? + PROGRAM[10:46:21]I don't know what -- to be honest. - + PROGRAM[10:46:30]I suppose it depends on where you're coming from, but as the + PROGRAM[10:46:46]song goes in My Fair Lady, "Why can't a woman be more like a + PROGRAM[10:47:03]man? " + JUDGE02[10:47:52]I liked that song but I prefer "Why can't children be more like we are" + JUDGE02[10:48:10]from another play the name of which I cannot remember. + PROGRAM[10:48:12]What, if anything, do you enjoy about being a man? + JUDGE02[10:49:01]I would enjoy all aspects of life no matter what my gender. + PROGRAM[10:49:02]Do you think of yourself as being attractive? + + This program insists on always speaking about the same story around +Bill Clinton. You see, even a program with a rather narrow mind can +behave so much like a human being that it can win this prize. It is +quite common to let these programs talk to each other via network +connections. But during the competition itself, the program and its +computer have to be present at the place the competition is held. We +all would love to see a 'gawk' program win in such an event. Maybe it +is up to you to accomplish this? + + Some other ideas for useful networked applications: + * Read the file 'doc/awkforai.txt' in the 'gawk' distribution. It + was written by Ronald P. Loui (at the time, Associate Professor of + Computer Science, at Washington University in St. Louis, + <loui@ai.wustl.edu>) and summarizes why he taught 'gawk' to + students of Artificial Intelligence. Here are some passages from + the text: + + The GAWK manual can be consumed in a single lab session and + the language can be mastered by the next morning by the + average student. GAWK's automatic initialization, implicit + coercion, I/O support and lack of pointers forgive many of the + mistakes that young programmers are likely to make. Those who + have seen C but not mastered it are happy to see that GAWK + retains some of the same sensibilities while adding what must + be regarded as spoonsful of syntactic sugar. + ... + There are further simple answers. Probably the best is the + fact that increasingly, undergraduate AI programming is + involving the Web. Oren Etzioni (University of Washington, + Seattle) has for a while been arguing that the "softbot" is + replacing the mechanical engineers' robot as the most + glamorous AI testbed. If the artifact whose behavior needs to + be controlled in an intelligent way is the software agent, + then a language that is well-suited to controlling the + software environment is the appropriate language. That would + imply a scripting language. If the robot is KAREL, then the + right language is "turn left; turn right." If the robot is + Netscape, then the right language is something that can + generate 'netscape -remote + 'openURL(http://cs.wustl.edu/~loui)'' with elan. + ... + AI programming requires high-level thinking. There have + always been a few gifted programmers who can write high-level + programs in assembly language. Most however need the ambient + abstraction to have a higher floor. + ... + Second, inference is merely the expansion of notation. No + matter whether the logic that underlies an AI program is + fuzzy, probabilistic, deontic, defeasible, or deductive, the + logic merely defines how strings can be transformed into other + strings. A language that provides the best support for string + processing in the end provides the best support for logic, for + the exploration of various logics, and for most forms of + symbolic processing that AI might choose to call "reasoning" + instead of "logic." The implication is that PROLOG, which + saves the AI programmer from having to write a unifier, saves + perhaps two dozen lines of GAWK code at the expense of + strongly biasing the logic and representational expressiveness + of any approach. + + Now that 'gawk' itself can connect to the Internet, it should be + obvious that it is suitable for writing intelligent web agents. + + * 'awk' is strong at pattern recognition and string processing. So, + it is well suited to the classic problem of language translation. + A first try could be a program that knows the 100 most frequent + English words and their counterparts in German or French. The + service could be implemented by regularly reading email with the + program above, replacing each word by its translation and sending + the translation back via SMTP. Users would send English email to + their translation service and get back a translated email message + in return. As soon as this works, more effort can be spent on a + real translation program. + + * Another dialogue-oriented application (on the verge of ridicule) is + the email "support service." Troubled customers write an email to + an automatic 'gawk' service that reads the email. It looks for + keywords in the mail and assembles a reply email accordingly. By + carefully investigating the email header, and repeating these + keywords through the reply email, it is rather simple to give the + customer a feeling that someone cares. Ideally, such a service + would search a database of previous cases for solutions. If none + exists, the database could, for example, consist of all the + newsgroups, mailing lists and FAQs on the Internet. + + +File: gawkinet.info, Node: Some Applications and Techniques, Next: Links, Prev: Using Networking, Up: Top + +3 Some Applications and Techniques +********************************** + +In this major node, we look at a number of self-contained scripts, with +an emphasis on concise networking. Along the way, we work towards +creating building blocks that encapsulate often needed functions of the +networking world, show new techniques that broaden the scope of problems +that can be solved with 'gawk', and explore leading edge technology that +may shape the future of networking. + + We often refer to the site-independent core of the server that we +built in *note A Simple Web Server: Simple Server. When building new +and nontrivial servers, we always copy this building block and append +new instances of the two functions 'SetUpServer()' and 'HandleGET()'. + + This makes a lot of sense, since this scheme of event-driven +execution provides 'gawk' with an interface to the most widely accepted +standard for GUIs: the web browser. Now, 'gawk' can rival even Tcl/Tk. + + Tcl and 'gawk' have much in common. Both are simple scripting +languages that allow us to quickly solve problems with short programs. +But Tcl has Tk on top of it, and 'gawk' had nothing comparable up to +now. While Tcl needs a large and ever-changing library (Tk, which was +bound to the X Window System until recently), 'gawk' needs just the +networking interface and some kind of browser on the client's side. +Besides better portability, the most important advantage of this +approach (embracing well-established standards such HTTP and HTML) is +that _we do not need to change the language_. We let others do the work +of fighting over protocols and standards. We can use HTML, JavaScript, +VRML, or whatever else comes along to do our work. + +* Menu: + +* PANIC:: An Emergency Web Server. +* GETURL:: Retrieving Web Pages. +* REMCONF:: Remote Configuration Of Embedded Systems. +* URLCHK:: Look For Changed Web Pages. +* WEBGRAB:: Extract Links From A Page. +* STATIST:: Graphing A Statistical Distribution. +* MAZE:: Walking Through A Maze In Virtual Reality. +* MOBAGWHO:: A Simple Mobile Agent. +* STOXPRED:: Stock Market Prediction As A Service. +* PROTBASE:: Searching Through A Protein Database. + + +File: gawkinet.info, Node: PANIC, Next: GETURL, Prev: Some Applications and Techniques, Up: Some Applications and Techniques + +3.1 PANIC: An Emergency Web Server +================================== + +At first glance, the '"Hello, world"' example in *note A Primitive Web +Service: Primitive Service, seems useless. By adding just a few lines, +we can turn it into something useful. + + The PANIC program tells everyone who connects that the local site is +not working. When a web server breaks down, it makes a difference if +customers get a strange "network unreachable" message, or a short +message telling them that the server has a problem. In such an +emergency, the hard disk and everything on it (including the regular web +service) may be unavailable. Rebooting the web server off a diskette +makes sense in this setting. + + To use the PANIC program as an emergency web server, all you need are +the 'gawk' executable and the program below on a diskette. By default, +it connects to port 8080. A different value may be supplied on the +command line: + + BEGIN { + RS = ORS = "\r\n" + if (MyPort == 0) MyPort = 8080 + HttpService = "/inet/tcp/" MyPort "/0/0" + Hello = "<HTML><HEAD><TITLE>Out Of Service</TITLE>" \ + "</HEAD><BODY><H1>" \ + "This site is temporarily out of service." \ + "</H1></BODY></HTML>" + Len = length(Hello) + length(ORS) + while ("awk" != "complex") { + print "HTTP/1.0 200 OK" |& HttpService + print "Content-Length: " Len ORS |& HttpService + print Hello |& HttpService + while ((HttpService |& getline) > 0) + continue; + close(HttpService) + } + } + + +File: gawkinet.info, Node: GETURL, Next: REMCONF, Prev: PANIC, Up: Some Applications and Techniques + +3.2 GETURL: Retrieving Web Pages +================================ + +GETURL is a versatile building block for shell scripts that need to +retrieve files from the Internet. It takes a web address as a +command-line parameter and tries to retrieve the contents of this +address. The contents are printed to standard output, while the header +is printed to '/dev/stderr'. A surrounding shell script could analyze +the contents and extract the text or the links. An ASCII browser could +be written around GETURL. But more interestingly, web robots are +straightforward to write on top of GETURL. On the Internet, you can find +several programs of the same name that do the same job. They are +usually much more complex internally and at least 10 times longer. + + At first, GETURL checks if it was called with exactly one web +address. Then, it checks if the user chose to use a special proxy +server whose name is handed over in a variable. By default, it is +assumed that the local machine serves as proxy. GETURL uses the 'GET' +method by default to access the web page. By handing over the name of a +different method (such as 'HEAD'), it is possible to choose a different +behavior. With the 'HEAD' method, the user does not receive the body of +the page content, but does receive the header: + + BEGIN { + if (ARGC != 2) { + print "GETURL - retrieve Web page via HTTP 1.0" + print "IN:\n the URL as a command-line parameter" + print "PARAM(S):\n -v Proxy=MyProxy" + print "OUT:\n the page content on stdout" + print " the page header on stderr" + print "JK 16.05.1997" + print "ADR 13.08.2000" + exit + } + URL = ARGV[1]; ARGV[1] = "" + if (Proxy == "") Proxy = "127.0.0.1" + if (ProxyPort == 0) ProxyPort = 80 + if (Method == "") Method = "GET" + HttpService = "/inet/tcp/0/" Proxy "/" ProxyPort + ORS = RS = "\r\n\r\n" + print Method " " URL " HTTP/1.0" |& HttpService + HttpService |& getline Header + print Header > "/dev/stderr" + while ((HttpService |& getline) > 0) + printf "%s", $0 + close(HttpService) + } + + This program can be changed as needed, but be careful with the last +lines. Make sure transmission of binary data is not corrupted by +additional line breaks. Even as it is now, the byte sequence +'"\r\n\r\n"' would disappear if it were contained in binary data. Don't +get caught in a trap when trying a quick fix on this one. + + +File: gawkinet.info, Node: REMCONF, Next: URLCHK, Prev: GETURL, Up: Some Applications and Techniques + +3.3 REMCONF: Remote Configuration of Embedded Systems +===================================================== + +Today, you often find powerful processors in embedded systems. +Dedicated network routers and controllers for all kinds of machinery are +examples of embedded systems. Processors like the Intel 80x86 or the +AMD Elan are able to run multitasking operating systems, such as XINU or +GNU/Linux in embedded PCs. These systems are small and usually do not +have a keyboard or a display. Therefore it is difficult to set up their +configuration. There are several widespread ways to set them up: + + * DIP switches + + * Read Only Memories such as EPROMs + + * Serial lines or some kind of keyboard + + * Network connections via 'telnet' or SNMP + + * HTTP connections with HTML GUIs + + In this node, we look at a solution that uses HTTP connections to +control variables of an embedded system that are stored in a file. +Since embedded systems have tight limits on resources like memory, it is +difficult to employ advanced techniques such as SNMP and HTTP servers. +'gawk' fits in quite nicely with its single executable which needs just +a short script to start working. The following program stores the +variables in a file, and a concurrent process in the embedded system may +read the file. The program uses the site-independent part of the simple +web server that we developed in *note A Web Service with Interaction: +Interacting Service. As mentioned there, all we have to do is to write +two new procedures 'SetUpServer()' and 'HandleGET()': + + function SetUpServer() { + TopHeader = "<HTML><title>Remote Configuration</title>" + TopDoc = "<BODY>\ + <h2>Please choose one of the following actions:</h2>\ + <UL>\ + <LI><A HREF=" MyPrefix "/AboutServer>About this server</A></LI>\ + <LI><A HREF=" MyPrefix "/ReadConfig>Read Configuration</A></LI>\ + <LI><A HREF=" MyPrefix "/CheckConfig>Check Configuration</A></LI>\ + <LI><A HREF=" MyPrefix "/ChangeConfig>Change Configuration</A></LI>\ + <LI><A HREF=" MyPrefix "/SaveConfig>Save Configuration</A></LI>\ + </UL>" + TopFooter = "</BODY></HTML>" + if (ConfigFile == "") ConfigFile = "config.asc" + } + + The function 'SetUpServer()' initializes the top level HTML texts as +usual. It also initializes the name of the file that contains the +configuration parameters and their values. In case the user supplies a +name from the command line, that name is used. The file is expected to +contain one parameter per line, with the name of the parameter in column +one and the value in column two. + + The function 'HandleGET()' reflects the structure of the menu tree as +usual. The first menu choice tells the user what this is all about. +The second choice reads the configuration file line by line and stores +the parameters and their values. Notice that the record separator for +this file is '"\n"', in contrast to the record separator for HTTP. The +third menu choice builds an HTML table to show the contents of the +configuration file just read. The fourth choice does the real work of +changing parameters, and the last one just saves the configuration into +a file: + + function HandleGET() { + if(MENU[2] == "AboutServer") { + Document = "This is a GUI for remote configuration of an\ + embedded system. It is is implemented as one GAWK script." + } else if (MENU[2] == "ReadConfig") { + RS = "\n" + while ((getline < ConfigFile) > 0) + config[$1] = $2; + close(ConfigFile) + RS = "\r\n" + Document = "Configuration has been read." + } else if (MENU[2] == "CheckConfig") { + Document = "<TABLE BORDER=1 CELLPADDING=5>" + for (i in config) + Document = Document "<TR><TD>" i "</TD>" \ + "<TD>" config[i] "</TD></TR>" + Document = Document "</TABLE>" + } else if (MENU[2] == "ChangeConfig") { + if ("Param" in GETARG) { # any parameter to set? + if (GETARG["Param"] in config) { # is parameter valid? + config[GETARG["Param"]] = GETARG["Value"] + Document = (GETARG["Param"] " = " GETARG["Value"] ".") + } else { + Document = "Parameter <b>" GETARG["Param"] "</b> is invalid." + } + } else { + Document = "<FORM method=GET><h4>Change one parameter</h4>\ + <TABLE BORDER CELLPADDING=5>\ + <TR><TD>Parameter</TD><TD>Value</TD></TR>\ + <TR><TD><input type=text name=Param value=\"\" size=20></TD>\ + <TD><input type=text name=Value value=\"\" size=40></TD>\ + </TR></TABLE><input type=submit value=\"Set\"></FORM>" + } + } else if (MENU[2] == "SaveConfig") { + for (i in config) + printf("%s %s\n", i, config[i]) > ConfigFile + close(ConfigFile) + Document = "Configuration has been saved." + } + } + + We could also view the configuration file as a database. From this +point of view, the previous program acts like a primitive database +server. Real SQL database systems also make a service available by +providing a TCP port that clients can connect to. But the application +level protocols they use are usually proprietary and also change from +time to time. This is also true for the protocol that MiniSQL uses. + + +File: gawkinet.info, Node: URLCHK, Next: WEBGRAB, Prev: REMCONF, Up: Some Applications and Techniques + +3.4 URLCHK: Look for Changed Web Pages +====================================== + +Most people who make heavy use of Internet resources have a large +bookmark file with pointers to interesting web sites. It is impossible +to regularly check by hand if any of these sites have changed. A +program is needed to automatically look at the headers of web pages and +tell which ones have changed. URLCHK does the comparison after using +GETURL with the 'HEAD' method to retrieve the header. + + Like GETURL, this program first checks that it is called with exactly +one command-line parameter. URLCHK also takes the same command-line +variables 'Proxy' and 'ProxyPort' as GETURL, because these variables are +handed over to GETURL for each URL that gets checked. The one and only +parameter is the name of a file that contains one line for each URL. In +the first column, we find the URL, and the second and third columns hold +the length of the URL's body when checked for the two last times. Now, +we follow this plan: + + 1. Read the URLs from the file and remember their most recent lengths + + 2. Delete the contents of the file + + 3. For each URL, check its new length and write it into the file + + 4. If the most recent and the new length differ, tell the user + + It may seem a bit peculiar to read the URLs from a file together with +their two most recent lengths, but this approach has several advantages. +You can call the program again and again with the same file. After +running the program, you can regenerate the changed URLs by extracting +those lines that differ in their second and third columns: + + BEGIN { + if (ARGC != 2) { + print "URLCHK - check if URLs have changed" + print "IN:\n the file with URLs as a command-line parameter" + print " file contains URL, old length, new length" + print "PARAMS:\n -v Proxy=MyProxy -v ProxyPort=8080" + print "OUT:\n same as file with URLs" + print "JK 02.03.1998" + exit + } + URLfile = ARGV[1]; ARGV[1] = "" + if (Proxy != "") Proxy = " -v Proxy=" Proxy + if (ProxyPort != "") ProxyPort = " -v ProxyPort=" ProxyPort + while ((getline < URLfile) > 0) + Length[$1] = $3 + 0 + close(URLfile) # now, URLfile is read in and can be updated + GetHeader = "gawk " Proxy ProxyPort " -v Method=\"HEAD\" -f geturl.awk " + for (i in Length) { + GetThisHeader = GetHeader i " 2>&1" + while ((GetThisHeader | getline) > 0) + if (toupper($0) ~ /CONTENT-LENGTH/) NewLength = $2 + 0 + close(GetThisHeader) + print i, Length[i], NewLength > URLfile + if (Length[i] != NewLength) # report only changed URLs + print i, Length[i], NewLength + } + close(URLfile) + } + + Another thing that may look strange is the way GETURL is called. +Before calling GETURL, we have to check if the proxy variables need to +be passed on. If so, we prepare strings that will become part of the +command line later. In 'GetHeader()', we store these strings together +with the longest part of the command line. Later, in the loop over the +URLs, 'GetHeader()' is appended with the URL and a redirection operator +to form the command that reads the URL's header over the Internet. +GETURL always produces the headers over '/dev/stderr'. That is the +reason why we need the redirection operator to have the header piped in. + + This program is not perfect because it assumes that changing URLs +results in changed lengths, which is not necessarily true. A more +advanced approach is to look at some other header line that holds time +information. But, as always when things get a bit more complicated, +this is left as an exercise to the reader. + + +File: gawkinet.info, Node: WEBGRAB, Next: STATIST, Prev: URLCHK, Up: Some Applications and Techniques + +3.5 WEBGRAB: Extract Links from a Page +====================================== + +Sometimes it is necessary to extract links from web pages. Browsers do +it, web robots do it, and sometimes even humans do it. Since we have a +tool like GETURL at hand, we can solve this problem with some help from +the Bourne shell: + + BEGIN { RS = "http://[#%&\\+\\-\\./0-9\\:;\\?A-Z_a-z\\~]*" } + RT != "" { + command = ("gawk -v Proxy=MyProxy -f geturl.awk " RT \ + " > doc" NR ".html") + print command + } + + Notice that the regular expression for URLs is rather crude. A +precise regular expression is much more complex. But this one works +rather well. One problem is that it is unable to find internal links of +an HTML document. Another problem is that 'ftp', 'telnet', 'news', +'mailto', and other kinds of links are missing in the regular +expression. However, it is straightforward to add them, if doing so is +necessary for other tasks. + + This program reads an HTML file and prints all the HTTP links that it +finds. It relies on 'gawk''s ability to use regular expressions as +record separators. With 'RS' set to a regular expression that matches +links, the second action is executed each time a non-empty link is +found. We can find the matching link itself in 'RT'. + + The action could use the 'system()' function to let another GETURL +retrieve the page, but here we use a different approach. This simple +program prints shell commands that can be piped into 'sh' for execution. +This way it is possible to first extract the links, wrap shell commands +around them, and pipe all the shell commands into a file. After editing +the file, execution of the file retrieves exactly those files that we +really need. In case we do not want to edit, we can retrieve all the +pages like this: + + gawk -f geturl.awk http://www.suse.de | gawk -f webgrab.awk | sh + + After this, you will find the contents of all referenced documents in +files named 'doc*.html' even if they do not contain HTML code. The most +annoying thing is that we always have to pass the proxy to GETURL. If +you do not like to see the headers of the web pages appear on the +screen, you can redirect them to '/dev/null'. Watching the headers +appear can be quite interesting, because it reveals interesting details +such as which web server the companies use. Now, it is clear how the +clever marketing people use web robots to determine the market shares of +Microsoft and Netscape in the web server market. + + Port 80 of any web server is like a small hole in a repellent +firewall. After attaching a browser to port 80, we usually catch a +glimpse of the bright side of the server (its home page). With a tool +like GETURL at hand, we are able to discover some of the more concealed +or even "indecent" services (i.e., lacking conformity to standards of +quality). It can be exciting to see the fancy CGI scripts that lie +there, revealing the inner workings of the server, ready to be called: + + * With a command such as: + + gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/ + + some servers give you a directory listing of the CGI files. + Knowing the names, you can try to call some of them and watch for + useful results. Sometimes there are executables in such + directories (such as Perl interpreters) that you may call remotely. + If there are subdirectories with configuration data of the web + server, this can also be quite interesting to read. + + * The well-known Apache web server usually has its CGI files in the + directory '/cgi-bin'. There you can often find the scripts + 'test-cgi' and 'printenv'. Both tell you some things about the + current connection and the installation of the web server. Just + call: + + gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/test-cgi + gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/printenv + + * Sometimes it is even possible to retrieve system files like the web + server's log file--possibly containing customer data--or even the + file '/etc/passwd'. (We don't recommend this!) + + *Caution:* Although this may sound funny or simply irrelevant, we are +talking about severe security holes. Try to explore your own system +this way and make sure that none of the above reveals too much +information about your system. + + +File: gawkinet.info, Node: STATIST, Next: MAZE, Prev: WEBGRAB, Up: Some Applications and Techniques + +3.6 STATIST: Graphing a Statistical Distribution +================================================ + +In the HTTP server examples we've shown thus far, we never present an +image to the browser and its user. Presenting images is one task. +Generating images that reflect some user input and presenting these +dynamically generated images is another. In this node, we use GNUPlot +for generating '.png', '.ps', or '.gif' files.(1) + + The program we develop takes the statistical parameters of two +samples and computes the t-test statistics. As a result, we get the +probabilities that the means and the variances of both samples are the +same. In order to let the user check plausibility, the program presents +an image of the distributions. The statistical computation follows +'Numerical Recipes in C: The Art of Scientific Computing' by William H. +Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. +Since 'gawk' does not have a built-in function for the computation of +the beta function, we use the 'ibeta()' function of GNUPlot. As a side +effect, we learn how to use GNUPlot as a sophisticated calculator. The +comparison of means is done as in 'tutest', paragraph 14.2, page 613, +and the comparison of variances is done as in 'ftest', page 611 in +'Numerical Recipes'. + + As usual, we take the site-independent code for servers and append +our own functions 'SetUpServer()' and 'HandleGET()': + + function SetUpServer() { + TopHeader = "<HTML><title>Statistics with GAWK</title>" + TopDoc = "<BODY>\ + <h2>Please choose one of the following actions:</h2>\ + <UL>\ + <LI><A HREF=" MyPrefix "/AboutServer>About this server</A></LI>\ + <LI><A HREF=" MyPrefix "/EnterParameters>Enter Parameters</A></LI>\ + </UL>" + TopFooter = "</BODY></HTML>" + GnuPlot = "gnuplot 2>&1" + m1=m2=0; v1=v2=1; n1=n2=10 + } + + Here, you see the menu structure that the user sees. Later, we will +see how the program structure of the 'HandleGET()' function reflects the +menu structure. What is missing here is the link for the image we +generate. In an event-driven environment, request, generation, and +delivery of images are separated. + + Notice the way we initialize the 'GnuPlot' command string for the +pipe. By default, GNUPlot outputs the generated image via standard +output, as well as the results of 'print'(ed) calculations via standard +error. The redirection causes standard error to be mixed into standard +output, enabling us to read results of calculations with 'getline'. By +initializing the statistical parameters with some meaningful defaults, +we make sure the user gets an image the first time he uses the program. + + Following is the rather long function 'HandleGET()', which implements +the contents of this service by reacting to the different kinds of +requests from the browser. Before you start playing with this script, +make sure that your browser supports JavaScript and that it also has +this option switched on. The script uses a short snippet of JavaScript +code for delayed opening of a window with an image. A more detailed +explanation follows: + + function HandleGET() { + if(MENU[2] == "AboutServer") { + Document = "This is a GUI for a statistical computation.\ + It compares means and variances of two distributions.\ + It is implemented as one GAWK script and uses GNUPLOT." + } else if (MENU[2] == "EnterParameters") { + Document = "" + if ("m1" in GETARG) { # are there parameters to compare? + Document = Document "<SCRIPT LANGUAGE=\"JavaScript\">\ + setTimeout(\"window.open(\\\"" MyPrefix "/Image" systime()\ + "\\\",\\\"dist\\\", \\\"status=no\\\");\", 1000); </SCRIPT>" + m1 = GETARG["m1"]; v1 = GETARG["v1"]; n1 = GETARG["n1"] + m2 = GETARG["m2"]; v2 = GETARG["v2"]; n2 = GETARG["n2"] + t = (m1-m2)/sqrt(v1/n1+v2/n2) + df = (v1/n1+v2/n2)*(v1/n1+v2/n2)/((v1/n1)*(v1/n1)/(n1-1) \ + + (v2/n2)*(v2/n2) /(n2-1)) + if (v1>v2) { + f = v1/v2 + df1 = n1 - 1 + df2 = n2 - 1 + } else { + f = v2/v1 + df1 = n2 - 1 + df2 = n1 - 1 + } + print "pt=ibeta(" df/2 ",0.5," df/(df+t*t) ")" |& GnuPlot + print "pF=2.0*ibeta(" df2/2 "," df1/2 "," \ + df2/(df2+df1*f) ")" |& GnuPlot + print "print pt, pF" |& GnuPlot + RS="\n"; GnuPlot |& getline; RS="\r\n" # $1 is pt, $2 is pF + print "invsqrt2pi=1.0/sqrt(2.0*pi)" |& GnuPlot + print "nd(x)=invsqrt2pi/sd*exp(-0.5*((x-mu)/sd)**2)" |& GnuPlot + print "set term png small color" |& GnuPlot + #print "set term postscript color" |& GnuPlot + #print "set term gif medium size 320,240" |& GnuPlot + print "set yrange[-0.3:]" |& GnuPlot + print "set label 'p(m1=m2) =" $1 "' at 0,-0.1 left" |& GnuPlot + print "set label 'p(v1=v2) =" $2 "' at 0,-0.2 left" |& GnuPlot + print "plot mu=" m1 ",sd=" sqrt(v1) ", nd(x) title 'sample 1',\ + mu=" m2 ",sd=" sqrt(v2) ", nd(x) title 'sample 2'" |& GnuPlot + print "quit" |& GnuPlot + GnuPlot |& getline Image + while ((GnuPlot |& getline) > 0) + Image = Image RS $0 + close(GnuPlot) + } + Document = Document "\ + <h3>Do these samples have the same Gaussian distribution?</h3>\ + <FORM METHOD=GET> <TABLE BORDER CELLPADDING=5>\ + <TR>\ + <TD>1. Mean </TD> + <TD><input type=text name=m1 value=" m1 " size=8></TD>\ + <TD>1. Variance</TD> + <TD><input type=text name=v1 value=" v1 " size=8></TD>\ + <TD>1. Count </TD> + <TD><input type=text name=n1 value=" n1 " size=8></TD>\ + </TR><TR>\ + <TD>2. Mean </TD> + <TD><input type=text name=m2 value=" m2 " size=8></TD>\ + <TD>2. Variance</TD> + <TD><input type=text name=v2 value=" v2 " size=8></TD>\ + <TD>2. Count </TD> + <TD><input type=text name=n2 value=" n2 " size=8></TD>\ + </TR> <input type=submit value=\"Compute\">\ + </TABLE></FORM><BR>" + } else if (MENU[2] ~ "Image") { + Reason = "OK" ORS "Content-type: image/png" + #Reason = "OK" ORS "Content-type: application/x-postscript" + #Reason = "OK" ORS "Content-type: image/gif" + Header = Footer = "" + Document = Image + } + } + + As usual, we give a short description of the service in the first +menu choice. The third menu choice shows us that generation and +presentation of an image are two separate actions. While the latter +takes place quite instantly in the third menu choice, the former takes +place in the much longer second choice. Image data passes from the +generating action to the presenting action via the variable 'Image' that +contains a complete '.png' image, which is otherwise stored in a file. +If you prefer '.ps' or '.gif' images over the default '.png' images, you +may select these options by uncommenting the appropriate lines. But +remember to do so in two places: when telling GNUPlot which kind of +images to generate, and when transmitting the image at the end of the +program. + + Looking at the end of the program, the way we pass the 'Content-type' +to the browser is a bit unusual. It is appended to the 'OK' of the +first header line to make sure the type information becomes part of the +header. The other variables that get transmitted across the network are +made empty, because in this case we do not have an HTML document to +transmit, but rather raw image data to contain in the body. + + Most of the work is done in the second menu choice. It starts with a +strange JavaScript code snippet. When first implementing this server, +we used a short '"<IMG SRC=" MyPrefix "/Image>"' here. But then +browsers got smarter and tried to improve on speed by requesting the +image and the HTML code at the same time. When doing this, the browser +tries to build up a connection for the image request while the request +for the HTML text is not yet completed. The browser tries to connect to +the 'gawk' server on port 8080 while port 8080 is still in use for +transmission of the HTML text. The connection for the image cannot be +built up, so the image appears as "broken" in the browser window. We +solved this problem by telling the browser to open a separate window for +the image, but only after a delay of 1000 milliseconds. By this time, +the server should be ready for serving the next request. + + But there is one more subtlety in the JavaScript code. Each time the +JavaScript code opens a window for the image, the name of the image is +appended with a timestamp ('systime()'). Why this constant change of +name for the image? Initially, we always named the image 'Image', but +then the Netscape browser noticed the name had _not_ changed since the +previous request and displayed the previous image (caching behavior). +The server core is implemented so that browsers are told _not_ to cache +anything. Obviously HTTP requests do not always work as expected. One +way to circumvent the cache of such overly smart browsers is to change +the name of the image with each request. These three lines of +JavaScript caused us a lot of trouble. + + The rest can be broken down into two phases. At first, we check if +there are statistical parameters. When the program is first started, +there usually are no parameters because it enters the page coming from +the top menu. Then, we only have to present the user a form that he can +use to change statistical parameters and submit them. Subsequently, the +submission of the form causes the execution of the first phase because +_now_ there _are_ parameters to handle. + + Now that we have parameters, we know there will be an image +available. Therefore we insert the JavaScript code here to initiate the +opening of the image in a separate window. Then, we prepare some +variables that will be passed to GNUPlot for calculation of the +probabilities. Prior to reading the results, we must temporarily change +'RS' because GNUPlot separates lines with newlines. After instructing +GNUPlot to generate a '.png' (or '.ps' or '.gif') image, we initiate the +insertion of some text, explaining the resulting probabilities. The +final 'plot' command actually generates the image data. This raw binary +has to be read in carefully without adding, changing, or deleting a +single byte. Hence the unusual initialization of 'Image' and completion +with a 'while' loop. + + When using this server, it soon becomes clear that it is far from +being perfect. It mixes source code of six scripting languages or +protocols: + + * GNU 'awk' implements a server for the protocol: + * HTTP which transmits: + * HTML text which contains a short piece of: + * JavaScript code opening a separate window. + * A Bourne shell script is used for piping commands into: + * GNUPlot to generate the image to be opened. + + After all this work, the GNUPlot image opens in the JavaScript window +where it can be viewed by the user. + + It is probably better not to mix up so many different languages. The +result is not very readable. Furthermore, the statistical part of the +server does not take care of invalid input. Among others, using +negative variances will cause invalid results. + + ---------- Footnotes ---------- + + (1) Due to licensing problems, the default installation of GNUPlot +disables the generation of '.gif' files. If your installed version does +not accept 'set term gif', just download and install the most recent +version of GNUPlot and the GD library (http://www.boutell.com/gd/) by +Thomas Boutell. Otherwise you still have the chance to generate some +ASCII-art style images with GNUPlot by using 'set term dumb'. (We tried +it and it worked.) + + +File: gawkinet.info, Node: MAZE, Next: MOBAGWHO, Prev: STATIST, Up: Some Applications and Techniques + +3.7 MAZE: Walking Through a Maze In Virtual Reality +=================================================== + + In the long run, every program becomes rococo, and then rubble. + Alan Perlis + + By now, we know how to present arbitrary 'Content-type's to a +browser. In this node, our server will present a 3D world to our +browser. The 3D world is described in a scene description language +(VRML, Virtual Reality Modeling Language) that allows us to travel +through a perspective view of a 2D maze with our browser. Browsers with +a VRML plugin enable exploration of this technology. We could do one of +those boring 'Hello world' examples here, that are usually presented +when introducing novices to VRML. If you have never written any VRML +code, have a look at the VRML FAQ. Presenting a static VRML scene is a +bit trivial; in order to expose 'gawk''s new capabilities, we will +present a dynamically generated VRML scene. The function +'SetUpServer()' is very simple because it only sets the default HTML +page and initializes the random number generator. As usual, the +surrounding server lets you browse the maze. + + function SetUpServer() { + TopHeader = "<HTML><title>Walk through a maze</title>" + TopDoc = "\ + <h2>Please choose one of the following actions:</h2>\ + <UL>\ + <LI><A HREF=" MyPrefix "/AboutServer>About this server</A>\ + <LI><A HREF=" MyPrefix "/VRMLtest>Watch a simple VRML scene</A>\ + </UL>" + TopFooter = "</HTML>" + srand() + } + + The function 'HandleGET()' is a bit longer because it first computes +the maze and afterwards generates the VRML code that is sent across the +network. As shown in the STATIST example (*note STATIST::), we set the +type of the content to VRML and then store the VRML representation of +the maze as the page content. We assume that the maze is stored in a 2D +array. Initially, the maze consists of walls only. Then, we add an +entry and an exit to the maze and let the rest of the work be done by +the function 'MakeMaze()'. Now, only the wall fields are left in the +maze. By iterating over the these fields, we generate one line of VRML +code for each wall field. + + function HandleGET() { + if (MENU[2] == "AboutServer") { + Document = "If your browser has a VRML 2 plugin,\ + this server shows you a simple VRML scene." + } else if (MENU[2] == "VRMLtest") { + XSIZE = YSIZE = 11 # initially, everything is wall + for (y = 0; y < YSIZE; y++) + for (x = 0; x < XSIZE; x++) + Maze[x, y] = "#" + delete Maze[0, 1] # entry is not wall + delete Maze[XSIZE-1, YSIZE-2] # exit is not wall + MakeMaze(1, 1) + Document = "\ + #VRML V2.0 utf8\n\ + Group {\n\ + children [\n\ + PointLight {\n\ + ambientIntensity 0.2\n\ + color 0.7 0.7 0.7\n\ + location 0.0 8.0 10.0\n\ + }\n\ + DEF B1 Background {\n\ + skyColor [0 0 0, 1.0 1.0 1.0 ]\n\ + skyAngle 1.6\n\ + groundColor [1 1 1, 0.8 0.8 0.8, 0.2 0.2 0.2 ]\n\ + groundAngle [ 1.2 1.57 ]\n\ + }\n\ + DEF Wall Shape {\n\ + geometry Box {size 1 1 1}\n\ + appearance Appearance { material Material { diffuseColor 0 0 1 } }\n\ + }\n\ + DEF Entry Viewpoint {\n\ + position 0.5 1.0 5.0\n\ + orientation 0.0 0.0 -1.0 0.52\n\ + }\n" + for (i in Maze) { + split(i, t, SUBSEP) + Document = Document " Transform { translation " + Document = Document t[1] " 0 -" t[2] " children USE Wall }\n" + } + Document = Document " ] # end of group for world\n}" + Reason = "OK" ORS "Content-type: model/vrml" + Header = Footer = "" + } + } + + Finally, we have a look at 'MakeMaze()', the function that generates +the 'Maze' array. When entered, this function assumes that the array +has been initialized so that each element represents a wall element and +the maze is initially full of wall elements. Only the entrance and the +exit of the maze should have been left free. The parameters of the +function tell us which element must be marked as not being a wall. +After this, we take a look at the four neighboring elements and remember +which we have already treated. Of all the neighboring elements, we take +one at random and walk in that direction. Therefore, the wall element +in that direction has to be removed and then, we call the function +recursively for that element. The maze is only completed if we iterate +the above procedure for _all_ neighboring elements (in random order) and +for our present element by recursively calling the function for the +present element. This last iteration could have been done in a loop, +but it is done much simpler recursively. + + Notice that elements with coordinates that are both odd are assumed +to be on our way through the maze and the generating process cannot +terminate as long as there is such an element not being 'delete'd. All +other elements are potentially part of the wall. + + function MakeMaze(x, y) { + delete Maze[x, y] # here we are, we have no wall here + p = 0 # count unvisited fields in all directions + if (x-2 SUBSEP y in Maze) d[p++] = "-x" + if (x SUBSEP y-2 in Maze) d[p++] = "-y" + if (x+2 SUBSEP y in Maze) d[p++] = "+x" + if (x SUBSEP y+2 in Maze) d[p++] = "+y" + if (p>0) { # if there are unvisited fields, go there + p = int(p*rand()) # choose one unvisited field at random + if (d[p] == "-x") { delete Maze[x - 1, y]; MakeMaze(x - 2, y) + } else if (d[p] == "-y") { delete Maze[x, y - 1]; MakeMaze(x, y - 2) + } else if (d[p] == "+x") { delete Maze[x + 1, y]; MakeMaze(x + 2, y) + } else if (d[p] == "+y") { delete Maze[x, y + 1]; MakeMaze(x, y + 2) + } # we are back from recursion + MakeMaze(x, y); # try again while there are unvisited fields + } + } + + +File: gawkinet.info, Node: MOBAGWHO, Next: STOXPRED, Prev: MAZE, Up: Some Applications and Techniques + +3.8 MOBAGWHO: a Simple Mobile Agent +=================================== + + There are two ways of constructing a software design: One way is to + make it so simple that there are obviously no deficiencies, and the + other way is to make it so complicated that there are no obvious + deficiencies. + C. A. R. Hoare + + A "mobile agent" is a program that can be dispatched from a computer +and transported to a remote server for execution. This is called +"migration", which means that a process on another system is started +that is independent from its originator. Ideally, it wanders through a +network while working for its creator or owner. In places like the UMBC +Agent Web, people are quite confident that (mobile) agents are a +software engineering paradigm that enables us to significantly increase +the efficiency of our work. Mobile agents could become the mediators +between users and the networking world. For an unbiased view at this +technology, see the remarkable paper 'Mobile Agents: Are they a good +idea?'.(1) + + When trying to migrate a process from one system to another, a server +process is needed on the receiving side. Depending on the kind of +server process, several ways of implementation come to mind. How the +process is implemented depends upon the kind of server process: + + * HTTP can be used as the protocol for delivery of the migrating + process. In this case, we use a common web server as the receiving + server process. A universal CGI script mediates between migrating + process and web server. Each server willing to accept migrating + agents makes this universal service available. HTTP supplies the + 'POST' method to transfer some data to a file on the web server. + When a CGI script is called remotely with the 'POST' method instead + of the usual 'GET' method, data is transmitted from the client + process to the standard input of the server's CGI script. So, to + implement a mobile agent, we must not only write the agent program + to start on the client side, but also the CGI script to receive the + agent on the server side. + + * The 'PUT' method can also be used for migration. HTTP does not + require a CGI script for migration via 'PUT'. However, with common + web servers there is no advantage to this solution, because web + servers such as Apache require explicit activation of a special + 'PUT' script. + + * 'Agent Tcl' pursues a different course; it relies on a dedicated + server process with a dedicated protocol specialized for receiving + mobile agents. + + Our agent example abuses a common web server as a migration tool. +So, it needs a universal CGI script on the receiving side (the web +server). The receiving script is activated with a 'POST' request when +placed into a location like '/httpd/cgi-bin/PostAgent.sh'. Make sure +that the server system uses a version of 'gawk' that supports network +access (Version 3.1 or later; verify with 'gawk --version'). + + #!/bin/sh + MobAg=/tmp/MobileAgent.$$ + # direct script to mobile agent file + cat > $MobAg + # execute agent concurrently + gawk -f $MobAg $MobAg > /dev/null & + # HTTP header, terminator and body + gawk 'BEGIN { print "\r\nAgent started" }' + rm $MobAg # delete script file of agent + + By making its process id ('$$') part of the unique file name, the +script avoids conflicts between concurrent instances of the script. +First, all lines from standard input (the mobile agent's source code) +are copied into this unique file. Then, the agent is started as a +concurrent process and a short message reporting this fact is sent to +the submitting client. Finally, the script file of the mobile agent is +removed because it is no longer needed. Although it is a short script, +there are several noteworthy points: + +Security + _There is none_. In fact, the CGI script should never be made + available on a server that is part of the Internet because everyone + would be allowed to execute arbitrary commands with it. This + behavior is acceptable only when performing rapid prototyping. + +Self-Reference + Each migrating instance of an agent is started in a way that + enables it to read its own source code from standard input and use + the code for subsequent migrations. This is necessary because it + needs to treat the agent's code as data to transmit. 'gawk' is not + the ideal language for such a job. Lisp and Tcl are more suitable + because they do not make a distinction between program code and + data. + +Independence + After migration, the agent is not linked to its former home in any + way. By reporting 'Agent started', it waves "Goodbye" to its + origin. The originator may choose to terminate or not. + + The originating agent itself is started just like any other +command-line script, and reports the results on standard output. By +letting the name of the original host migrate with the agent, the agent +that migrates to a host far away from its origin can report the result +back home. Having arrived at the end of the journey, the agent +establishes a connection and reports the results. This is the reason +for determining the name of the host with 'uname -n' and storing it in +'MyOrigin' for later use. We may also set variables with the '-v' +option from the command line. This interactivity is only of importance +in the context of starting a mobile agent; therefore this 'BEGIN' +pattern and its action do not take part in migration: + + BEGIN { + if (ARGC != 2) { + print "MOBAG - a simple mobile agent" + print "CALL:\n gawk -f mobag.awk mobag.awk" + print "IN:\n the name of this script as a command-line parameter" + print "PARAM:\n -v MyOrigin=myhost.com" + print "OUT:\n the result on stdout" + print "JK 29.03.1998 01.04.1998" + exit + } + if (MyOrigin == "") { + "uname -n" | getline MyOrigin + close("uname -n") + } + } + + Since 'gawk' cannot manipulate and transmit parts of the program +directly, the source code is read and stored in strings. Therefore, the +program scans itself for the beginning and the ending of functions. +Each line in between is appended to the code string until the end of the +function has been reached. A special case is this part of the program +itself. It is not a function. Placing a similar framework around it +causes it to be treated like a function. Notice that this mechanism +works for all the functions of the source code, but it cannot guarantee +that the order of the functions is preserved during migration: + + #ReadMySelf + /^function / { FUNC = $2 } + /^END/ || /^#ReadMySelf/ { FUNC = $1 } + FUNC != "" { MOBFUN[FUNC] = MOBFUN[FUNC] RS $0 } + (FUNC != "") && (/^}/ || /^#EndOfMySelf/) \ + { FUNC = "" } + #EndOfMySelf + + The web server code in *note A Web Service with Interaction: +Interacting Service, was first developed as a site-independent core. +Likewise, the 'gawk'-based mobile agent starts with an agent-independent +core, to which can be appended application-dependent functions. What +follows is the only application-independent function needed for the +mobile agent: + + function migrate(Destination, MobCode, Label) { + MOBVAR["Label"] = Label + MOBVAR["Destination"] = Destination + RS = ORS = "\r\n" + HttpService = "/inet/tcp/0/" Destination + for (i in MOBFUN) + MobCode = (MobCode "\n" MOBFUN[i]) + MobCode = MobCode "\n\nBEGIN {" + for (i in MOBVAR) + MobCode = (MobCode "\n MOBVAR[\"" i "\"] = \"" MOBVAR[i] "\"") + MobCode = MobCode "\n}\n" + print "POST /cgi-bin/PostAgent.sh HTTP/1.0" |& HttpService + print "Content-length:", length(MobCode) ORS |& HttpService + printf "%s", MobCode |& HttpService + while ((HttpService |& getline) > 0) + print $0 + close(HttpService) + } + + The 'migrate()' function prepares the aforementioned strings +containing the program code and transmits them to a server. A +consequence of this modular approach is that the 'migrate()' function +takes some parameters that aren't needed in this application, but that +will be in future ones. Its mandatory parameter 'Destination' holds the +name (or IP address) of the server that the agent wants as a host for +its code. The optional parameter 'MobCode' may contain some 'gawk' code +that is inserted during migration in front of all other code. The +optional parameter 'Label' may contain a string that tells the agent +what to do in program execution after arrival at its new home site. One +of the serious obstacles in implementing a framework for mobile agents +is that it does not suffice to migrate the code. It is also necessary +to migrate the state of execution of the agent. In contrast to 'Agent +Tcl', this program does not try to migrate the complete set of +variables. The following conventions are used: + + * Each variable in an agent program is local to the current host and + does _not_ migrate. + + * The array 'MOBFUN' shown above is an exception. It is handled by + the function 'migrate()' and does migrate with the application. + + * The other exception is the array 'MOBVAR'. Each variable that + takes part in migration has to be an element of this array. + 'migrate()' also takes care of this. + + Now it's clear what happens to the 'Label' parameter of the function +'migrate()'. It is copied into 'MOBVAR["Label"]' and travels alongside +the other data. Since travelling takes place via HTTP, records must be +separated with '"\r\n"' in 'RS' and 'ORS' as usual. The code assembly +for migration takes place in three steps: + + * Iterate over 'MOBFUN' to collect all functions verbatim. + + * Prepare a 'BEGIN' pattern and put assignments to mobile variables + into the action part. + + * Transmission itself resembles GETURL: the header with the request + and the 'Content-length' is followed by the body. In case there is + any reply over the network, it is read completely and echoed to + standard output to avoid irritating the server. + + The application-independent framework is now almost complete. What +follows is the 'END' pattern that is executed when the mobile agent has +finished reading its own code. First, it checks whether it is already +running on a remote host or not. In case initialization has not yet +taken place, it starts 'MyInit()'. Otherwise (later, on a remote host), +it starts 'MyJob()': + + END { + if (ARGC != 2) exit # stop when called with wrong parameters + if (MyOrigin != "") # is this the originating host? + MyInit() # if so, initialize the application + else # we are on a host with migrated data + MyJob() # so we do our job + } + + All that's left to extend the framework into a complete application +is to write two application-specific functions: 'MyInit()' and +'MyJob()'. Keep in mind that the former is executed once on the +originating host, while the latter is executed after each migration: + + function MyInit() { + MOBVAR["MyOrigin"] = MyOrigin + MOBVAR["Machines"] = "localhost/80 max/80 moritz/80 castor/80" + split(MOBVAR["Machines"], Machines) # which host is the first? + migrate(Machines[1], "", "") # go to the first host + while (("/inet/tcp/8080/0/0" |& getline) > 0) # wait for result + print $0 # print result + close("/inet/tcp/8080/0/0") + } + + As mentioned earlier, this agent takes the name of its origin +('MyOrigin') with it. Then, it takes the name of its first destination +and goes there for further work. Notice that this name has the port +number of the web server appended to the name of the server, because the +function 'migrate()' needs it this way to create the 'HttpService' +variable. Finally, it waits for the result to arrive. The 'MyJob()' +function runs on the remote host: + + function MyJob() { + # forget this host + sub(MOBVAR["Destination"], "", MOBVAR["Machines"]) + MOBVAR["Result"]=MOBVAR["Result"] SUBSEP SUBSEP MOBVAR["Destination"] ":" + while (("who" | getline) > 0) # who is logged in? + MOBVAR["Result"] = MOBVAR["Result"] SUBSEP $0 + close("who") + if (index(MOBVAR["Machines"], "/") > 0) { # any more machines to visit? + split(MOBVAR["Machines"], Machines) # which host is next? + migrate(Machines[1], "", "") # go there + } else { # no more machines + gsub(SUBSEP, "\n", MOBVAR["Result"]) # send result to origin + print MOBVAR["Result"] |& "/inet/tcp/0/" MOBVAR["MyOrigin"] "/8080" + close("/inet/tcp/0/" MOBVAR["MyOrigin"] "/8080") + } + } + + After migrating, the first thing to do in 'MyJob()' is to delete the +name of the current host from the list of hosts to visit. Now, it is +time to start the real work by appending the host's name to the result +string, and reading line by line who is logged in on this host. A very +annoying circumstance is the fact that the elements of 'MOBVAR' cannot +hold the newline character ('"\n"'). If they did, migration of this +string did not work because the string didn't obey the syntax rule for a +string in 'gawk'. 'SUBSEP' is used as a temporary replacement. If the +list of hosts to visit holds at least one more entry, the agent migrates +to that place to go on working there. Otherwise, we replace the +'SUBSEP's with a newline character in the resulting string, and report +it to the originating host, whose name is stored in +'MOBVAR["MyOrigin"]'. + + ---------- Footnotes ---------- + + (1) <http://www.research.ibm.com/massive/mobag.ps> + + +File: gawkinet.info, Node: STOXPRED, Next: PROTBASE, Prev: MOBAGWHO, Up: Some Applications and Techniques + +3.9 STOXPRED: Stock Market Prediction As A Service +================================================== + + Far out in the uncharted backwaters of the unfashionable end of the + Western Spiral arm of the Galaxy lies a small unregarded yellow + sun. + + Orbiting this at a distance of roughly ninety-two million miles is + an utterly insignificant little blue-green planet whose + ape-descendent life forms are so amazingly primitive that they + still think digital watches are a pretty neat idea. + + This planet has -- or rather had -- a problem, which was this: most + of the people living on it were unhappy for pretty much of the + time. Many solutions were suggested for this problem, but most of + these were largely concerned with the movements of small green + pieces of paper, which is odd because it wasn't the small green + pieces of paper that were unhappy. + Douglas Adams, 'The Hitch Hiker's Guide to the Galaxy' + + Valuable services on the Internet are usually _not_ implemented as +mobile agents. There are much simpler ways of implementing services. +All Unix systems provide, for example, the 'cron' service. Unix system +users can write a list of tasks to be done each day, each week, twice a +day, or just once. The list is entered into a file named 'crontab'. +For example, to distribute a newsletter on a daily basis this way, use +'cron' for calling a script each day early in the morning. + + # run at 8 am on weekdays, distribute the newsletter + 0 8 * * 1-5 $HOME/bin/daily.job >> $HOME/log/newsletter 2>&1 + + The script first looks for interesting information on the Internet, +assembles it in a nice form and sends the results via email to the +customers. + + The following is an example of a primitive newsletter on stock market +prediction. It is a report which first tries to predict the change of +each share in the Dow Jones Industrial Index for the particular day. +Then it mentions some especially promising shares as well as some shares +which look remarkably bad on that day. The report ends with the usual +disclaimer which tells every child _not_ to try this at home and hurt +anybody. + + Good morning Uncle Scrooge, + + This is your daily stock market report for Monday, October 16, 2000. + Here are the predictions for today: + + AA neutral + GE up + JNJ down + MSFT neutral + ... + UTX up + DD down + IBM up + MO down + WMT up + DIS up + INTC up + MRK down + XOM down + EK down + IP down + + The most promising shares for today are these: + + INTC http://biz.yahoo.com/n/i/intc.html + + The stock shares to avoid today are these: + + EK http://biz.yahoo.com/n/e/ek.html + IP http://biz.yahoo.com/n/i/ip.html + DD http://biz.yahoo.com/n/d/dd.html + ... + + The script as a whole is rather long. In order to ease the pain of +studying other people's source code, we have broken the script up into +meaningful parts which are invoked one after the other. The basic +structure of the script is as follows: + + BEGIN { + Init() + ReadQuotes() + CleanUp() + Prediction() + Report() + SendMail() + } + + The earlier parts store data into variables and arrays which are +subsequently used by later parts of the script. The 'Init()' function +first checks if the script is invoked correctly (without any +parameters). If not, it informs the user of the correct usage. What +follows are preparations for the retrieval of the historical quote data. +The names of the 30 stock shares are stored in an array 'name' along +with the current date in 'day', 'month', and 'year'. + + All users who are separated from the Internet by a firewall and have +to direct their Internet accesses to a proxy must supply the name of the +proxy to this script with the '-v Proxy=NAME' option. For most users, +the default proxy and port number should suffice. + + function Init() { + if (ARGC != 1) { + print "STOXPRED - daily stock share prediction" + print "IN:\n no parameters, nothing on stdin" + print "PARAM:\n -v Proxy=MyProxy -v ProxyPort=80" + print "OUT:\n commented predictions as email" + print "JK 09.10.2000" + exit + } + # Remember ticker symbols from Dow Jones Industrial Index + StockCount = split("AA GE JNJ MSFT AXP GM JPM PG BA HD KO \ + SBC C HON MCD T CAT HWP MMM UTX DD IBM MO WMT DIS INTC \ + MRK XOM EK IP", name); + # Remember the current date as the end of the time series + day = strftime("%d") + month = strftime("%m") + year = strftime("%Y") + if (Proxy == "") Proxy = "chart.yahoo.com" + if (ProxyPort == 0) ProxyPort = 80 + YahooData = "/inet/tcp/0/" Proxy "/" ProxyPort + } + + There are two really interesting parts in the script. One is the +function which reads the historical stock quotes from an Internet +server. The other is the one that does the actual prediction. In the +following function we see how the quotes are read from the Yahoo server. +The data which comes from the server is in CSV format (comma-separated +values): + + Date,Open,High,Low,Close,Volume + 9-Oct-00,22.75,22.75,21.375,22.375,7888500 + 6-Oct-00,23.8125,24.9375,21.5625,22,10701100 + 5-Oct-00,24.4375,24.625,23.125,23.50,5810300 + + Lines contain values of the same time instant, whereas columns are +separated by commas and contain the kind of data that is described in +the header (first) line. At first, 'gawk' is instructed to separate +columns by commas ('FS = ","'). In the loop that follows, a connection +to the Yahoo server is first opened, then a download takes place, and +finally the connection is closed. All this happens once for each ticker +symbol. In the body of this loop, an Internet address is built up as a +string according to the rules of the Yahoo server. The starting and +ending date are chosen to be exactly the same, but one year apart in the +past. All the action is initiated within the 'printf' command which +transmits the request for data to the Yahoo server. + + In the inner loop, the server's data is first read and then scanned +line by line. Only lines which have six columns and the name of a month +in the first column contain relevant data. This data is stored in the +two-dimensional array 'quote'; one dimension being time, the other being +the ticker symbol. During retrieval of the first stock's data, the +calendar names of the time instances are stored in the array 'day' +because we need them later. + + function ReadQuotes() { + # Retrieve historical data for each ticker symbol + FS = "," + for (stock = 1; stock <= StockCount; stock++) { + URL = "http://chart.yahoo.com/table.csv?s=" name[stock] \ + "&a=" month "&b=" day "&c=" year-1 \ + "&d=" month "&e=" day "&f=" year \ + "g=d&q=q&y=0&z=" name[stock] "&x=.csv" + printf("GET " URL " HTTP/1.0\r\n\r\n") |& YahooData + while ((YahooData |& getline) > 0) { + if (NF == 6 && $1 ~ /Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec/) { + if (stock == 1) + days[++daycount] = $1; + quote[$1, stock] = $5 + } + } + close(YahooData) + } + FS = " " + } + + Now that we _have_ the data, it can be checked once again to make +sure that no individual stock is missing or invalid, and that all the +stock quotes are aligned correctly. Furthermore, we renumber the time +instances. The most recent day gets day number 1 and all other days get +consecutive numbers. All quotes are rounded toward the nearest whole +number in US Dollars. + + function CleanUp() { + # clean up time series; eliminate incomplete data sets + for (d = 1; d <= daycount; d++) { + for (stock = 1; stock <= StockCount; stock++) + if (! ((days[d], stock) in quote)) + stock = StockCount + 10 + if (stock > StockCount + 1) + continue + datacount++ + for (stock = 1; stock <= StockCount; stock++) + data[datacount, stock] = int(0.5 + quote[days[d], stock]) + } + delete quote + delete days + } + + Now we have arrived at the second really interesting part of the +whole affair. What we present here is a very primitive prediction +algorithm: _If a stock fell yesterday, assume it will also fall today; +if it rose yesterday, assume it will rise today_. (Feel free to replace +this algorithm with a smarter one.) If a stock changed in the same +direction on two consecutive days, this is an indication which should be +highlighted. Two-day advances are stored in 'hot' and two-day declines +in 'avoid'. + + The rest of the function is a sanity check. It counts the number of +correct predictions in relation to the total number of predictions one +could have made in the year before. + + function Prediction() { + # Predict each ticker symbol by prolonging yesterday's trend + for (stock = 1; stock <= StockCount; stock++) { + if (data[1, stock] > data[2, stock]) { + predict[stock] = "up" + } else if (data[1, stock] < data[2, stock]) { + predict[stock] = "down" + } else { + predict[stock] = "neutral" + } + if ((data[1, stock] > data[2, stock]) && (data[2, stock] > data[3, stock])) + hot[stock] = 1 + if ((data[1, stock] < data[2, stock]) && (data[2, stock] < data[3, stock])) + avoid[stock] = 1 + } + # Do a plausibility check: how many predictions proved correct? + for (s = 1; s <= StockCount; s++) { + for (d = 1; d <= datacount-2; d++) { + if (data[d+1, s] > data[d+2, s]) { + UpCount++ + } else if (data[d+1, s] < data[d+2, s]) { + DownCount++ + } else { + NeutralCount++ + } + if (((data[d, s] > data[d+1, s]) && (data[d+1, s] > data[d+2, s])) || + ((data[d, s] < data[d+1, s]) && (data[d+1, s] < data[d+2, s])) || + ((data[d, s] == data[d+1, s]) && (data[d+1, s] == data[d+2, s]))) + CorrectCount++ + } + } + } + + At this point the hard work has been done: the array 'predict' +contains the predictions for all the ticker symbols. It is up to the +function 'Report()' to find some nice words to introduce the desired +information. + + function Report() { + # Generate report + report = "\nThis is your daily " + report = report "stock market report for "strftime("%A, %B %d, %Y")".\n" + report = report "Here are the predictions for today:\n\n" + for (stock = 1; stock <= StockCount; stock++) + report = report "\t" name[stock] "\t" predict[stock] "\n" + for (stock in hot) { + if (HotCount++ == 0) + report = report "\nThe most promising shares for today are these:\n\n" + report = report "\t" name[stock] "\t\thttp://biz.yahoo.com/n/" \ + tolower(substr(name[stock], 1, 1)) "/" tolower(name[stock]) ".html\n" + } + for (stock in avoid) { + if (AvoidCount++ == 0) + report = report "\nThe stock shares to avoid today are these:\n\n" + report = report "\t" name[stock] "\t\thttp://biz.yahoo.com/n/" \ + tolower(substr(name[stock], 1, 1)) "/" tolower(name[stock]) ".html\n" + } + report = report "\nThis sums up to " HotCount+0 " winners and " AvoidCount+0 + report = report " losers. When using this kind\nof prediction scheme for" + report = report " the 12 months which lie behind us,\nwe get " UpCount + report = report " 'ups' and " DownCount " 'downs' and " NeutralCount + report = report " 'neutrals'. Of all\nthese " UpCount+DownCount+NeutralCount + report = report " predictions " CorrectCount " proved correct next day.\n" + report = report "A success rate of "\ + int(100*CorrectCount/(UpCount+DownCount+NeutralCount)) "%.\n" + report = report "Random choice would have produced a 33% success rate.\n" + report = report "Disclaimer: Like every other prediction of the stock\n" + report = report "market, this report is, of course, complete nonsense.\n" + report = report "If you are stupid enough to believe these predictions\n" + report = report "you should visit a doctor who can treat your ailment." + } + + The function 'SendMail()' goes through the list of customers and +opens a pipe to the 'mail' command for each of them. Each one receives +an email message with a proper subject heading and is addressed with his +full name. + + function SendMail() { + # send report to customers + customer["uncle.scrooge@ducktown.gov"] = "Uncle Scrooge" + customer["more@utopia.org" ] = "Sir Thomas More" + customer["spinoza@denhaag.nl" ] = "Baruch de Spinoza" + customer["marx@highgate.uk" ] = "Karl Marx" + customer["keynes@the.long.run" ] = "John Maynard Keynes" + customer["bierce@devil.hell.org" ] = "Ambrose Bierce" + customer["laplace@paris.fr" ] = "Pierre Simon de Laplace" + for (c in customer) { + MailPipe = "mail -s 'Daily Stock Prediction Newsletter'" c + print "Good morning " customer[c] "," | MailPipe + print report "\n.\n" | MailPipe + close(MailPipe) + } + } + + Be patient when running the script by hand. Retrieving the data for +all the ticker symbols and sending the emails may take several minutes +to complete, depending upon network traffic and the speed of the +available Internet link. The quality of the prediction algorithm is +likely to be disappointing. Try to find a better one. Should you find +one with a success rate of more than 50%, please tell us about it! It +is only for the sake of curiosity, of course. ':-)' + + +File: gawkinet.info, Node: PROTBASE, Prev: STOXPRED, Up: Some Applications and Techniques + +3.10 PROTBASE: Searching Through A Protein Database +=================================================== + + Hoare's Law of Large Problems: Inside every large problem is a + small problem struggling to get out. + + Yahoo's database of stock market data is just one among the many +large databases on the Internet. Another one is located at NCBI +(National Center for Biotechnology Information). Established in 1988 as +a national resource for molecular biology information, NCBI creates +public databases, conducts research in computational biology, develops +software tools for analyzing genome data, and disseminates biomedical +information. In this section, we look at one of NCBI's public services, +which is called BLAST (Basic Local Alignment Search Tool). + + You probably know that the information necessary for reproducing +living cells is encoded in the genetic material of the cells. The +genetic material is a very long chain of four base nucleotides. It is +the order of appearance (the sequence) of nucleotides which contains the +information about the substance to be produced. Scientists in +biotechnology often find a specific fragment, determine the nucleotide +sequence, and need to know where the sequence at hand comes from. This +is where the large databases enter the game. At NCBI, databases store +the knowledge about which sequences have ever been found and where they +have been found. When the scientist sends his sequence to the BLAST +service, the server looks for regions of genetic material in its +database which look the most similar to the delivered nucleotide +sequence. After a search time of some seconds or minutes the server +sends an answer to the scientist. In order to make access simple, NCBI +chose to offer their database service through popular Internet +protocols. There are four basic ways to use the so-called BLAST +services: + + * The easiest way to use BLAST is through the web. Users may simply + point their browsers at the NCBI home page and link to the BLAST + pages. NCBI provides a stable URL that may be used to perform + BLAST searches without interactive use of a web browser. This is + what we will do later in this section. A demonstration client and + a 'README' file demonstrate how to access this URL. + + * Currently, 'blastcl3' is the standard network BLAST client. You + can download 'blastcl3' from the anonymous FTP location. + + * BLAST 2.0 can be run locally as a full executable and can be used + to run BLAST searches against private local databases, or + downloaded copies of the NCBI databases. BLAST 2.0 executables may + be found on the NCBI anonymous FTP server. + + * The NCBI BLAST Email server is the best option for people without + convenient access to the web. A similarity search can be performed + by sending a properly formatted mail message containing the + nucleotide or protein query sequence to <blast@ncbi.nlm.nih.gov>. + The query sequence is compared against the specified database using + the BLAST algorithm and the results are returned in an email + message. For more information on formulating email BLAST searches, + you can send a message consisting of the word "HELP" to the same + address, <blast@ncbi.nlm.nih.gov>. + + Our starting point is the demonstration client mentioned in the first +option. The 'README' file that comes along with the client explains the +whole process in a nutshell. In the rest of this section, we first show +what such requests look like. Then we show how to use 'gawk' to +implement a client in about 10 lines of code. Finally, we show how to +interpret the result returned from the service. + + Sequences are expected to be represented in the standard IUB/IUPAC +amino acid and nucleic acid codes, with these exceptions: lower-case +letters are accepted and are mapped into upper-case; a single hyphen or +dash can be used to represent a gap of indeterminate length; and in +amino acid sequences, 'U' and '*' are acceptable letters (see below). +Before submitting a request, any numerical digits in the query sequence +should either be removed or replaced by appropriate letter codes (e.g., +'N' for unknown nucleic acid residue or 'X' for unknown amino acid +residue). The nucleic acid codes supported are: + + A --> adenosine M --> A C (amino) + C --> cytidine S --> G C (strong) + G --> guanine W --> A T (weak) + T --> thymidine B --> G T C + U --> uridine D --> G A T + R --> G A (purine) H --> A C T + Y --> T C (pyrimidine) V --> G C A + K --> G T (keto) N --> A G C T (any) + - gap of indeterminate length + + Now you know the alphabet of nucleotide sequences. The last two +lines of the following example query show you such a sequence, which is +obviously made up only of elements of the alphabet just described. +Store this example query into a file named 'protbase.request'. You are +now ready to send it to the server with the demonstration client. + + PROGRAM blastn + DATALIB month + EXPECT 0.75 + BEGIN + >GAWK310 the gawking gene GNU AWK + tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat + caccaccatggacagcaaa + + The actual search request begins with the mandatory parameter +'PROGRAM' in the first column followed by the value 'blastn' (the name +of the program) for searching nucleic acids. The next line contains the +mandatory search parameter 'DATALIB' with the value 'month' for the +newest nucleic acid sequences. The third line contains an optional +'EXPECT' parameter and the value desired for it. The fourth line +contains the mandatory 'BEGIN' directive, followed by the query sequence +in FASTA/Pearson format. Each line of information must be less than 80 +characters in length. + + The "month" database contains all new or revised sequences released +in the last 30 days and is useful for searching against new sequences. +There are five different blast programs, 'blastn' being the one that +compares a nucleotide query sequence against a nucleotide sequence +database. + + The last server directive that must appear in every request is the +'BEGIN' directive. The query sequence should immediately follow the +'BEGIN' directive and must appear in FASTA/Pearson format. A sequence +in FASTA/Pearson format begins with a single-line description. The +description line, which is required, is distinguished from the lines of +sequence data that follow it by having a greater-than ('>') symbol in +the first column. For the purposes of the BLAST server, the text of the +description is arbitrary. + + If you prefer to use a client written in 'gawk', just store the +following 10 lines of code into a file named 'protbase.awk' and use this +client instead. Invoke it with 'gawk -f protbase.awk protbase.request'. +Then wait a minute and watch the result coming in. In order to +replicate the demonstration client's behavior as closely as possible, +this client does not use a proxy server. We could also have extended +the client program in *note Retrieving Web Pages: GETURL, to implement +the client request from 'protbase.awk' as a special case. + + { request = request "\n" $0 } + + END { + BLASTService = "/inet/tcp/0/www.ncbi.nlm.nih.gov/80" + printf "POST /cgi-bin/BLAST/nph-blast_report HTTP/1.0\n" |& BLASTService + printf "Content-Length: " length(request) "\n\n" |& BLASTService + printf request |& BLASTService + while ((BLASTService |& getline) > 0) + print $0 + close(BLASTService) + } + + The demonstration client from NCBI is 214 lines long (written in C) +and it is not immediately obvious what it does. Our client is so short +that it _is_ obvious what it does. First it loops over all lines of the +query and stores the whole query into a variable. Then the script +establishes an Internet connection to the NCBI server and transmits the +query by framing it with a proper HTTP request. Finally it receives and +prints the complete result coming from the server. + + Now, let us look at the result. It begins with an HTTP header, which +you can ignore. Then there are some comments about the query having +been filtered to avoid spuriously high scores. After this, there is a +reference to the paper that describes the software being used for +searching the data base. After a repetition of the original query's +description we find the list of significant alignments: + + Sequences producing significant alignments: (bits) Value + + gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733... 38 0.20 + gb|AC021056.12|AC021056 Homo sapiens chromosome 3 clone RP11-115... 38 0.20 + emb|AL160278.10|AL160278 Homo sapiens chromosome 9 clone RP11-57... 38 0.20 + emb|AL391139.11|AL391139 Homo sapiens chromosome X clone RP11-35... 38 0.20 + emb|AL365192.6|AL365192 Homo sapiens chromosome 6 clone RP3-421H... 38 0.20 + emb|AL138812.9|AL138812 Homo sapiens chromosome 11 clone RP1-276... 38 0.20 + gb|AC073881.3|AC073881 Homo sapiens chromosome 15 clone CTD-2169... 38 0.20 + + This means that the query sequence was found in seven human +chromosomes. But the value 0.20 (20%) means that the probability of an +accidental match is rather high (20%) in all cases and should be taken +into account. You may wonder what the first column means. It is a key +to the specific database in which this occurrence was found. The unique +sequence identifiers reported in the search results can be used as +sequence retrieval keys via the NCBI server. The syntax of sequence +header lines used by the NCBI BLAST server depends on the database from +which each sequence was obtained. The table below lists the identifiers +for the databases from which the sequences were derived. + + Database Name Identifier Syntax + ============================ ======================== + GenBank gb|accession|locus + EMBL Data Library emb|accession|locus + DDBJ, DNA Database of Japan dbj|accession|locus + NBRF PIR pir||entry + Protein Research Foundation prf||name + SWISS-PROT sp|accession|entry name + Brookhaven Protein Data Bank pdb|entry|chain + Kabat's Sequences of Immuno... gnl|kabat|identifier + Patents pat|country|number + GenInfo Backbone Id bbs|number + + For example, an identifier might be 'gb|AC021182.14|AC021182', where +the 'gb' tag indicates that the identifier refers to a GenBank sequence, +'AC021182.14' is its GenBank ACCESSION, and 'AC021182' is the GenBank +LOCUS. The identifier contains no spaces, so that a space indicates the +end of the identifier. + + Let us continue in the result listing. Each of the seven alignments +mentioned above is subsequently described in detail. We will have a +closer look at the first of them. + + >gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733N23, WORKING DRAFT SEQUENCE, 4 + unordered pieces + Length = 176383 + + Score = 38.2 bits (19), Expect = 0.20 + Identities = 19/19 (100%) + Strand = Plus / Plus + + Query: 35 tggtgaagtgtgtttcttg 53 + ||||||||||||||||||| + Sbjct: 69786 tggtgaagtgtgtttcttg 69804 + + This alignment was located on the human chromosome 7. The fragment +on which part of the query was found had a total length of 176383. Only +19 of the nucleotides matched and the matching sequence ran from +character 35 to 53 in the query sequence and from 69786 to 69804 in the +fragment on chromosome 7. If you are still reading at this point, you +are probably interested in finding out more about Computational Biology +and you might appreciate the following hints. + + 1. There is a book called 'Introduction to Computational Biology' by + Michael S. Waterman, which is worth reading if you are seriously + interested. You can find a good book review on the Internet. + + 2. While Waterman's book can explain to you the algorithms employed + internally in the database search engines, most practitioners + prefer to approach the subject differently. The applied side of + Computational Biology is called Bioinformatics, and emphasizes the + tools available for day-to-day work as well as how to actually + _use_ them. One of the very few affordable books on Bioinformatics + is 'Developing Bioinformatics Computer Skills'. + + 3. The sequences _gawk_ and _gnuawk_ are in widespread use in the + genetic material of virtually every earthly living being. Let us + take this as a clear indication that the divine creator has + intended 'gawk' to prevail over other scripting languages such as + 'perl', 'tcl', or 'python' which are not even proper sequences. + (:-) + + +File: gawkinet.info, Node: Links, Next: GNU Free Documentation License, Prev: Some Applications and Techniques, Up: Top + +4 Related Links +*************** + +This section lists the URLs for various items discussed in this major +node. They are presented in the order in which they appear. + +'Internet Programming with Python' + <http://www.fsbassociates.com/books/python.htm> + +'Advanced Perl Programming' + <http://www.oreilly.com/catalog/advperl> + +'Web Client Programming with Perl' + <http://www.oreilly.com/catalog/webclient> + +Richard Stevens's home page and book + <http://www.kohala.com/~rstevens> + +The SPAK home page + <http://www.userfriendly.net/linux/RPM/contrib/libc6/i386/spak-0.6b-1.i386.html> + +Volume III of 'Internetworking with TCP/IP', by Comer and Stevens + <http://www.cs.purdue.edu/homes/dec/tcpip3s.cont.html> + +XBM Graphics File Format + <http://www.wotsit.org/download.asp?f=xbm> + +GNUPlot + <http://www.cs.dartmouth.edu/gnuplot_info.html> + +Mark Humphrys' Eliza page + <http://www.compapp.dcu.ie/~humphrys/eliza.html> + +Yahoo! Eliza Information + <http://dir.yahoo.com/Recreation/Games/Computer_Games/Internet_Games/Web_Games/Artificial_Intelligence> + +Java versions of Eliza + <http://www.tjhsst.edu/Psych/ch1/eliza.html> + +Java versions of Eliza with source code + <http://home.adelphia.net/~lifeisgood/eliza/eliza.htm> + +Eliza Programs with Explanations + <http://chayden.net/chayden/eliza/Eliza.shtml> + +Loebner Contest + <http://acm.org/~loebner/loebner-prize.htmlx> + +Tck/Tk Information + <http://www.scriptics.com/> + +Intel 80x86 Processors + <http://developer.intel.com/design/platform/embedpc/what_is.htm> + +AMD Elan Processors + <http://www.amd.com/products/epd/processors/4.32bitcont/32bitcont/index.html> + +XINU + <http://willow.canberra.edu.au/~chrisc/xinu.html> + +GNU/Linux + <http://uclinux.lineo.com/> + +Embedded PCs + <http://dir.yahoo.com/Business_and_Economy/Business_to_Business/Computers/Hardware/Embedded_Control/> + +MiniSQL + <http://www.hughes.com.au/library/> + +Market Share Surveys + <http://www.netcraft.com/survey> + +'Numerical Recipes in C: The Art of Scientific Computing' + <http://www.nr.com> + +VRML + <http://www.vrml.org> + +The VRML FAQ + <http://www.vrml.org/technicalinfo/specifications/specifications.htm#FAQ> + +The UMBC Agent Web + <http://www.cs.umbc.edu/agents> + +Apache Web Server + <http://www.apache.org> + +National Center for Biotechnology Information (NCBI) + <http://www.ncbi.nlm.nih.gov> + +Basic Local Alignment Search Tool (BLAST) + <http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html> + +NCBI Home Page + <http://www.ncbi.nlm.nih.gov> + +BLAST Pages + <http://www.ncbi.nlm.nih.gov/BLAST> + +BLAST Demonstration Client + <ftp://ncbi.nlm.nih.gov/blast/blasturl/> + +BLAST anonymous FTP location + <ftp://ncbi.nlm.nih.gov/blast/network/netblast/> + +BLAST 2.0 Executables + <ftp://ncbi.nlm.nih.gov/blast/executables/> + +IUB/IUPAC Amino Acid and Nucleic Acid Codes + <http://www.uthscsa.edu/geninfo/blastmail.html#item6> + +FASTA/Pearson Format + <http://www.ncbi.nlm.nih.gov/BLAST/fasta.html> + +Fasta/Pearson Sequence in Java + <http://www.kazusa.or.jp/java/codon_table_java/> + +Book Review of 'Introduction to Computational Biology' + <http://www.acm.org/crossroads/xrds5-1/introcb.html> + +'Developing Bioinformatics Computer Skills' + <http://www.oreilly.com/catalog/bioskills/> + + +File: gawkinet.info, Node: GNU Free Documentation License, Next: Index, Prev: Links, Up: Top + +GNU Free Documentation License +****************************** + + Version 1.3, 3 November 2008 + + Copyright (C) 2000, 2001, 2002, 2007, 2008 Free Software Foundation, Inc. + <http://fsf.org/> + + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + 0. PREAMBLE + + The purpose of this License is to make a manual, textbook, or other + functional and useful document "free" in the sense of freedom: to + assure everyone the effective freedom to copy and redistribute it, + with or without modifying it, either commercially or + noncommercially. Secondarily, this License preserves for the + author and publisher a way to get credit for their work, while not + being considered responsible for modifications made by others. + + This License is a kind of "copyleft", which means that derivative + works of the document must themselves be free in the same sense. + It complements the GNU General Public License, which is a copyleft + license designed for free software. + + We have designed this License in order to use it for manuals for + free software, because free software needs free documentation: a + free program should come with manuals providing the same freedoms + that the software does. But this License is not limited to + software manuals; it can be used for any textual work, regardless + of subject matter or whether it is published as a printed book. We + recommend this License principally for works whose purpose is + instruction or reference. + + 1. APPLICABILITY AND DEFINITIONS + + This License applies to any manual or other work, in any medium, + that contains a notice placed by the copyright holder saying it can + be distributed under the terms of this License. Such a notice + grants a world-wide, royalty-free license, unlimited in duration, + to use that work under the conditions stated herein. The + "Document", below, refers to any such manual or work. Any member + of the public is a licensee, and is addressed as "you". You accept + the license if you copy, modify or distribute the work in a way + requiring permission under copyright law. + + A "Modified Version" of the Document means any work containing the + Document or a portion of it, either copied verbatim, or with + modifications and/or translated into another language. + + A "Secondary Section" is a named appendix or a front-matter section + of the Document that deals exclusively with the relationship of the + publishers or authors of the Document to the Document's overall + subject (or to related matters) and contains nothing that could + fall directly within that overall subject. (Thus, if the Document + is in part a textbook of mathematics, a Secondary Section may not + explain any mathematics.) The relationship could be a matter of + historical connection with the subject or with related matters, or + of legal, commercial, philosophical, ethical or political position + regarding them. + + The "Invariant Sections" are certain Secondary Sections whose + titles are designated, as being those of Invariant Sections, in the + notice that says that the Document is released under this License. + If a section does not fit the above definition of Secondary then it + is not allowed to be designated as Invariant. The Document may + contain zero Invariant Sections. If the Document does not identify + any Invariant Sections then there are none. + + The "Cover Texts" are certain short passages of text that are + listed, as Front-Cover Texts or Back-Cover Texts, in the notice + that says that the Document is released under this License. A + Front-Cover Text may be at most 5 words, and a Back-Cover Text may + be at most 25 words. + + A "Transparent" copy of the Document means a machine-readable copy, + represented in a format whose specification is available to the + general public, that is suitable for revising the document + straightforwardly with generic text editors or (for images composed + of pixels) generic paint programs or (for drawings) some widely + available drawing editor, and that is suitable for input to text + formatters or for automatic translation to a variety of formats + suitable for input to text formatters. A copy made in an otherwise + Transparent file format whose markup, or absence of markup, has + been arranged to thwart or discourage subsequent modification by + readers is not Transparent. An image format is not Transparent if + used for any substantial amount of text. A copy that is not + "Transparent" is called "Opaque". + + Examples of suitable formats for Transparent copies include plain + ASCII without markup, Texinfo input format, LaTeX input format, + SGML or XML using a publicly available DTD, and standard-conforming + simple HTML, PostScript or PDF designed for human modification. + Examples of transparent image formats include PNG, XCF and JPG. + Opaque formats include proprietary formats that can be read and + edited only by proprietary word processors, SGML or XML for which + the DTD and/or processing tools are not generally available, and + the machine-generated HTML, PostScript or PDF produced by some word + processors for output purposes only. + + The "Title Page" means, for a printed book, the title page itself, + plus such following pages as are needed to hold, legibly, the + material this License requires to appear in the title page. For + works in formats which do not have any title page as such, "Title + Page" means the text near the most prominent appearance of the + work's title, preceding the beginning of the body of the text. + + The "publisher" means any person or entity that distributes copies + of the Document to the public. + + A section "Entitled XYZ" means a named subunit of the Document + whose title either is precisely XYZ or contains XYZ in parentheses + following text that translates XYZ in another language. (Here XYZ + stands for a specific section name mentioned below, such as + "Acknowledgements", "Dedications", "Endorsements", or "History".) + To "Preserve the Title" of such a section when you modify the + Document means that it remains a section "Entitled XYZ" according + to this definition. + + The Document may include Warranty Disclaimers next to the notice + which states that this License applies to the Document. These + Warranty Disclaimers are considered to be included by reference in + this License, but only as regards disclaiming warranties: any other + implication that these Warranty Disclaimers may have is void and + has no effect on the meaning of this License. + + 2. VERBATIM COPYING + + You may copy and distribute the Document in any medium, either + commercially or noncommercially, provided that this License, the + copyright notices, and the license notice saying this License + applies to the Document are reproduced in all copies, and that you + add no other conditions whatsoever to those of this License. You + may not use technical measures to obstruct or control the reading + or further copying of the copies you make or distribute. However, + you may accept compensation in exchange for copies. If you + distribute a large enough number of copies you must also follow the + conditions in section 3. + + You may also lend copies, under the same conditions stated above, + and you may publicly display copies. + + 3. COPYING IN QUANTITY + + If you publish printed copies (or copies in media that commonly + have printed covers) of the Document, numbering more than 100, and + the Document's license notice requires Cover Texts, you must + enclose the copies in covers that carry, clearly and legibly, all + these Cover Texts: Front-Cover Texts on the front cover, and + Back-Cover Texts on the back cover. Both covers must also clearly + and legibly identify you as the publisher of these copies. The + front cover must present the full title with all words of the title + equally prominent and visible. You may add other material on the + covers in addition. Copying with changes limited to the covers, as + long as they preserve the title of the Document and satisfy these + conditions, can be treated as verbatim copying in other respects. + + If the required texts for either cover are too voluminous to fit + legibly, you should put the first ones listed (as many as fit + reasonably) on the actual cover, and continue the rest onto + adjacent pages. + + If you publish or distribute Opaque copies of the Document + numbering more than 100, you must either include a machine-readable + Transparent copy along with each Opaque copy, or state in or with + each Opaque copy a computer-network location from which the general + network-using public has access to download using public-standard + network protocols a complete Transparent copy of the Document, free + of added material. If you use the latter option, you must take + reasonably prudent steps, when you begin distribution of Opaque + copies in quantity, to ensure that this Transparent copy will + remain thus accessible at the stated location until at least one + year after the last time you distribute an Opaque copy (directly or + through your agents or retailers) of that edition to the public. + + It is requested, but not required, that you contact the authors of + the Document well before redistributing any large number of copies, + to give them a chance to provide you with an updated version of the + Document. + + 4. MODIFICATIONS + + You may copy and distribute a Modified Version of the Document + under the conditions of sections 2 and 3 above, provided that you + release the Modified Version under precisely this License, with the + Modified Version filling the role of the Document, thus licensing + distribution and modification of the Modified Version to whoever + possesses a copy of it. In addition, you must do these things in + the Modified Version: + + A. Use in the Title Page (and on the covers, if any) a title + distinct from that of the Document, and from those of previous + versions (which should, if there were any, be listed in the + History section of the Document). You may use the same title + as a previous version if the original publisher of that + version gives permission. + + B. List on the Title Page, as authors, one or more persons or + entities responsible for authorship of the modifications in + the Modified Version, together with at least five of the + principal authors of the Document (all of its principal + authors, if it has fewer than five), unless they release you + from this requirement. + + C. State on the Title page the name of the publisher of the + Modified Version, as the publisher. + + D. Preserve all the copyright notices of the Document. + + E. Add an appropriate copyright notice for your modifications + adjacent to the other copyright notices. + + F. Include, immediately after the copyright notices, a license + notice giving the public permission to use the Modified + Version under the terms of this License, in the form shown in + the Addendum below. + + G. Preserve in that license notice the full lists of Invariant + Sections and required Cover Texts given in the Document's + license notice. + + H. Include an unaltered copy of this License. + + I. Preserve the section Entitled "History", Preserve its Title, + and add to it an item stating at least the title, year, new + authors, and publisher of the Modified Version as given on the + Title Page. If there is no section Entitled "History" in the + Document, create one stating the title, year, authors, and + publisher of the Document as given on its Title Page, then add + an item describing the Modified Version as stated in the + previous sentence. + + J. Preserve the network location, if any, given in the Document + for public access to a Transparent copy of the Document, and + likewise the network locations given in the Document for + previous versions it was based on. These may be placed in the + "History" section. You may omit a network location for a work + that was published at least four years before the Document + itself, or if the original publisher of the version it refers + to gives permission. + + K. For any section Entitled "Acknowledgements" or "Dedications", + Preserve the Title of the section, and preserve in the section + all the substance and tone of each of the contributor + acknowledgements and/or dedications given therein. + + L. Preserve all the Invariant Sections of the Document, unaltered + in their text and in their titles. Section numbers or the + equivalent are not considered part of the section titles. + + M. Delete any section Entitled "Endorsements". Such a section + may not be included in the Modified Version. + + N. Do not retitle any existing section to be Entitled + "Endorsements" or to conflict in title with any Invariant + Section. + + O. Preserve any Warranty Disclaimers. + + If the Modified Version includes new front-matter sections or + appendices that qualify as Secondary Sections and contain no + material copied from the Document, you may at your option designate + some or all of these sections as invariant. To do this, add their + titles to the list of Invariant Sections in the Modified Version's + license notice. These titles must be distinct from any other + section titles. + + You may add a section Entitled "Endorsements", provided it contains + nothing but endorsements of your Modified Version by various + parties--for example, statements of peer review or that the text + has been approved by an organization as the authoritative + definition of a standard. + + You may add a passage of up to five words as a Front-Cover Text, + and a passage of up to 25 words as a Back-Cover Text, to the end of + the list of Cover Texts in the Modified Version. Only one passage + of Front-Cover Text and one of Back-Cover Text may be added by (or + through arrangements made by) any one entity. If the Document + already includes a cover text for the same cover, previously added + by you or by arrangement made by the same entity you are acting on + behalf of, you may not add another; but you may replace the old + one, on explicit permission from the previous publisher that added + the old one. + + The author(s) and publisher(s) of the Document do not by this + License give permission to use their names for publicity for or to + assert or imply endorsement of any Modified Version. + + 5. COMBINING DOCUMENTS + + You may combine the Document with other documents released under + this License, under the terms defined in section 4 above for + modified versions, provided that you include in the combination all + of the Invariant Sections of all of the original documents, + unmodified, and list them all as Invariant Sections of your + combined work in its license notice, and that you preserve all + their Warranty Disclaimers. + + The combined work need only contain one copy of this License, and + multiple identical Invariant Sections may be replaced with a single + copy. If there are multiple Invariant Sections with the same name + but different contents, make the title of each such section unique + by adding at the end of it, in parentheses, the name of the + original author or publisher of that section if known, or else a + unique number. Make the same adjustment to the section titles in + the list of Invariant Sections in the license notice of the + combined work. + + In the combination, you must combine any sections Entitled + "History" in the various original documents, forming one section + Entitled "History"; likewise combine any sections Entitled + "Acknowledgements", and any sections Entitled "Dedications". You + must delete all sections Entitled "Endorsements." + + 6. COLLECTIONS OF DOCUMENTS + + You may make a collection consisting of the Document and other + documents released under this License, and replace the individual + copies of this License in the various documents with a single copy + that is included in the collection, provided that you follow the + rules of this License for verbatim copying of each of the documents + in all other respects. + + You may extract a single document from such a collection, and + distribute it individually under this License, provided you insert + a copy of this License into the extracted document, and follow this + License in all other respects regarding verbatim copying of that + document. + + 7. AGGREGATION WITH INDEPENDENT WORKS + + A compilation of the Document or its derivatives with other + separate and independent documents or works, in or on a volume of a + storage or distribution medium, is called an "aggregate" if the + copyright resulting from the compilation is not used to limit the + legal rights of the compilation's users beyond what the individual + works permit. When the Document is included in an aggregate, this + License does not apply to the other works in the aggregate which + are not themselves derivative works of the Document. + + If the Cover Text requirement of section 3 is applicable to these + copies of the Document, then if the Document is less than one half + of the entire aggregate, the Document's Cover Texts may be placed + on covers that bracket the Document within the aggregate, or the + electronic equivalent of covers if the Document is in electronic + form. Otherwise they must appear on printed covers that bracket + the whole aggregate. + + 8. TRANSLATION + + Translation is considered a kind of modification, so you may + distribute translations of the Document under the terms of section + 4. Replacing Invariant Sections with translations requires special + permission from their copyright holders, but you may include + translations of some or all Invariant Sections in addition to the + original versions of these Invariant Sections. You may include a + translation of this License, and all the license notices in the + Document, and any Warranty Disclaimers, provided that you also + include the original English version of this License and the + original versions of those notices and disclaimers. In case of a + disagreement between the translation and the original version of + this License or a notice or disclaimer, the original version will + prevail. + + If a section in the Document is Entitled "Acknowledgements", + "Dedications", or "History", the requirement (section 4) to + Preserve its Title (section 1) will typically require changing the + actual title. + + 9. TERMINATION + + You may not copy, modify, sublicense, or distribute the Document + except as expressly provided under this License. Any attempt + otherwise to copy, modify, sublicense, or distribute it is void, + and will automatically terminate your rights under this License. + + However, if you cease all violation of this License, then your + license from a particular copyright holder is reinstated (a) + provisionally, unless and until the copyright holder explicitly and + finally terminates your license, and (b) permanently, if the + copyright holder fails to notify you of the violation by some + reasonable means prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is + reinstated permanently if the copyright holder notifies you of the + violation by some reasonable means, this is the first time you have + received notice of violation of this License (for any work) from + that copyright holder, and you cure the violation prior to 30 days + after your receipt of the notice. + + Termination of your rights under this section does not terminate + the licenses of parties who have received copies or rights from you + under this License. If your rights have been terminated and not + permanently reinstated, receipt of a copy of some or all of the + same material does not give you any rights to use it. + + 10. FUTURE REVISIONS OF THIS LICENSE + + The Free Software Foundation may publish new, revised versions of + the GNU Free Documentation License from time to time. Such new + versions will be similar in spirit to the present version, but may + differ in detail to address new problems or concerns. See + <http://www.gnu.org/copyleft/>. + + Each version of the License is given a distinguishing version + number. If the Document specifies that a particular numbered + version of this License "or any later version" applies to it, you + have the option of following the terms and conditions either of + that specified version or of any later version that has been + published (not as a draft) by the Free Software Foundation. If the + Document does not specify a version number of this License, you may + choose any version ever published (not as a draft) by the Free + Software Foundation. If the Document specifies that a proxy can + decide which future versions of this License can be used, that + proxy's public statement of acceptance of a version permanently + authorizes you to choose that version for the Document. + + 11. RELICENSING + + "Massive Multiauthor Collaboration Site" (or "MMC Site") means any + World Wide Web server that publishes copyrightable works and also + provides prominent facilities for anybody to edit those works. A + public wiki that anybody can edit is an example of such a server. + A "Massive Multiauthor Collaboration" (or "MMC") contained in the + site means any set of copyrightable works thus published on the MMC + site. + + "CC-BY-SA" means the Creative Commons Attribution-Share Alike 3.0 + license published by Creative Commons Corporation, a not-for-profit + corporation with a principal place of business in San Francisco, + California, as well as future copyleft versions of that license + published by that same organization. + + "Incorporate" means to publish or republish a Document, in whole or + in part, as part of another Document. + + An MMC is "eligible for relicensing" if it is licensed under this + License, and if all works that were first published under this + License somewhere other than this MMC, and subsequently + incorporated in whole or in part into the MMC, (1) had no cover + texts or invariant sections, and (2) were thus incorporated prior + to November 1, 2008. + + The operator of an MMC Site may republish an MMC contained in the + site under CC-BY-SA on the same site at any time before August 1, + 2009, provided the MMC is eligible for relicensing. + +ADDENDUM: How to use this License for your documents +==================================================== + +To use this License in a document you have written, include a copy of +the License in the document and put the following copyright and license +notices just after the title page: + + Copyright (C) YEAR YOUR NAME. + Permission is granted to copy, distribute and/or modify this document + under the terms of the GNU Free Documentation License, Version 1.3 + or any later version published by the Free Software Foundation; + with no Invariant Sections, no Front-Cover Texts, and no Back-Cover + Texts. A copy of the license is included in the section entitled ``GNU + Free Documentation License''. + + If you have Invariant Sections, Front-Cover Texts and Back-Cover +Texts, replace the "with...Texts." line with this: + + with the Invariant Sections being LIST THEIR TITLES, with + the Front-Cover Texts being LIST, and with the Back-Cover Texts + being LIST. + + If you have Invariant Sections without Cover Texts, or some other +combination of the three, merge those two alternatives to suit the +situation. + + If your document contains nontrivial examples of program code, we +recommend releasing these examples in parallel under your choice of free +software license, such as the GNU General Public License, to permit +their use in free software. + + +File: gawkinet.info, Node: Index, Prev: GNU Free Documentation License, Up: Top + +Index +***** + + +* Menu: + +* /inet/ files (gawk): Gawk Special Files. (line 34) +* /inet/tcp special files (gawk): File /inet/tcp. (line 6) +* /inet/udp special files (gawk): File /inet/udp. (line 6) +* | (vertical bar), |& operator (I/O): TCP Connecting. (line 25) +* advanced features, network connections: Troubleshooting. (line 6) +* agent: Challenges. (line 75) +* agent <1>: MOBAGWHO. (line 6) +* AI: Challenges. (line 75) +* apache: WEBGRAB. (line 72) +* apache <1>: MOBAGWHO. (line 42) +* Bioinformatics: PROTBASE. (line 227) +* BLAST, Basic Local Alignment Search Tool: PROTBASE. (line 6) +* blocking: Making Connections. (line 35) +* Boutell, Thomas: STATIST. (line 6) +* CGI (Common Gateway Interface): MOBAGWHO. (line 42) +* CGI (Common Gateway Interface), dynamic web pages and: Web page. + (line 45) +* CGI (Common Gateway Interface), library: CGI Lib. (line 11) +* clients: Making Connections. (line 21) +* Clinton, Bill: Challenges. (line 58) +* Common Gateway Interface, See CGI: Web page. (line 45) +* Computational Biology: PROTBASE. (line 227) +* contest: Challenges. (line 6) +* cron utility: STOXPRED. (line 23) +* CSV format: STOXPRED. (line 128) +* Dow Jones Industrial Index: STOXPRED. (line 44) +* ELIZA program: Simple Server. (line 11) +* ELIZA program <1>: Simple Server. (line 178) +* email: Email. (line 11) +* FASTA/Pearson format: PROTBASE. (line 102) +* FDL (Free Documentation License): GNU Free Documentation License. + (line 6) +* filenames, for network access: Gawk Special Files. (line 29) +* files, /inet/ (gawk): Gawk Special Files. (line 34) +* files, /inet/tcp (gawk): File /inet/tcp. (line 6) +* files, /inet/udp (gawk): File /inet/udp. (line 6) +* finger utility: Setting Up. (line 22) +* Free Documentation License (FDL): GNU Free Documentation License. + (line 6) +* FTP (File Transfer Protocol): Basic Protocols. (line 45) +* gawk, networking: Using Networking. (line 6) +* gawk, networking, connections: Special File Fields. (line 53) +* gawk, networking, connections <1>: TCP Connecting. (line 6) +* gawk, networking, filenames: Gawk Special Files. (line 29) +* gawk, networking, See Also email: Email. (line 6) +* gawk, networking, service, establishing: Setting Up. (line 6) +* gawk, networking, troubleshooting: Caveats. (line 6) +* gawk, web and, See web service: Interacting Service. (line 6) +* getline command: TCP Connecting. (line 11) +* GETURL program: GETURL. (line 6) +* GIF image format: Web page. (line 45) +* GIF image format <1>: STATIST. (line 6) +* GNU Free Documentation License: GNU Free Documentation License. + (line 6) +* GNU/Linux: Troubleshooting. (line 54) +* GNU/Linux <1>: Interacting. (line 27) +* GNU/Linux <2>: REMCONF. (line 6) +* GNUPlot utility: Interacting Service. (line 189) +* GNUPlot utility <1>: STATIST. (line 6) +* Hoare, C.A.R.: MOBAGWHO. (line 6) +* Hoare, C.A.R. <1>: PROTBASE. (line 6) +* hostname field: Special File Fields. (line 34) +* HTML (Hypertext Markup Language): Web page. (line 29) +* HTTP (Hypertext Transfer Protocol): Basic Protocols. (line 45) +* HTTP (Hypertext Transfer Protocol) <1>: Web page. (line 6) +* HTTP (Hypertext Transfer Protocol), record separators and: Web page. + (line 29) +* HTTP server, core logic: Interacting Service. (line 6) +* HTTP server, core logic <1>: Interacting Service. (line 24) +* Humphrys, Mark: Simple Server. (line 178) +* Hypertext Markup Language (HTML): Web page. (line 29) +* Hypertext Transfer Protocol, See HTTP: Web page. (line 6) +* image format: STATIST. (line 6) +* images, in web pages: Interacting Service. (line 189) +* images, retrieving over networks: Web page. (line 45) +* input/output, two-way, See Also gawk, networking: Gawk Special Files. + (line 19) +* Internet, See networks: Interacting. (line 48) +* JavaScript: STATIST. (line 56) +* Linux: Troubleshooting. (line 54) +* Linux <1>: Interacting. (line 27) +* Linux <2>: REMCONF. (line 6) +* Lisp: MOBAGWHO. (line 98) +* localport field: Gawk Special Files. (line 34) +* Loebner, Hugh: Challenges. (line 6) +* Loui, Ronald: Challenges. (line 75) +* MAZE: MAZE. (line 6) +* Microsoft Windows: WEBGRAB. (line 43) +* Microsoft Windows, networking: Troubleshooting. (line 54) +* Microsoft Windows, networking, ports: Setting Up. (line 37) +* MiniSQL: REMCONF. (line 109) +* MOBAGWHO program: MOBAGWHO. (line 6) +* NCBI, National Center for Biotechnology Information: PROTBASE. + (line 6) +* network type field: Special File Fields. (line 11) +* networks, gawk and: Using Networking. (line 6) +* networks, gawk and, connections: Special File Fields. (line 53) +* networks, gawk and, connections <1>: TCP Connecting. (line 6) +* networks, gawk and, filenames: Gawk Special Files. (line 29) +* networks, gawk and, See Also email: Email. (line 6) +* networks, gawk and, service, establishing: Setting Up. (line 6) +* networks, gawk and, troubleshooting: Caveats. (line 6) +* networks, ports, reserved: Setting Up. (line 37) +* networks, ports, specifying: Special File Fields. (line 24) +* networks, See Also web pages: PANIC. (line 6) +* Numerical Recipes: STATIST. (line 24) +* ORS variable, HTTP and: Web page. (line 29) +* ORS variable, POP and: Email. (line 36) +* PANIC program: PANIC. (line 6) +* Perl: Using Networking. (line 14) +* Perl, gawk networking and: Using Networking. (line 24) +* Perlis, Alan: MAZE. (line 6) +* pipes, networking and: TCP Connecting. (line 30) +* PNG image format: Web page. (line 45) +* PNG image format <1>: STATIST. (line 6) +* POP (Post Office Protocol): Email. (line 6) +* POP (Post Office Protocol) <1>: Email. (line 36) +* Post Office Protocol (POP): Email. (line 6) +* PostScript: STATIST. (line 138) +* PROLOG: Challenges. (line 75) +* PROTBASE: PROTBASE. (line 6) +* protocol field: Special File Fields. (line 17) +* PS image format: STATIST. (line 6) +* Python: Using Networking. (line 14) +* Python, gawk networking and: Using Networking. (line 24) +* record separators, HTTP and: Web page. (line 29) +* record separators, POP and: Email. (line 36) +* REMCONF program: REMCONF. (line 6) +* remoteport field: Gawk Special Files. (line 34) +* RFC 1939: Email. (line 6) +* RFC 1939 <1>: Email. (line 36) +* RFC 1945: Web page. (line 29) +* RFC 2068: Web page. (line 6) +* RFC 2068 <1>: Interacting Service. (line 104) +* RFC 2616: Web page. (line 6) +* RFC 821: Email. (line 6) +* robot: Challenges. (line 84) +* robot <1>: WEBGRAB. (line 6) +* RS variable, HTTP and: Web page. (line 29) +* RS variable, POP and: Email. (line 36) +* servers: Making Connections. (line 14) +* servers <1>: Setting Up. (line 22) +* servers, as hosts: Special File Fields. (line 34) +* servers, HTTP: Interacting Service. (line 6) +* servers, web: Simple Server. (line 6) +* Simple Mail Transfer Protocol (SMTP): Email. (line 6) +* SMTP (Simple Mail Transfer Protocol): Basic Protocols. (line 45) +* SMTP (Simple Mail Transfer Protocol) <1>: Email. (line 6) +* STATIST program: STATIST. (line 6) +* STOXPRED program: STOXPRED. (line 6) +* synchronous communications: Making Connections. (line 35) +* Tcl/Tk: Using Networking. (line 14) +* Tcl/Tk, gawk and: Using Networking. (line 24) +* Tcl/Tk, gawk and <1>: Some Applications and Techniques. + (line 22) +* TCP (Transmission Control Protocol): Using Networking. (line 29) +* TCP (Transmission Control Protocol) <1>: File /inet/tcp. (line 6) +* TCP (Transmission Control Protocol), connection, establishing: TCP Connecting. + (line 6) +* TCP (Transmission Control Protocol), UDP and: Interacting. (line 48) +* TCP/IP, network type, selecting: Special File Fields. (line 11) +* TCP/IP, protocols, selecting: Special File Fields. (line 17) +* TCP/IP, sockets and: Gawk Special Files. (line 19) +* Transmission Control Protocol, See TCP: Using Networking. (line 29) +* troubleshooting, gawk, networks: Caveats. (line 6) +* troubleshooting, networks, connections: Troubleshooting. (line 6) +* troubleshooting, networks, timeouts: Caveats. (line 18) +* UDP (User Datagram Protocol): File /inet/udp. (line 6) +* UDP (User Datagram Protocol), TCP and: Interacting. (line 48) +* Unix, network ports and: Setting Up. (line 37) +* URLCHK program: URLCHK. (line 6) +* User Datagram Protocol, See UDP: File /inet/udp. (line 6) +* vertical bar (|), |& operator (I/O): TCP Connecting. (line 25) +* VRML: MAZE. (line 6) +* web browsers, See web service: Interacting Service. (line 6) +* web pages: Web page. (line 6) +* web pages, images in: Interacting Service. (line 189) +* web pages, retrieving: GETURL. (line 6) +* web servers: Simple Server. (line 6) +* web service: Primitive Service. (line 6) +* web service <1>: PANIC. (line 6) +* WEBGRAB program: WEBGRAB. (line 6) +* Weizenbaum, Joseph: Simple Server. (line 11) +* XBM image format: Interacting Service. (line 189) +* Yahoo!: REMCONF. (line 6) +* Yahoo! <1>: STOXPRED. (line 6) + + + +Tag Table: +Node: Top2022 +Node: Preface5665 +Node: Introduction7040 +Node: Stream Communications8066 +Node: Datagram Communications9240 +Node: The TCP/IP Protocols10870 +Ref: The TCP/IP Protocols-Footnote-111554 +Node: Basic Protocols11711 +Ref: Basic Protocols-Footnote-113756 +Node: Ports13785 +Node: Making Connections15192 +Ref: Making Connections-Footnote-117750 +Ref: Making Connections-Footnote-217797 +Node: Using Networking17978 +Node: Gawk Special Files20301 +Node: Special File Fields22110 +Ref: table-inet-components26003 +Node: Comparing Protocols27312 +Node: File /inet/tcp27846 +Node: File /inet/udp28874 +Ref: File /inet/udp-Footnote-130573 +Node: TCP Connecting30827 +Node: Troubleshooting33173 +Ref: Troubleshooting-Footnote-136232 +Node: Interacting36805 +Node: Setting Up39545 +Node: Email43048 +Node: Web page45380 +Ref: Web page-Footnote-148197 +Node: Primitive Service48395 +Node: Interacting Service51136 +Ref: Interacting Service-Footnote-160303 +Node: CGI Lib60335 +Node: Simple Server67310 +Ref: Simple Server-Footnote-175053 +Node: Caveats75154 +Node: Challenges76299 +Node: Some Applications and Techniques84997 +Node: PANIC87462 +Node: GETURL89186 +Node: REMCONF91819 +Node: URLCHK97314 +Node: WEBGRAB101166 +Node: STATIST105628 +Ref: STATIST-Footnote-1117377 +Node: MAZE117822 +Node: MOBAGWHO124029 +Ref: MOBAGWHO-Footnote-1138047 +Node: STOXPRED138102 +Node: PROTBASE152390 +Node: Links165506 +Node: GNU Free Documentation License168939 +Node: Index194059 + +End Tag Table |