1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
|
Info file gawk-info, produced by Makeinfo, -*- Text -*- from input
file gawk.texinfo.
This file documents `awk', a program that you can use to select
particular records in a file and perform operations upon them.
Copyright (C) 1989 Free Software Foundation, Inc.
Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.
Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.
Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Foundation.
File: gawk-info, Node: For, Next: Break, Prev: Do, Up: Statements
The `for' Statement
===================
The `for' statement makes it more convenient to count iterations of a
loop. The general form of the `for' statement looks like this:
for (INITIALIZATION; CONDITION; INCREMENT)
BODY
This statement starts by executing INITIALIZATION. Then, as long as
CONDITION is true, it repeatedly executes BODY and then INCREMENT.
Typically INITIALIZATION sets a variable to either zero or one,
INCREMENT adds 1 to it, and CONDITION compares it against the desired
number of iterations.
Here is an example of a `for' statement:
awk '{ for (i = 1; i <= 3; i++)
print $i
}'
This prints the first three fields of each input record, one field
per line.
In the `for' statement, BODY stands for any statement, but
INITIALIZATION, CONDITION and INCREMENT are just expressions. You
cannot set more than one variable in the INITIALIZATION part unless
you use a multiple assignment statement such as `x = y = 0', which is
possible only if all the initial values are equal. (But you can
initialize additional variables by writing their assignments as
separate statements preceding the `for' loop.)
The same is true of the INCREMENT part; to increment additional
variables, you must write separate statements at the end of the loop.
The C compound expression, using C's comma operator, would be useful
in this context, but it is not supported in `awk'.
Most often, INCREMENT is an increment expression, as in the example
above. But this is not required; it can be any expression whatever.
For example, this statement prints odd numbers from 1 to 100:
# print odd numbers from 1 to 100
for (i = 1; i <= 100; i += 2)
print i
Any of the three expressions following `for' may be omitted if you
don't want it to do anything. Thus, `for (;x > 0;)' is equivalent to
`while (x > 0)'. If the CONDITION part is empty, it is treated as
TRUE, effectively yielding an infinite loop.
In most cases, a `for' loop is an abbreviation for a `while' loop, as
shown here:
INITIALIZATION
while (CONDITION) {
BODY
INCREMENT
}
(The only exception is when the `continue' statement (*note
Continue::.) is used inside the loop; changing a `for' statement to a
`while' statement in this way can change the effect of the `continue'
statement inside the loop.)
The `awk' language has a `for' statement in addition to a `while'
statement because often a `for' loop is both less work to type and
more natural to think of. Counting the number of iterations is very
common in loops. It can be easier to think of this counting as part
of looping rather than as something to do inside the loop.
The next section has more complicated examples of `for' loops.
There is an alternate version of the `for' loop, for iterating over
all the indices of an array:
for (i in array)
PROCESS array[i]
*Note Arrays::, for more information on this version of the `for' loop.
File: gawk-info, Node: Break, Next: Continue, Prev: For, Up: Statements
The `break' Statement
=====================
The `break' statement jumps out of the innermost `for', `while', or
`do'--`while' loop that encloses it. The following example finds the
smallest divisor of any number, and also identifies prime numbers:
awk '# find smallest divisor of num
{ num = $1
for (div = 2; div*div <= num; div++)
if (num % div == 0)
break
if (num % div == 0)
printf "Smallest divisor of %d is %d\n", num, div
else
printf "%d is prime\n", num }'
When the remainder is zero in the first `if' statement, `awk'
immediately "breaks" out of the containing `for' loop. This means
that `awk' proceeds immediately to the statement following the loop
and continues processing. (This is very different from the `exit'
statement (*note Exit::.) which stops the entire `awk' program.)
Here is another program equivalent to the previous one. It
illustrates how the CONDITION of a `for' or `while' could just as
well be replaced with a `break' inside an `if':
awk '# find smallest divisor of num
{ num = $1
for (div = 2; ; div++) {
if (num % div == 0) {
printf "Smallest divisor of %d is %d\n", num, div
break
}
if (div*div > num) {
printf "%d is prime\n", num
break
}
}
}'
File: gawk-info, Node: Continue, Next: Next, Prev: Break, Up: Statements
The `continue' Statement
========================
The `continue' statement, like `break', is used only inside `for',
`while', and `do'--`while' loops. It skips over the rest of the loop
body, causing the next cycle around the loop to begin immediately.
Contrast this with `break', which jumps out of the loop altogether.
Here is an example:
# print names that don't contain the string "ignore"
# first, save the text of each line
{ names[NR] = $0 }
# print what we're interested in
END {
for (x in names) {
if (names[x] ~ /ignore/)
continue
print names[x]
}
}
If any of the input records contain the string `ignore', this example
skips the print statement and continues back to the first statement
in the loop.
This isn't a practical example of `continue', since it would be just
as easy to write the loop like this:
for (x in names)
if (x !~ /ignore/)
print x
The `continue' statement causes `awk' to skip the rest of what is
inside a `for' loop, but it resumes execution with the increment part
of the `for' loop. The following program illustrates this fact:
awk 'BEGIN {
for (x = 0; x <= 20; x++) {
if (x == 5)
continue
printf ("%d ", x)
}
print ""
}'
This program prints all the numbers from 0 to 20, except for 5, for
which the `printf' is skipped. Since the increment `x++' is not
skipped, `x' does not remain stuck at 5.
File: gawk-info, Node: Next, Next: Exit, Prev: Continue, Up: Statements
The `next' Statement
====================
The `next' statement forces `awk' to immediately stop processing the
current record and go on to the next record. This means that no
further rules are executed for the current record. The rest of the
current rule's action is not executed either.
Contrast this with the effect of the `getline' function (*note
Getline::.). That too causes `awk' to read the next record
immediately, but it does not alter the flow of control in any way.
So the rest of the current action executes with a new input record.
At the grossest level, `awk' program execution is a loop that reads
an input record and then tests each rule pattern against it. If you
think of this loop as a `for' statement whose body contains the
rules, then the `next' statement is analogous to a `continue'
statement: it skips to the end of the body of the loop, and executes
the increment (which reads another record).
For example, if your `awk' program works only on records with four
fields, and you don't want it to fail when given bad input, you might
use the following rule near the beginning of the program:
NF != 4 {
printf ("line %d skipped: doesn't have 4 fields", FNR) > "/dev/tty"
next
}
so that the following rules will not see the bad record. The error
message is redirected to `/dev/tty' (the terminal), so that it won't
get lost amid the rest of the program's regular output.
File: gawk-info, Node: Exit, Prev: Next, Up: Statements
The `exit' Statement
====================
The `exit' statement causes `awk' to immediately stop executing the
current rule and to stop processing input; any remaining input is
ignored.
If an `exit' statement is executed from a `BEGIN' rule the program
stops processing everything immediately. No input records will be
read. However, if an `END' rule is present, it will be executed
(*note BEGIN/END::.).
If `exit' is used as part of an `END' rule, it causes the program to
stop immediately.
An `exit' statement that is part an ordinary rule (that is, not part
of a `BEGIN' or `END' rule) stops the execution of any further
automatic rules, but the `END' rule is executed if there is one. If
you don't want the `END' rule to do its job in this case, you can set
a variable to nonzero before the `exit' statement, and check that
variable in the `END' rule.
If an argument is supplied to `exit', its value is used as the exit
status code for the `awk' process. If no argument is supplied,
`exit' returns status zero (success).
For example, let's say you've discovered an error condition you
really don't know how to handle. Conventionally, programs report
this by exiting with a nonzero status. Your `awk' program can do
this using an `exit' statement with a nonzero argument. Here's an
example of this:
BEGIN {
if (("date" | getline date_now) < 0) {
print "Can't get system date"
exit 4
}
}
File: gawk-info, Node: Arrays, Next: Built-in, Prev: Statements, Up: Top
Actions: Using Arrays in `awk'
******************************
An "array" is a table of various values, called "elements". The
elements of an array are distinguished by their "indices". Names of
arrays in `awk' are strings of alphanumeric characters and
underscores, just like regular variables.
You cannot use the same identifier as both a variable and as an array
name in one `awk' program.
* Menu:
* Intro: Array Intro. Basic facts abou arrays in `awk'.
* Reference to Elements:: How to examine one element of an array.
* Assigning Elements:: How to change an element of an array.
* Example: Array Example. Sample program explained.
* Scanning an Array:: A variation of the `for' statement. It loops
through the indices of an array's existing elements.
* Delete:: The `delete' statement removes an element from an array.
* Multi-dimensional:: Emulating multi--dimensional arrays in `awk'.
* Multi-scanning:: Scanning multi--dimensional arrays.
File: gawk-info, Node: Array Intro, Next: Reference to Elements, Up: Arrays
Introduction to Arrays
======================
The `awk' language has one--dimensional "arrays" for storing groups
of related strings or numbers. Each array must have a name; valid
array names are the same as valid variable names, and they do
conflict with variable names: you can't have both an array and a
variable with the same name at any point in an `awk' program.
Arrays in `awk' superficially resemble arrays in other programming
languages; but there are fundamental differences. In `awk', you
don't need to declare the size of an array before you start to use it.
What's more, in `awk' any number or even a string may be used as an
array index.
In most other languages, you have to "declare" an array and specify
how many elements or components it has. In such languages, the
declaration causes a contiguous block of memory to be allocated for
that many elements. An index in the array must be a positive
integer; for example, the index 0 specifies the first element in the
array, which is actually stored at the beginning of the block of
memory. Index 1 specifies the second element, which is stored in
memory right after the first element, and so on. It is impossible to
add more elements to the array, because it has room for only as many
elements as you declared. (Some languages have arrays whose first
index is 1, others require that you specify both the first and last
index when you declare the array. In such a language, an array could
be indexed, for example, from -3 to 17.) A contiguous array of four
elements might look like this, conceptually, if the element values
are 8, `"foo"', `""' and 30:
+--------+--------+-------+--------+
| 8 | "foo" | "" | 30 | value
+--------+--------+-------+--------+
0 1 2 3 index
Only the values are stored; the indices are implicit from the order
of the values. 8 is the value at index 0, because 8 appears in the
position with 0 elements before it.
Arrays in `awk' are different: they are "associative". This means
that each array is a collection of pairs: an index, and its
corresponding array element value:
Element 4 Value 30
Element 2 Value "foo"
Element 1 Value 8
Element 3 Value ""
We have shown the pairs in jumbled order because their order doesn't
mean anything.
One advantage of an associative array is that new pairs can be added
at any time. For example, suppose we add to that array a tenth
element whose value is `"number ten"'. The result is this:
Element 10 Value "number ten"
Element 4 Value 30
Element 2 Value "foo"
Element 1 Value 8
Element 3 Value ""
Now the array is "sparse" (i.e. some indices are missing): it has
elements number 4 and 10, but doesn't have an element 5, 6, 7, 8, or 9.
Another consequence of associative arrays is that the indices don't
have to be positive integers. Any number, or even a string, can be
an index. For example, here is an array which translates words from
English into French:
Element "dog" Value "chien"
Element "cat" Value "chat"
Element "one" Value "un"
Element 1 Value "un"
Here we decided to translate the number 1 in both spelled--out and
numeral form--thus illustrating that a single array can have both
numbers and strings as indices.
When `awk' creates an array for you, e.g. with the `split' built--in
function (*note String Functions::.), that array's indices start at
the number one.
File: gawk-info, Node: Reference to Elements, Next: Assigning Elements, Prev: Array Intro, Up: Arrays
Referring to an Array Element
=============================
The principal way of using an array is to refer to one of its elements.
An array reference is an expression which looks like this:
ARRAY[INDEX]
Here ARRAY is the name of an array. The expression INDEX is the
index of the element of the array that you want. The value of the
array reference is the current value of that array element.
For example, `foo[4.3]' is an expression for the element of array
`foo' at index 4.3.
If you refer to an array element that has no recorded value, the
value of the reference is `""', the null string. This includes
elements to which you have not assigned any value, and elements that
have been deleted (*note Delete::.). Such a reference automatically
creates that array element, with the null string as its value. (In
some cases, this is unfortunate, because it might waste memory inside
`awk').
You can find out if an element exists in an array at a certain index
with the expression:
INDEX in ARRAY
This expression tests whether or not the particular index exists,
without the side effect of creating that element if it is not present.
The expression has the value 1 (true) if `ARRAY[SUBSCRIPT]' exists,
and 0 (false) if it does not exist.
For example, to find out whether the array `frequencies' contains the
subscript `"2"', you would ask:
if ("2" in frequencies) print "Subscript \"2\" is present."
Note that this is *not* a test of whether or not the array
`frequencies' contains an element whose *value* is `"2"'. (There is
no way to that except to scan all the elements.) Also, this *does
not* create `frequencies["2"]', while the following (incorrect)
alternative would:
if (frequencies["2"] != "") print "Subscript \"2\" is present."
File: gawk-info, Node: Assigning Elements, Next: Array Example, Prev: Reference to Elements, Up: Arrays
Assigning Array Elements
========================
Array elements are lvalues: they can be assigned values just like
`awk' variables:
ARRAY[SUBSCRIPT] = VALUE
Here ARRAY is the name of your array. The expression SUBSCRIPT is
the index of the element of the array that you want to assign a
value. The expression VALUE is the value you are assigning to that
element of the array.
File: gawk-info, Node: Array Example, Next: Scanning an Array, Prev: Assigning Elements, Up: Arrays
Basic Example of an Array
=========================
The following program takes a list of lines, each beginning with a
line number, and prints them out in order of line number. The line
numbers are not in order, however, when they are first read: they
are scrambled. This program sorts the lines by making an array using
the line numbers as subscripts. It then prints out the lines in
sorted order of their numbers. It is a very simple program, and will
get confused if it encounters repeated numbers, gaps, or lines that
don't begin with a number.
BEGIN {
max=0
}
{
if ($1 > max)
max = $1
arr[$1] = $0
}
END {
for (x = 1; x <= max; x++)
print arr[x]
}
The first rule just initializes the variable `max'. (This is not
strictly necessary, since an uninitialized variable has the null
string as its value, and the null string is effectively zero when
used in a context where a number is required.)
The second rule keeps track of the largest line number seen so far;
it also stores each line into the array `arr', at an index that is
the line's number.
The third rule runs after all the input has been read, to print out
all the lines.
When this program is run with the following input:
5 I am the Five man
2 Who are you? The new number two!
4 . . . And four on the floor
1 Who is number one?
3 I three you.
its output is this:
1 Who is number one?
2 Who are you? The new number two!
3 I three you.
4 . . . And four on the floor
5 I am the Five man
File: gawk-info, Node: Scanning an Array, Next: Delete, Prev: Array Example, Up: Arrays
Scanning All Elements of an Array
=================================
In programs that use arrays, often you need a loop that will execute
once for each element of an array. In other languages, where arrays
are contiguous and indices are limited to positive integers, this is
easy: the largest index is one less than the length of the array, and
you can find all the valid indices by counting from zero up to that
value. This technique won't do the job in `awk', since any number or
string may be an array index. So `awk' has a special kind of `for'
statement for scanning an array:
for (VAR in ARRAY)
BODY
This loop executes BODY once for each different value that your
program has previously used as an index in ARRAY, with the variable
VAR set to that index.
Here is a program that uses this form of the `for' statement. The
first rule scans the input records and notes which words appear (at
least once) in the input, by storing a 1 into the array `used' with
the word as index. The second rule scans the elements of `used' to
find all the distinct words that appear in the input. It prints each
word that is more than 10 characters long, and also prints the number
of such words. *Note Built-in::, for more information on the
built--in function `length'.
# Record a 1 for each word that is used at least once.
{
for (i = 0; i < NF; i++)
used[$i] = 1
}
# Find number of distinct words more than 10 characters long.
END {
num_long_words = 0
for (x in used)
if (length(x) > 10) {
++num_long_words
print x
}
print num_long_words, "words longer than 10 characters"
}
*Note Sample Program::, for a more detailed example of this type.
The order in which elements of the array are accessed by this
statement is determined by the internal arrangement of the array
elements within `awk' and cannot be controlled or changed. This can
lead to problems if new elements are added to ARRAY by statements in
BODY; you cannot predict whether or not the `for' loop will reach
them. Similarly, changing VAR inside the loop can produce strange
results. It is best to avoid such things.
File: gawk-info, Node: Delete, Next: Multi-dimensional, Prev: Scanning an Array, Up: Arrays
The `delete' Statement
======================
You can remove an individual element of an array using the `delete'
statement:
delete ARRAY[INDEX]
When an array element is deleted, it is as if you had never referred
to it and had never given it any value. Any value the element
formerly had can no longer be obtained.
Here is an example of deleting elements in an array:
awk '{ for (i in frequencies)
delete frequencies[i]
}'
This example removes all the elements from the array `frequencies'.
If you delete an element, the `for' statement to scan the array will
not report that element, and the `in' operator to check for the
presence of that element will return 0:
delete foo[4]
if (4 in foo)
print "This will never be printed"
File: gawk-info, Node: Multi-dimensional, Next: Multi-scanning, Prev: Delete, Up: Arrays
Multi--dimensional arrays
=========================
A multi--dimensional array is an array in which an element is
identified by a sequence of indices, not a single index. For
example, a two--dimensional array requires two indices. The usual
way (in most languages, including `awk') to refer to an element of a
two--dimensional array named `grid' is with `grid[x,y]'.
Multi--dimensional arrays are supported in `awk' through
concatenation of indices into one string. What happens is that `awk'
converts the indices into strings (*note Conversion::.) and
concatenates them together, with a separator between them. This
creates a single string that describes the values of the separate
indices. The combined string is used as a single index into an
ordinary, one--dimensional array. The separator used is the value of
the special variable `SUBSEP'.
For example, suppose the value of `SUBSEP' is `","' and the
expression `foo[5,12]="value"' is executed. The numbers 5 and 12
will be concatenated with a comma between them, yielding `"5,12"';
thus, the array element `foo["5,12"]' will be set to `"value"'.
Once the element's value is stored, `awk' has no record of whether it
was stored with a single index or a sequence of indices. The two
expressions `foo[5,12]' and `foo[5 SUBSEP 12]' always have the same
value.
The default value of `SUBSEP' is not a comma; it is the string
`"\034"', which contains a nonprinting character that is unlikely to
appear in an `awk' program or in the input data.
The usefulness of choosing an unlikely character comes from the fact
that index values that contain a string matching `SUBSEP' lead to
combined strings that are ambiguous. Suppose that `SUBSEP' is a
comma; then `foo["a,b", "c"]' and `foo["a", "b,c"]' will be
indistinguishable because both are actually stored as `foo["a,b,c"]'.
Because `SUBSEP' is `"\034"', such confusion can actually happen only
when an index contains the character `"\034"', which is a rare event.
You can test whether a particular index--sequence exists in a
``multi--dimensional'' array with the same operator `in' used for
single dimensional arrays. Instead of a single index as the
left--hand operand, write the whole sequence of indices, separated by
commas, in parentheses:
(SUBSCRIPT1, SUBSCRIPT2, ...) in ARRAY
The following example treats its input as a two--dimensional array of
fields; it rotates this array 90 degrees clockwise and prints the
result. It assumes that all lines have the same number of elements.
awk 'BEGIN {
max_nf = max_nr = 0
}
{
if (max_nf < NF)
max_nf = NF
max_nr = NR
for (x = 1; x <= NF; x++)
vector[x, NR] = $x
}
END {
for (x = 1; x <= max_nf; x++) {
for (y = max_nr; y >= 1; --y)
printf("%s ", vector[x, y])
printf("\n")
}
}'
When given the input:
1 2 3 4 5 6
2 3 4 5 6 1
3 4 5 6 1 2
4 5 6 1 2 3
it produces:
4 3 2 1
5 4 3 2
6 5 4 3
1 6 5 4
2 1 6 5
3 2 1 6
File: gawk-info, Node: Multi-scanning, Prev: Multi-dimensional, Up: Arrays
Scanning Multi--dimensional Arrays
==================================
There is no special `for' statement for scanning a
``multi--dimensional'' array; there cannot be one, because in truth
there are no multi--dimensional arrays or elements; there is only a
multi--dimensional *way of accessing* an array.
However, if your program has an array that is always accessed as
multi--dimensional, you can get the effect of scanning it by
combining the scanning `for' statement (*note Scanning an Array::.)
with the `split' built--in function (*note String Functions::.). It
works like this:
for (combined in ARRAY) {
split (combined, separate, SUBSEP)
...
}
This finds each concatenated, combined index in the array, and splits
it into the individual indices by breaking it apart where the value
of `SUBSEP' appears. The split--out indices become the elements of
the array `separate'.
Thus, suppose you have previously stored in `ARRAY[1, "foo"]'; then
an element with index `"1\034foo"' exists in ARRAY. (Recall that the
default value of `SUBSEP' contains the character with code 034.)
Sooner or later the `for' statement will find that index and do an
iteration with `combined' set to `"1\034foo"'. Then the `split'
function will be called as follows:
split ("1\034foo", separate, "\034")
The result of this is to set `separate[1]' to 1 and `separate[2]' to
`"foo"'. Presto, the original sequence of separate indices has been
recovered.
File: gawk-info, Node: Built-in, Next: User-defined, Prev: Arrays, Up: Top
Built--in functions
*******************
"Built--in" functions are functions always available for your `awk'
program to call. This chapter defines all the built--in functions
that exist; some of them are mentioned in other sections, but they
are summarized here for your convenience. (You can also define new
functions yourself. *Note User-defined::.)
In most cases, any extra arguments given to built--in functions are
ignored. The defaults for omitted arguments vary from function to
function and are described under the individual functions.
The name of a built--in function need not be followed immediately by
the opening left parenthesis of the arguments; whitespace is allowed.
However, it is wise to write no space there, since user--defined
functions do not allow space.
When a function is called, expressions that create the function's
actual parameters are evaluated completely before the function call
is performed. For example, in the code fragment:
i = 4
j = myfunc(i++)
the variable `i' will be set to 5 before `myfunc' is called with a
value of 4 for its actual parameter.
* Menu:
* Numeric Functions:: Functions that work with numbers,
including `int', `sin' and `rand'.
* String Functions:: Functions for string manipulation,
such as `split', `match', and `sprintf'.
* I/O Functions:: Functions for files and shell commands
File: gawk-info, Node: Numeric Functions, Next: String Functions, Up: Built-in
Numeric Built--in Functions
===========================
The general syntax of the numeric built--in functions is the same for
each. Here is an example of that syntax:
awk '# Read input records containing a pair of points: x0, y0, x1, y1.
# Print the points and the distance between them.
{ printf "%f %f %f %f %f\n", $1, $2, $3, $4,
sqrt(($2-$1) * ($2-$1) + ($4-$3) * ($4-$3)) }'
This calculates the square root of a calculation that uses the values
of the fields. It then prints the first four fields of the input
record and the result of the square root calculation.
Here is the full list of numeric built--in functions:
`int(X)'
This gives you the integer part of X, truncated toward 0. This
produces the nearest integer to X, located between X and 0.
For example, `int(3)' is 3, `int(3.9)' is 3, `int(-3.9)' is -3,
and `int(-3)' is -3 as well.
`sqrt(X)'
This gives you the positive square root of X. It reports an
error if X is negative.
`exp(X)'
This gives you the exponential of X, or reports an error if X is
out of range. The range of values X can have depends on your
machine's floating point representation.
`log(X)'
This gives you the natural logarithm of X, if X is positive;
otherwise, it reports an error.
`sin(X)'
This gives you the sine of X, with X in radians.
`cos(X)'
This gives you the cosine of X, with X in radians.
`atan2(Y, X)'
This gives you the arctangent of Y/X, with both in radians.
`rand()'
This gives you a random number. The values of `rand()' are
uniformly--distributed between 0 and 1. The value is never 0
and never 1.
Often you want random integers instead. Here is a user--defined
function you can use to obtain a random nonnegative integer less
than N:
function randint(n) {
return int(n * rand())
}
The multiplication produces a random real number at least 0, and
less than N. We then make it an integer (using `int') between 0
and `N-1'.
Here is an example where a similar function is used to produce
random integers between 1 and N:
awk '
# Function to roll a simulated die.
function roll(n) { return 1 + int(rand() * n) }
# Roll 3 six--sided dice and print total number of points.
{
printf("%d points\n", roll(6)+roll(6)+roll(6))
}'
*Note* that `rand()' starts generating numbers from the same
point, or "seed", each time you run `awk'. This means that the
same program will produce the same results each time you run it.
The numbers are random within one `awk' run, but predictable
from run to run. This is convenient for debugging, but if you
want a program to do different things each time it is used, you
must change the seed to a value that will be different in each
run. To do this, use `srand'.
`srand(X)'
The function `srand(X)' sets the starting point, or "seed", for
generating random numbers to the value X.
Each seed value leads to a particular sequence of ``random''
numbers. Thus, if you set the seed to the same value a second
time, you will get the same sequence of ``random'' numbers again.
If you omit the argument X, as in `srand()', then the current
date and time of day are used for a seed. This is the way to
get random numbers that are truly unpredictable.
The return value of `srand()' is the previous seed. This makes
it easy to keep track of the seeds for use in consistently
reproducing sequences of random numbers.
File: gawk-info, Node: String Functions, Next: I/O Functions, Prev: Numeric Functions, Up: Built-in
Built--in Functions for String Manipulation
===========================================
`index(IN, FIND)'
This searches the string IN for the first occurrence of the
string FIND, and returns the position where that occurrence
begins in the string IN. For example:
awk 'BEGIN { print index("peanut", "an") }'
prints `3'. If FIND is not found, `index' returns 0.
`length(STRING)'
This gives you the number of characters in STRING. If STRING is
a number, the length of the digit string representing that
number is returned. For example, `length("abcde")' is 5.
Whereas, `length(15 * 35)' works out to 3. How? Well, 15 * 35
= 525, and 525 is then converted to the string `"525"', which
has three characters.
`match(STRING, REGEXP)'
The `match' function searches the string, STRING, for the
longest, leftmost substring matched by the regular expression,
REGEXP. It returns the character position, or "index", of where
that substring begins (1, if it starts at the beginning of
STRING). If no match if found, it returns 0.
The `match' function sets the special variable `RSTART' to the
index. It also sets the special variable `RLENGTH' to the
length of the matched substring. If no match is found, `RSTART'
is set to 0, and `RLENGTH' to -1.
For example:
awk '{
if ($1 == "FIND")
regex = $2
else {
where = match($0, regex)
if (where)
print "Match of", regex, "found at", where, "in", $0
}
}'
This program looks for lines that match the regular expression
stored in the variable `regex'. This regular expression can be
changed. If the first word on a line is `FIND', `regex' is
changed to be the second word on that line. Therefore, given:
FIND fo*bar
My program was a foobar
But none of it would doobar
FIND Melvin
JF+KM
This line is property of The Reality Engineering Co.
This file was created by Melvin.
`awk' prints:
Match of fo*bar found at 18 in My program was a foobar
Match of Melvin found at 26 in This file was created by Melvin.
`split(STRING, ARRAY, FIELD_SEPARATOR)'
This divides STRING up into pieces separated by FIELD_SEPARATOR,
and stores the pieces in ARRAY. The first piece is stored in
`ARRAY[1]', the second piece in `ARRAY[2]', and so forth. The
string value of the third argument, FIELD_SEPARATOR, is used as
a regexp to search for to find the places to split STRING. If
the FIELD_SEPARATOR is omitted, the value of `FS' is used.
`split' returns the number of elements created.
The `split' function, then, splits strings into pieces in a
manner similar to the way input lines are split into fields.
For example:
split("auto-da-fe", a, "-")
splits the string `auto-da-fe' into three fields using `-' as
the separator. It sets the contents of the array `a' as follows:
a[1] = "auto"
a[2] = "da"
a[3] = "fe"
The value returned by this call to `split' is 3.
`sprintf(FORMAT, EXPRESSION1,...)'
This returns (without printing) the string that `printf' would
have printed out with the same arguments (*note Printf::.). For
example:
sprintf("pi = %.2f (approx.)", 22/7)
returns the string `"pi = 3.14 (approx.)"'.
`sub(REGEXP, REPLACEMENT_STRING, TARGET_VARIABLE)'
The `sub' function alters the value of TARGET_VARIABLE. It
searches this value, which should be a string, for the leftmost
substring matched by the regular expression, REGEXP, extending
this match as far as possible. Then the entire string is
changed by replacing the matched text with REPLACEMENT_STRING.
The modified string becomes the new value of TARGET_VARIABLE.
This function is peculiar because TARGET_VARIABLE is not simply
used to compute a value, and not just any expression will do: it
must be a variable, field or array reference, so that `sub' can
store a modified value there. If this argument is omitted, then
the default is to use and alter `$0'.
For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)
sets `str' to `"wither, water, everywhere"', by replacing the
leftmost, longest occurrence of `at' with `ith'.
The `sub' function returns the number of substitutions made
(either one or zero).
The special character, `&', in the replacement string,
REPLACEMENT_STRING, stands for the precise substring that was
matched by REGEXP. (If the regexp can match more than one
string, then this precise substring may vary.) For example:
awk '{ sub(/candidate/, "& and his wife"); print }'
will change the first occurrence of ``candidate'' to ``candidate
and his wife'' on each input line.
The effect of this special character can be turned off by
preceding it with a backslash (`\&'). To include a backslash in
the replacement string, it too must be preceded with a (second)
backslash.
Note: if you use `sub' with a third argument that is not a
variable, field or array element reference, then it will still
search for the pattern and return 0 or 1, but the modified
string is thrown away because there is no place to put it. For
example:
sub(/USA/, "United States", "the USA and Canada")
will indeed produce a string `"the United States and Canada"',
but there will be no way to use that string!
`gsub(REGEXP, REPLACEMENT_STRING, TARGET_VARIABLE)'
This is similar to the `sub' function, except `gsub' replaces
*all* of the longest, leftmost, *non--overlapping* matching
substrings it can find. The ``g'' in `gsub' stands for
"global", which means replace *everywhere*. For example:
awk '{ gsub(/Britain/, "United Kingdom"); print }'
replaces all occurrences of the string `Britain' with `United
Kingdom' for all input records.
The `gsub' function returns the number of substitutions made.
If the variable to be searched and altered, TARGET_VARIABLE, is
omitted, then the entire input record, `$0', is used.
The characters `&' and `\' are special in `gsub' as they are in
`sub' (see immediately above).
`substr(STRING, START, LENGTH)'
This returns a LENGTH--character--long substring of STRING,
starting at character number START. The first character of a
string is character number one. For example,
`substr("washington", 5, 3)' returns `"ing"'.
If LENGTH is not present, this function returns the whole suffix
of STRING that begins at character number START. For example,
`substr("washington", 5)' returns `"ington"'.
File: gawk-info, Node: I/O Functions, Prev: String Functions, Up: Built-in
Built--in Functions for I/O to Files and Commands
=================================================
`close(FILENAME)'
Close the file FILENAME. The argument may alternatively be a
shell command that was used for redirecting to or from a pipe;
then the pipe is closed.
*Note Close Input::, regarding closing input files and pipes.
*Note Close Output::, regarding closing output files and pipes.
`system(COMMAND)'
The system function allows the user to execute operating system
commands and then return to the `awk' program. The `system'
function executes the command given by the string value of
COMMAND. It returns, as its value, the status returned by the
command that was executed. This is known as returning the "exit
status".
For example, if the following fragment of code is put in your
`awk' program:
END {
system("mail -s 'awk run done' operator < /dev/null")
}
the system operator will be sent mail when the `awk' program
finishes processing input and begins its end--of--input
processing.
Note that much the same result can be obtained by redirecting
`print' or `printf' into a pipe. However, if your `awk' program
is interactive, this function is useful for cranking up large
self--contained programs, such as a shell or an editor.
File: gawk-info, Node: User-defined, Next: Special, Prev: Built-in, Up: Top
User--defined Functions
***********************
Complicated `awk' programs can often be simplified by defining your
own functions. User--defined functions can be called just like
built--in ones (*note Function Calls::.), but it is up to you to
define them--to tell `awk' what they should do.
* Menu:
* Definition Syntax:: How to write definitions and what they mean.
* Function Example:: An example function definition and what it does.
* Function Caveats:: Things to watch out for.
* Return Statement:: Specifying the value a function returns.
File: gawk-info, Node: Definition Syntax, Next: Function Example, Up: User-defined
Syntax of Function Definitions
==============================
The definition of a function named NAME looks like this:
function NAME (PARAMETER-LIST) {
BODY-OF-FUNCTION
}
A valid function name is like a valid variable name: a sequence of
letters, digits and underscores, not starting with a digit.
Such function definitions can appear anywhere between the rules of
the `awk' program. The general format of an `awk' program, then, is
now modified to include sequences of rules *and* user--defined
function definitions.
The function definition need not precede all the uses of the function.
This is because `awk' reads the entire program before starting to
execute any of it.
The PARAMETER-LIST is a list of the function's "local" variable
names, separated by commas. Within the body of the function, local
variables refer to arguments with which the function is called. If
the function is called with fewer arguments than it has local
variables, this is not an error; the extra local variables are simply
set as the null string.
The local variable values hide or "shadow" any variables of the same
names used in the rest of the program. The shadowed variables are
not accessible in the function definition, because there is no way to
name them while their names have been taken away for the local
variables. All other variables used in the `awk' program can be
referenced or set normally in the function definition.
The local variables last only as long as the function is executing.
Once the function finishes, the shadowed variables come back.
The BODY-OF-FUNCTION part of the definition is the most important
part, because this is what says what the function should actually *do*.
The local variables exist to give the body a way to talk about the
arguments.
Functions may be "recursive", i.e., they can call themselves, either
directly, or indirectly (via calling a second function that calls the
first again).
The keyword `function' may also be written `func'.
File: gawk-info, Node: Function Example, Next: Function Caveats, Prev: Definition Syntax, Up: User-defined
Function Definition Example
===========================
Here is an example of a user--defined function, called `myprint',
that takes a number and prints it in a specific format.
function myprint(num)
{
printf "%6.3g\n", num
}
To illustrate, let's use the following `awk' rule to use, or "call",
our `myprint' function:
$3 > 0 { myprint($3) }'
This program prints, in our special format, all the third fields that
contain a positive number in our input. Therefore, when given:
1.2 3.4 5.6 7.8
9.10 11.12 13.14 15.16
17.18 19.20 21.22 23.24
this program, using our function to format the results, will print:
5.6
13.1
21.2
Here is a rather contrived example of a recursive function. It
prints a string backwards:
function rev (str, len) {
if (len == 0) {
printf "\n"
return
}
printf "%c", substr(str, len, 1)
rev(str, len - 1)
}
File: gawk-info, Node: Function Caveats, Next: Return Statement, Prev: Function Example, Up: User-defined
Caveats of Function Calling
===========================
*Note* that there cannot be any blanks between the function name and
the left parenthesis of the argument list, when calling a function.
This is so `awk' can tell you are not trying to concatenate the value
of a variable with the value of an expression inside the parentheses.
When a function is called, it is given a *copy* of the values of its
arguments. This is called "passing by value". The caller may use a
variable as the expression for the argument, but the called function
does not know this: all it knows is what value the argument had. For
example, if you write this code:
foo = "bar"
z = myfunc(foo)
then you should not think of the argument to `myfunc' as being ``the
variable `foo'''. Instead, think of the argument as the string
value, `"bar"'.
If the function `myfunc' alters the values of its local variables,
this has no effect on any other variables. In particular, if
`myfunc' does this:
function myfunc (win) {
print win
win = "zzz"
print win
}
to change its first argument variable `win', this *does not* change
the value of `foo' in the caller. The role of `foo' in calling
`myfunc' ended when its value, `"bar"', was computed. If `win' also
exists outside of `myfunc', this definition will not change it--that
value is shadowed during the execution of `myfunc' and cannot be seen
or changed from there.
However, when arrays are the parameters to functions, they are *not*
copied. Instead, the array itself is made available for direct
manipulation by the function. This is usually called "passing by
reference". Changes made to an array parameter inside the body of a
function *are* visible outside that function. *This can be very
dangerous if you don't watch what you are doing.* For example:
function changeit (array, ind, nvalue) {
array[ind] = nvalue
}
BEGIN {
a[1] = 1 ; a[2] = 2 ; a[3] = 3
changeit(a, 2, "two")
printf "a[1] = %s, a[2] = %s, a[3] = %s\n", a[1], a[2], a[3]
}
will print `a[1] = 1, a[2] = two, a[3] = 3', because the call to
`changeit' stores `"two"' in the second element of `a'.
File: gawk-info, Node: Return Statement, Prev: Function Caveats, Up: User-defined
The `return' statement
======================
The body of a user--defined function can contain a `return' statement.
This statement returns control to the rest of the `awk' program. It
can also be used to return a value for use in the rest of the `awk'
program. It looks like:
`return EXPRESSION'
The EXPRESSION part is optional. If it is omitted, then the returned
value is undefined and, therefore, unpredictable.
A `return' statement with no value expression is assumed at the end
of every function definition. So if control reaches the end of the
function definition, then the function returns an unpredictable value.
Here is an example of a user--defined function that returns a value
for the largest number among the elements of an array:
function maxelt (vec, i, ret) {
for (i in vec) {
if (ret == "" || vec[i] > ret)
ret = vec[i]
}
return ret
}
You call `maxelt' with one argument, an array name. The local
variables `i' and `ret' are not intended to be arguments; while there
is nothing to stop you from passing two or three arguments to
`maxelt', the results would be strange.
When writing a function definition, it is conventional to separate
the parameters from the local variables with extra spaces, as shown
above in the definition of `maxelt'.
Here is a program that uses, or calls, our `maxelt' function. This
program loads an array, calls `maxelt', and then reports the maximum
number in that array:
awk '
function maxelt (vec, i, ret) {
for (i in vec) {
if (ret == "" || vec[i] > ret)
ret = vec[i]
}
return ret
}
# Load all fields of each record into nums.
{
for(i = 1; i <= NF; i++)
nums[NR, i] = $i
}
END {
print maxelt(nums)
}'
Given the following input:
1 5 23 8 16
44 3 5 2 8 26
256 291 1396 2962 100
-6 467 998 1101
99385 11 0 225
our program tells us (predictably) that:
99385
is the largest number in our array.
File: gawk-info, Node: Special, Next: Sample Program, Prev: User-defined, Up: Top
Special Variables
*****************
Most `awk' variables are available for you to use for your own
purposes; they will never change except when your program assigns
them, and will never affect anything except when your program
examines them.
A few variables have special meanings. Some of them `awk' examines
automatically, so that they enable you to tell `awk' how to do
certain things. Others are set automatically by `awk', so that they
carry information from the internal workings of `awk' to your program.
Most of these variables are also documented in the chapters where
their areas of activity are described.
* Menu:
* User-modified:: Special variables that you change to control `awk'.
* Auto-set:: Special variables where `awk' gives you information.
|