Awk Command in Linux by Example

The Linux Newbie Guide

⇒

Fundamentals

Advanced

Supplement

Command Index

ENG⇒中

awk

Processing text-based data with awk
Basic usage of awk
Awk program patterns
User-defined variables
Built-in variables
BEGIN and END blocks
Output functions print and printf
Associative arrays
Executing system commands using system
Closing files/pipelines using close
Reading input data using getline
Mathematical functions
String functions
User-defined functions(UDFs)

ENG⇒中 ENG⇒中
Processing text-based data with awk
Many UNIX/Linux command names originate from peculiar abbreviations, and awk is no exception! Its name comes from the initials of its three authors, Alfred Aho, Peter Weinberger, and Brian Kernighan.

Awk is an interpreted language that heavily borrows syntax from the C programming language, incorporating the essence of C's text processing and output formatting capabilities. Additionally, awk supports features that the original C language lacks, such as matching with regular expressions and the use of Associative Arrays.

As a result, the most significant difference between awk and C lies in their application. C is a general-purpose programming language with numerous complex commands and syntax, while awk is compact and concise, particularly well-suited for handling and processing data in textual records and text formatting. In the 1980s, awk enjoyed popularity until around the 1990s when it gradually yielded ground to another general-purpose interpreted language, Perl.

Basic usage of awk
In comparison to sed, which processes text on a line-by-line basis, awk can also handle text using "fields."

For example, consider the output of ls -l, which lists detailed file information. The output contains 8 fields, separated by spaces:

$ export TIME_STYLE=long-iso ←Set the time format (environment settings may affect the output format of "ls -l")
$ ls -l
drwxr-xy-x	2	aaa	aaa	4096	2011-09-07	11:44	Desktop
drwxr-xy-x	2	aaa	aaa	4096	2011-09-07	11:44	Documents
drwxr-xy-x	2	aaa	aaa	4096	2011-09-07	11:44	Music
drwxr-xy-x	2	aaa	aaa	4096	2011-09-07	11:44	Pictures
drwxr-xy-x	2	aaa	aaa	4096	2011-09-07	11:44	Public
↑	↑	↑	↑	↑	↑	↑	↑
$1	$2	$3	$4	$5	$6	$7	$8 ←field variables

If we pipe the output of ls -l into awk, written as ls -l | awk {}, each field will be automatically stored in its default field variable "$0", "$1" to "$N" (where N is the number of fields). Because the use of parameters and statements between awk commands can be abstract and complex, and awk may have difficulty interpreting them, it is generally recommended to use single quotes (') to enclose the parameters, except for files and options, which should be written as ls -l | awk '{}' .

In the example above, the first line has 8 fields, which will result in 9 field variables with values from "$0" to "$8" as follows:

Variable	Content	Note
$0	drwxr-xy-x 2 aaa aaa 4096 2011-09-07 11:44 Desktop	"$0" contains the entire line's string
$1	drwxr-xy-x	string from field=1
$2	2	string from field=2
$3	aaa	string from field=3
$4	aaa	string from field=4
$5	4096	string from field=5
$6	2011-09-07	string from field=6
$7	11:44	string from field=7
$8	Desktop	string from field=8

The most special field variable is "$0," which contains the entire line's content, and when "$0" is modified, it automatically updates other field variables "$1" to "$N."

If we write only ls -l | awk '{}', there will be no output because there are no output functions and no specified field to be output. The most commonly used built-in output function in awk is "print." For example, if I want to output only the size field (field=5) and the filename field (field=8) of the ls -l output, I can write:

Example:

$ ls -l | awk '{print $5,$8}'← Output only fields 5 and 8
4096 Desktop
4096 Documents
4096 Music
. . .

In the "print" function, the comma "," represents the Output Field Separator (OFS), and the default OFS is a space, so each comma in the print function outputs a space. You can try removing the comma and entering ls -l | awk '{print $5 $8}' to see the difference in output.

If there are multiple commands, they should be separated by a semicolon ";" or written on the next line. For example, ls -l | awk '{size=$5; file_name=$8; print size, file_name;}' (in this example, "size" and "file_name" are user-defined variables).

Once you understand the usage of field variables "$N," it is easy to use this feature to change the output format. The most basic usage of awk is to change the output format, as shown in the example below, where awk is used to modify the original output of ls.

Example:

$ ls -l | awk '{print "File",$8,"size =",$5,"Byte"}'
File Desktop size = 4096 Byte
File Documents size = 4096 Byte
File Music size = 4096 Byte
. . .

In the above example, you can add strings to the "print" function to output them. Strings to be added should be enclosed in double quotes.

One of the reasons why awk was popular is its ability to perform calculations using field variables, such as "S3*base-1". Continuing the previous example, if I want to display the file size field in KiB, I can divide "$5" by 1024.

Example:

$ ls -l | awk '{print "File",$8,"size =",$5/1024,"KB"}'
File Desktop size = 4 KB
File Documents size = 4 KB
File Music size = 4 KB
. . .
$ ls -l | awk '{print "File",$8,"size =",$5/1024,"KB"}' > reformate.txt ← Save the new output as a file "reformate.txt"

As for input data in awk, besides coming from a pipeline "|", it can also come from a file. For example, awk '{print $3}' data.txt prints field 3 of the file "data.txt". If you want to create an interactive program that accepts input from the keyboard, awk also supports the "-" symbol for standard input, as shown in the example below, which takes two numbers as input and outputs their product:

Example:

$ awk '{print $1*$2}' - ←The trailing "-" symbol represents standard input (keyboard)
3.14 1.41421 ←Enter any two numbers
4.44062 ←Output the product (Press <Ctrl-D> to end)

To execute awk, you can use the methods mentioned above, or, like sed, use the "-f" option to use an external script file or write it as a shell script. When using an external script file, remove the single quotes (') enclosing the "{}," as shown in the example below:

Example: (Using an external awk script file)

$ cat awk_scr ←Suppose there is an external script file named "awk_scr" with the following content
{print "File",$8,"size =",$5/1024,"KB"}
$ ls -l | awk -f awk_scr ← Use the "-f" option to use the external script file "awk_scr"

Example: (Write as a shell script)

$ cat awk_scr1 ←Suppose there is a file named "awk_scr1" with the following content
awk '{print "File",$8,"size =",$5/1024,"KB"}'
$ chmod +x awk_scr1 ←Give "awk_scr1" executable permissions
$ ls -l | ./awk_scr1 ←Execute "awk_scr1"

^ back on top ^

Awk program patterns
In addition to basic output patterns, awk also possesses powerful program patterns since awk itself is a scripting language. Whether or not to learn the programming aspect of awk depends on individual needs and expertise. The following explanation is provided for you to make your judgment.

While many tasks can be achieved using C language or shell scripts, the entry barrier for C language is higher, and for small tasks, using C might be overkill. On the other hand, shell scripts might fall short when it comes to text processing. However, for those skilled in awk scripting and with creativity, awk can almost entirely replace all filtering programs like grep, sed, tr, cut, etc. Moreover, it has calculation and statistical capabilities. If you need to process textual data records, awk is the first choice. Some have even tested that awk can be over 30 times faster than shell scripts for the same functionality. The syntax of awk scripting heavily borrows from C language syntax, so if you are already familiar with C/C++/Java, learning awk's scripting language will be relatively simple. However, if you are not familiar with C, it might be more challenging.

Readers can become proficient in awk scripting without knowing C language since awk scripting is much simpler. However, in this explanation, we assume that readers are already familiar with C language, so we won't specifically explain C language instructions and syntax. Instead, we will focus on the differences between awk and C language.

The structure of an awk program is mainly [Pattern] '[{Actions}]' [Files], where, in awk terminology, "Pattern" is not a regular expression pattern, but rather a condition. "{Actions}" represents the statements to be executed, and "Files" is the text data file to be processed. Of course, besides files, awk can also process data from other commands through pipelines.

A Pattern is not always present. If it exists, the Actions will be executed when the Pattern is satisfied; otherwise, the Actions will not be executed. For example, to filter files based on size, the command can be written as ls -l | awk '$5 > 8192 {print $5, $8}'. It means that if the content of field 5 is greater than 8192, the "print $5, $8" action will be executed. If there is no Pattern, as in ls -l | awk '{print $5, $8}', the Actions "{print $5, $8}" will be executed regardless of any conditions.

"{Actions}" can also be omitted. When omitted, the default action is "print $0." For example, awk 'NR <= 5' /etc/passwd works like the "head" command. (In this example, "NR" is a built-in variable).

Awk's Patterns provide similar judgment syntax to C language, such as:

awk relational operators
Operator	Meaning
==	equals
!=	not equals
>	greater than
>=	greater than or equal to
<	less than
<=	less than or equal to
&&	logical AND
\|\|	logical OR

Unlike traditional C language, Awk's Patterns can also use regular expressions for matching judgments, represented by "~" for matching and "!" for non-matching.
(In fact, the current version of awk supports extended regular expressions.)

Syntax:

awk relational operators
Operator	Meaning	Note
string ~ /regular expression/ [Actions]	If the string matches the regular expression, execute Actions	[note]
string !~ /regular expression/ [Actions]	If the string does not match the regular expression, execute Actions
/regular expression/ [Actions]	If the current input line matches the regular expression, execute Actions (omitting the string and "~" symbol will use $0 to match the regular expression).
!/regular expression/ [Actions]	If the current input line does not match the regular expression, execute Actions.

To match regular expressions, remember to enclose them in paired slashes "/". For example, in the command ls /etc | awk '$1 ~ /pr*e/', it means that if the content of field 1 in any line matches the regular expression "pr*e", then that line will be output.

If you omit both the string to match and the tilde symbol "~", the meaning can be considered as a search. In this case, if the current input line contains a match to the regular expression, the specified actions will be executed. For instance, awk '/colou*r/' file works similarly to the grep command.

The "{Actions}" part is not limited to just "print." Various commands and syntax similar to C language are also valid.

List of awk syntax

Note

if ( conditional ) statement [ else statement ]
while ( conditional ) statement
do {statement} while (conditional)
for ( expression ; conditional ; expression ) statement
for ( variable in array ) statement
break
continue
{ [ statement ] ...}
variable=expression
[command][&][|]getline [var][<][ file]
print [ expression-list ] [ > expression ]
printf ( ) format [ , expression-list ] [ > expression ]
function( )
next
exit

Reference Source:
http://www.grymoire.com/Unix/Awk.html

Additionally, awk comments, like sed, are denoted by the "#" symbol.

^ back on top ^

User-defined variables

User-defined variables in awk are different from those in C language. In awk, variables do not require declaration and are typeless global variables.

For example, you can create a variable "score" without declaring it, and you can assign various values to "score" without specifying its type. The following examples are all valid:

score = 99 (integer)
score = 99.99 (floating-point)
score = "ninety-nine" (string)
score = NF (built-in variable)
score = $9 (field variable)

The reason why awk does not require variable type declaration like C language (e.g., int x) is that awk treats all data as ASCII and performs type conversion only when necessary. For instance, the commands awk 'BEGIN {print 3 * 7}' 和 awk 'BEGIN {print "3" * "7"}' both output the same result (the usage of BEGIN is referenced to BEGIN and END). However, it is still a good practice to use quotes (") to indicate that the data is a string if you know that you are dealing with strings and not numbers.

For example:

$ cat awk_scr2
BEGIN {
   brand="555" #←Variable "brand" is a string "555"
   unit_price = 0.8 #←Variable "unit_price" is a numerical value 0.8
   dozen = 12 #←Variable "dozen" is a numerical value 12
   print brand,"cigarettes a dozen price=",unit_price * dozen
  }
$ awk -f awk_scr2
555 cigarettes a dozen price= 9.6

Arrays in awk are also typeless variables and do not require declaration or definition of size. They support up to two-dimensional arrays. For instance, the following is an example of the multiplication table (9x9) written in C language and rewritten in awk:

Example:

$ cat awk_m_table
BEGIN{ #← Multiplication Table example in awk
   for( i=1; i<=9; i++ ){
for( j=1; j<=9; j++ ){
   array[i, j] = i * j
   print i" * "j" = "array[i,j]
}
   }
}
$ awk -f awk_m_table
1 * 1 = 1
1 * 2 = 2
. . .
9 * 7 = 63
9 * 8 = 72
9 * 9 = 81

^ back on top ^

Built-in variables
Built-in variables (also known as intrinsic variables) in awk are different from user-defined variables in that their values are generated by awk during its operation. Some built-in variables have default values, but users can modify these default values to change the behavior of awk.

Besides field variables, there are many built-in variables available in awk for various operations. Built-in variables are written in uppercase letters, so it is recommended not to use all uppercase names for user-defined variables. This practice helps to avoid name collisions and makes it clear which variables are built-in and which are user-defined.

For example, in the code snippet "for (i = 0; i < NF; i++)," it is evident that "NF" is a built-in variable, while "i" is a user-defined variable.

Two of the most commonly used built-in variables are "NF" and "NR." "NF" (Number of Fields) stores the number of fields in each line, and "NR" (Number of Records) stores the current line number in the file (in awk terminology, a line is called a "record").

For example:

$ echo 'ab cd ef' | awk '{print NF}' ←Since there are three fields, NF=3
3

Below is a list of all built-in variables in awk:

awk built-in variables
Variable	Meaning	Default	Regular-Eexpression support	Note
ARGC	The number of command-line arguments passed to awk	-
ARGV	The number of command-line arguments passed to awk	-
FILENAME	The number of command-line arguments passed to awk	-
FNR	The record number in the current input file.	-
FS	The input field separator	blank& tab	Yes	參考 BEGIN 範例
IGNORECASE	(Non-zero value) When set to a non-zero value, matches are case-insensitive	1	Yes	he GNU version of gawk supports this built-in variable
NF	The number of fields in the current input record.	-
NR	The total number of input records processed so far.	-		Refer to theENDexample
OFMT	The output format for numbers	%.6g		Refer to the print example
OFS	Output Field Separator	blank		Refer to the print example
ORS	Output Record Separator	newline
RS	The input record separator	newline	Yes
RSTART	iIndex of the first character matched by the "match()"function	-		Note: Refer to string functions
RLENGTH	Match length of the string matched by the "match()" function	-		Note: Refer to string functions
SUBSEP	The subscript separator for array elements	"\034"		Refer to associative arrays

The built-in variables listed in the table may not be supported in all versions of awk, but in modern versions like GNU awk (gawk), most of them should be available (you can check by entering awk --version in the terminal).

The examples below illustrate some differences between awk and C, and the other built-in variables will be explained or tested when needed in subsequent applications.

ARGC and ARGV are two built-in variables in awk that are similar to the argc and argv [ ] in C. In C, they are used to read input parameters, while in awk, they represent the list of input files. For example, in the command awk '{}' abc def ghi (where abc, def, and ghi are filenames), the values of ARGC and ARGV are as follows:

ARGC=4
ARGV[0]="awk"
ARGV[1]="abc"
ARGV[2}="def"
ARGV[3}="ghi"

Hence, ARGC is often used as an index. For example, the following script lists the read-in files:

awk 'BEGIN {for( i=0; i<ARGC; i++) print ARGV[i]}' /etc/*.conf。

By default, awk uses whitespace, tab("\t"), and newline as the field and record separators for input data. However, not all data may have these separators. In such cases, you can modify the built-in variables "FS" (Field Separator) and "RS" (Record Separator) accordingly, and you can even use regular expressions for this purpose.

For example, consider the following script with a field separator that is neither whitespace nor tab but rather ":" or "-". You can set FS="[:-]" to specify this field separator. Since you may not know the exact number of fields in each record, you can effectively use "NF" (Number of Fields) in a loop to handle the data:

Example: (Input field separators are either ":" or "-", and output each field's data)

$ cat awk_scr3
BEGIN {
FS="[:-]" #← Set the field separator to ":" or "-"
}
{
for( i=1; i<=NF; i++ )
print $i
}
$ echo "ab-cd ef:gh-ij" | awk -f awk_scr3
ab
cd ef
gh
ij

Other commonly used built-in variables include handling DOS/Windows formatted line breaks by setting "RS = "\r\n"". Similarly, you can define "OFS" (Output Field Separator) and "ORS" (Output Record Separator) for formatting the output data accordingly.

^ back on top ^

BEGIN and END blocks
BEGIN and END blocks in awk are special code blocks used to perform actions before processing the data "BEGIN {}" and after processing all the data "END {}". The main part of the awk program, where the processing of each record takes place, does not have any special markers and is executed for each input record. The structure of an awk program can be roughly divided into three parts:

[BEGIN { statement }]: The BEGIN block is executed once before processing any input data. It is commonly used for initialization and setup tasks.
[{main}]: This is the main part of the awk program, where the processing of each record (line) of input data takes place. The main block is executed for each input record.
[END { statement }]: The END block is executed once after processing all the input data. It is commonly used to perform final calculations or display summary results.

For example, to process the "/etc/shadow" file, which has field separators as ":", you can set the "FS" (Field Separator) variable in the BEGIN block to ":". Then, in the main block, you can check if the second field ($2) is empty to find out which accounts do not have passwords set. Finally, in the END block, you can display the total number of accounts without passwords.

Here's an example of finding accounts without passwords using awk: (Requires logging in as root to read the file "/etc/shadow")

# cat awk_nopasswd #←Script to find accounts without passwords
BEGIN {
FS=":" # ←Set the field separator to ":"
total=0 # ← Initialize a user-defined variable "total" to 0
}

{ # ←Main program block
if ( $2 == "" )
{
print $1 ": no password"
total ++
}
}

END { print "Total no password account=",total}# ←END block

To execute this awk program, you can use the following command:

# cat /etc/shadow | awk -f awk_nopasswd
john: no password
fossett: no password
Total no password account= 2

In another example, we have a simple awk program that only contains the BEGIN block to print "Hello AWK":

Example:

$ awk 'BEGIN{print "Hello AWK"}'
Hello AWK

Since there is no input data, the main block is not executed, and "Hello AWK" is printed only once, thanks to the BEGIN block.

In the last example, we use the END block to print the value of the built-in variable "NR," which represents the total number of input records (lines), simulating the behavior of the wc -l command:

Example:

$ awk 'END {print NR}' /usr/share/dict/linux.words ←simulating the behavior of the wc -l command
479829

In this example, "NR" will be the total number of lines in the "/usr/share/dict/linux.words" file.

^ back on top ^

Output functions print and printf

print:
In awk, if you only have a pattern (condition), the {Actions} part can be omitted. When it is omitted, the default action is print $0, which prints the entire record (line). For example:

awk '/regex1/,/regex2/{print $0}' file
awk '/regex1/,/regex2/{print}' file
awk '/regex1/,/regex2/'file (Equivalent to using sed L sed '/regex1/,/regex2/!d' file)

All three commands are equivalent and will print lines matching the pattern regex1 to regex2 from the input file.

Example:

$ awk '/^ayy*/,/^azz*/' /usr/share/dict/linux.words ←List all words starting with ay to az in the dictionary

The comma " ," inside print represents the Output Field Separator (OFS), which is a space by default but can be changed.

For example:

$ awk 'BEGIN {print "hello","awk"}' ←Output: hello awk
hello awk
$ awk 'BEGIN {OFS="<-->";print "hello","awk"}' ←Output: hello<-->awk
hello<-->awk

The default Output Record Separator (ORS) for print is a newline. You can change it using the built-in variable ORS. For example, to output data in DOS/Windows format:
awk 'BEGIN {ORS="\r\n"}{print}' unix_file > dos_file can simulate the unix2dos command.

The default numeric output format for print is "%.6g", which represents numbers in scientific notation or floating-point with 6 decimal places. You can change this format using the built-in variable "OFMT". For example:

Example:

$ awk 'BEGIN{print 0123456789.0123456789}' ←The default output is in 6-digit scientific notation
$ awk 'BEGIN{OFMT="%.3f";print 0123456789.0123456789}' ←Change the floating-point output to 3 decimal places
123456789.012
$ awk 'BEGIN{OFMT="%d";print 0123456789.0123456789}' ←Change the output to an integer
123456789

printf:
To have more control over the output format, awk provides a printf function similar to the one in C.

For example:

$ awk 'BEGIN{ printf ("%d %s %1.2f\n",2,"Cheeseburgers",4.699)}'
2 Cheeseburgers 4.70

If you are not familiar with C, the format inside printf might look unfamiliar. The placeholders (indicated by "%") inside the double quotes are used to specify the data types and width of the output. The printf function does not use the Output Record Separator (ORS), so you need to manually add a \n for a new line.

The placeholders "%d" or " %s" on the right side of "%" indicate the data type for output. Some commonly used placeholders and data types are:

awk printf format
Symbol	Data Type
%c	ASCII character
%d	Integer
%e	Scientific notation
%f	Floating-point
%g	Automatic choice between scientific notation and floating-point
%o	Octal
%s	String
%x	Hexadecimal

In addition to specifying the data type for the output, you can also specify the width of the data. For example, in the format "%1.2f", "1" represents the width of the integer part, and "2" represents the width of the decimal part (the dot "." separates the integer and decimal parts). If the width is omitted, the system will determine it, as shown in the examples below:

awk printf format for width
Symbol	Data Type
%f	Floating-point with no specified width (system default)
%3d	3-digit integer
%.2f	Floating-point with 2 decimal places width
%2.f	Floating-point with 2-digit integer width

Example:

$ awk 'BEGIN{ printf ("%f \n",4.699)}' ←Output: 4.699000 (default width)
4.699000
$ awk 'BEGIN{ printf ("%.2f \n",4.699)}' ←Output: 4.70 (two decimal places)
4.70
$ awk 'BEGIN{ printf ("%2.f \n",4.699)}' ←Output: 5 (rounded to the nearest integer)
5
$ awk 'BEGIN{ printf ("%3d \n",4.699)}' ←Output: 4 (integer part, no rounding)
4

By default, the output is right-justified. To left-justify the output, you can use the " %-" flag:

Example:

$ echo 65 66| awk '{printf ("%10c%10c \n",$1,$2)}' ←Output ASCII 65 & 66 with a width of 10 characters (right-aligned)
A B
$ echo 65 66| awk '{printf ("%10c%-10c \n",$1,$2)}' ←Force the second character to be left-aligned
AB

^ back on top ^

Associative Arrays
In addition to traditional arrays with numeric indices, awk also offers a unique and powerful feature known as "Associative Arrays." These arrays use strings as indices, unlike traditional C programming language arrays that use numeric indices. Associative arrays may feel different to users familiar with traditional programming languages, but they are a distinctive and powerful aspect of awk.

So, what exactly are associative arrays? In an associative array, strings are used as keys to access the corresponding values. Imagine an associative array as an Excel worksheet, where the indices are represented by strings such as "A1," "A2," "B1," "B2," and so on. In awk, you can write to an associative array using the syntax: array_name[index_string] = value.

For example, consider the following two entries written to the associative array "color" (you don't need to declare or define its size in advance, and you can use it directly):

color["RED"] = 2.1
color["BLUE"] = "TV"

We can imagine this as an Excel worksheet (although it's only one-dimensional) with the following contents:

RED	BLUE	ndex_string
2.1	"TV"	←value

To retrieve the content of a cell in the associative array, the format is "array_name[index_string]", for example, "color["RED"]". The following example demonstrates writing into the associative array and printing its content:

$ awk 'BEGIN{color["RED"]=2.1;color["BLUE"]="TV";print color["RED"],color["BLUE"]}'
2.1 TV ←cThe content of color["RED"] is "2.1" and the content of color["BLUE"] is "TV"

While in Excel, you can visually see which cells hold data, how can you know how much data and which index strings are stored in the associative array? The associative array provides the following syntax to access the entire matrix: for (index_variable in array) do something with array[index_variable].

Using the example above, if we want to print the entire content of the "color" associative array, the code would be written as "for (i in color) print i, color[i]", and it is implemented as follows:

$ awk 'BEGIN{color["RED"]=2.1;color["BLUE"]="TV";for (i in color) print i,color[i]}'
BLUE TV
RED 2.1

In the above example, the command "for (i in color)" (variable "i" can be named differently) automatically searches the entire "color" array. If there are elements in the array, it will store the index string in the variable "i". Therefore, in the above example, "print i" outputs the index string, and "print color[i]" outputs the content associated with that index string. (One thing to note is that "for (i in color) print i" outputs in a random order).

Now, let's explore a practical application of associative arrays. Suppose we have a text file named "parts.db" that contains information about selectable colors in a computer store's peripherals:

KEYBOARD white black
MOUSE blue red black white yellow
CASE black
MONITOR white silver red

I want to count the occurrence of each color, and I can easily accomplish this using associative arrays in awk, as shown in the following example:

$ cat awk_scr4 ←program to count the occurrence of each color
{
    for( i=2; i<=NF; i++ )
         color[$i]++ #←equivalent to color[$i] = color[$i] +1
}
END {
      for( j in color )
         printf( "%10s %d \n", j, color[j] )
    }
$ awk -f awk_scr4 parts.db ←execute "awk_scr4" to count occurrences in the file "parts.db"
       red 2
     white 3
     black 3
      blue 1
    silver 1
    yellow 1

How to interpret the program? The code segment "for( i=2; i<=NF; i++)" starts the loop from 2 because the field "$1" does not contain colors. The loop runs for each field in a row, and since the number of fields is not fixed, the built-in variable "NF" is used to control the loop.

In the loop, "color[$i]++" is used to count occurrences. For example, when the program reads the first row of "parts.db" and "$1" contains the string "white," the expression "color[white]++" is executed, and the value of "color[white]" becomes 1. Then, when the program reads the second row and field 4 "$5" also contains the string "white," "color[white]++" is executed again, and the value of "color[white]" becomes 2. This process continues, allowing us to count the occurrence of each string.

This example demonstrates how associative arrays can simplify the task. Without using associative arrays, the program would be longer and more complex.

When introducing User-defined variables, I used the multiplication table to demonstrate two-dimensional arrays. However, awk does not directly support two-dimensional arrays. Instead, it cleverly uses associative arrays to simulate them. For example, a two-dimensional array "arrayA[3,7]" is converted into a string-indexed array "arrayA["3\0347"] ", where the green background "\034" is defined as the built-in variable "SUBSEP". If this conflicts with the data you are processing, you can define "SUBSEP" to be some other value.

The following example shows an experiment with a two-dimensional array, which is actually an associative array.

$ awk 'BEGIN{arrayA[3,7]="INDIGO";print arrayA["3\0347"];print arrayA[3,7]}'
INDIGO ←aarrayA[3,7] is equivalent to arrayA["3\0347"], so the output is the same
INDIGO

delete :Deleting Arrays
Arrays, whether traditional or associative, consume memory resources, so it is useful to be able to delete their contents when necessary. You can delete an array or specific elements from an array using the following syntax:

Instruction	Note
delete array_name	Delete the entire array
delete array_name["string"]	Delete a cell (one-dimensional) in an associative array
delete array_name[2,3]	Delete a cell (two-dimensional) in an associative array
delete array_name [10]	Delete a cell (one-dimensional) in an associative array

Example:

$ cat awk_scr5
{
    for( i=2; i<=NF; i++ )
         color[$i]++
    delete color ["yellow"] # ←Delete a cell "color["yellow"]" from the associative array
}
. . .

^ back on top ^

Executing system commands using system
awk has gained popularity for its ability to easily execute system commands and utilize pipelines and redirection. In the following example, we modify the script from "awk_scr4" to redirect the computed result to a file.

Example:

$ cat awk_scr6
BEGIN { #←BBEGIN block
        outfile = "result"
      }
{ #←Main program block
    for( i=2; i<=NF; i++ )
         color[$i]++
}
END { #←END END block
      for( j in color )
         printf( "%10s %d \n", j, color[j] ) > outfile #←Redirect the result to a file

      print "***** Result Statistics *****" > outfile #←Redirect the output to the file
    }
$ awk -f awk_scr6 parts.db ←Execute "awk_scr6" script (file "parts.db" follows the Associative Arrays example)
$ cat result ←View the file "result"
       red 2
     white 3
     black 3
      blue 1
    silver 1
    yellow 1
***** Result Statistics *****

In the above example, the redirection ">" in awk works differently than in the shell. When redirecting in awk, if the output file already exists, it will be deleted before creating a new file. However, subsequent actions with ">" will be treated as append redirection ">>".

If you want to execute a system command inside awk to display the output on the screen, you can simply use the "system" command. The modified "awk_scr7" script below demonstrates this:

Example:

$ cat awk_scr7
... (BEGIN and main program blocks are the same as in "awk_scr6")

END {
      for( j in color )
         printf( "%10s %d \n", j, color[j] ) > outfile

      print "***** Result Statistics *****" > outfile
      system ("cat "outfile) # ←Execute the system command "cat"
    }

^ back on top ^

Closing files/pipelines using close
In the examples 'awk_scr6' and 'awk_scr7', file creation is used with the symbol > outfile. It is essential to properly close files using the command close("file") at the end of the program. Otherwise, there may be unexpected bugs (similar to leaving a door open while going out, which might result in unpredicted consequences like theft).

Why do we need to close files? When awk creates a file, it internally establishes a pointer to link to that file. For instance, the symbol > in example 'awk_scr6' is used for both redirection and append redirection. Initially, when > creates a file, it establishes a pointer to link to that file. If the pointer's link still exists in subsequent actions, it becomes a append redirection. The command close("file") severs the file's pointer link. If awk outputs many files simultaneously without using close("file") properly, it can lead to confusion since it won't know which file is being processed or whether it should redirect or cumulatively redirect.

A useful way to determine whether a file has been created and not closed is to check if the symbol ">" is used for append redirection. If the file is closed, then using ">" will create the file again. The following examples 'awk_scr8' and 'awk_scr9' demonstrate this concept.

Example:

$ cat awk_scr8 # ←Example: File awk_scr8
BEGIN {
        print "abc" > "fileA" # ←Creates file "fileA"
        print "123" > "fileA" # ←Cumulative redirection to "fileA"
      }
$ awk -f awk_scr8 # ←Executing cat awk_scr8
$ cat fileA
abc
123

$ cat awk_scr9 # ←Example: File awk_scr8
BEGIN {
        print "abc" > "fileA" # ←Creates file "fileA"
        close ("fileA") ←Closes "fileA" (severs the file's pointer link)
        print "123" > "fileA" #← Pointer link is severed, so this creates "fileA" again
      }
$ awk -f awk_scr9 #←Executing cat awk_scr9
$ cat fileA
123  ←The later result overwrites the previous one

Using "close"
close has two uses:

close("file"): Closes the specified file.
close("command"): Closes the file created through the specified pipeline command.

The second usage is shown in the example 'awk_scr10,' which is a modification of 'awk_scr9' with an additional pipeline using the tr command to convert lowercase to uppercase. When using close to close files created through pipelines, ensure to write the entire pipeline command precisely inside close to avoid treating it as a different file.

For example, the two forms of the tr command, tr 'a-z' 'A-Z' and tr '[:lower:]' '[:upper:]', have the same meaning, which is converting lowercase characters to uppercase characters. However, for awk's close function, these two forms are treated differently.

Example:

$ cat awk_scr10
BEGIN {
        print "abc" | "tr 'a-z' 'A-Z' > fileA"# ←Outputs to pipeline using tr to convert lowercase to uppercase
        close ("tr 'a-z' 'A-Z' > fileA")# ←Close the pipeline ("command") by including the whole command
        system ("echo '123' >>" "fileA")# ←Pointer link is severed (closed), so use cumulative redirection '>>'
      }
$ awk -f awk_scr10 ←Executing cat awk_scr10
$ cat fileA
ABC
123

In the example above, during the debugging stage, if you are unsure whether the close operation is correct, you can use "print close ("tr 'a-z' 'A-Z' > fileA") " to print the result of the close operation. If it's not zero, it indicates an error in the close command (possibly a typo). After debugging, you can remove the print statement.

^ back on top ^

Reading input data using getline
In awk, if you want to read multiple files, you can use awk file1 file2. However, if you want to read the output of multiple system commands simultaneously (e.g., the output of both ls and cat commands), how can you achieve that? For this purpose, awk provides the "getline" command to read the output of system commands or data files (primarily used to read the output of system commands).

When used alone, getline reads one line at a time from the current file and stores it in the field variable. If written in the main program area, it reads the next line (as the main program has already read the current line). For example:

Example:

$ seq 1 10 | awk 'BEGIN{getline;print}' ← Reads one line at a time
1
$ seq 1 10 | awk '{getline;print}' ←gAs getline is written in the main program, it reads the next line, hence the output is skipping lines
2
4
6
8
10

Reading one line at a time may not seem very useful, so in practical applications, we use a loop to read all the data. But how do we know the loop's termination condition? The getline command returns a value after each read operation, with the following meanings:

getline read record	getline read record
Success	1
Failure	-1
End Of File (EOF)	0

In the following example, we print the return value of getline:

$ seq 1 2 | awk 'BEGIN{print getline; print getline; print getline}'
1 ←getline read successfully ($0=1), return value = 1
1 ←getline read successfully ($0=2), return value = 1
0 ←ggetline read failed ($0=EOF), return value = 0

Since getline returns 1 when it reads a record successfully, we can use a while loop to repeatedly execute getline and read the entire data. For instance:

Example:

$ seq 1 3 | awk 'BEGIN{while (getline) print}' ←As `getline` will return 0 at the end of the file, it exits the while loop.
1
2
3

getline can not only read from the current file but can also be used with pipelines or redirections to read data from data files or the output of specific commands. The possible formats are:

getline [var]	Standalone usage, reads the current line and stores it in the field variable
getline [var] < "FILE"	Reads data from a file
"COMMAND" \| getline [var]	Reads data from the output of a command

Among them, "var" is a self-defined variable. If the variable already exists, then "var=$0". For example, "getline cell" means that the variable "cell" will be assigned the value of "$0".

In the second format,〝getline [var] < "FILE"〞, if "FILE" is replaced with a hyphen, as in getline < "-", it means that standard input can be used for interactive input with the program.

In the third format,〝 "COMMAND" | getline [var]〞, getline reads data from the output of the specified command. For example, instead of using a pipeline, you can rewrite the following command as awk 'BEGIN{while ("seq 1 10" | getline) print}' to directly read the output of the command seq 1 10.

The following example reads the output of two system commands, ls -F and ls -A. It lists the empty directories in the working directory. The program first uses the command ls -F to output the file names through a pipeline to awk. If the filename indicates a directory, the program uses getline to read the output of ls -A for that directory to check if it is empty.

Example: (Detecting empty directories in the working directory)

$ cat awk_scr11
{
    /\/$/ #←Equivalent to "if ($0 ~/\/$/)" (if the filename is a directory, continue with the subsequent actions; otherwise, process the next filename)
    {
       DirName=$0;

       while (("ls -A " DirName )| getline)# ←Use getline to read the output of the system command "ls -A"
              ListCount++

       if (ListCount == 0) # # ←If ListCount=0, it is an empty directory
              {print "Directory --> "DirName" is empty"}

       ListCount=0
    }
}
$ ls -F | awk -f awk_scr11
Directory --> dir2 is empty
Directory -- >Documents is empty
Directory --> Download is empty

The program segment "/\/$/" uses regular expressions to filter the output from ls -F (if the filename is a directory, it will have a trailing "/" character, e.g., "Documents/", and the subsequent actions will be performed on that filename).
The program segment "while( "ls -A" | getline)" will exit the loop if getline returns 0 (End Of File), at which point the variable "ListCount" will be 0, indicating an empty directory.

^ back on top ^

Mathematical functions
In addition to basic arithmetic operations, awk also provides the following useful mathematical expressions:

Math Function Name	Description	Example	Example Result
%	Modulo	7%5	2
^	Exponentiation	2^3	8

如還不能符合需求還有方便的數學函數可供應用,每一個數學函數皆會返回一運算結果,返回的值可以指定給一變數;如〝A=int(3.8)〞或直接列印如〝print int(3.8)〞。

下表為 awk 支援的數學函數,函數中的〝x〞或〝y〞為輸入的值。

In addition to simple arithmetic operations, awk also provides the following useful mathematical functions, each of which returns a computed result that can be assigned to a variable or directly printed using "print":

Math Function Name	Description	Example	Example Result
sin( x )	Sine; where x is in radians (radians = degrees/180 * PI)	sin (90 /180 * 3.4146)	1
cos( x )	Cosine; where x is in radians (radians = degrees/180 * PI)	cos (180/180*3.1416)	- 1
atan2( y, x )	Arc-tangent (y/x); returns the angle in radians	atan2(30,45)	0.588003
exp( x )	e^x	exp(1)	2.71828
log( x )	log _ex	log (5)	1.60994
sqrt( x )	Square root	sqrt (9)	3
int( x )	Integer value (truncates the decimal part)	int (5.6)	5
rand( )	Random number generator, where 0 <= rand() < 1
srand( [x] )	nitializes rand(), where x is the random seed (if omitted, the current date and time will be used as the seed)

Most mathematical functions are straightforward (don't ask me about math; I returned it to my teacher a long time ago). This section only introduces some functions that are prone to errors or have special considerations.

The function "rand( )" is a random number generator that produces a random number between 0 and less than 1. For example, you can use a "for" loop to execute "rand( )" ten times:

Example:

$ awk 'BEGIN{for (i=1;i<=10;i++) print rand()}'
0.237788
0.291066
0.845814
0.152208
0.585537
0.193475
0.810623
0.173531
0.484983
0.151863

Though the results may seem random, if you run the same statement multiple times, you may notice a pattern due to the algorithm being used (e.g., using a certain number as the "seed" and performing operations with the result as the next seed). To eliminate this predictable behavior, you can use "srand( )" to change the random seed. The following example demonstrates the usage of "srand( )".

Example:

$ awk 'BEGIN{srand();for (i=1;i<=10;i++) print rand()}'

Why is the range of the random number generator "0 <= rand( ) < 1" ? Because it is easy to apply to any desired range. For instance, if you want to use awk to randomly select numbers for a lottery, such as the "Lotto 6/49," you can use "rand( ) * 49", truncate the decimal part using "int( )", and then add 1:

Example: (Lotto 6/49 number generator)

$ awk 'BEGIN{srand();for (i=1;i<=6;i++) print int(rand()*49)+1}'
Output omitted

If there is a mathematical operation that awk's built-in functions do not support, you can use getline along with an external command to obtain the result. The following example uses the echo command to convert the binary number 1100 _bin to decimal and stores it in the variable "dec".

Example:

$ awk 'BEGIN{"echo '$((2#1100))' " | getline dec ;print dec}'
12

^ back on top ^

String functions
awk is very friendly to string operations. To concatenate two strings, simply place them together with a space in between. For example, the two strings "123" and "abc" can be combined into the new string "123abc".

Example:

$ awk 'BEGIN{str1="123";str2= str1 "abc"; print str2}'
123abc

However, string operations are not limited to simple concatenation. awk provides the following functions for more advanced string operations:

String Function Name	Description	Example	Example Result
sub(regex, replace [,string] )	Replace the first occurrence of a substring with a new substring	st1="google goooogle" sub(/go+g/,"YAHOO",st1)	1 st1="YAHOOle goooogle"
gsub regex,replace [,string ] )	Replace all occurrences of a substring with a new substring	st1="google goooogle" gsub(/go+g/,"YAHOO",st1)	2 st1="YAHOOle YAHOOle"
index(string, substring)	Find the position of a substring in a string	index("this","is")	3
match(string,regex )	Find the position and length of the first occurrence of a pattern in a string	match("123xyzxyzxyz456",/(xyz)+/)	RSTART=4 RLENGTH=9
length [(string)]	Get the length of a string	length ("yahoo")	5
substr(string, index [,length] )	Extract a substring from a string	substr("12345678',3,4}	"3456"
split(string, Array [,regex] )	Split a string into an array using a delimiter	split("abc:de-fgh",arrA,/[:-]/	arraA[1]="abc" arraA[2]="de" arraA[3]="fgh"
tolower( string )	Convert uppercase letters to lowercase	tolower("Yahoo! 123")	"yahoo! 123"
toupper( string )	Convert lowercase letters to uppercase	toupperr("Yahoo! 123")	"YAHOO! 123"
sprintf(format, data1,data2 ... )	Format data as a string similar to printf.	sprintf("%.4f",3.14162654)	3.1416

String functions are not always easy to use based solely on their function names. Sometimes, it can be challenging to explain without practical examples. Therefore, let's provide a simple explanation and practical tests for each string function according to the table above.

sub(regex, replace [,string]): Replace the first occurrence of a substring with a new substring.
This function is similar to sed 's/Regex/Replace/'. If the original string is at position string (if string is omitted, it defaults to $0), and if it matches the regular expression "regex", then the match will be replaced with "replace", but only once, and the function returns the number of replacements made.

例:

$ echo "google goooogle" |awk '{sub(/go+g/,"YAHOO");print }' ← If the string "google goooogle" matches the regular expression "go+g", it will be replaced with "YAHOO" once.
YAHOOle goooogle

gsub(regex,replace [,string ]: Replace all occurrences of a substring with a new substring.
Similar to sed 's/Regex/Replace/g', but works similarly to "sub( )" with the difference that it replaces all occurrences of the substring in the string.
Example:

$ awk 'BEGIN{st1="google gooooogle";print gsub(/go+g/,"YAHOO",st1);print st1}'
2 ← Adding "print gsub()" will return the number of replacements made.
YAHOOle YAHOOle ← Replaced result

index(String, substring) : Find the position of a substring in a string
Returns the position of the first occurrence of the substring in the string. If not found, it returns 0.

Example:

$ echo 'this' | awk '{print index($0,"is")}'
3
match(string,regex ): Find the position and length of the first occurrence of a pattern in a string
Similar to "index()", but uses regular expression "regex" to match the string, and the position and length of the match are recorded in built-in variables "RSTART" and "RLENGTH".

Example:

$ echo '123xyzxyzxyz456' | awk '{match($0,/(xyz)+/); print RSTART,RLENGTH}'
4 9
length [(string)]: Get the length of a string
If the string is omitted, it returns the length of "$0" by defaul

Example:

$ echo 'yahoo' | awk '{print length()}'
5
substr(string, index [,length]): Extract a substring from a string
Returns a substring starting from "index" with a length of "length". If "length" is omitted, it extends to the end of the string.

Example:

$ echo '123456789' | awk '{print substr($0,3,4)}'
3456
$ echo '123456789' | awk '{print substr($0,3)}'
3456789
split(string, Array [,regex]): Split a string into an array using a delimiter
Splits the string into an array. If the last argument "[regex]" is omitted, the default delimiter is a space or tab.

Example:

$ echo "abc de fgh" | awk '{split($0,arrayA);for (i in arrayA) print arrayA[i]}'
abc
de
fgh
tolower(string):Convert uppercase letters to lowercase

toupper(string):Convert lowercase letters to uppercase

Example:

$ awk 'BEGIN{print tolower("Yahoo! 123")}'
yahoo! 123
$ awk 'BEGIN{print toupper("Yahoo! 123")}'
YAHOO! 123

sprintf(format, data1,data2... ): Format data as a string similar to "printf"
Similar to printf, but it formats the output as a string and assigns it to a variable.
In the following example, "%.4f" is used to round the decimal to four decimal places.
Example:

$ echo '3.141592654' | awk '{new=sprintf("%.4f",$0);print new}'
3.1416

^ back on top ^

User-defined functions(UDFs)
User-defined functions (UDFs) allow you to create custom functions in awk when the built-in functions are not sufficient for your needs. The syntax for defining a UDF is as follows:

function name (para 1, para 2, para 3...)
{
body-of-function [return value]
}

UDFs in awk are similar to traditional C language functions, but they do not require explicit declaration and are typeless. To define a UDF, you need to add the meta-characters "function" before the function name.

Here's an example of a simple user-defined function called "abs( )" that calculates the absolute value of a number:

Example:

$ cat awk_abs
{
     print abs($0) # ←Call the user-defined absolute value function 'abs()'
}

function abs (value) # ←User-defined absolute value function 'abs()'
    if(value <0)
          value = value * (-1)

     return value # ← Use 'return' if there is a return value
}
$ echo "-13.38" | awk -f awk_abs
13.38

^ back on top ^

[註]
[Note] The expression "string ~ /regex/" in awk will be expanded to "{if (string ~ /regex/) print}" during execution. Therefore, when writing an external script file, you should remove the outermost "'{ }'" from the "string ~ /regex/" expression. For example,awk '$8 ~ /pr*e/" will be expanded to awk '{if ($8 ~ /pre/) print}'

This is an important distinction to be aware of when using awk's pattern matching in script files. The expression "string ~ /regex/" is a shorthand way of writing an if statement that checks if "string" matches the regular expression "regex" and then executes the print statement.