Awk is an interpreted language that heavily borrows syntax from the C programming language, incorporating the essence of C's text processing and output formatting capabilities. Additionally, awk supports features that the original C language lacks, such as matching with regular expressions and the use of Associative Arrays.
As a result, the most significant difference between awk and C lies in their application. C is a general-purpose programming language with numerous complex commands and syntax, while awk is compact and concise, particularly well-suited for handling and processing data in textual records and text formatting. In the 1980s, awk enjoyed popularity until around the 1990s when it gradually yielded ground to another general-purpose interpreted language, Perl.
For example, consider the output of ls -l, which lists detailed file information. The output contains 8 fields, separated by spaces:
$ export TIME_STYLE=long-iso ←Set the time format (environment settings may affect the output format of "ls -l") | |||||||
$ ls -l | |||||||
drwxr-xy-x | 2 | aaa | aaa | 4096 | 2011-09-07 | 11:44 | Desktop |
drwxr-xy-x | 2 | aaa | aaa | 4096 | 2011-09-07 | 11:44 | Documents |
drwxr-xy-x | 2 | aaa | aaa | 4096 | 2011-09-07 | 11:44 | Music |
drwxr-xy-x | 2 | aaa | aaa | 4096 | 2011-09-07 | 11:44 | Pictures |
drwxr-xy-x | 2 | aaa | aaa | 4096 | 2011-09-07 | 11:44 | Public |
↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ |
$1 | $2 | $3 | $4 | $5 | $6 | $7 | $8 ←field variables |
Variable | Content | Note |
$0 | drwxr-xy-x 2 aaa aaa 4096 2011-09-07 11:44 Desktop | "$0" contains the entire line's string |
$1 | drwxr-xy-x | string from field=1 |
$2 | 2 | string from field=2 |
$3 | aaa | string from field=3 |
$4 | aaa | string from field=4 |
$5 | 4096 | string from field=5 |
$6 | 2011-09-07 | string from field=6 |
$7 | 11:44 | string from field=7 |
$8 | Desktop | string from field=8 |
The most special field variable is "$0," which contains the entire line's content, and when "$0" is modified, it automatically updates other field variables "$1" to "$N."
If we write only ls -l | awk '{}', there will be no output because there are no output functions and no specified field to be output. The most commonly used built-in output function in awk is "print." For example, if I want to output only the size field (field=5) and the filename field (field=8) of the ls -l output, I can write:$ ls -l | awk '{print $5,$8}'← Output only fields 5 and 8 4096 Desktop 4096 Documents 4096 Music . . . |
In the "print" function, the comma "," represents the Output Field Separator (OFS), and the default OFS is a space, so each comma in the print function outputs a space. You can try removing the comma and entering ls -l | awk '{print $5 $8}' to see the difference in output.
If there are multiple commands, they should be separated by a semicolon ";" or written on the next line. For example, ls -l | awk '{size=$5; file_name=$8; print size, file_name;}' (in this example, "size" and "file_name" are user-defined variables).
Once you understand the usage of field variables "$N," it is easy to use this feature to change the output format. The most basic usage of awk is to change the output format, as shown in the example below, where awk is used to modify the original output of ls.$ ls -l | awk '{print "File",$8,"size =",$5,"Byte"}' File Desktop size = 4096 Byte File Documents size = 4096 Byte File Music size = 4096 Byte . . . |
In the above example, you can add strings to the "print" function to output them. Strings to be added should be enclosed in double quotes.
One of the reasons why awk was popular is its ability to perform calculations using field variables, such as "S3*base-1". Continuing the previous example, if I want to display the file size field in KiB, I can divide "$5" by 1024.
Example:$ ls -l | awk '{print "File",$8,"size =",$5/1024,"KB"}' File Desktop size = 4 KB File Documents size = 4 KB File Music size = 4 KB . . . $ ls -l | awk '{print "File",$8,"size =",$5/1024,"KB"}' > reformate.txt ← Save the new output as a file "reformate.txt" |
$ awk '{print $1*$2}' - ←The trailing "-" symbol represents standard input (keyboard) 3.14 1.41421 ←Enter any two numbers 4.44062 ←Output the product (Press <Ctrl-D> to end) |
$ cat awk_scr ←Suppose there is an external script file named "awk_scr" with the following content {print "File",$8,"size =",$5/1024,"KB"} $ ls -l | awk -f awk_scr ← Use the "-f" option to use the external script file "awk_scr" |
$ cat awk_scr1 ←Suppose there is a file named "awk_scr1" with the following content awk '{print "File",$8,"size =",$5/1024,"KB"}' $ chmod +x awk_scr1 ←Give "awk_scr1" executable permissions $ ls -l | ./awk_scr1 ←Execute "awk_scr1" |
While many tasks can be achieved using C language or shell scripts, the entry barrier for C language is higher, and for small tasks, using C might be overkill. On the other hand, shell scripts might fall short when it comes to text processing. However, for those skilled in awk scripting and with creativity, awk can almost entirely replace all filtering programs like grep, sed, tr, cut, etc. Moreover, it has calculation and statistical capabilities. If you need to process textual data records, awk is the first choice. Some have even tested that awk can be over 30 times faster than shell scripts for the same functionality. The syntax of awk scripting heavily borrows from C language syntax, so if you are already familiar with C/C++/Java, learning awk's scripting language will be relatively simple. However, if you are not familiar with C, it might be more challenging.
Readers can become proficient in awk scripting without knowing C language since awk scripting is much simpler. However, in this explanation, we assume that readers are already familiar with C language, so we won't specifically explain C language instructions and syntax. Instead, we will focus on the differences between awk and C language.
The structure of an awk program is mainly [Pattern] '[{Actions}]' [Files], where, in awk terminology, "Pattern" is not a regular expression pattern, but rather a condition. "{Actions}" represents the statements to be executed, and "Files" is the text data file to be processed. Of course, besides files, awk can also process data from other commands through pipelines.
A Pattern is not always present. If it exists, the Actions will be executed when the Pattern is satisfied; otherwise, the Actions will not be executed. For example, to filter files based on size, the command can be written as ls -l | awk '$5 > 8192 {print $5, $8}'. It means that if the content of field 5 is greater than 8192, the "print $5, $8" action will be executed. If there is no Pattern, as in ls -l | awk '{print $5, $8}', the Actions "{print $5, $8}" will be executed regardless of any conditions.
"{Actions}" can also be omitted. When omitted, the default action is "print $0." For example, awk 'NR <= 5' /etc/passwd works like the "head" command. (In this example, "NR" is a built-in variable).
Awk's Patterns provide similar judgment syntax to C language, such as:
awk relational operators | |
Operator | Meaning |
== | equals |
!= | not equals |
> | greater than |
>= | greater than or equal to |
< | less than |
<= | less than or equal to |
&& | logical AND |
|| | logical OR |
awk relational operators | ||
Operator | Meaning | Note |
string ~ /regular expression/ [Actions] | If the string matches the regular expression, execute Actions | [note] |
string !~ /regular expression/ [Actions] |
If the string does not match the regular expression, execute Actions | |
/regular expression/ [Actions] |
If the current input line matches the regular expression, execute Actions (omitting the string and "~" symbol will use $0 to match the regular expression). |
|
!/regular expression/ [Actions] |
If the current input line does not match the regular expression, execute Actions. |
To match regular expressions, remember to enclose them in paired slashes "/". For example, in the command ls /etc | awk '$1 ~ /pr*e/', it means that if the content of field 1 in any line matches the regular expression "pr*e", then that line will be output.
If you omit both the string to match and the tilde symbol "~", the meaning can be considered as a search. In this case, if the current input line contains a match to the regular expression, the specified actions will be executed. For instance, awk '/colou*r/' file works similarly to the grep command.
The "{Actions}" part is not limited to just "print." Various commands and syntax similar to C language are also valid.
List of awk syntax | Note |
if ( conditional ) statement [ else statement ] while ( conditional ) statement do {statement} while (conditional) for ( expression ; conditional ; expression ) statement for ( variable in array ) statement break continue { [ statement ] ...} variable=expression [command][&][|]getline [var][<][ file] print [ expression-list ] [ > expression ] printf ( ) format [ , expression-list ] [ > expression ] function( ) next exit |
Reference Source: http://www.grymoire.com/Unix/Awk.html |
User-defined variables in awk are different from those in C language. In awk, variables do not require declaration and are typeless global variables.
For example, you can create a variable "score" without declaring it, and you can assign various values to "score" without specifying its type. The following examples are all valid:
score = 99 (integer)The reason why awk does not require variable type declaration like C language (e.g., int x) is that awk treats all data as ASCII and performs type conversion only when necessary. For instance, the commands awk 'BEGIN {print 3 * 7}' 和 awk 'BEGIN {print "3" * "7"}' both output the same result (the usage of BEGIN is referenced to BEGIN and END). However, it is still a good practice to use quotes (") to indicate that the data is a string if you know that you are dealing with strings and not numbers.
For example:$ cat awk_scr2 BEGIN { brand="555" #←Variable "brand" is a string "555" unit_price = 0.8 #←Variable "unit_price" is a numerical value 0.8 dozen = 12 #←Variable "dozen" is a numerical value 12 print brand,"cigarettes a dozen price=",unit_price * dozen } $ awk -f awk_scr2 555 cigarettes a dozen price= 9.6 |
Arrays in awk are also typeless variables and do not require declaration or definition of size. They support up to two-dimensional arrays. For instance, the following is an example of the multiplication table (9x9) written in C language and rewritten in awk:
Example:$ cat awk_m_table BEGIN{ #← Multiplication Table example in awk for( i=1; i<=9; i++ ){ for( j=1; j<=9; j++ ){ array[i, j] = i * j print i" * "j" = "array[i,j] } } } $ awk -f awk_m_table 1 * 1 = 1 1 * 2 = 2 . . . 9 * 7 = 63 9 * 8 = 72 9 * 9 = 81 |
Besides field variables, there are many built-in variables available in awk for various operations. Built-in variables are written in uppercase letters, so it is recommended not to use all uppercase names for user-defined variables. This practice helps to avoid name collisions and makes it clear which variables are built-in and which are user-defined.
For example, in the code snippet "for (i = 0; i < NF; i++)," it is evident that "NF" is a built-in variable, while "i" is a user-defined variable.
Two of the most commonly used built-in variables are "NF" and "NR." "NF" (Number of Fields) stores the number of fields in each line, and "NR" (Number of Records) stores the current line number in the file (in awk terminology, a line is called a "record").
For example:$ echo 'ab cd ef' | awk '{print NF}' ←Since there are three fields, NF=3 3 |
awk built-in variables | ||||
Variable | Meaning | Default | Regular-Eexpression support |
Note |
ARGC | The number of command-line arguments passed to awk | - | ||
ARGV | The number of command-line arguments passed to awk | - | ||
FILENAME | The number of command-line arguments passed to awk | - | ||
FNR | The record number in the current input file. | - | ||
FS | The input field separator |
blank& tab | Yes | 參考 BEGIN 範例 |
IGNORECASE | (Non-zero value) When set to a non-zero value, matches are case-insensitive | 1 | Yes | he GNU version of gawk supports this built-in variable |
NF | The number of fields in the current input record. | - | ||
NR | The total number of input records processed so far. | - | Refer to theENDexample | |
OFMT | The output format for numbers | %.6g | Refer to the print example | |
OFS | Output Field Separator |
blank | Refer to the print example | |
ORS | Output Record Separator | newline | ||
RS | The input record separator | newline | Yes | |
RSTART | iIndex of the first character matched by the "match()"function | - | Note: Refer to string functions | |
RLENGTH | Match length of the string matched by the "match()" function |
- | Note: Refer to string functions | |
SUBSEP | The subscript separator for array elements | "\034" | Refer to associative arrays |
The built-in variables listed in the table may not be supported in all versions of awk, but in modern versions like GNU awk (gawk), most of them should be available (you can check by entering awk --version in the terminal).
The examples below illustrate some differences between awk and C, and the other built-in variables will be explained or tested when needed in subsequent applications.
ARGC and ARGV are two built-in variables in awk that are similar to the argc and argv [ ] in C. In C, they are used to read input parameters, while in awk, they represent the list of input files. For example, in the command awk '{}' abc def ghi (where abc, def, and ghi are filenames), the values of ARGC and ARGV are as follows:
ARGC=4
ARGV[0]="awk"
ARGV[1]="abc"
ARGV[2}="def"
ARGV[3}="ghi"
Hence, ARGC is often used as an index. For example, the following script lists the read-in files:
awk 'BEGIN {for( i=0; i<ARGC; i++) print ARGV[i]}' /etc/*.conf。
By default, awk uses whitespace, tab("\t"), and newline as the field and record separators for input data. However, not all data may have these separators. In such cases, you can modify the built-in variables "FS" (Field Separator) and "RS" (Record Separator) accordingly, and you can even use regular expressions for this purpose.
For example, consider the following script with a field separator that is neither whitespace nor tab but rather ":" or "-". You can set FS="[:-]" to specify this field separator. Since you may not know the exact number of fields in each record, you can effectively use "NF" (Number of Fields) in a loop to handle the data:
Example: (Input field separators are either ":" or "-", and output each field's data)$ cat awk_scr3 BEGIN { FS="[:-]" #← Set the field separator to ":" or "-" } { for( i=1; i<=NF; i++ ) print $i } $ echo "ab-cd ef:gh-ij" | awk -f awk_scr3 ab cd ef gh ij |
[BEGIN { statement }]: The BEGIN block is executed once before processing any input data. It is commonly used for initialization and setup tasks.
[{main}]: This is the main part of the awk program, where the processing of each record (line) of input data takes place. The main block is executed for each input record.
[END { statement }]: The END block is executed once after processing all the input data. It is commonly used to perform final calculations or display summary results.
For example, to process the "/etc/shadow" file, which has field separators as ":", you can set the "FS" (Field Separator) variable in the BEGIN block to ":". Then, in the main block, you can check if the second field ($2) is empty to find out which accounts do not have passwords set. Finally, in the END block, you can display the total number of accounts without passwords.
Here's an example of finding accounts without passwords using awk: (Requires logging in as root to read the file "/etc/shadow")# cat awk_nopasswd #←Script to find accounts without passwords BEGIN { FS=":" # ←Set the field separator to ":" total=0 # ← Initialize a user-defined variable "total" to 0 } { # ←Main program block if ( $2 == "" ) { print $1 ": no password" total ++ } } END { print "Total no password account=",total}# ←END block |
# cat /etc/shadow | awk -f awk_nopasswd john: no password fossett: no password Total no password account= 2 |
$ awk 'BEGIN{print "Hello AWK"}' Hello AWK |
Since there is no input data, the main block is not executed, and "Hello AWK" is printed only once, thanks to the BEGIN block.
In the last example, we use the END block to print the value of the built-in variable "NR," which represents the total number of input records (lines), simulating the behavior of the wc -l command:$ awk 'END {print NR}' /usr/share/dict/linux.words ←simulating the behavior of the wc -l command 479829 |
$ awk '/^ayy*/,/^azz*/' /usr/share/dict/linux.words ←List all words starting with ay to az in the dictionary |
$ awk 'BEGIN {print "hello","awk"}' ←Output: hello awk hello awk $ awk 'BEGIN {OFS="<-->";print "hello","awk"}' ←Output: hello<-->awk hello<-->awk |
$ awk 'BEGIN{print 0123456789.0123456789}' ←The default output is in 6-digit scientific notation $ awk 'BEGIN{OFMT="%.3f";print 0123456789.0123456789}' ←Change the floating-point output to 3 decimal places 123456789.012 $ awk 'BEGIN{OFMT="%d";print 0123456789.0123456789}' ←Change the output to an integer 123456789 |
$ awk 'BEGIN{ printf ("%d %s %1.2f\n",2,"Cheeseburgers",4.699)}' 2 Cheeseburgers 4.70 |
awk printf format | |
Symbol | Data Type |
%c | ASCII character |
%d | Integer |
%e | Scientific notation |
%f | Floating-point |
%g | Automatic choice between scientific notation and floating-point |
%o | Octal |
%s | String |
%x | Hexadecimal |
awk printf format for width | |
Symbol | Data Type |
%f | Floating-point with no specified width (system default) |
%3d | 3-digit integer |
%.2f | Floating-point with 2 decimal places width |
%2.f | Floating-point with 2-digit integer width |
$ awk 'BEGIN{ printf ("%f \n",4.699)}' ←Output: 4.699000 (default width) 4.699000 $ awk 'BEGIN{ printf ("%.2f \n",4.699)}' ←Output: 4.70 (two decimal places) 4.70 $ awk 'BEGIN{ printf ("%2.f \n",4.699)}' ←Output: 5 (rounded to the nearest integer) 5 $ awk 'BEGIN{ printf ("%3d \n",4.699)}' ←Output: 4 (integer part, no rounding) 4 |
$ echo 65 66| awk '{printf ("%10c%10c \n",$1,$2)}' ←Output ASCII 65 & 66 with a width of 10 characters (right-aligned) A B $ echo 65 66| awk '{printf ("%10c%-10c \n",$1,$2)}' ←Force the second character to be left-aligned AB |
So, what exactly are associative arrays? In an associative array, strings are used as keys to access the corresponding values. Imagine an associative array as an Excel worksheet, where the indices are represented by strings such as "A1," "A2," "B1," "B2," and so on. In awk, you can write to an associative array using the syntax: array_name[index_string] = value.
For example, consider the following two entries written to the associative array "color" (you don't need to declare or define its size in advance, and you can use it directly):
color["RED"] = 2.1
color["BLUE"] = "TV"
We can imagine this as an Excel worksheet (although it's only one-dimensional) with the following contents:
RED | BLUE | ndex_string |
2.1 | "TV" | ←value |
$ awk 'BEGIN{color["RED"]=2.1;color["BLUE"]="TV";print color["RED"],color["BLUE"]}' 2.1 TV ←cThe content of color["RED"] is "2.1" and the content of color["BLUE"] is "TV" |
While in Excel, you can visually see which cells hold data, how can you know how much data and which index strings are stored in the associative array? The associative array provides the following syntax to access the entire matrix: for (index_variable in array) do something with array[index_variable].
Using the example above, if we want to print the entire content of the "color" associative array, the code would be written as "for (i in color) print i, color[i]", and it is implemented as follows:
$ awk 'BEGIN{color["RED"]=2.1;color["BLUE"]="TV";for (i in color) print i,color[i]}' BLUE TV RED 2.1 |
KEYBOARD white black |
I want to count the occurrence of each color, and I can easily accomplish this using associative arrays in awk, as shown in the following example:
$ cat awk_scr4 ←program to count the occurrence of each color { for( i=2; i<=NF; i++ ) color[$i]++ #←equivalent to color[$i] = color[$i] +1 } END { for( j in color ) printf( "%10s %d \n", j, color[j] ) } $ awk -f awk_scr4 parts.db ←execute "awk_scr4" to count occurrences in the file "parts.db" red 2 white 3 black 3 blue 1 silver 1 yellow 1 |
How to interpret the program? The code segment "for( i=2; i<=NF; i++)" starts the loop from 2 because the field "$1" does not contain colors. The loop runs for each field in a row, and since the number of fields is not fixed, the built-in variable "NF" is used to control the loop.
In the loop, "color[$i]++" is used to count occurrences. For example, when the program reads the first row of "parts.db" and "$1" contains the string "white," the expression "color[white]++" is executed, and the value of "color[white]" becomes 1. Then, when the program reads the second row and field 4 "$5" also contains the string "white," "color[white]++" is executed again, and the value of "color[white]" becomes 2. This process continues, allowing us to count the occurrence of each string.
This example demonstrates how associative arrays can simplify the task. Without using associative arrays, the program would be longer and more complex.
When introducing User-defined variables, I used the multiplication table to demonstrate two-dimensional arrays. However, awk does not directly support two-dimensional arrays. Instead, it cleverly uses associative arrays to simulate them. For example, a two-dimensional array "arrayA[3,7]" is converted into a string-indexed array "arrayA["3\0347"] ", where the green background "\034" is defined as the built-in variable "SUBSEP". If this conflicts with the data you are processing, you can define "SUBSEP" to be some other value.
The following example shows an experiment with a two-dimensional array, which is actually an associative array.$ awk 'BEGIN{arrayA[3,7]="INDIGO";print arrayA["3\0347"];print arrayA[3,7]}' INDIGO ←aarrayA[3,7] is equivalent to arrayA["3\0347"], so the output is the same INDIGO |
Instruction | Note |
delete array_name | Delete the entire array |
delete array_name["string"] | Delete a cell (one-dimensional) in an associative array |
delete array_name[2,3] | Delete a cell (two-dimensional) in an associative array |
delete array_name [10] | Delete a cell (one-dimensional) in an associative array |
$ cat awk_scr5 { for( i=2; i<=NF; i++ ) color[$i]++ delete color ["yellow"] # ←Delete a cell "color["yellow"]" from the associative array } . . . |
$ cat awk_scr6 BEGIN { #←BBEGIN block outfile = "result" } { #←Main program block for( i=2; i<=NF; i++ ) color[$i]++ } END { #←END END block for( j in color ) printf( "%10s %d \n", j, color[j] ) > outfile #←Redirect the result to a file print "***** Result Statistics *****" > outfile #←Redirect the output to the file } $ awk -f awk_scr6 parts.db ←Execute "awk_scr6" script (file "parts.db" follows the Associative Arrays example) $ cat result ←View the file "result" red 2 white 3 black 3 blue 1 silver 1 yellow 1 ***** Result Statistics ***** |
In the above example, the redirection ">" in awk works differently than in the shell. When redirecting in awk, if the output file already exists, it will be deleted before creating a new file. However, subsequent actions with ">" will be treated as append redirection ">>".
If you want to execute a system command inside awk to display the output on the screen, you can simply use the "system" command. The modified "awk_scr7" script below demonstrates this:
$ cat awk_scr7 ... (BEGIN and main program blocks are the same as in "awk_scr6") END { for( j in color ) printf( "%10s %d \n", j, color[j] ) > outfile print "***** Result Statistics *****" > outfile system ("cat "outfile) # ←Execute the system command "cat" } |
Why do we need to close files? When awk creates a file, it internally establishes a pointer to link to that file. For instance, the symbol > in example 'awk_scr6' is used for both redirection and append redirection. Initially, when > creates a file, it establishes a pointer to link to that file. If the pointer's link still exists in subsequent actions, it becomes a append redirection. The command close("file") severs the file's pointer link. If awk outputs many files simultaneously without using close("file") properly, it can lead to confusion since it won't know which file is being processed or whether it should redirect or cumulatively redirect.
A useful way to determine whether a file has been created and not closed is to check if the symbol ">" is used for append redirection. If the file is closed, then using ">" will create the file again. The following examples 'awk_scr8' and 'awk_scr9' demonstrate this concept. $ cat awk_scr8 # ←Example: File awk_scr8 BEGIN { print "abc" > "fileA" # ←Creates file "fileA" print "123" > "fileA" # ←Cumulative redirection to "fileA" } $ awk -f awk_scr8 # ←Executing cat awk_scr8 $ cat fileA abc 123 $ cat awk_scr9 # ←Example: File awk_scr8 BEGIN { print "abc" > "fileA" # ←Creates file "fileA" close ("fileA") ←Closes "fileA" (severs the file's pointer link) print "123" > "fileA" #← Pointer link is severed, so this creates "fileA" again } $ awk -f awk_scr9 #←Executing cat awk_scr9 $ cat fileA 123 ←The later result overwrites the previous one |
$ cat awk_scr10 BEGIN { print "abc" | "tr 'a-z' 'A-Z' > fileA"# ←Outputs to pipeline using tr to convert lowercase to uppercase close ("tr 'a-z' 'A-Z' > fileA")# ←Close the pipeline ("command") by including the whole command system ("echo '123' >>" "fileA")# ←Pointer link is severed (closed), so use cumulative redirection '>>' } $ awk -f awk_scr10 ←Executing cat awk_scr10 $ cat fileA ABC 123 |
When used alone, getline reads one line at a time from the current file and stores it in the field variable. If written in the main program area, it reads the next line (as the main program has already read the current line). For example:
Example: $ seq 1 10 | awk 'BEGIN{getline;print}' ← Reads one line at a time 1 $ seq 1 10 | awk '{getline;print}' ←gAs getline is written in the main program, it reads the next line, hence the output is skipping lines 2 4 6 8 10 |
Reading one line at a time may not seem very useful, so in practical applications, we use a loop to read all the data. But how do we know the loop's termination condition? The getline command returns a value after each read operation, with the following meanings:
getline read record | getline read record |
Success | 1 |
Failure | -1 |
End Of File (EOF) | 0 |
$ seq 1 2 | awk 'BEGIN{print getline; print getline; print getline}' 1 ←getline read successfully ($0=1), return value = 1 1 ←getline read successfully ($0=2), return value = 1 0 ←ggetline read failed ($0=EOF), return value = 0 |
$ seq 1 3 | awk 'BEGIN{while (getline) print}' ←As `getline` will return 0 at the end of the file, it exits the while loop. 1 2 3 |
getline [var] | Standalone usage, reads the current line and stores it in the field variable |
getline [var] < "FILE" | Reads data from a file |
"COMMAND" | getline [var] | Reads data from the output of a command |
Among them, "var" is a self-defined variable. If the variable already exists, then "var=$0". For example, "getline cell" means that the variable "cell" will be assigned the value of "$0".
In the second format,〝getline [var] < "FILE"〞, if "FILE" is replaced with a hyphen, as in getline < "-", it means that standard input can be used for interactive input with the program.
In the third format,〝 "COMMAND" | getline [var]〞, getline reads data from the output of the specified command. For example, instead of using a pipeline, you can rewrite the following command as awk 'BEGIN{while ("seq 1 10" | getline) print}' to directly read the output of the command seq 1 10.
The following example reads the output of two system commands, ls -F and ls -A. It lists the empty directories in the working directory. The program first uses the command ls -F to output the file names through a pipeline to awk. If the filename indicates a directory, the program uses getline to read the output of ls -A for that directory to check if it is empty.
Example: (Detecting empty directories in the working directory)$ cat awk_scr11 { /\/$/ #←Equivalent to "if ($0 ~/\/$/)" (if the filename is a directory, continue with the subsequent actions; otherwise, process the next filename) { DirName=$0; while (("ls -A " DirName )| getline)# ←Use getline to read the output of the system command "ls -A" ListCount++ if (ListCount == 0) # # ←If ListCount=0, it is an empty directory {print "Directory --> "DirName" is empty"} ListCount=0 } } $ ls -F | awk -f awk_scr11 Directory --> dir2 is empty Directory -- >Documents is empty Directory --> Download is empty |
Math Function Name | Description | Example | Example Result |
% | Modulo | 7%5 | 2 |
^ | Exponentiation | 2^3 | 8 |
Math Function Name | Description | Example | Example Result |
sin( x ) | Sine; where x is in radians (radians = degrees/180 * PI) | sin (90 /180 * 3.4146) | 1 |
cos( x ) | Cosine; where x is in radians (radians = degrees/180 * PI) | cos (180/180*3.1416) | - 1 |
atan2( y, x ) | Arc-tangent (y/x); returns the angle in radians | atan2(30,45) | 0.588003 |
exp( x ) | ex | exp(1) | 2.71828 |
log( x ) | log e x | log (5) | 1.60994 |
sqrt( x ) | Square root | sqrt (9) | 3 |
int( x ) | Integer value (truncates the decimal part) | int (5.6) | 5 |
rand( ) | Random number generator, where 0 <= rand() < 1 | ||
srand( [x] ) | nitializes rand(), where x is the random seed (if omitted, the current date and time will be used as the seed) |
Most mathematical functions are straightforward (don't ask me about math; I returned it to my teacher a long time ago). This section only introduces some functions that are prone to errors or have special considerations.
The function "rand( )" is a random number generator that produces a random number between 0 and less than 1. For example, you can use a "for" loop to execute "rand( )" ten times: $ awk 'BEGIN{for (i=1;i<=10;i++) print rand()}' 0.237788 0.291066 0.845814 0.152208 0.585537 0.193475 0.810623 0.173531 0.484983 0.151863 |
$ awk 'BEGIN{srand();for (i=1;i<=10;i++) print rand()}' |
Why is the range of the random number generator "0 <= rand( ) < 1" ? Because it is easy to apply to any desired range. For instance, if you want to use awk to randomly select numbers for a lottery, such as the "Lotto 6/49," you can use "rand( ) * 49", truncate the decimal part using "int( )", and then add 1:
Example: (Lotto 6/49 number generator)$ awk 'BEGIN{srand();for (i=1;i<=6;i++) print int(rand()*49)+1}' Output omitted |
$ awk 'BEGIN{"echo '$((2#1100))' " | getline dec ;print dec}' 12 |
$ awk 'BEGIN{str1="123";str2= str1 "abc"; print str2}' 123abc |
String Function Name | Description | Example | Example Result |
sub(regex, replace [,string] ) | Replace the first occurrence of a substring with a new substring | st1="google goooogle" sub(/go+g/,"YAHOO",st1) |
1 st1="YAHOOle goooogle" |
gsub regex,replace [,string ] ) | Replace all occurrences of a substring with a new substring | st1="google goooogle" gsub(/go+g/,"YAHOO",st1) |
2 st1="YAHOOle YAHOOle" |
index(string, substring) | Find the position of a substring in a string | index("this","is") | 3 |
match(string,regex ) | Find the position and length of the first occurrence of a pattern in a string |
match("123xyzxyzxyz456",/(xyz)+/) | RSTART=4 RLENGTH=9 |
length [(string)] | Get the length of a string | length ("yahoo") | 5 |
substr(string, index [,length] ) | Extract a substring from a string | substr("12345678',3,4} | "3456" |
split(string, Array [,regex] ) | Split a string into an array using a delimiter | split("abc:de-fgh",arrA,/[:-]/ | arraA[1]="abc" arraA[2]="de" arraA[3]="fgh" |
tolower( string ) | Convert uppercase letters to lowercase | tolower("Yahoo! 123") | "yahoo! 123" |
toupper( string ) | Convert lowercase letters to uppercase | toupperr("Yahoo! 123") | "YAHOO! 123" |
sprintf(format, data1,data2 ... ) | Format data as a string similar to printf. | sprintf("%.4f",3.14162654) | 3.1416 |
$ echo "google goooogle" |awk '{sub(/go+g/,"YAHOO");print }' ← If the string "google goooogle" matches the regular expression "go+g", it will be replaced with "YAHOO" once. YAHOOle goooogle |
$ awk 'BEGIN{st1="google gooooogle";print gsub(/go+g/,"YAHOO",st1);print st1}' 2 ← Adding "print gsub()" will return the number of replacements made. YAHOOle YAHOOle ← Replaced result |
$ echo 'this' | awk '{print index($0,"is")}' 3 |
$ echo '123xyzxyzxyz456' | awk '{match($0,/(xyz)+/); print RSTART,RLENGTH}' 4 9 |
$ echo 'yahoo' | awk '{print length()}' 5 |
$ echo '123456789' | awk '{print substr($0,3,4)}' 3456 $ echo '123456789' | awk '{print substr($0,3)}' 3456789 |
$ echo "abc de fgh" | awk '{split($0,arrayA);for (i in arrayA) print arrayA[i]}' abc de fgh |
$ awk 'BEGIN{print tolower("Yahoo! 123")}' yahoo! 123 $ awk 'BEGIN{print toupper("Yahoo! 123")}' YAHOO! 123 |
In the following example, "%.4f" is used to round the decimal to four decimal places.
Example: $ echo '3.141592654' | awk '{new=sprintf("%.4f",$0);print new}' 3.1416 |
function name (para 1, para 2, para 3...) { body-of-function [return value] } |
UDFs in awk are similar to traditional C language functions, but they do not require explicit declaration and are typeless. To define a UDF, you need to add the meta-characters "function" before the function name.
Here's an example of a simple user-defined function called "abs( )" that calculates the absolute value of a number: $ cat awk_abs { print abs($0) # ←Call the user-defined absolute value function 'abs()' } function abs (value) # ←User-defined absolute value function 'abs()' if(value <0) value = value * (-1) return value # ← Use 'return' if there is a return value } $ echo "-13.38" | awk -f awk_abs 13.38 |
This is an important distinction to be aware of when using awk's pattern matching in script files. The expression "string ~ /regex/" is a shorthand way of writing an if statement that checks if "string" matches the regular expression "regex" and then executes the print statement.