The Linux Newbie Guide  ⇒    Fundamentals     Advanced     Supplement   Command Index   ENG⇒中
All rights reserved, please indicate the source when citing
 

 awk

Processing text-based data with awk
        Basic usage of awk
        Awk program patterns
            User-defined variables
            Built-in variables
            BEGIN and END blocks
            Output functions print and printf
            Associative arrays
            Executing system commands using system
            Closing files/pipelines using close
            Reading input data using getline
            Mathematical functions
            String functions
            User-defined functions(UDFs)

ENG⇒中ENG⇒中
  Processing text-based data with awk
Many UNIX/Linux command names originate from peculiar abbreviations, and awk is no exception! Its name comes from the initials of its three authors, Alfred Aho, Peter Weinberger, and Brian Kernighan.

Awk is an interpreted language that heavily borrows syntax from the C programming language, incorporating the essence of C's text processing and output formatting capabilities. Additionally, awk supports features that the original C language lacks, such as matching with regular expressions and the use of Associative Arrays.

As a result, the most significant difference between awk and C lies in their application. C is a general-purpose programming language with numerous complex commands and syntax, while awk is compact and concise, particularly well-suited for handling and processing data in textual records and text formatting. In the 1980s, awk enjoyed popularity until around the 1990s when it gradually yielded ground to another general-purpose interpreted language, Perl.



Basic usage of awk
In comparison to sed, which processes text on a line-by-line basis, awk can also handle text using "fields."

For example, consider the output of ls -l, which lists detailed file information. The output contains 8 fields, separated by spaces:

export TIME_STYLE=long-iso ←Set the time format (environment settings may affect the output format of "ls -l")
ls -l
drwxr-xy-x  aaa  aaa  4096  2011-09-07  11:44  Desktop 
drwxr-xy-x  aaa  aaa  4096  2011-09-07  11:44  Documents 
drwxr-xy-x  aaa  aaa  4096  2011-09-07  11:44  Music 
drwxr-xy-x  aaa  aaa  4096  2011-09-07  11:44  Pictures 
drwxr-xy-x  aaa  aaa  4096  2011-09-07  11:44  Public 
↑  ↑  ↑  ↑  ↑  ↑  ↑  ↑ 
$1  $2  $3  $4  $5  $6  $7  $8 ←field variables

If we pipe the output of ls -l into awk, written as ls -l | awk {}, each field will be automatically stored in its default field variable "$0", "$1" to "$N" (where N is the number of fields). Because the use of parameters and statements between awk commands can be abstract and complex, and awk may have difficulty interpreting them, it is generally recommended to use single quotes (') to enclose the parameters, except for files and options, which should be written as ls -l | awk '{}' .

In the example above, the first line has 8 fields, which will result in 9 field variables with values from "$0" to "$8" as follows:
Variable Content Note
$0 drwxr-xy-x 2 aaa aaa 4096 2011-09-07 11:44 Desktop "$0" contains the entire line's string
$1 drwxr-xy-x string from field=1
$2 2 string from field=2
$3 aaa string from field=3
$4 aaa string from field=4
$5 4096 string from field=5
$6 2011-09-07 string from field=6
$7 11:44 string from field=7
$8 Desktop string from field=8

The most special field variable is "$0," which contains the entire line's content, and when "$0" is modified, it automatically updates other field variables "$1" to "$N."

If we write only ls -l | awk '{}', there will be no output because there are no output functions and no specified field to be output. The most commonly used built-in output function in awk is "print." For example, if I want to output only the size field (field=5) and the filename field (field=8) of the ls -l output, I can write:

Example:
$ ls -l | awk '{print $5,$8}'← Output only fields 5 and 8
4096 Desktop
4096 Documents
4096 Music
. . .

In the "print" function, the comma "," represents the Output Field Separator (OFS), and the default OFS is a space, so each comma in the print function outputs a space. You can try removing the comma and entering ls -l | awk '{print $5 $8}' to see the difference in output.

If there are multiple commands, they should be separated by a semicolon ";" or written on the next line. For example, ls -l | awk '{size=$5; file_name=$8; print size, file_name;}' (in this example, "size" and "file_name" are user-defined variables).

Once you understand the usage of field variables "$N," it is easy to use this feature to change the output format. The most basic usage of awk is to change the output format, as shown in the example below, where awk is used to modify the original output of ls.

Example:
$ ls -l | awk '{print "File",$8,"size =",$5,"Byte"}'
File Desktop size = 4096 Byte
File Documents size = 4096 Byte
File Music size = 4096 Byte
. . .

In the above example, you can add strings to the "print" function to output them. Strings to be added should be enclosed in double quotes.

One of the reasons why awk was popular is its ability to perform calculations using field variables, such as "S3*base-1". Continuing the previous example, if I want to display the file size field in KiB, I can divide "$5" by 1024.

Example:
$ ls -l | awk '{print "File",$8,"size =",$5/1024,"KB"}'
File Desktop size = 4 KB
File Documents size = 4 KB
File Music size = 4 KB
. . .
$ ls -l | awk '{print "File",$8,"size =",$5/1024,"KB"}' > reformate.txt ← Save the new output as a file "reformate.txt"

As for input data in awk, besides coming from a pipeline "|", it can also come from a file. For example, awk '{print $3}' data.txt prints field 3 of the file "data.txt". If you want to create an interactive program that accepts input from the keyboard, awk also supports the "-" symbol for standard input, as shown in the example below, which takes two numbers as input and outputs their product:


Example:
$ awk '{print $1*$2}' - ←The trailing "-" symbol represents standard input (keyboard)
3.14  1.41421 ←Enter any two numbers
4.44062 ←Output the product (Press <Ctrl-D> to end)

To execute awk, you can use the methods mentioned above, or, like sed, use the "-f" option to use an external script file or write it as a shell script. When using an external script file, remove the single quotes (') enclosing the "{}," as shown in the example below:

Example: (Using an external awk script file)
$ cat awk_scr ←Suppose there is an external script file named "awk_scr" with the following content
{print "File",$8,"size =",$5/1024,"KB"}
$ ls -l | awk -f awk_scr ← Use the "-f" option to use the external script file "awk_scr"

Example: (Write as a shell script)
$ cat awk_scr1 ←Suppose there is a file named "awk_scr1" with the following content
awk '{print "File",$8,"size =",$5/1024,"KB"}'
$ chmod +x awk_scr1 ←Give "awk_scr1" executable permissions
$ ls -l | ./awk_scr1 ←Execute "awk_scr1"

^ back on top ^


  Awk program patterns
In addition to basic output patterns, awk also possesses powerful program patterns since awk itself is a scripting language. Whether or not to learn the programming aspect of awk depends on individual needs and expertise. The following explanation is provided for you to make your judgment.

While many tasks can be achieved using C language or shell scripts, the entry barrier for C language is higher, and for small tasks, using C might be overkill. On the other hand, shell scripts might fall short when it comes to text processing. However, for those skilled in awk scripting and with creativity, awk can almost entirely replace all filtering programs like grep, sed, tr, cut, etc. Moreover, it has calculation and statistical capabilities. If you need to process textual data records, awk is the first choice. Some have even tested that awk can be over 30 times faster than shell scripts for the same functionality. The syntax of awk scripting heavily borrows from C language syntax, so if you are already familiar with C/C++/Java, learning awk's scripting language will be relatively simple. However, if you are not familiar with C, it might be more challenging.

Readers can become proficient in awk scripting without knowing C language since awk scripting is much simpler. However, in this explanation, we assume that readers are already familiar with C language, so we won't specifically explain C language instructions and syntax. Instead, we will focus on the differences between awk and C language.

The structure of an awk program is mainly [Pattern] '[{Actions}]' [Files], where, in awk terminology, "Pattern" is not a regular expression pattern, but rather a condition. "{Actions}" represents the statements to be executed, and "Files" is the text data file to be processed. Of course, besides files, awk can also process data from other commands through pipelines.

A Pattern is not always present. If it exists, the Actions will be executed when the Pattern is satisfied; otherwise, the Actions will not be executed. For example, to filter files based on size, the command can be written as ls -l | awk '$5 > 8192 {print $5, $8}'. It means that if the content of field 5 is greater than 8192, the "print $5, $8" action will be executed. If there is no Pattern, as in ls -l | awk '{print $5, $8}', the Actions "{print $5, $8}" will be executed regardless of any conditions.

"{Actions}" can also be omitted. When omitted, the default action is "print $0." For example, awk 'NR <= 5' /etc/passwd works like the "head" command. (In this example, "NR" is a built-in variable).

Awk's Patterns provide similar judgment syntax to C language, such as:

awk relational operators
Operator Meaning
== equals
!= not equals
> greater than
>= greater than or equal to
< less than
<= less than or equal to
&& logical AND
|| logical OR

Unlike traditional C language, Awk's Patterns can also use regular expressions for matching judgments, represented by "~" for matching and "!" for non-matching.
(In fact, the current version of awk supports extended regular expressions.)

Syntax:

awk relational operators
Operator Meaning Note
string ~ /regular expression/ [Actions] If the string matches the regular expression, execute Actions
[note]
string !~ /regular expression/ [Actions]
If the string does not match the regular expression, execute Actions
/regular expression/ [Actions]
If the current input line matches the regular expression, execute Actions
(omitting the string and "~" symbol will use $0 to match the regular expression).
!/regular expression/ [Actions]
If the current input line does not match the regular expression, execute Actions.

To match regular expressions, remember to enclose them in paired slashes "/". For example, in the command ls /etc | awk '$1 ~ /pr*e/', it means that if the content of field 1 in any line matches the regular expression "pr*e", then that line will be output.

If you omit both the string to match and the tilde symbol "~", the meaning can be considered as a search. In this case, if the current input line contains a match to the regular expression, the specified actions will be executed. For instance, awk '/colou*r/' file works similarly to the grep command.

The "{Actions}" part is not limited to just "print." Various commands and syntax similar to C language are also valid.



List of awk syntax Note
if ( conditional ) statement [ else statement ]
while ( conditional ) statement
do {statement} while (conditional)
for ( expression ; conditional ; expression ) statement
for ( variable in array ) statement
break
continue
{ [ statement ] ...}
variable=expression
[command][&][|]getline [var][<][ file]
print [ expression-list ] [ > expression ]
printf ( ) format [ , expression-list ] [ > expression ]
function( )
next
exit
Reference Source:
http://www.grymoire.com/Unix/Awk.html

Additionally, awk comments, like sed, are denoted by the "#" symbol.

^ back on top ^

^ back on top ^



^ back on top ^




[註]
[Note] The expression "string ~ /regex/" in awk will be expanded to "{if (string ~ /regex/) print}" during execution. Therefore, when writing an external script file, you should remove the outermost "'{ }'" from the "string ~ /regex/" expression. For example,awk '$8 ~ /pr*e/" will be expanded to awk '{if ($8 ~ /pre/) print}'

This is an important distinction to be aware of when using awk's pattern matching in script files. The expression "string ~ /regex/" is a shorthand way of writing an if statement that checks if "string" matches the regular expression "regex" and then executes the print statement.