Unix/Linux has a special group of software tools called "filters." Some filters cannot directly accept input from the keyboard or be used independently. Instead, they operate through pipes or redirection. Filters primarily work on a line-by-line basis to filter, search, modify, replace, insert, delete, or perform statistical operations on data.
The output of filters is usually directed to stdout, which is the screen. If you want to save the output to a file, you need to redirect it, such as using cat fileA | tr -s '\n' > fileB.
Commonly used filter tools in Unix/Linux include grep, cut, col, tr, uniq, sort, sed, and awk. Among them, sed and awk have their own scripting languages and are relatively complex, so they will be explained separately in the sed and awk sections. $ grep home /etc/passwd ←will display the lines in the file "/etc/passwd" that contain the string "home". This can be used to see the currently established accounts. aaa:x:500:500::/home/aaa:/bin/bash bbb:x:501:501::/home/bbb:/bin/bash frank:x:502:502:Frank Wang:/home/frank:/bin/bash phoebe:x:503:503::/home/phoebe:/bin/bash |
To avoid any misinterpretation of the pattern (PATTERN) when using regular expressions, it is generally recommended to enclose the pattern in single quotes ('') or double quotes (""). Therefore, in the example you provided, it is advised to write it as grep '/home' /etc/passwd.
grep itself is not difficult; the challenge lies in becoming proficient with regular expressions.
Example: $ ls -F /etc | grep '/$' ←display only the directories under "/etc" and exclude files from the output a2ps/ acpi/ alsa/ alternatives/ audisp/ audit/ avahi/ blkid/ bluetooth/ (The following is omitted) |
$ ls -d /etc/* | grep [[:digit:]] | grep [[:upper:]] ←List the files with digits and uppercase letters in the "/etc" directory /etc/X11 |
$ ls -d /etc/* | grep '[A-Z].*' ←list the files or directories in the "/etc" directory where the first character is uppercase. /etc/ConsoleKit /etc/DIR_COLORS /etc/group.OLD /etc/Muttrc |
$ grep -r 'colou*r' /etc/gconf ←Search for files in the /etc/gconf directory (including subdirectories) that contain the keywords "color" or "colour" in their contents. /etc/gconf/gconf.xml.defaults/%gconf-tree-or.xml: <entry name="color_shading_ty /etc/gconf/gconf.xml.defaults/%gconf-tree-or.xml: <entry name="secondary_color"> /etc/gconf/gconf.xml.defaults/%gconf-tree-or.xml: <entry name="primary_color"> |
grep has two commonly used options: -F and -E. The -F option is used to disable the interpretation of regular expressions, treating the pattern as a literal string. On the other hand, the -E option enables extended regular expression matching.
Here are some possible options and usages of grep:
Syntax:[STDIN] grep [-otpiton][--option] [FILE] or [STDOUT] | 註 | ||
Command name/Function/Command user | Options | Function | |
grep/ Find strings/ Any in files |
-a | Search for binary files | |
-A# | To display the line containing the searched string and the subsequent lines up to a specific line number (indicated by '#'), | e.g. grep -A 2 "search_string" file.txt | |
-B# | To display the line containing the searched string and the preceding lines up to a specific line number (indicated by '#') | e.g. grep -B 2 "search_string" file.txt | |
-C# | To display the line containing the searched string along with a certain number of lines before and after it (indicated by '#') | e.g. grep -C 2 "search_string" file.txt | |
-c | Display the number of lines matching the search results | ||
-D[read][skip] | Search device file or Name Pipe(FIFOs) or Socket files | The available items are "read": treat the device file as a normal file "skip": do not process the device file |
|
-d[read][skip][recurse] | Search directory | This option may not be fully supported or may cause errors on some versions of the OS or file system. The available items are "read": treat the directory as a general file "skip": do not process the directory "recurse": process the directory and subdirectory, the same option "-r" |
|
-e | Specified template(pattern) | It is mainly used to process files starting with "-" (because the file starting with "-" is the same as the option symbol "-", this option will be misjudged) |
|
-E | Force extended regular expressions to interpret search syntax | ||
-f | Specified template | ||
-F | Search with fixed strings (i.e. not interpreted in regular expressions ) | ||
-G | Iinterpret the pattern as a basic regular expression | ||
-h | Do not list filenames when searching for multiple files | The difference is only when searching for multiple files | |
-H | List the content and file name of the line matching the string (this is the default value) | ||
-i | Ignore case differences | ||
-I | If the search binary file matches, do not output "Binary file XXX matches" | The option is used to search for patterns in binary files without displaying the message "Binary file XXX" that could potentially disrupt the output. | |
-l | Only list matching filenames | Mainly used for multi-file search | |
-L | Only list unmatching filenames | Mainly used for multi-file search | |
-n | List the line numbers that match the string | ||
-q | No output | Mainly used for bash files when judging | |
-r | Search along with subdirectories (recursive search) | ||
-v | Reverse search, that is, the lines that matches the string is not output | ||
-w | Matches only "whole words" strings | For example, the whole word matches "apple", but the string such as "apples" or "applets" does not match | |
-x | Match the entire line exactly against the specified pattern | ||
--help | Displays the command's built-in help and usage information |
$ grep -n 'google' re.txt ← List the number of lines where the string is located |
$ cat MY_PATTERN ← For example, if there is a template file "MY_PATTERN", the content is as follows: TAIWAN [Tt]aiwan $ grep -f MY_PATTERN *.txt ← Use the template in the template file "MY_PATTERN" to search for all files with the extension name "txt" |
$ cat my_file ←For example, the content of the file "my_file" is as follows: Introduction to Linux Linux is a muti-user & muti-task OS $ grep -e '-user' my_file ←Search for the string '-user' in the file "my_file" Linux is a muti-user & muti-task OS |
$ grep -ne 'mail' -ne 'news' /etc/passwd search string "mail" & "news" and list line number 9:mail:x:8:12:mail:/var/spool/mail:/sbin/nologin 10:news:x:9:13:news:/etc/news: 27:mailnull:x:47:47::/var/spool/mqueue:/sbin/nologin |
$ grep -q 'google' re.txt && cp re.txt re.txt~ ←If the file "reg.txt" has the string "google", backup this file |
$ egrep 'goo?' re.txt ← Search string "go" or "goo"... "gooooooooooo" |
$ ls /etc/ | egrep 'pr(o|e)' ← Search for strings with "pre" or "pro" iproute2 modprobe.conf modprobe.conf~ modprobe.d prelink.cache (The following is omitted) |
$ fgrep 's.' re.txt ←Search for the exact string "s." without using regular expressions google Goggles Solves SUDOKU Puzzles. $ fgrep '.*' /etc/*.conf ←To search for the string ".*" (literally, without interpreting it as a regular expression meaning "zero or more characters") |
$ echo -e '12\t3\t456\t789' 12 3 456 789 |
$ echo -e '12\t3\t456\t789' | cut -f 2,4 3 789 |
-f n | Extrac field n |
-f n,m,o,p | Extract fields n,m,o and p |
-f n-m | Extract fields from n to m |
-f n- | Extract fields from n to last |
-f -n | Extract fields from first to nth |
-f n-m,o-p,q,r- | Extract fields from n to m, o to p, q and r to the end |
$ cat /etc/passwd | grep '/home' | cut -d":" -f1 aaa bbb patrick cindy danny |
$ echo '123456789' | cut -c 5- ←Extract characters from 5 to the last character 56789 |
Syntax:[STDIN] cut [-otpiton][--option][CHAR][FILE] | Note | ||
Command name/Function/Command user | Options | Function | |
cut/ retrieve field/ Any |
-b | The extraction unit is byte | If used in an English locale, the effect of the "-c" option is equivalent to the "-b" option |
-c | The extraction unit is character | ||
-d "CHAR" | To specify a custom field delimiter | ||
-f FIELD[,FILED] | Set output fields | Use the option -d with a custom field delimiter or the default tab delimiter | |
-s | If a line doesn't have a matching field delimiter, it won't be outputted | ||
-n | Don't split mutibyte characters | For multibyte characters used in non-English locales, | |
--output-delimiter | Customize the output delimiting string | ||
--help | Displays the command's built-in help and usage information |
Let me explain some ambiguous options. When using cut -d to set the field delimiter, if a line does not have a matching field delimiter, by default, the entire line will be outputted. In such cases, you can add the -s option to prevent the line from being outputted.
The option "--output-delimiter" allows you to freely set the output delimiter string. In the following example, we will change the original tab delimiter to the string "---" for output.
$ echo -e '12\t3\t456\t789' | cut --output-delimiter="---" -f 1- ←Custom output delimiter string〝---〞 12--3---456---789 |
$ export TIME_STYLE=long-iso ←Set the date/time format (different environment settings will affect the output format of "ls -l") | |||||||
$ ls -l --time-style=long-iso |
|||||||
drwxr-xy-x | 2 | aaa | aaa | 4096 | 2011-09-07 | 11:44 | Desktop |
drwxr-xy-x | 2 | aaa | aaa | 4096 | 2011-09-07 | 11:44 | Documents |
-rw-rw-r-- | 2 | aaa | aaa | 8 | 2011-08-08 | 12:42 | fileA |
-rw-rw-r-- | 2 | aaa | aaa | 12 | 2011-08-07 | 12:34 | fileB |
-rw-rw-r-- | 2 | aaa | aaa | 112 | 2011-03-08 | 10:12 | MypProject |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 ←number of fields |
$ man regex REGEX(3) Linux Programmer’s Manual REGEX(3) NAME regcomp, regexec, regerror, regfree - POSIX regex functions SYNOPSIS #include <sys/types.h> #include <regex.h> int regcomp(regex_t *preg, const char *regex, int cflags); int regexec(const regex_t *preg, const char *string, size_t nmatch, (The following is omitted) |
When viewing the man (manual) pages for a command, you may notice that the documentation contains formatting elements such as bold text or underlines. This is because the man pages are not plain text files; they contain control characters that control the display or operation of the text.
Sometimes, for convenience or editing purposes, you may want to save the manual page of a command to a file. You can do this using the command man regex > regex.txt, which redirects the output of the man command to a file named "regex.txt". However, when you try to open this file using a text editor like vi or others, you may encounter garbled or unreadable characters, as shown below:REGEX(3) Linux Programmer’s Manual REGEX(3) N^HNA^HAM^HME^HE regcomp, regexec, regerror, regfree - POSIX regex functions S^HSY^HYN^HNO^HOP^HPS^HSI^HIS^HS #^H#i^Hin^Hnc^Hcl^Hlu^Hud^Hde^He <^H<s^Hsy^Hys^Hs/^H/t^Hty^Hyp^Hpe^Hes^Hs (The following is omitted) |
In this case, you can use the col program to filter out control characters from the file. The "-b" option is used to filter out all known control characters like "RLF" (Reverse Line Feed), "HRLF" (Half Reverse Line Feed), "HFLF" (Half Forward Line Feed), etc.
So, in the given example, using the command man regex | col -b > regex.txt will filter out the control characters that cause garbled text, allowing you to use a regular text editor to read or edit the file.
The name col is an abbreviation of the word "colander," Generally, col is operated through pipelines (e.g., COMMAND | col) or redirection (e.g., col -x < FILE > SAVE_FILE) for operation.
Another useful option is -x, which converts tabs to corresponding spaces. This can be handy when you have a file that needs to convert tabs to spaces for formatting purposes.
$ echo -e '12\t\t3\t456\t789' | sed -n 'l' ← use the command "sed" to make the tab appear (display\t) 12\t\t3\t456\t789$ $ echo -e '12\t3\t456\t789' | col -x | sed -n 'l' ← use "col -x" to convert tabs to spaces 12 3 456 789$ |
$ echo '012345' | tr 1 A ←Convert character "1" to "A" 0A2345 |
$ echo 'abcdef 123 XYZ' | tr 'a-z' 'A-Z' ← convert lowercase characters to uppercase ABCDEF 123 XYZ $ echo 'abcdef 123 XYZ' | tr '[:digit:]' 'i-z' ← convert numbers to lowercase English and start from i abcdef jkl XYZ |
You can indeed use the format tr 'abcde' 'ijklm' where the characters 'i', 'j', 'k', 'l', and 'm' sequentially replace 'a', 'b', 'c', 'd', and 'e' respectively. However, it's important to note that it's not replacing the string 'ijklm' with 'abcde'.
In the following example, the character '5' is replaced with 'i' and the character '6' is replaced with 's':
Example: $ echo "Hello 12345 World" | tr '56' 'is' Hello 1234s is |
Although sed is powerful, it operates on data based on the "pattern space" and removes trailing newline characters, making it unable to handle newline characters directly. In such cases, tr can be used easily.
For example, in earlier versions of Apple's Mac OS 9, the newline character was represented as "CR". To convert a UNIX/Linux file to a format readable by this vintage computer, you can perform the following transformation:ASCII control character representations | Dec | Hex | ASCII abbr. | Name/Meaning |
\a | 7 | 7 | BEL | bell |
\b | 8 | 8 | BS | (backspace |
\t | 9 | 9 | TAB | (horizontal tab |
\n | 10 | 0A | LF | (line feed,new line |
\v | 11 | 0B | VT | vertical tab |
\f | 12 | 0C | FF | NP form feed, new page |
\r | 13 | 0D | CR | carriage return |
\\ | 92 | 5c | character "\" |
Syntax:[STDIN] tr [-otpiton][--option] CHAR SET1 [CHAR SET2] | ||
Command name/Function/Command user | Options | Function |
tr/ Translate character/ Any |
-c | Invert selection |
-d | Delete characters | |
-s | elete characters -s Remove consecutive repeated characters | |
-t | Delete the part where the source character is more than the destination character |
$ echo "1 2 3 4" | tr -s " " ←Remove excess whitespace characters. 1 2 3 4 $ echo -e "1\t\t\t2 3 4" | tr -s " \t" ←Remove excess whitespace characters & tab 1 2 3 4 $ tr -s '\n' < fileA > fileB ←Delete "fileA" extra blank line and save it as "fileB" $ sed -n '5,20 p' fileA | tr -s "\n" ←Delete redundant blank lines in lines 5~20 of "fileA" |
$ echo 'busy buzzing bumblebees buzzing busying' | tr -d 'busy' ← Delete characters "b", "u", "s" or "y" (not delete string "busy"). zzing mleee zzing ing $ echo -e "1\t\t\t2 3 4" | tr -s " \t" | tr -d " \t" ← First delete the repeated spce and tabs and then delete the blanks and tabs 1234 $ echo 'abcdef 123 XYZ' | tr -d '[:digit:]' ←←delete all digits abcdef XYZ $ tr -d '\r' < DOS_FILE > UNIX_FILE ← delete the character '\r' in windows/DOS files |
$ echo 'abcdef 123 XYZ' | tr -c '[:alpha:]' '-' ← Convert all non-alphabetic characters to characters 〝-〝 abcdef-----XYZ- $ echo 'abcdef 123 XYZ' | tr -t 'abcde' 'AB' ←The target characters are only "A" and "B" less than the source "a"~"e", so only convert the first two characters ABcdef 123 XYZ |
The basic usage of sort is straightforward, and once you see an example, you will understand it. Below is an example with a file named "equip," which is a list of computer equipment purchased by the company this year.
$ cat equip xerox apr 4 acer1 feb 1 XEROX-FUJI may 5 printer1 oct 6 acer2 oct 10 printer1 jul 3 ASUS1 sep 4 Apple jun 5 IBM2 dec 7 acer2 oct 10 ASUS2 nov 20 IBM1 mar 1 |
$ LANG=C ← Setting the "LANG" environment variable to "C" or "POSIX" will configure the locale to use the ASCII character set $ sort equip ASUS1 sep 4 ASUS2 nov 20 Apple jun 5 IBM1 mar 1 IBM2 dec 7 XEROX-FUJI may 5 acer1 feb 1 acer2 oct 10 acer2 oct 10 printer1 jul 3 printer1 oct 6 xerox apr 4 |
The previous example is much better, but it's still strange that "ASUS" and "acer" with the same starting letter "A" are sorted so differently, as per the ASCII table where uppercase letters come before lowercase letters.
To sort the list without considering case sensitivity, you can use the "-f" option with the sort command. This option tells the sort command to ignore case and treat all letters as uppercase before sorting. In this way, "ASUS" and "acer" will be considered as the same letter starting with "A," and they will be sorted based on other criteria (such as their remaining characters or subsequent sorting rules).
$ sort -f equip ←option "-f" is case-insensitive acer1 feb 1 acer2 oct 10 acer2 oct 10 Apple jun 5 ASUS1 sep 4 ASUS2 nov 20 IBM1 mar 1 IBM2 dec 7 printer1 jul 3 printer1 oct 6 xerox apr 4 XEROX-FUJI may 5 |
$ sort -k 3 equip ← the "-k" option, you can customize the sorting based on different columns of your data IBM1 mar 1 acer1 feb 1 acer2 oct 10 acer2 oct 10 ASUS2 nov 20 printer1 jul 3 ASUS1 sep 4 (The following is omitted) |
Please note that the previous example still looks a bit strange. For example, when sorting in ascending order, the value "20" in the third column should come after "3." The reason for this behavior is that the sort command sorts based on the first character of each field. If the first character is the same, it proceeds to compare the next character, and so on. Therefore, based on the first character, "1x" and "2x" will always come before "3."
To resolve this issue, you can use the -n option. This option tells the sort command to perform a numerical sort based on the values in the specified field. Here's an example:
$ sort -n -k 3 equip acer1 feb 1 IBM1 mar 1 printer11 jul 3 ASUS1 sep 4 xerox apr 4 Apple jun 5 XEROX-FUJI may 5 printer11 oct 6 IBM2 dec 7 acer2 oct 10 acer2 oct 10 ASUS2 nov 20 |
the sort command supports sorting based on multiple fields with different priority levels. You can achieve this by adding multiple -k options to specify the fields and their priority.
In your example, you want to sort based on field 2, and in the case of equality in field 2, you want to further sort based on field 5. Here's the command:
$ ls -l /etc | sort -k2 -k5 ←Sort according to field 5 when field 2 is equal (The above is omitted) -rw-r--r-- 1 root root 77598 Nov 25 00:30 ld.so.cache -rw-r--r-- 1 root root 84649 Aug 23 2007 sensors.conf -rw-r--r-- 1 root root 117276 Sep 17 2007 Muttrc -rw-r--r-- 1 root root 362047 Apr 18 2007 services -rw-r--r-- 1 root root 412666 Jan 26 14:21 prelink.cache (The following is omitted) |
$ sort -M -k 2 equip ← Sort according to the field2 (Month) acer1 feb 1 IBM1 mar 1 printer11 jul 3 ASUS1 sep 4 (The following is omitted) |
Talk about the option "-k", such as "-k 2.3", which means that the sorting starts from the third character of column 2.
The following example assumes that the "id" of the employee user is added before the list you get. For example, if you want to filter out the "id" and then sort it, you can specify field 1 to sort from the 6th character, as shown below.
$ cat equip1 ←The id of the employee user is added before the list id03_xerox apr 4 id04_XEROX-FUJI may 5 id06_acer1 feb 1 id05_IBM1 mar 1 id09_IBM2 dec 7 id12_printer1 jul 3 $ sort -k1.6 equip1 ←The sixth of the specified field 1 Character start sorting id06_acer1 feb 1 id05_IBM1 mar 1 id09_IBM2 dec 7 id12_printer1 jul 3 id03_xerox apr 4 id04_XEROX-FUJI may 5 |
Syntax:[STDIN] sort [-otpiton] [FILES] | ||
Command name/Function/Command user | Options | Function |
sort/ sortitng/ Any |
-b | ignore leading whitespace characters on each line |
-d | Only whitespace, numbers and letters are considered | |
-f | Ignore case | |
-g | The size of the number is sorted, similar to the option "-n" but can be scientific notation such as "1.23E10" | |
-i | Ignore non-printable characters | |
-k field[.STAR CHAR] | Sort by the specified column | |
-M | Sort according to the English "JAN", "FEB"... "DEC" of the month | |
-n | Sort by number | |
-o FILE | Write output to file instead of screen | |
-R | Random order | |
-r | Reverse sort (big to small) | |
-t CHAR | Specifies the delimiter character. | |
-u | Delete duplicate lines after sorting |
$ sort equip | uniq > result ←sorts the content of the "equip" file and then uses "uniq" to remove adjacent duplicate lines(equivalent to using the "sort -u") $ echo -e 'lineA\nlineA\nlineB' | uniq ←will delete adjacent duplicate lines. Only consecutive duplicate lines will be removed lineA lineB $ echo -e 'lineA\nlineB\nlineA' | uniq ←will not delete non-adjacent duplicate lines. Only consecutive duplicate lines are removed lineA lineB lineA $ echo -e 'lineA\n\n\n\n\n\nlineB' | uniq ←remove consecutive duplicate empty lines lineA LineB |
Syntax:[STDIN] uniq [-otpiton] [FILES] | ||
Command name/Function/Command user | Options | Fuction |
uniq/ delete adjacent duplicate lines/ Any |
-c | Show the number of repetitions |
-d | Show only adjacent and repeated rows | |
-f # | Skip the compared fields ("#" is the number of fields) | |
-s # | Ignore the #th characte | |
-u | Contrary to "-d", list lines that appear only once | |
-w # | Compare at most # characters per line |
$ echo -e 'lineA\nLineB\nLineB\nLineC' | uniq -u ←lists only the lines that appear exactly once lineA LineC # echo -e 'lineA\nLineB\nLineB\nLineC' | uniq -c ←displays the count of repeated occurrences of each line 1 lineA 2 LineB 1 LineC $ uniq -s 9 fileA ←Do not compare the 9th characters of each line $ uniq -w 9 fileA ←Compare only the first 9 characters of each line |