The Linux Newbie Guide  ⇒    Fundamentals     Advanced     Supplement   Command Index   ENG⇒中
All rights reserved, please indicate the source when citing
 

Regular Expressions

1.0 Introduction to Regular Expressions
1.1 Basic Regular Expressions (RE/BRE)
        Bracket Expressions (POSIX bracket notation)
        POSIX Characters
        . : Match any single character
        * : Characters before matching are repeated from zero to infinite
        ^ : Match the string at the start position
        $ : Match the string at the end position
        & : Remember the matched string
        {a, b} : Match the character before the repetition
        ( ) : Set the string before the match
        < > : Match the single word
        ( )\1 : Backward reference memory match
        | : Or match
1.2 Extended Regular Expressions (ERE)
        | : Or match
        + : Match the previous one to infinite repeated characters
        ? : Match the previous zero to one repeated characters

ENG⇒中ENG⇒中
  1.0 Introduction to Regular Expressions
"Regular Expressions" abbreviated as RE or regex or regexp), its role is: "match a string or character(s) that meets a certain rule".

For example, in many product catalogs of the company, sometimes "color" and sometimes "colour" are used to mark the color. For example, if I want to search for the keyword "color" or "colour" one day, I only need to use "colou*r" , you can find the two spellings of "color" or "colour", and often use the regular expressions "\<[^aeiouyAEIOUY]*\>" to find out the words without vowels to check whether there are may typos. Just spend a little time learning, very convenient and practical.

The matching description written in regular expresstions or extended regular expressions is called "pattern", such as "\<[^aeiouyAEIOUY]*\>" or "colou*r" in the above example.

The above example is just a small test of the regular expressions. The regular expressions was originally only popular in some UNIX tool programs such as grep and sed . Because it is too powerful and easy to use, it gradually spread to other places. For example, Notpad++, a commonly used tool software for Windows, supports regular expressions, so it can be applied in many places after learning, and it is worth learning.

MS Word wildcard

Many people will confuse regular-expressions with wildcards and can’t tell the difference, but it’s no wonder, because regular expressions and wildcard the symbols used by characters overlap, but the meanings they represent are not necessarily the same. What is even more confusing is that since UNIX released UNXI V6 in 1975, "globbing patterns" (also called "glob") have been added has expanded the wildcard syntax, since then wildcards also have  the function of Bracket Expressions and the syntax of regular expressions has some overlap (although the results are not necessarily the same).

In most cases, wildcards can only operate on files (such as ls /dev/[sh]d* ), regular expressions can be thought of as enhanced versions of wildcards, and can also be used to match the contents of files or the output of programs (eg seq 1 999 | grep '5\{2,3\}' ), it is also more delicate and flexible to match the string of a certain rule. But the regular expressions also has obvious deficiencies; for example, the syntax of the regular expressions is not easy to read for the earthlings, and not all commands or tools support the regular expressions.

Before explaining the regular expressions, give an example to illustrate the biggest disadvantage of wildcards. The purpose of the following example is to use wildcards to list files or directories whose first character in the "/etc" directory is capitalized.

Example:
$ cd /etc
$ LANG=POSIX Set the locales to "POSIX" (equivalent to "LANG=C" or "LANG=" to clear all language settings)
$ ls -d [A-Z]* | head -n 5 List the first 5 files or directories whose first character is uppercase using wildcards
ConsoleKit 
DIR_COLORS
Muttrc
DIR_COLORS.xterm 
Muttrc.local  
$ LANG=en_US.UTF-8 Set the locales to "en_US.UTF-8"
$ ls -d [A-Z]* | head -n 5 Repeat the same action once again
bashrc The output is different now.
blkid
bluetooth
bonobo-activation
capi.conf

The above experiments illustrate one of the biggest disadvantages of wildcards, that is, the output of the same command will be different on different machines or environments (different language settings may affect the sorting of wildcards ) .

If the same function is rewritten with ls and grep that supports regular expressions, there will be consistent output results, and no wildcards will affect the output due to different environments.

example:
$ cd /etc
$ LANG=POSIX
$ ls -d * | grep '[A-Z].*' | head -n 5 Use ls with grep to list the first 5 files or directories whose first character is uppercase.
ConsoleKit 
DIR_COLORS
DIR_COLORS.xterm
Muttrc
Muttrc.local  
$ LANG=en_US.UTF-8 Set the locales to "en_US.UTF-8"
$ ls -d * | grep '[A-Z].*' | head -n 5 Test the command again to see the results.
ConsoleKit The results are consistent now
DIR_COLORS
DIR_COLORS.xterm
Muttrc
Muttrc.local

Regular expressions themselves are not difficult, but correctly matching patterns requires experience and practice. Regular expressions can be divided into basic regular expressions (RE) and extended regular expressions (ERE).  However, the level of support for regular expressions varies among different software tools. Below are the levels of support for regular expressions in commonly used Unix/Linux tools.

Utility basic regular expressions extended regular expressions
shell    
vi  
locate  
find
grep
sed
awk


^ back on top ^



  
1.1 Basic Regular Expressions (RE/BRE)

Basic Regular Expressions are often abbreviated as regex, RE, BRE or "posix-basic". The basic regular expression is the most basic usage in the regular expression. Generally, if there is no special explanation, the regular expression refers to the basic regular expression; if you use man to check the usage of a certain command , there is support for "regexp" or "regex" or " posix-basic" means that there is support for Basic Regular Expressions.

In Basic Regular Expressions, the characters "{ }", "( )", "< >", and "|" are treated as reserved metacharacters with special meanings. Therefore, they need to be escape characters with a backslash to be treated as literal characters. This is the most notable difference between Basic Regular Expressions and Extended Regular Expressions (ERE).

Since the vi (vim) editor itself has good support for regular expressions and highlights matched strings, it is suitable for practicing regular expressions. Therefore,  so the following examples use vi general mode search Come experiment.
(If the matched strings are not highlighted, please set ":set hlsearch" in the vi configuration. For certain Linux distributions such as Fedora, you may need to install "vim-enhanced" for the matched text to be highlighted. Fedora users can use the command "sudo dnf install vim-enhanced" or "sudo yum install vim-enhanced" to install "vim-enhanced".)

Please enter the following randomly found English tongue twister in vim and save it as "re.txt". It may be used again when introducing Extended Regular Expressions or grep and sed (you can copy and paste if you don't want to type it out):

(vi editor)
busy buzzing bumblebees buzzing busying
6 silly sisters selling shining shoes
The driver was drunk and drove the doctor's car into deep ditch.
can you can a can as a canner can can a can

google Goggles Solves SUDOKU Puzzles.
How much oil boil can a gum boil boil if a gum boil can boil oil?
Where's the peck of pickled peppers Peter Piper picked?
55 Flags freely flutter from the floating frigate


^ back on top ^


1.2 Extended Regular Expressions (ERE)

但為什麼要有〝延伸正規表示法〞呢?個人認為(個人認為不一定對,有空我再考證一下)因基礎正規表示法在定義的時候漏掉了或匹配的〝|〞。

為什要為了一個或匹配符號的〝|〞而定義延伸正規表示法?符號〝|〞這麼重要嗎?是的!!舉簡單的例子,假設我不用〝|〞,但我要匹配單字〝as〞或〝if〞我可能可寫成〝[ai][sf]〞但此時你要保佑不要匹配到單字〝is〞。

但隨便增加或匹配符號的〝|〞到基礎正規表示法會有相容問題,如以前寫的 pattern 只是要匹配字元〝|〞但並不是要進行或匹配運算。 故解決方法為原基礎正規表示法要用或匹配要加跳脫字元寫成〝\|〞 另新增另一表示法叫〝延伸正規表示法〞則直接用〝|〞表示或匹配也順便修改了些東東。

延伸正規表示法(Extended Regular Expression)或叫〝posix-Extended〞常簡寫為 ERE,和基礎正規表示法不同的地方如下。


But why is there an "extended regular expressions"? I personally think (I don't think it is necessarily correct, I will check again when I have time) because the basic regular expressions missed or matched "|" when it was defined .

Why define extended regular expressions "|" so important? Yes!! For a simple example, suppose I don't use "|", but I want to match the single word "as " or "if" I may write as "[ai][sf]", but at this time you have to be blessed not to match the single word "is".

However, adding or matching symbols "|" to the basic regular expression will cause compatibility problems. For example, the pattern written before is only to match the character "|" but not to perform OR matching operations. Therefore, the solution is to use or match the original basic regular expressions and add escape characters to write it as "\|", and add another expression called "extended regular expressions" to directly use "|" to express or match and modify it by the way something.

Extended Regular Expressions or "posix-Extended" is often abbreviated as ERE, and the difference from the basic regular expressions is as follows.
So the extended regular expressions is nothing special, isn’t it? Generally speaking, if the command supports the extended regular expressions, and the option to specify the extended regular expressions to match is "-E", such as grep -E . And grep is also a typical command that supports extended regular expressions.

The new items added to the extended regular expression are as follows:


^ back on top ^





[Note]
ASCII table (source from "https://en.wikipedia.org/wiki/ASCII")
Dec Hex Abbr Linux Common Representations Name Dec Hex Glyph Dec Hex Glyph Dec Hex Glyph
0 0 NUL   Null 32 20 (空格) 65 41 A 98 62 b
1 1 SOH   start of heading 33 21 ! 66 42 B 99 63 c
2 2 STX   star of text   34 22 " 67 43 C 100 64 d
3 3 ETX   end of text 35 23 #   68 44 D   101 65 e
4 4 EOT   end of transmission 36 24 $ 69 45 E 102 66 f
5 5 ENQ   enquiry 37 25 % 70 46 F 103 67 g
6 6 ACK   acknowledge 38 26 & 71 47 G 104 68 h
7 7 BEL \a bell 39 27 ' 72 48 H 105 69 i
8 8 BS \b backspace 40 28 ( 73 49 I 106 6A j
9 9 TAB \t horizontal tab 41 29 ) 74 4A J 107 6B k
10 0A LF \n line feed,new line 42 2A * 75 4B K 108 6C l
11 0B VT \v vertical tab 43 2B + 76 4C L 109 6D m
12 0C FF \f NP form feed, new page 44 2C , 77 4D M 110 6E n
13 0D CR \r carriage return 45 2D - 78 4E N 111 6F o
14 0E SO   Shift out 46 2E . 79 4F O 112 70 p
15 0F SI   Shift in 47 2F / 80 50 P 113 71 q
16 10 DLE   data link escape 48 30 0 81 51 Q 114 72 r
17 11 DC1   device ctrl. 1
(XON enable software control speed)
49 31 1 82 52 R 115 73 s
18 12 DC2   device ctrl. 2 50 32 2 83 53 S 116 74 t
19 13 DC3   device ctrl. 3
(XOFF disable software control speed))
51 33 3 84 54 T 117 75 u
20 14 DC4   device ctrl. 4 52 34 4 85 55 U 118 76 v
21 15 NAK   negative ack. 53 35 5 86 56 V 119 77 w
22 16 SYN   syn. idle 54 36 6 87 57 W 120 78 x
23 17 ETB   end of trans. block 55 37 7 88 58 X 121 79 y
24 18 CAN   cancel 56 38 8 89 59 Y 122 7A z
25 19 EM   end of medium 57 39 9 90 5A Z 123 7B {
26 1A SUB   substitute 58 3A : 91 5B [ 124 7C |
27 1B ESC   escape 59 3B ; 92 5C \ 125 7D }
28 1C FS   file separator 60 3C < 93 5D ] 126 7E ~
29 1D GS   group separator 61 3D = 94 5E ^ 127 7F DEL
(Invisiable)
30 1E RS   record separator 62 3E > 95 5F _
31 1F US   unit separator 63 3F ? 96 60 `
127 7F DEL   delete   64 40 @ 97 61 a