![]() |
![]() |
![]() |
![]() |
![]() |
1.0 正規表示法簡介
〝正規表示法〞(Regular Expressions 縮寫為 RE 或 regex 或 regexp)或叫〝正則表達式〞,其作用為:「匹配符合某一規則的字串或字元(字符)」。
例如公司許多的產品目錄,顏色的標示有時用〝color〞有時用〝colour〞,如某天我要搜尋關鍵字〝color〞或〝colour〞如用正規表示法我只要用〝colou*r〞就可找到〝color〞或〝colour〞這兩種拼法,也時常用正規表示法〝\<[^aeiouyAEIOUY]*\>〞來找出無母音的單字來初步檢查看有無打字錯誤,只要花少許時間學習,就能擁有〝正規表示法〞的〝法眼〞很方便和實用。
正規表示法或延伸正規表示法所書寫的匹配敘述叫〝樣板〞(pattern),如上例的〝\<[^aeiouyAEIOUY]*\>〞或〝colou*r〞。
上例只是正規表示法的牛刀小試,正規表示法原本只流行在 UNIX 的一些工具程式如 grep, sed 等,因太強大和好用,故逐漸擴展到其他地方。甚至連微軟在其 Wold/Excel 中也對正規表示法提供了相當程度的支援,故學會了可應用在許多的地方,很值得一學。
<下圖為 MS Word 用正規表示法尋找〝tha〞或〝thi〞>
正規表示法很多人會和萬用字元搞混在一起而分不清,不過這也難怪,因正規表示法和萬用字元所用的符號部分重疊,但所代表的意義卻不一定一樣。更讓人搞混的為 UNIX 自從 1975 年發表了 UNXI V6,其中新增了〝globbing patterns〞(又叫〝glob〞)[註 1.0])擴充了萬用字元語法,從此萬用字元也有 Bracket Expressions 的功能和正規表示法的語法有部分重疊(雖然結果不一定相同)。
萬用字元大多數情況只能操作檔案(例如 ls /dev/[sh]d* ),正規表示法可想像為萬用字元的加強版,還可操作在匹配檔案的內容或程式的輸出(例如 seq 1 999 | grep '5\{2,3\}' ),也更細膩和彈性來匹配某一規則的字串。但正規表示法也有明顯不足的地方;例如正規表示法的語法對地球人來說就像火星文不易閱讀,且外並非所有的指令或工具都有支援正規表示法。
解釋正規表示法之前先舉個例子來說明萬用字元的最大缺點,如下例目的為用萬用字元列出〝/etc〞目錄內第一個字元為大寫的檔案或目錄:
例:
$ cd /etc $ LANG=POSIX ←設定語系為〝POSIX〞(同等〝LANG=C〞或〝LANG=〞為清除所有設定的語系) $ ls -d [A-Z]* | head -n 5 ←用萬用字元列出前 5 個第一個字元為大寫的檔案或目錄 ConsoleKit DIR_COLORS Muttrc DIR_COLORS.xterm Muttrc.local $ LANG=en_US.UTF-8 ←設定語系為〝en_US.UTF-8〞 $ ls -d [A-Z]* | head -n 5 ←同一動作再作一次 bashrc ←輸出不一樣了? blkid bluetooth bonobo-activation capi.conf |
$ cd /etc $ LANG=POSIX $ ls -d * | grep '[A-Z].*' | head -n 5 ←用 ls 配合 grep 列出前 5 個第一個字元為大寫的檔案或目錄 ConsoleKit DIR_COLORS DIR_COLORS.xterm Muttrc Muttrc.local $ LANG=en_US.UTF-8 ←設定語系為〝en_US.UTF-8〞 $ ls -d * | grep '[A-Z].*' | head -n 5 ←再測試看結果 ConsoleKit ←結果一致了 DIR_COLORS DIR_COLORS.xterm Muttrc Muttrc.local |
Utility | 基礎正規表示法 | 延伸正規表示法 |
shell | ||
vi | ○ | |
locate | ○ | |
find | ○ | ○ |
grep | ○ | ○ |
sed | ○ | ○ |
awk | ○ | ○ |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can google Goggles Solves SUDOKU Puzzles. How much oil boil can a gum boil boil if a gum boil can boil oil? Where's the peck of pickled peppers Peter Piper picked? 55 Flags freely flutter from the floating frigate |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. 中間略 /bu[nms] ←在 vi 狀態列搜尋 bu 後面接〝n〞或〝m〞或〝s〞的字串 |
Regular Expressions | Matches |
[0-9] | 任何數字 |
[a-z] | 所有小寫字母 |
[A-Z] | 所有大寫字母 |
[a-zA-Z] 或 [A-Za-z] | 所有字母 |
[0-9a-zA-Z] | 任何數字和字母 |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can google Goggles Solves SUDOKU Puzzles. How much oil boil can a gum boil boil if a gum boil can boil oil? Where's the peck of pickled peppers Peter Piper picked? 中間略 /[A-Z]he ←搜尋第一個字為大寫,後面為〝he〞的字串 |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can 中間略 /[^A-Z]he ←搜尋第一個字元不為大寫,後面為〝he〞的字串 |
POSIX Characters 匹配字元 | |||
POSIX | ASCII | 說明 | 註 |
[:alnum:] | [A-Z,a-z,0-9] | 英文字母和數字 | |
[:alpha:] | [A-Z,a-z] | 英文字母 | |
[:blank:] | 空格(ASCII = 20H)和 TAB(ASCII = 9H) | ||
[:cntrl:] | [0H-1FH,7FH] | 控制字元 | |
[:digit:] | [0-9] | 數字 | |
[:graph:] | [21H-7EH] | 會顯示的字元 | |
[:upper:] | [A-Z] | 大寫字母 | |
[:lower:] | [a-z] | 小寫字母 | |
[:print:] | [20H-7EH] | 會顯示的字元+空格 | |
[:punct:] | [\]\[!"#$%&')(*+,./:;<=>?@\^_`{|}~-] | 標點和符號 | |
[:space:] | [ \t\r\n\v\f] | Whitespace(不顯示的字元) | Whitespace 字元(如換行)一般是不會顯示出來, 但 Linux 對一些常用的 Whitspace 字元有其固定的表示方式 (可參考 [註 1.1]ASCII 中的〝Linux 常用的表示方式〞一欄) |
[:xdigit:] | [A-F,a-f,0-9] | 十六進制的字元 |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can google Goggles Solves SUDOKU Puzzles. How much oil boil can a gum boil boil if a gum boil can boil oil? Where's the peck of pickled peppers Peter Piper picked? 55 Flags freely flutter from the floating frigate 中間略 /[[:digit:]] ←用 POSIX Characters 搜尋阿拉伯數字 |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. 中間略 /s.ll ←因〝.〞可匹配任何字元,故第二行的字串〝sill〞或〝sell〞都符合 |
Regular Expressions | Matches | 例 |
.* | 無到任何字元 | [\]\[!"#$ABCabc0123\n\a\t 等 |
[0-9][0-9]* | 一或一個以上的連續相鄰的數字 | 0、11、12345、 543543543 |
[A-Z][A-Z]* | 大於或等於一個以上都是大寫相鄰的字元 | Y、IJK、ZZZZZZ |
[a-z][a-z] * | 大於或等於一個以上都是小寫相鄰的字元 | z、xyz、abcdefghijk |
Goo*gle | G 和 gle 之間 的〝o〞可一到無限 | Gogle、Google、Gooooooooooogle |
yaho* | yah〝o〞可零到無限 | yah、yaho、yahoooooooooooooo |
.*k | 從第一個任意字元直到遇到〝k〞 | This is a book |
G.* | 〝 G 〞開始一直到換行 | Good morning Mr. Chen |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can 中間略 /can* |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can 中間略 /^can |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can google Goggles Solves SUDOKU Puzzles. 中間略 /^[^A-Z] ←匹配起始位置非大寫的字元 |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can 中間略 /an$ |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can ←空白行 google Goggles Solves SUDOKU Puzzles. 中間略 /^$ ←匹配空白行 |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes (The) driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can google(Goggles)(Solves) SUDOKU (Puzzles). 中間略 :1,$ s/[A-Z][a-z][a-z]*/(&)/g ←把第一個字母為大寫其後字母為小寫的字加上〝( )〞號 |
What are the numbers for I II III IV V VI VII VIII VIIII in Roman numerals? 中間略 /VI\{2,3\} ←搜尋〝V〞後面連續 2~3 個〝I〞的字串 |
gogle google gooogle goooogle gooooogle is a gd god good goood 中間略 /go\{,2\}d ←搜尋〝g〞和〝d〞之間〝o〞上限為 2 的字串 |
gogle google gooogle goooogle gooooogle is a gd god good goood 中間略 /o\{3\} ←搜尋〝o〞下限為 3 的字串 |
1,000 milliliter equal 1 liter. 中間略 /\(li\)*ter ←匹配〝ter〞前面 0~∞ 的〝li〞 |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can google Goggles Solves SUDOKU Puzzles. 中間略 /\<[[:upper:]]*\> ←匹配字母都大寫的單字 |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes> 中間略 /\([a-zA-Z]\)\1 ←搜尋兩相鄰且相同的字母的字串 |
busy buzzing bumblebees buzzing busying silly 6 sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. 中間略 :1,$ s/\(\<[0-9]*\>\) \(\<[a-z]*\>\) /\2 \1 /g ←如匹配到一字串都是數字且下個字串都是小寫字母就交換位置 |
busy buzzing bumblebees buzzing busying 6 silly sisters selling shining shoes The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can google Goggles Solves SUDOKU Puzzles. 中間略 /bus\|buzz ←匹配〝bus〞或〝buzz〞 |
1.2 延伸正規表示法(ERE)
但為什麼要有〝延伸正規表示法〞呢?個人認為(個人認為不一定對,有空我再考證一下)因基礎正規表示法在定義的時候漏掉了或匹配的〝|〞。
為什要為了一個或匹配符號的〝|〞而定義延伸正規表示法?符號〝|〞這麼重要嗎?是的!!舉簡單的例子,假設我不用〝|〞,但我要匹配單字〝as〞或〝if〞我可能可寫成〝[ai][sf]〞但此時你要保佑不要匹配到單字〝is〞。
但隨便增加或匹配符號的〝|〞到基礎正規表示法會有相容問題,如以前寫的 pattern 只是要匹配字元〝|〞但並不是要進行或匹配運算。
故解決方法為原基礎正規表示法要用或匹配要加跳脫字元寫成〝\|〞 另新增另一表示法叫〝延伸正規表示法〞則直接用〝|〞表示或匹配也順便修改了些東東。
延伸正規表示法(Extended Regular Expressions)或叫〝posix-Extended〞常簡寫為 ERE,和基礎正規表示法不同的地方如下。
$ grep -E 'ca(r|n)' re.txt ←匹配〝car〞或〝can〞(注意!不用跳脫字元了) The driver was drunk and drove the doctor's car into deep ditch. can you can a can as a canner can can a can How much oil boil can a gum boil boil if a gum boil can boil oil? $ grep 'ca\(r\|n\)' re.txt ←如去掉選項〝-E〞為基礎正規表示法,或匹配要加跳脫字元 |
$ grep -E 'go+gle' re.txt ←列出〝gogle〞或〝google〞或 〝goooooooooooogle〞 google Goggles Solves SUDOKU Puzzles $ seq 1 10000 | grep -E '^199+' ←列出有 199, 199, 19999999 有關的數字 199 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 |
$ grep -E '[+-]?[0-9]+\.[0-9]+([Ee][+-][0-9]+)?' fileA ←列出〝fileA〞內的實數(Real numbers) |
Dec | Hex | 縮寫 | Linux 常用的表示方式 | 名稱/意義 | Dec | Hex | 顯示字元 | Dec | Hex | 顯示字元 | Dec | Hex | 顯示字元 | |||
0 | 0 | NUL | 空字元(Null) | 32 | 20 | (空格) | 65 | 41 | A | 98 | 62 | b | ||||
1 | 1 | SOH | 標題開始 (start of heading) | 33 | 21 | ! | 66 | 42 | B | 99 | 63 | c | ||||
2 | 2 | STX | 本文開始 (star of text) | 34 | 22 | " | 67 | 43 | C | 100 | 64 | d | ||||
3 | 3 | ETX | 本文結束 (end of text) | 35 | 23 | # | 68 | 44 | D | 101 | 65 | e | ||||
4 | 4 | EOT | 傳輸結束 (end of transmission) | 36 | 24 | $ | 69 | 45 | E | 102 | 66 | f | ||||
5 | 5 | ENQ | 請求 (enquiry) | 37 | 25 | % | 70 | 46 | F | 103 | 67 | g | ||||
6 | 6 | ACK | 請求確認 (acknowledge) | 38 | 26 | & | 71 | 47 | G | 104 | 68 | h | ||||
7 | 7 | BEL | \a | 響鈴 (bell) | 39 | 27 | ' | 72 | 48 | H | 105 | 69 | i | |||
8 | 8 | BS | \b | 退格 (backspace) | 40 | 28 | ( | 73 | 49 | I | 106 | 6A | j | |||
9 | 9 | TAB | \t | 水平定位 (horizontal tab) | 41 | 29 | ) | 74 | 4A | J | 107 | 6B | k | |||
10 | 0A | LF | \n | 換行 (line feed,new line) | 42 | 2A | * | 75 | 4B | K | 108 | 6C | l | |||
11 | 0B | VT | \v | 垂直定位 (vertical tab) | 43 | 2B | + | 76 | 4C | L | 109 | 6D | m | |||
12 | 0C | FF | \f | 換頁 (NP form feed, new page) | 44 | 2C | , | 77 | 4D | M | 110 | 6E | n | |||
13 | 0D | CR | \r | 回車 (carriage return) | 45 | 2D | - | 78 | 4E | N | 111 | 6F | o | |||
14 | 0E | SO | 移出(Shift out) | 46 | 2E | . | 79 | 4F | O | 112 | 70 | p | ||||
15 | 0F | SI | 移入(Shift in) | 47 | 2F | / | 80 | 50 | P | 113 | 71 | q | ||||
16 | 10 | DLE | 跳出資料連結 (data link escape) | 48 | 30 | 0 | 81 | 51 | Q | 114 | 72 | r | ||||
17 | 11 | DC1 | 設備控制一(device ctrl. 1) (XON 啟用軟體速度控制) |
49 | 31 | 1 | 82 | 52 | R | 115 | 73 | s | ||||
18 | 12 | DC2 | 設備控制二 (device ctrl. 2) | 50 | 32 | 2 | 83 | 53 | S | 116 | 74 | t | ||||
19 | 13 | DC3 | 設備控制三 (device ctrl. 3) (XOFF 停用軟體速度控制) |
51 | 33 | 3 | 84 | 54 | T | 117 | 75 | u | ||||
20 | 14 | DC4 | 設備控制四 (device ctrl. 4) | 52 | 34 | 4 | 85 | 55 | U | 118 | 76 | v | ||||
21 | 15 | NAK | 失敗確認 (negative ack.) | 53 | 35 | 5 | 86 | 56 | V | 119 | 77 | w | ||||
22 | 16 | SYN | 暫停同步 (syn. idle) | 54 | 36 | 6 | 87 | 57 | W | 120 | 78 | x | ||||
23 | 17 | ETB | 結束區塊傳輸 (end of trans. block) | 55 | 37 | 7 | 88 | 58 | X | 121 | 79 | y | ||||
24 | 18 | CAN | 取消 (cancel) | 56 | 38 | 8 | 89 | 59 | Y | 122 | 7A | z | ||||
25 | 19 | EM | 媒介中斷 (end of medium) | 57 | 39 | 9 | 90 | 5A | Z | 123 | 7B | { | ||||
26 | 1A | SUB | 替換 (substitute) | 58 | 3A | : | 91 | 5B | [ | 124 | 7C | | | ||||
27 | 1B | ESC | 結束 (escape) | 59 | 3B | ; | 92 | 5C | \ | 125 | 7D | } | ||||
28 | 1C | FS | 檔案分割 (file separator) | 60 | 3C | < | 93 | 5D | ] | 126 | 7E | ~ | ||||
29 | 1D | GS | 群組分隔 (group separator) | 61 | 3D | = | 94 | 5E | ^ | 127 | 7F | DEL (不會顯示) |
||||
30 | 1E | RS | 記錄分隔 (record separator) | 62 | 3E | > | 95 | 5F | _ | |||||||
31 | 1F | US | 單元分隔 (unit separator) | 63 | 3F | ? | 96 | 60 | ` | |||||||
127 | 7F | DEL | 刪除 (delete) | 64 | 40 | @ | 97 | 61 | a | |||||||