[Top level Bash Page] .... [References]
Regular Expressions
This page serves to guide us as we learn regular expressions. References [3,4,7,23] seem to be the best source of help.
- Remember, shell metachararcters are expanded before the shell passes arguements so regex patterns must be in quotes to prevent the shell from interpretting
- Structure of regex
- Anchors- used to specify the position of the pattern : ^, $, \b \B \< \>
- character sets - match one or more characters in a single position : [ ] with any of the following: A-Z, a-z, 0-9, :predefined in current local: such as alnum, alpha, digit, ...
- modifiers - specify how many times the prev. character can be repeated : . , ?, *, +, {n,m}
- back references: \n where n is a digit that points to the n'th parenthesized subexpr
- Basic and Extended regex
- Basic supported by vi/sed/grep /more
- Extended supported by awk/egrep/
- Simple examples
- to match a line containing the substring 'value= 10.2 ms'
Install the check-regexp program (apt-get install source-highlight) - this is a very helpful tool. Give it a regexp and a text string and it shows not only if the regexp works as expected but provides further information as to how regexp might work.
- check-regexp '^00:00:[0-9]{1,2}' '00:00:14.1234 IP 130.127.49.2 '
- The {1,2} indicates we require exactly 1 or 2 digits 0...9.
- Note: [0-9]+ would match one or more
- check-regexp '^00:00:[0-9]' '00:00:01.1234 IP 130.127.49.2 2'
- searching : 00:00:01.1234 IP 130.127.49.2 2
- for the regexp : ^00:00:[0-9]
- num of subexps : 0
- what[0]: 00:00:0
- suffix: 1.1234 IP 130.127.49.2 2
- total number of matches: 1
- check-regexp '^00:00:[0-9]{1,2}' '00:00:01.1234 IP 130.127.49.2 2'
- searching : 00:00:01.1234 IP 130.127.49.2 2
- for the regexp : ^00:00:[0-9]{1,2}
- num of subexps : 0 what[0]: 00:00:01
- suffix: .1234 IP 130.127.49.2 2
- total number of matches: 1
Regular expressions examples
- "^#*" this is: to match 0 or more "#" character but ONLY if it starts at the beginning of a line
- Let's say we needed to find lines in a file that :
- Start with a T
- Is the first word on a line
- Second letter is lower case
- Exactly three letters long
- Third letter is a vowel
- /^T[a-z][aeiou]/
- Example of a UDP trace obtained from tcpdump. The file format is a line for each packet captured:
- 00:00:00.000000 IP 130.127.49.48.33581 > 130.127.49.144.commplex-main: UDP, length 1000
- All we want to do is catch and return the full lines that begin with '00:00:0'
- check-regexp '^00:00:[0-9]{1,2}' '00:00:14.1234 IP 130.127.49.2 '
- The {1,2} indicates we require exactly 1 or 2 digits 0...9.
- Note: [0-9]+ would match one or more
- check-regexp '^00:00:[0-9]' '00:00:01.1234 IP 130.127.49.2 2'
- searching : 00:00:01.1234 IP 130.127.49.2 2
- for the regexp : ^00:00:[0-9]
- num of subexps : 0
- what[0]: 00:00:0
- suffix: 1.1234 IP 130.127.49.2 2
- total number of matches: 1
- check-regexp '^00:00:[0-9]{1,2}' '00:00:01.1234 IP 130.127.49.2 2'
- searching : 00:00:01.1234 IP 130.127.49.2 2
- for the regexp : ^00:00:[0-9]{1,2}
- num of subexps : 0 what[0]: 00:00:01
- suffix: .1234 IP 130.127.49.2 2
- total number of matches: 1
- To do the same match using regexp in grep and assuming basic regexp :
- grep -o '^00:00:[0-9]\{1,2\}' t.trace
- Note the \ is necessary to ensure grep does not interpret the curly braces
- Exampe: given a data file that contains the output of the ping program. We want to filter out everything from each line except for the RTT
- The ping command: 'ping -D 202.58.60.194 > ping1.dat' produces a file with lines that include:
- PING 202.58.60.194 (202.58.60.194) 56(84) bytes of data.
[1504138863.190651] 64 bytes from 202.58.60.194: icmp_seq=1 ttl=237 time=250 ms
- ....
- [1504147755.429038] 64 bytes from 202.58.60.194: icmp_seq=8879 ttl=237 time=249 ms
[1504147756.425207] 64 bytes from 202.58.60.194: icmp_seq=8880 ttl=237 time=244 ms
- --- 202.58.60.194 ping statistics ---
8880 packets transmitted, 8857 received, 0% packet loss, time 8893241ms
rtt min/avg/max/mdev = 236.538/244.049/500.643/8.191 ms
- First, we need to clean up the file by removing the first line and last 3 lines (note, depending on how we filter, this step might not be necessary)
- This pipes the file contents starting with line 2 to head which displays all but the last 4 lines
- tail -n +2 ping1.dat | head -n -4
- Next, pipe the resulting stream to grep
- tail -n +2 ping1.dat | head -n -3 | grep -o "time="
- The '-o' tells grep to print only the parts of each line that match
- Grep will look at each line in the stream in put, and without the '-o' will display all lines that have the string 'time='
- From these lines we want to only show the rtt.
- We can use the -o and regexp to limit the match with a regular expr of "time=[0-9]*[.]*[0-9]*"
- This matches lines that contain a substring beginning with 'time=' followed by 0 or more numeric digits and then 0 or more dots, followed by 0 or more digits.
- tail -n +2 ping1.dat | head -n -3 | grep -o "time=[0-9]*[.]*[0-9]*" ...which returns:
time=250
time=241
.....
- Now...how can we remove the 'time=' ? Multiple ways....
- Add a final component to the pipeline:
- sed "s/time=//g" which substitutes the substring 'time=' with a nothing....it removes it !!
- Or....use awk by adding the following to the pipeline:
- ' awk 'BEGIN {FS = "="}{printf("%f\n",$2)}' #Note that we need to change the delimiter to '='
- For the last example, if we want to reduce the ping data file to a 2 column RTT.dat file with each entry consisting of the timestamp RTT.
- tail -n +2 ping1.dat | head -n -4 | sed 's/\[//g;s/\]//g;s/time=//g' | awk '{printf("%f %d \n",$1, $8)}'
- finds only what is in between two delimiters (begin and end) in a given file
- dataFName=$1 begin=$2 end=$3
- if (( $# < $"3" )); then
- echo "Usage: [data filename] [beginning delimiter] [end delimiter] [out filename]"
- exit 0
- fi
- #The following variable can be used to 'eat' any number of digits, followed by any nmber of dots, any number of digits (Question: what is the final space and *?)
- regex="[0-9]*[.]*[0-9]* *"
- #grep -o "$begin$regex$end" : this will produce the pattern with begin and end delimiters
- #sed "s/$end//g" | sed "s/$begin//g" : this strips off the delims
- awk '{printf("%d\t%2.3f\t%2.6f\t%2.6f\t%9.0f\n",$1, $6, $9, $8, $7)}'
Last update: 9/20/2017