This website is preserved for historical and scholarly reference and is no longer actively maintained.

Regular Expressions

This page serves to guide us as we learn regular expressions. References [3,4,7,23] seem to be the best source of help.

Remember, shell metachararcters are expanded before the shell passes arguements so regex patterns must be in quotes to prevent the shell from interpretting
Structure of regex
- Anchors- used to specify the position of the pattern : ^, $, \b \B \< \>
- character sets - match one or more characters in a single position : [ ] with any of the following: A-Z, a-z, 0-9, :predefined in current local: such as alnum, alpha, digit, ...
- modifiers - specify how many times the prev. character can be repeated : . , ?, *, +, {n,m}
- back references: \n where n is a digit that points to the n'th parenthesized subexpr
Basic and Extended regex
- Basic supported by vi/sed/grep /more
- Extended supported by awk/egrep/
Simple examples
- to match a line containing the substring 'value= 10.2 ms'

Install the check-regexp program (apt-get install source-highlight) - this is a very helpful tool. Give it a regexp and a text string and it shows not only if the regexp works as expected but provides further information as to how regexp might work.

check-regexp '^00:00:[0-9]{1,2}' '00:00:14.1234 IP 130.127.49.2 '
- The {1,2} indicates we require exactly 1 or 2 digits 0...9.
- Note: [0-9]+ would match one or more
- check-regexp '^00:00:[0-9]' '00:00:01.1234 IP 130.127.49.2 2'
  - searching : 00:00:01.1234 IP 130.127.49.2 2
  - for the regexp : ^00:00:[0-9]
  - num of subexps : 0
  - what[0]: 00:00:0
  - suffix: 1.1234 IP 130.127.49.2 2
  - total number of matches: 1
- check-regexp '^00:00:[0-9]{1,2}' '00:00:01.1234 IP 130.127.49.2 2'
  - searching : 00:00:01.1234 IP 130.127.49.2 2
  - for the regexp : ^00:00:[0-9]{1,2}
  - num of subexps : 0 what[0]: 00:00:01
  - suffix: .1234 IP 130.127.49.2 2
  - total number of matches: 1

Regular expressions examples

"^#*" this is: to match 0 or more "#" character but ONLY if it starts at the beginning of a line
Let's say we needed to find lines in a file that :
- Start with a T
- Is the first word on a line
- Second letter is lower case
- Exactly three letters long
- Third letter is a vowel
- /^T[a-z][aeiou]/
Example of a UDP trace obtained from tcpdump. The file format is a line for each packet captured:
- 00:00:00.000000 IP 130.127.49.48.33581 > 130.127.49.144.commplex-main: UDP, length 1000
- All we want to do is catch and return the full lines that begin with '00:00:0'
- check-regexp '^00:00:[0-9]{1,2}' '00:00:14.1234 IP 130.127.49.2 '
  - The {1,2} indicates we require exactly 1 or 2 digits 0...9.
  - Note: [0-9]+ would match one or more
  - check-regexp '^00:00:[0-9]' '00:00:01.1234 IP 130.127.49.2 2'
    - searching : 00:00:01.1234 IP 130.127.49.2 2
    - for the regexp : ^00:00:[0-9]
    - num of subexps : 0
    - what[0]: 00:00:0
    - suffix: 1.1234 IP 130.127.49.2 2
    - total number of matches: 1
  - check-regexp '^00:00:[0-9]{1,2}' '00:00:01.1234 IP 130.127.49.2 2'
    - searching : 00:00:01.1234 IP 130.127.49.2 2
    - for the regexp : ^00:00:[0-9]{1,2}
    - num of subexps : 0 what[0]: 00:00:01
    - suffix: .1234 IP 130.127.49.2 2
    - total number of matches: 1
- To do the same match using regexp in grep and assuming basic regexp :
  - grep -o '^00:00:[0-9]\{1,2\}' t.trace
    - Note the \ is necessary to ensure grep does not interpret the curly braces

Exampe: given a data file that contains the output of the ping program. We want to filter out everything from each line except for the RTT
- The ping command: 'ping -D 202.58.60.194 > ping1.dat' produces a file with lines that include:
  - PING 202.58.60.194 (202.58.60.194) 56(84) bytes of data.
    [1504138863.190651] 64 bytes from 202.58.60.194: icmp_seq=1 ttl=237 time=250 ms
  - ....
  - [1504147755.429038] 64 bytes from 202.58.60.194: icmp_seq=8879 ttl=237 time=249 ms
    [1504147756.425207] 64 bytes from 202.58.60.194: icmp_seq=8880 ttl=237 time=244 ms
  - --- 202.58.60.194 ping statistics ---
    8880 packets transmitted, 8857 received, 0% packet loss, time 8893241ms
    rtt min/avg/max/mdev = 236.538/244.049/500.643/8.191 ms
- First, we need to clean up the file by removing the first line and last 3 lines (note, depending on how we filter, this step might not be necessary)
  - This pipes the file contents starting with line 2 to head which displays all but the last 4 lines
    - tail -n +2 ping1.dat | head -n -4
  - Next, pipe the resulting stream to grep
    - tail -n +2 ping1.dat | head -n -3 | grep -o "time="
      - The '-o' tells grep to print only the parts of each line that match
      - Grep will look at each line in the stream in put, and without the '-o' will display all lines that have the string 'time='
      - From these lines we want to only show the rtt.
      - We can use the -o and regexp to limit the match with a regular expr of "time=[0-9]*[.]*[0-9]*"
        
        This matches lines that contain a substring beginning with 'time=' followed by 0 or more numeric digits and then 0 or more dots, followed by 0 or more digits.
      - tail -n +2 ping1.dat | head -n -3 | grep -o "time=[0-9]*[.]*[0-9]*" ...which returns:
        time=250
        time=241
        .....
    - Now...how can we remove the 'time=' ? Multiple ways....
      - Add a final component to the pipeline:
        
        sed "s/time=//g" which substitutes the substring 'time=' with a nothing....it removes it !!
      - Or....use awk by adding the following to the pipeline:
        
        ' awk 'BEGIN {FS = "="}{printf("%f\n",$2)}' #Note that we need to change the delimiter to '='
For the last example, if we want to reduce the ping data file to a 2 column RTT.dat file with each entry consisting of the timestamp RTT.
- tail -n +2 ping1.dat | head -n -4 | sed 's/\[//g;s/\]//g;s/time=//g' | awk '{printf("%f %d \n",$1, $8)}'
finds only what is in between two delimiters (begin and end) in a given file

dataFName=$1 begin=$2 end=$3
if (( $# < $"3" )); then
- echo "Usage: [data filename] [beginning delimiter] [end delimiter] [out filename]"
- exit 0
fi
#The following variable can be used to 'eat' any number of digits, followed by any nmber of dots, any number of digits (Question: what is the final space and *?)
regex="[0-9]*[.]*[0-9]* *"
#grep -o "$begin$regex$end" : this will produce the pattern with begin and end delimiters
#sed "s/$end//g" | sed "s/$begin//g" : this strips off the delims
awk '{printf("%d\t%2.3f\t%2.6f\t%2.6f\t%9.0f\n",$1, $6, $9, $8, $7)}'

Last update: 9/20/2017