Regular Expressions

This page serves to guide us as we learn regular expressions. References [3,4,7,23] seem to be the best source of help.

Remember, shell metachararcters are expanded before the shell passes arguements so regex patterns must be in quotes to prevent the shell from interpretting
Structure of regex
- Anchors- used to specify the position of the pattern : ^, $, \b \B \< \>
- character sets - match one or more characters in a single position : [ ] with any of the following: A-Z, a-z, 0-9, :predefined in current local: such as alnum, alpha, digit, ...
- modifiers - specify how many times the prev. character can be repeated : . , ?, *, +, {n,m}
- backrerferences: \n where n is a digit that points to the n'th parenthesized subexpr
Basic and Extended regex
- Basic supported by vi/sed/grep /more
- Extended supported by awk/egrep/

Install the check-regexp program (apt-get install source-highlight) - this is a very helpful tool. Give it a regexp and a text string and it shows not only if the regexp works as expected but provides further information as to how regexp might work.

check-regexp '^00:00:[0-9]{1,2}' '00:00:14.1234 IP 130.127.49.2 '
- The {1,2} indicates we require exactly 1 or 2 digits 0...9.
- Note: [0-9]+ would match one or more
- check-regexp '^00:00:[0-9]' '00:00:01.1234 IP 130.127.49.2 2'
  - searching : 00:00:01.1234 IP 130.127.49.2 2
  - for the regexp : ^00:00:[0-9]
  - num of subexps : 0
  - what[0]: 00:00:0
  - suffix: 1.1234 IP 130.127.49.2 2
  - total number of matches: 1
- check-regexp '^00:00:[0-9]{1,2}' '00:00:01.1234 IP 130.127.49.2 2'
  - searching : 00:00:01.1234 IP 130.127.49.2 2
  - for the regexp : ^00:00:[0-9]{1,2}
  - num of subexps : 0 what[0]: 00:00:01
  - suffix: .1234 IP 130.127.49.2 2
  - total number of matches: 1

Regular expressions examples

"^#*" this is: to match 0 or more "#" character but ONLY if it starts at the beginning of a line
Let's say we needed to find lines in a file that :
- Start with a T
- Is the first word on a line
- Second letter is lower case
- Exactly three letters long
- Third letter is a vowel
- /^T[a-z][aeiou]/
Example of a UDP trace obtained from tcpdump. The file format is a line for each packet captured:
- 00:00:00.000000 IP 130.127.49.48.33581 > 130.127.49.144.commplex-main: UDP, length 1000
- All we want to do is catch and return the full lines that begin with '00:00:0'
- check-regexp '^00:00:[0-9]{1,2}' '00:00:14.1234 IP 130.127.49.2 '
  - The {1,2} indicates we require exactly 1 or 2 digits 0...9.
  - Note: [0-9]+ would match one or more
  - check-regexp '^00:00:[0-9]' '00:00:01.1234 IP 130.127.49.2 2'
    - searching : 00:00:01.1234 IP 130.127.49.2 2
    - for the regexp : ^00:00:[0-9]
    - num of subexps : 0
    - what[0]: 00:00:0
    - suffix: 1.1234 IP 130.127.49.2 2
    - total number of matches: 1
  - check-regexp '^00:00:[0-9]{1,2}' '00:00:01.1234 IP 130.127.49.2 2'
    - searching : 00:00:01.1234 IP 130.127.49.2 2
    - for the regexp : ^00:00:[0-9]{1,2}
    - num of subexps : 0 what[0]: 00:00:01
    - suffix: .1234 IP 130.127.49.2 2
    - total number of matches: 1
- To do the same match using regexp in grep and assuming basic regexp :
  - grep -o '^00:00:[0-9]\{1,2\}' t.trace
    - Note the \ is necessary to ensure grep does not interpret the curly braces
Next example, is a script that matches lines with 00:00:0 and returns just the timestamp and length

Exampe: script that finds only what is in between two delimiters (begin and end) in a given file

dataFName=$1 begin=$2 end=$3
if (( $# < $"3" )); then
- echo "Usage: [data filename] [beginning delimiter] [end delimiter] [out filename]"
- exit 0
fi
#The following variable can be used to 'eat' any number of digits, followed by any nmber of dots, any number of digits (Question: what is the final space and *?)
regex="[0-9]*[.]*[0-9]* *"
#grep -o "$begin$regex$end" : this will produce the pattern with begin and end delimiters
#sed "s/$end//g" | sed "s/$begin//g" : this strips off the delims
cat $dataFName | grep -o "$begin$regex$end" | sed "s/$end//g" | sed "s/$begin//g

Last update: 6/30/2017