[Top level Bash Page] .... [References]
Regular Expressions
This page serves to guide us as we learn regular expressions. References [3,4,7,23] seem to be the best source of help.
- Remember, shell metachararcters are expanded before the shell passes arguements so regex patterns must be in quotes to prevent the shell from interpretting
- Structure of regex
- Anchors- used to specify the position of the pattern : ^, $, \b \B \< \>
- character sets - match one or more characters in a single position : [ ] with any of the following: A-Z, a-z, 0-9, :predefined in current local: such as alnum, alpha, digit, ...
- modifiers - specify how many times the prev. character can be repeated : . , ?, *, +, {n,m}
- backrerferences: \n where n is a digit that points to the n'th parenthesized subexpr
- Basic and Extended regex
- Basic supported by vi/sed/grep /more
- Extended supported by awk/egrep/
Install the check-regexp program (apt-get install source-highlight) - this is a very helpful tool. Give it a regexp and a text string and it shows not only if the regexp works as expected but provides further information as to how regexp might work.
- check-regexp '^00:00:[0-9]{1,2}' '00:00:14.1234 IP 130.127.49.2 '
- The {1,2} indicates we require exactly 1 or 2 digits 0...9.
- Note: [0-9]+ would match one or more
- check-regexp '^00:00:[0-9]' '00:00:01.1234 IP 130.127.49.2 2'
- searching : 00:00:01.1234 IP 130.127.49.2 2
- for the regexp : ^00:00:[0-9]
- num of subexps : 0
- what[0]: 00:00:0
- suffix: 1.1234 IP 130.127.49.2 2
- total number of matches: 1
- check-regexp '^00:00:[0-9]{1,2}' '00:00:01.1234 IP 130.127.49.2 2'
- searching : 00:00:01.1234 IP 130.127.49.2 2
- for the regexp : ^00:00:[0-9]{1,2}
- num of subexps : 0 what[0]: 00:00:01
- suffix: .1234 IP 130.127.49.2 2
- total number of matches: 1
Regular expressions examples
- "^#*" this is: to match 0 or more "#" character but ONLY if it starts at the beginning of a line
- Let's say we needed to find lines in a file that :
- Start with a T
- Is the first word on a line
- Second letter is lower case
- Exactly three letters long
- Third letter is a vowel
- /^T[a-z][aeiou]/
- Example of a UDP trace obtained from tcpdump. The file format is a line for each packet captured:
- 00:00:00.000000 IP 130.127.49.48.33581 > 130.127.49.144.commplex-main: UDP, length 1000
- All we want to do is catch and return the full lines that begin with '00:00:0'
- check-regexp '^00:00:[0-9]{1,2}' '00:00:14.1234 IP 130.127.49.2 '
- The {1,2} indicates we require exactly 1 or 2 digits 0...9.
- Note: [0-9]+ would match one or more
- check-regexp '^00:00:[0-9]' '00:00:01.1234 IP 130.127.49.2 2'
- searching : 00:00:01.1234 IP 130.127.49.2 2
- for the regexp : ^00:00:[0-9]
- num of subexps : 0
- what[0]: 00:00:0
- suffix: 1.1234 IP 130.127.49.2 2
- total number of matches: 1
- check-regexp '^00:00:[0-9]{1,2}' '00:00:01.1234 IP 130.127.49.2 2'
- searching : 00:00:01.1234 IP 130.127.49.2 2
- for the regexp : ^00:00:[0-9]{1,2}
- num of subexps : 0 what[0]: 00:00:01
- suffix: .1234 IP 130.127.49.2 2
- total number of matches: 1
- To do the same match using regexp in grep and assuming basic regexp :
- grep -o '^00:00:[0-9]\{1,2\}' t.trace
- Note the \ is necessary to ensure grep does not interpret the curly braces
- Next example, is a script that matches lines with 00:00:0 and returns just the timestamp and length
- Exampe: script that finds only what is in between two delimiters (begin and end) in a given file
- dataFName=$1 begin=$2 end=$3
- if (( $# < $"3" )); then
- echo "Usage: [data filename] [beginning delimiter] [end delimiter] [out filename]"
- exit 0
- fi
- #The following variable can be used to 'eat' any number of digits, followed by any nmber of dots, any number of digits (Question: what is the final space and *?)
- regex="[0-9]*[.]*[0-9]* *"
- #grep -o "$begin$regex$end" : this will produce the pattern with begin and end delimiters
- #sed "s/$end//g" | sed "s/$begin//g" : this strips off the delims
- cat $dataFName | grep -o "$begin$regex$end" | sed "s/$end//g" | sed "s/$begin//g
Last update: 6/30/2017