Shell scripts | Regular expressions
When working with text data, something that stands out in scripting is using regular expressions (aka regex). Regular expressions are symbolic notations we can use to identify patterns in text data.
Although regex are available in almost every scripting and programming languages (if not all of them), they vary slightly from language to language.
Regular expressions take literals and metacharacters as values to form patterns. They can be used both inside shell scripts and along command line tools.
With regular expressions we can:
- Search in huge text files to find specific words.
- Validate input to match what the program may require (e.g., don't accept letters when asking for a number in input).
- Replace particular words or letters automatically in a document (e.g., uppercase words' first letter after a dot).
- Coordinate actions inside the command line tools (e.g., redirecting some parameters only if a regex condition is met).
To work with regex in this episode along with the command line let's create a document containing some random data.
$ touch data.txt # ~/data.txt ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqurtuvwxyz 0123456789 big bag bug love live like .[{()\^$|?*+ codeberg.org [email protected] [email protected] 432-111-7654 234.111.4567 Sir Camelot James Sir James Mr. Watson Mr John A car should not fly A plane cannot sail A boat can float
Let's find James in this text:
$ grep 'James' data.txt Sir Camelot James Sir James
The example above takes a literal ('James'
) and will print any line that matches at least the four characters J
, a
, m
, e
, s
in that order.
$ grep 'J[oa]' data.txt Sir Camelot James Sir James Mr John
In this case we're passing a meta-character pattern as regex. The regular expression will match any instance of the character J
followed by either the character o
or a
.
Note that in both cases we enclosed the regex with quotes. This is important since the majority of the meta-characters used by regex are meaningful to the shell too.
The accepted meta-characters in regular expressions are:
$ ^ . - * ? + ( ) { } [ ] \ |
Let's take a more in-depth look to them. POSIX makes a difference between basic regex and extended regex.
- Basic regex encloses
^ $ . [ ] *
- Extended regex adds
{ } ( ) ? + \ |
Basic regex
Basic regex (BRE) requires adding a backslash to anything that exceeds the defined metacharacters if we want to use it as a metacharacter.
— Anchors
- Caret
^
matches the beginning of a line. This matches a position, not a character.
# print all directories and files in /home/ with matching beginning "^Do" $ cd $ printf "%s\n" * | grep "^Do" Documents Downloads
- Dollar sign
$
matches the end of a line. This matches a position, not a character.
# print all directories and files in /home/ with mathcing ending "squot; $ cd $ printf "%s\n" * | grep "squot; Documents Downloads
— Any Character
- Period (or dot)
.
matches any single character except line breaks. Inside a regular expression it increases the length of the required match.
$ grep '.co' data.txt [email protected] [email protected] $ grep 'b.g' data.txt big bag bug $ grep 'l.v' data.txt live love $ grep 'l..e' data.txt live like love
— Character sets
- Square Brackets
[]
match any of the characters given between the brackets. This can be fixed or a range, and may be both a positive match or a negative match.
# character set [ABC] # negate set [^ABC] # range [A-Z] # combined range [A-Za-gh0-9] #an equivalent to dot (.) is [^\n\r]
— Match zero or more
- Asterisk
*
matches zero or more occurrences of the previous character
# look for the character b followed by the character e zero or more times $ grep 'be*' data.txt abcdefghijklmnopqurtuvwxyz codeberg.org big bag bug A boat can float
Extended regex
Extended regular expressions (ERE) can sometimes be used with *nix utilities like grep
by including the command line flag -E
. Other Unix utilities like awk
or egrep
use it by default.
— Alternation
- Vertical bar
|
acts like a logical OR operand. The patterns will be tested in order. It matches the expression before or after the vertical bar.
# two $ grep -E 'car|plane' data.txt A car should not fly A plane cannot sail # more than two $ grep -E 'car|plane|boat' data.txt A car should not fly A plane cannot sail A boat can float
— Escape
- Backslash
\
removes or adds special meaning to the next character. (Handy when we want to look for a character that is actually a metacharacter).
# search for dot (.) and don't treat it as a metachar. $ grep '\.' data.txt .[{()\^$|?*+ codeberg.org 234.111.4567 Mr. Watson
To represent non printable characters we can use the following:
- \t
matches a tab.
- \r
matches a carriage return.
- \n
matches a newline.
Combined with some specific letters after it, the backslash gives us more functionality:
- \s
matches anything considered a white space, like tabs, line breaks, etc.
- \d
matches any digit. A handy alternative to [0-9]
.
- \w
matches anything considered a word character.
Typing the uppercase of s
, d
, and w
makes the expression to search for the opposite meaning of the lowercase character.
- \S
matches anything not considered a white space.
- \D
matches anything not considered a digit.
- \W
matches anything not considered a word character.
— Match zero or once.
- Question mark
?
matches the preceding character zero or one times only.
- ab?c
matches either ac
or abc
.
- (ab)?
matches ''
or ab
.
— Match once or more
- Plus sign
+
matches one or more occurrences of the preceding character.
- ab+c
matches abc
, abbbc
but not ac
.
- [abc]+
matches a
, b
, c
, ca
, cba
, abccb
, etc.
— Match n specific times
- Curly brackets
{}
match the preceding element the n times defined inside them. It can be fixed or cover a range between n and m.
# {n} matches exactly n times the preceding item. $ egrep '[0-9]{3}' data.txt 0123456789 432-111-7654 234.111.4567 # {n,} matches at least n times $ egrep '[2-6]{4,}' data.txt 0123456789 # {n, m} matches at least n, but no more than m times $ egrep 'n{1,2}' data.txt abcdefghijklmnopqurtuvwxyz Mr. Watson Mr John A car cannot fly A plane cannot sail A boat cannot land # {,m} matches less or equal m times # if not combined with other regex block it will print the whole document.
We can take advantage of this to avoid long regular expression syntax. As an example if we would like to find every phone number inside a clients_data
file we could do it as follows:
#American format 234.555.6789 \d{3}.\d{3}.\{4} #British format 7222 555 555 \d{4}\s?\d{3}\s?\d{3}
— Grouping
- Parenthesis
()
group several characters together of a regular expression.
# this will match whether Camelot is present or not $ grep -E 'Sir (Camelot)?James' data.txt Sir Camelot James Sir James
Summing up
The grep
tool is always going to print the line that contains the regex match. There are ways to limit the print so we get only the isolated match combining it with other command line tools. The same way we used regex with grep
, we can do with commands and tools like ed
, awk
or sed
.
I encourage you to start using regular expressions in your code, no matter the programming language (if you didn't yet). Learning how to mix the building blocks together into effective patterns is in fact something that will take time and practice. Check out regex101 for a pleasant experience on learning regular expressions in an interactive playground (:
For an advanced example, think about a shell script looking for the files in a remote directory, and find every file that contains a plain text file with the name account_data
. After verifying that each file has login credentials, encrypt the file. All done with matching regular expressions
— It's a bit funny to store confidential data into plain text files, but the guys at Facebook, Instagram and Whatsapp have been saving hashes, credentials and backups from people as plain text files in their servers for a while.¯\(ツ)/¯