Shell scripts | Regular expressions

When working with text data, something that stands out in scripting is using regular expressions (aka regex). Regular expressions are symbolic notations we can use to identify patterns in text data.

Although regex are available in almost every scripting and programming languages (if not all of them), they vary slightly from language to language.

Regular expressions take literals and metacharacters as values to form patterns. They can be used both inside shell scripts and along command line tools.

With regular expressions we can:

Search in huge text files to find specific words.
Validate input to match what the program may require (e.g., don't accept letters when asking for a number in input).
Replace particular words or letters automatically in a document (e.g., uppercase words' first letter after a dot).
Coordinate actions inside the command line tools (e.g., redirecting some parameters only if a regex condition is met).

To work with regex in this episode along with the command line let's create a document containing some random data.

$ touch data.txt

# ~/data.txt
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqurtuvwxyz     

0123456789

big bag bug 
love live like

.[{()\^$|?*+    

codeberg.org
user@domain.com
admin@other.co.uk

432-111-7654   
234.111.4567

Sir Camelot James
Sir James
Mr. Watson
Mr John

A car should not fly
A plane cannot sail
A boat can float

Let's find James in this text:

$ grep 'James' data.txt

Sir Camelot James
Sir James

The example above takes a literal ('James') and will print any line that matches at least the four characters J, a, m, e, s in that order.

$ grep 'J[oa]' data.txt

Sir Camelot James
Sir James
Mr John

In this case we're passing a meta-character pattern as regex. The regular expression will match any instance of the character J followed by either the character o or a.

Note that in both cases we enclosed the regex with quotes. This is important since the majority of the meta-characters used by regex are meaningful to the shell too.

The accepted meta-characters in regular expressions are:

$ ^ . - * ? + ( ) { } [ ] \ |

Let's take a more in-depth look to them. POSIX makes a difference between basic regex and extended regex.

Basic regex encloses ^ $ . [ ] *
Extended regex adds { } ( ) ? + \ |

Basic regex

Basic regex (BRE) requires adding a backslash to anything that exceeds the defined metacharacters if we want to use it as a metacharacter.

— Anchors

Caret ^ matches the beginning of a line. This matches a position, not a character.

# print all directories and files in /home/ with matching beginning "^Do"
$ cd
$ printf "%s\n" * | grep "^Do"
Documents
Downloads

Dollar sign $ matches the end of a line. This matches a position, not a character.

# print all directories and files in /home/ with mathcing ending "squot;
$ cd
$ printf "%s\n" * | grep "squot;
Documents
Downloads

— Any Character

Period (or dot) . matches any single character except line breaks. Inside a regular expression it increases the length of the required match.

$  grep '.co' data.txt 
user@domain.com
admin@other.co.uk

$ grep 'b.g' data.txt
big bag bug

$ grep 'l.v' data.txt
live love

$ grep 'l..e' data.txt
live like love

— Character sets

Square Brackets [] match any of the characters given between the brackets. This can be fixed or a range, and may be both a positive match or a negative match.

# character set [ABC]

# negate set [^ABC]

# range [A-Z]

# combined range [A-Za-gh0-9]

#an equivalent to dot (.) is [^\n\r]

— Match zero or more

Asterisk * matches zero or more occurrences of the previous character

# look for the character b followed by the character e zero or more times
$ grep 'be*' data.txt
abcdefghijklmnopqurtuvwxyz
codeberg.org
big bag bug
A boat can float

Extended regex

Extended regular expressions (ERE) can sometimes be used with *nix utilities like grep by including the command line flag -E. Other Unix utilities like awk or egrep use it by default.

— Alternation

Vertical bar | acts like a logical OR operand. The patterns will be tested in order. It matches the expression before or after the vertical bar.

# two
$ grep -E 'car|plane' data.txt

A car should not fly
A plane cannot sail

# more than two
$ grep -E 'car|plane|boat' data.txt

A car should not fly
A plane cannot sail
A boat can float

— Escape

Backslash \ removes or adds special meaning to the next character. (Handy when we want to look for a character that is actually a metacharacter).

# search for dot (.) and don't treat it as a metachar.
$ grep '\.' data.txt 
.[{()\^$|?*+   
codeberg.org
234.111.4567
Mr. Watson

To represent non printable characters we can use the following:

- \t matches a tab.

- \r matches a carriage return.

- \n matches a newline.

Combined with some specific letters after it, the backslash gives us more functionality:

- \s matches anything considered a white space, like tabs, line breaks, etc.

- \d matches any digit. A handy alternative to [0-9].

- \w matches anything considered a word character.

Typing the uppercase of s, d, and w makes the expression to search for the opposite meaning of the lowercase character.

- \S matches anything not considered a white space.

- \D matches anything not considered a digit.

- \W matches anything not considered a word character.

— Match zero or once.

Question mark ? matches the preceding character zero or one times only.

- ab?c matches either ac or abc.

- (ab)? matches '' or ab.

— Match once or more

Plus sign + matches one or more occurrences of the preceding character.

- ab+c matches abc, abbbc but not ac.

- [abc]+ matches a, b, c, ca, cba, abccb, etc.

— Match n specific times

Curly brackets {} match the preceding element the n times defined inside them. It can be fixed or cover a range between n and m.

# {n} matches exactly n times the preceding item.
$ egrep '[0-9]{3}' data.txt
0123456789
432-111-7654  
234.111.4567

# {n,} matches at least n times
$ egrep '[2-6]{4,}' data.txt
0123456789

# {n, m} matches at least n, but no more than m times
$ egrep 'n{1,2}' data.txt 
abcdefghijklmnopqurtuvwxyz    
Mr. Watson
Mr John
A car cannot fly
A plane cannot sail
A boat cannot land

# {,m} matches less or equal m times
# if not combined with other regex block it will print the whole document.

We can take advantage of this to avoid long regular expression syntax. As an example if we would like to find every phone number inside a clients_data file we could do it as follows:

#American format 234.555.6789
\d{3}.\d{3}.\{4}

#British format 7222 555 555
\d{4}\s?\d{3}\s?\d{3}

— Grouping

Parenthesis () group several characters together of a regular expression.

# this will match whether Camelot is present or not
$ grep -E 'Sir (Camelot)?James' data.txt
Sir Camelot James
Sir James

Summing up

The grep tool is always going to print the line that contains the regex match. There are ways to limit the print so we get only the isolated match combining it with other command line tools. The same way we used regex with grep, we can do with commands and tools like ed, awk or sed.

I encourage you to start using regular expressions in your code, no matter the programming language (if you didn't yet). Learning how to mix the building blocks together into effective patterns is in fact something that will take time and practice. Check out regex101 for a pleasant experience on learning regular expressions in an interactive playground (:

For an advanced example, think about a shell script looking for the files in a remote directory, and find every file that contains a plain text file with the name account_data. After verifying that each file has login credentials, encrypt the file. All done with matching regular expressions

— It's a bit funny to store confidential data into plain text files, but the guys at Facebook, Instagram and Whatsapp have been saving hashes, credentials and backups from people as plain text files in their servers for a while.¯\(ツ)/¯