Shell Scripting
March 23, 2020

Shell scripts | Awk & Sed

Shell scripting covers almost every essential need to create automated command-line programs. But what about going beyond the standards and extending our arsenal with some external tools? Let's dive a bit inside awk and sed.

Write programs to handle text streams, because that is a universal interface. — Ken Thompson.

  • Awk is a programming language that let us manipulate structured data.
  • Sed is a stream editor to manipulate, filter, and transform text.

Both of them are stream-oriented; they read input from text files one line at a time and direct the result to the standard output, which means the input file itself is not changed if it's not specified to do so.

Although their syntax may look cryptic, awk and sed can solve a lot of complex tasks in a single line of code. Combining them with the use of regular expressions we have a Swiss army knife for anyone working with text files. Since we're working inside a *nix system this is perfect for us.

One of the most useful cases with awk and sed is parsing files and generating reports. It's a bit complicated to explain both tools without seeing them in action. To work through this post without searching too much for a file to parse, create a file named pieces-list and populate with some text inside:

Name= "Capacitor" ID= 3456 quant.= 204 Man.= "Bosch"
Name= "Battery" ID= 2760 quant.= 0 Man.= "Phillips"
Name= "Fan-Frame" ID= 7864 quant.= 131 Man.= "Mitsubishi"
Name= "Bluetooth-Emmiter" ID= 19085, quant.= 184 Man.= "Intel"
Name= "WiFi-Card", ID= 2941, quant.= 115, Man.= "Intel"
Name= "Fan" ID= 4512 quant.= 98 Man.= "OEM"

AWK

Awk is a full fledged programming language and a powerful file parser. It offers a more general computational model for processing a file, allowing us to replace an entire shell script with an awk single liner.

Awk programs consist of a series of rules. Rules generally consist of a pattern and a set of actions.

When a file is processed, awk reads the file line by line, then it checks to see if the line matches one or more of the patterns in the file and executes the actions associated to the matching pattern, taking the selected line as it's input.

If you've been reading the blog, you'll notice that we've used awk previously to configure our panel bar.

— The basic command-line syntax to invoke awk is:

$ awk [options] 'pattern {actions}' inputFile

We've seen how to get an output of a file before, using the cat command.

$ cat pieces-list

We've also seen how to split data to print only the parts we want using grep.

$ cat pieces-list | grep Intel

To start working with awk let's use it to print our pieces-list file:

$ awk '{ print }' pieces-list

We should have the same output result after running the program with both cat and the new awk method.

With awk we can use patterns too:

$ awk '/Intel/ { print }' pieces-list

patterns are declared between forward slashes.

This is useful but we still get a complete line containing the pattern we were looking for. One powerful feature of awk is that we can select pieces (named fields) of the line.

Named fields are represented with a dollar sign and the position number ($N).

$ awk '/Intel/ { print $2 }' pieces-list

Sometimes our pattern has to meet some conditions to be useful for us. We can use boolean statements to perform as patterns too:

In this example, the condition is that the sixth field has to be greater than one hundred:

$ awk '$6 > 1 { print $2 }' pieces-list

By default field separators are defined by spaces or tabs. If we want to use other pattern as a field separator we have to indicate so, changing the F variable:

-F=,

Awk allows us to use some internal functions to perform several actions.

  • length() allows to get the number of characters for the specified named fields.
$ awk '{ print length($2) }' pieces-list
  • printf formats the output of the specified named fields. We can align items both to the left and to the right using -% and % respectively.
$ awk '$6 > 1 { printf "%-19s", $2 }' pieces-list

— We can go further with awk and store all our commands inside a file so it's easier to apply the same line of commands for multiple files.

Awk command files can contain two special patterns:

  • BEGIN{} is a pattern that is executed only once before running the main commands.
  • END{} is a pattern that performs actions after all the instructions have been executed. It's only executed once.

Let's create a script to store our awk commands:

$ touch steps.awk

so now we can perform some awk examples into our pieces-list file.

— Format output

Our example pieces-list text is a bit messy. Wouldn't it be great to have each field ordered in nice columns?

First we need to define which character size our columns need. This value is given by our longest value in each field.

Using the builtin function length($N) we can get those values.

Let's define our main columns with the given values in our BEGIN pattern:

BEGIN{ printf "\n%-15s %-22s %-5s %9s\n", "MANUFACTURER ", "| PIECE NAME ","| ID ","|QUANTITY"}

In the main body we need a similar line for each one of the products in the list. This time we have to change our printed format in the fields that need to output a number:

{printf "%-16s %-22s %6d %9d\n", $8, $2, $4, $6}

In order to execute our stored awk commands, we simply need to indicate awk to read the file as follows:

$ awk -f steps.awk pieces-list

Our result should look similar to this:

MANUFACTURER   | PIECE NAME          | ID    | QUANTITY
-------------------------------------------------------

"Bosch"         "Capacitor"            3456         204
"Phillips"      "Battery"              2760           0
"Mitsubishi"    "Fan-Frame"            7864         131
"Intel"         "Bluetooth-Emmiter"   19085         184
"Intel"         "WiFi-Card"            2941         115
"OEM"           "Fan"                  4512          98

The same we used our messy example file we can use a web server ip traffic, username and password databases... endless possibilities to format.

— Process command-line arguments

We can take input from the user and pass it as a variable to perform actions with our data.

Let's say we want to ask the user for the product's ID and report them the manufacturer's name, the product's name and it available quantity.

Create a search.awk script we can perform the following instructions:

BEGIN{ print "Search results:\n" }

{if ( id == $4 ) print "Item ID " $4 "\n\t— Manufacturer: " $8 "\n\t— Piece Name: " $2, "\n\t—Stock Amount: " $6}

END{ print "\n---------------------------------\n"}

In this case we have created a variable named id to compare against our ID field. To make it work we should run the script addressing a value for the variable:

$ awk -v id=3456 -f search.awk pieces-list

Search results:

Item ID 3456
   — Manufacturer: "Bosch"
   — Piece Name: "Capacitor" 
   — Stock Amount: 204

---------------------------------

— Arithmetic and string operators

As in almost every programming programming language we can perform arithmetic operations inside awk passing named fields as values to operate with:

$ awk '{result += $6} END{printf "total amount of items: %d\n", result}' pieces-list 

total amount of items: 732

SED

Sed automates actions that seem a natural extension of interactive text editing. Most of these actions like replacing text, deleting lines, inserting new text, removing words... could be done manually from a text editor.

Automating all editing instructions inside one place and execute them in one pass can change hours of manual working in minutes of automated computing.

The command-line syntax to invoke sed is:

$ sed [options] instructions inputFile

If we run sed without any of these three parts we will have our file printed into our command line:

$ sed '' pieces-list

As you can see, the structure of calling sed is similar to calling awk.

This are a few instructions we can combine:

  • / acts as a separator for numbers or patterns.
/patternA/patternB/

  • s replace all the occurrences with a pattern.
s/orig/new/ 

We can indicate where to replace the matching pattern by adding the number of lines before the s character.

2s/orig/new/

This will replace orig with new in the second line of the file.

  • g means everywhere.
s/orig/new/g

  • w writes the contents of the pattern space into a file.
w /path/to/output_file

  • d deletes a specified line Nd where N is the line number. This can be act just the opposite, deleting all non matching input by adding an exclamation point N!d.
1d inputfile
1!d inputfile

Running multiple commands with sed can be achieved by separating them inside the single quotes with semi colons ; in the command line, or by writing all the commands into a file with the extension .sed.

Let's use some sed power to work inside our pieces-list.

— Find and replace

$ sed 's/quant./Quantity/' pieces-list
$ sed 's/Man./Manufacturer/' pieces-list

This method will change any origPat match in worklist with newPat that occurs the first time on a line.

We can replace a pattern with an empty space too by leaving the new pattern blank:

$ sed 's/"//g' pieces-list

Now let's combine both instructions at once:

$ sed 's/quant./Quantity/'; 's/"//g' pieces-list

so our pieces-list looks like this:

Name= Capacitor ID= 3456 Quantity= 204 Manufacturer= Bosch
Name= Battery ID= 2760 Quantity= 0 Manufacturer= Phillips
Name= Fan-Frame ID= 7864 Quantity= 131 Manufacturer= Mitsubishi
Name= Bluetooth-Emmiter ID= 19085, Quantity= 184 Manufacturer= Intel
Name= WiFi-Card, ID= 2941, Quantity= 115, Manufacturer= Intel
Name= Fan ID= 4512 Quantity= 98 Manufacturer= OEM

— Extract and edit

Another powerful option that we have the ability to perform within sed is to extract information from a file, edit that information in memory and put the new edited data inside another file, without using pipelines.

Let's use a file for storing the instructions for sed.

$ vim extract.sed

We want to inspect a whole file, and we're not going to know the number of lines. We need to search from pattern one through to pattern two:

/Name=/,/Man.=/ 

so we work with the text contained between the start and the end pattern.

Working on that pattern space we can open a curly brackets section, just like a function so we can store the commands to execute in.

/Name=/,/Man.=/ {
s/"//g
s/.*Man.=//g
w manufacturer_list
}

Now we can run sed with this file to create our output file.

$ sed -f extract.sed pieces-list

Bosch
Phillips
Mitsubishi
Intel
Intel
OEM

Of course all of this can be scripted through pipelines but using just sed we've achieved the same in fewer lines and less time.

Combining Awk and Sed

We've seen that we can take the advantage of clean our text data with sed, and format it with awk. Let's go a step further and combine both powers to get a better report.

$ sed 's/"//g' pieces-list | awk -f steps.awk

This way we remove the double quotes from all names and get a clean result.

We can sort results taking any desired field as an index base. In this case we are going to use the Manufacturer's name to perform a sorted list at the items:

$ sed 's/"//g' pieces-list | awk '{ print $8 " " $0 }' | sort | awk -f steps.awk

We know what the sed line does. Let's analyze the awk one:

  1. After the first pipe, we call awk to print the eighth value of the list with print $8.
  2. Next, we add a blank space with " ". This acts as our separator. Since the file is using spaces, we keep the method.
  3. Lastly we print the whole corresponding line so the next program in the pipe can read the information correctly.

Our result is going to be something weird. The formatted list maybe looks like this:

MANUFACTURER   | PIECE NAME          | ID    | QUANTITY
-------------------------------------------------------

Man.=           Name=                     0           0
Man.=           Name=                     0           0
Man.=           Name=                     0           0
Man.=           Name=                     0           0
Man.=           Name=                     0           0
Man.=           Name=                     0           0

---- End of report. Time: 06:44 | Date: 2020-04-01 ----

Since we are adding the eighth field as an index, we have increased the length of the lines and we need to increase the field to print inside our steps.awk file.

Having to track all this steps individually and in different files is not useful at all, that's why writing shell scripts for multiple tasks is so handy (yes, we can call sed and awk from within a shell script!).

— Create a script named format-report.sh and open it.

  • Remember this is a Shell script so indicate it at the beginning of the file.
#!/bin/sh

  • First we need to order our list based on the manufacturer's name.
awk '{print $8" " $0 }' $* | sort | 

  • We have to add a header for our report using the BEGIN pattern from awk.
awk 'BEGIN{ printf "\n%-15s %-22s %-5s %9s\n", "MANUFACTURER ", "| PIECE NAME ","| ID ","| QUANTITY"
print "-------------------------------------------------------\n"}

  • Next we execute the main loop of awk to print the formatted list.
{printf "%-18s %-23s %6d %9d\n", $9, $3, $5, $7}

  • And we can add some condition to check if an item is out of stock.
{ if ($7 < 1) printf "\nWarning! Item %d is out of stock.(%s from %s)\n", $5, $3, $9}

  • Once the main loop is done we can print a footer for our report using the END pattern, indicating time and date.
END {"date +'%Y-%m-%d'"|getline d; "date +'%H:%M'"|getline t; print "\n---- End of report. Time: " t " | Date: " d " ----"}' | 

  • Lastly we call sed to get rid of the double quotes that the names inside the list have.
sed 's/"//g'

In order to run the script, save it, change its permissions to make it executable, and pass the pieces-list as the first argument:

$ ./format-report.sh pieces-list

We should see something similar to this:

MANUFACTURER   | PIECE NAME          | ID |    QUANTITY
-------------------------------------------------------

Bosch           Capacitor              3456         204
Intel           Bluetooth-Emmiter     19085         184
Intel           WiFi-Card              2941         115
Mitsubishi      Fan-Frame              7864         131
OEM             Fan                    4512          98
Phillips        Battery                2760           0

Warning! Item 2760 is out of stock.(Battery from Phillips)

---- End of report. Time: 07:40 | Date: 2020-04-01 ----

Summing up

A fundamental part of the power of *nix systems are pipes and the ability to use them to combine programs as building blocks in many ways to create automated workflows.

We've seen how to manage text data without touching a manual text editor in several ways, so now we can introduce these techniques using awk and sed to our pipe workflow with a new level of flexibility.