Shell scripts | Awk & Sed
Shell scripting covers almost every essential need to create automated command-line programs. But what about going beyond the standards and extending our arsenal with some external tools? Let's dive a bit inside awk and sed.
Write programs to handle text streams, because that is a universal interface. — Ken Thompson.
- Awk is a programming language that let us manipulate structured data.
- Sed is a stream editor to manipulate, filter, and transform text.
Both of them are stream-oriented; they read input from text files one line at a time and direct the result to the standard output, which means the input file itself is not changed if it's not specified to do so.
Although their syntax may look cryptic, awk and sed can solve a lot of complex tasks in a single line of code. Combining them with the use of regular expressions we have a Swiss army knife for anyone working with text files. Since we're working inside a *nix system this is perfect for us.
One of the most useful cases with awk
and sed
is parsing files and generating reports. It's a bit complicated to explain both tools without seeing them in action. To work through this post without searching too much for a file to parse, create a file named pieces-list
and populate with some text inside:
Name= "Capacitor" ID= 3456 quant.= 204 Man.= "Bosch" Name= "Battery" ID= 2760 quant.= 0 Man.= "Phillips" Name= "Fan-Frame" ID= 7864 quant.= 131 Man.= "Mitsubishi" Name= "Bluetooth-Emmiter" ID= 19085, quant.= 184 Man.= "Intel" Name= "WiFi-Card", ID= 2941, quant.= 115, Man.= "Intel" Name= "Fan" ID= 4512 quant.= 98 Man.= "OEM"
AWK
Awk is a full fledged programming language and a powerful file parser. It offers a more general computational model for processing a file, allowing us to replace an entire shell script with an awk
single liner.
Awk programs consist of a series of rules. Rules generally consist of a pattern and a set of actions.
When a file is processed, awk
reads the file line by line, then it checks to see if the line matches one or more of the patterns in the file and executes the actions associated to the matching pattern, taking the selected line as it's input.
If you've been reading the blog, you'll notice that we've used awk previously to configure our panel bar.
— The basic command-line syntax to invoke awk
is:
$ awk [options] 'pattern {actions}' inputFile
We've seen how to get an output of a file before, using the cat
command.
$ cat pieces-list
We've also seen how to split data to print only the parts we want using grep.
$ cat pieces-list | grep Intel
To start working with awk
let's use it to print our pieces-list
file:
$ awk '{ print }' pieces-list
We should have the same output result after running the program with both cat
and the new awk
method.
With awk
we can use patterns too:
$ awk '/Intel/ { print }' pieces-list
patterns are declared between forward slashes.
This is useful but we still get a complete line containing the pattern we were looking for. One powerful feature of awk
is that we can select pieces (named fields) of the line.
Named fields are represented with a dollar sign and the position number ($N
).
$ awk '/Intel/ { print $2 }' pieces-list
Sometimes our pattern has to meet some conditions to be useful for us. We can use boolean statements to perform as patterns too:
In this example, the condition is that the sixth field has to be greater than one hundred:
$ awk '$6 > 1 { print $2 }' pieces-list
By default field separators are defined by spaces or tabs. If we want to use other pattern as a field separator we have to indicate so, changing the F
variable:
-F=,
Awk allows us to use some internal functions to perform several actions.
length()
allows to get the number of characters for the specified named fields.
$ awk '{ print length($2) }' pieces-list
printf
formats the output of the specified named fields. We can align items both to the left and to the right using-%
and%
respectively.
$ awk '$6 > 1 { printf "%-19s", $2 }' pieces-list
— We can go further with awk
and store all our commands inside a file so it's easier to apply the same line of commands for multiple files.
Awk command files can contain two special patterns:
BEGIN{}
is a pattern that is executed only once before running the main commands.END{}
is a pattern that performs actions after all the instructions have been executed. It's only executed once.
Let's create a script to store our awk
commands:
$ touch steps.awk
so now we can perform some awk
examples into our pieces-list
file.
— Format output
Our example pieces-list
text is a bit messy. Wouldn't it be great to have each field ordered in nice columns?
First we need to define which character size our columns need. This value is given by our longest value in each field.
Using the builtin function length($N)
we can get those values.
Let's define our main columns with the given values in our BEGIN
pattern:
BEGIN{ printf "\n%-15s %-22s %-5s %9s\n", "MANUFACTURER ", "| PIECE NAME ","| ID ","|QUANTITY"}
In the main body we need a similar line for each one of the products in the list. This time we have to change our printed format in the fields that need to output a number:
{printf "%-16s %-22s %6d %9d\n", $8, $2, $4, $6}
In order to execute our stored awk
commands, we simply need to indicate awk
to read the file as follows:
$ awk -f steps.awk pieces-list
Our result should look similar to this:
MANUFACTURER | PIECE NAME | ID | QUANTITY ------------------------------------------------------- "Bosch" "Capacitor" 3456 204 "Phillips" "Battery" 2760 0 "Mitsubishi" "Fan-Frame" 7864 131 "Intel" "Bluetooth-Emmiter" 19085 184 "Intel" "WiFi-Card" 2941 115 "OEM" "Fan" 4512 98
The same we used our messy example file we can use a web server ip traffic, username and password databases... endless possibilities to format.
— Process command-line arguments
We can take input from the user and pass it as a variable to perform actions with our data.
Let's say we want to ask the user for the product's ID and report them the manufacturer's name, the product's name and it available quantity.
Create a search.awk
script we can perform the following instructions:
BEGIN{ print "Search results:\n" } {if ( id == $4 ) print "Item ID " $4 "\n\t— Manufacturer: " $8 "\n\t— Piece Name: " $2, "\n\t—Stock Amount: " $6} END{ print "\n---------------------------------\n"}
In this case we have created a variable named id
to compare against our ID field. To make it work we should run the script addressing a value for the variable:
$ awk -v id=3456 -f search.awk pieces-list Search results: Item ID 3456 — Manufacturer: "Bosch" — Piece Name: "Capacitor" — Stock Amount: 204 ---------------------------------
— Arithmetic and string operators
As in almost every programming programming language we can perform arithmetic operations inside awk
passing named fields as values to operate with:
$ awk '{result += $6} END{printf "total amount of items: %d\n", result}' pieces-list total amount of items: 732
SED
Sed automates actions that seem a natural extension of interactive text editing. Most of these actions like replacing text, deleting lines, inserting new text, removing words... could be done manually from a text editor.
Automating all editing instructions inside one place and execute them in one pass can change hours of manual working in minutes of automated computing.
The command-line syntax to invoke sed
is:
$ sed [options] instructions inputFile
If we run sed
without any of these three parts we will have our file printed into our command line:
$ sed '' pieces-list
As you can see, the structure of calling sed
is similar to calling awk.
This are a few instructions we can combine:
/
acts as a separator for numbers or patterns.
/patternA/patternB/
s
replace all the occurrences with a pattern.
s/orig/new/
We can indicate where to replace the matching pattern by adding the number of lines before the s
character.
2s/orig/new/
This will replace orig
with new
in the second line of the file.
g
means everywhere.
s/orig/new/g
w
writes the contents of the pattern space into a file.
w /path/to/output_file
d
deletes a specified lineNd
whereN
is the line number. This can be act just the opposite, deleting all non matching input by adding an exclamation pointN!d
.
1d inputfile 1!d inputfile
Running multiple commands with sed
can be achieved by separating them inside the single quotes with semi colons ;
in the command line, or by writing all the commands into a file with the extension .sed
.
Let's use some sed
power to work inside our pieces-list
.
— Find and replace
$ sed 's/quant./Quantity/' pieces-list $ sed 's/Man./Manufacturer/' pieces-list
This method will change any origPat
match in worklist with newPat
that occurs the first time on a line.
We can replace a pattern with an empty space too by leaving the new pattern blank:
$ sed 's/"//g' pieces-list
Now let's combine both instructions at once:
$ sed 's/quant./Quantity/'; 's/"//g' pieces-list
so our pieces-list looks like this:
Name= Capacitor ID= 3456 Quantity= 204 Manufacturer= Bosch Name= Battery ID= 2760 Quantity= 0 Manufacturer= Phillips Name= Fan-Frame ID= 7864 Quantity= 131 Manufacturer= Mitsubishi Name= Bluetooth-Emmiter ID= 19085, Quantity= 184 Manufacturer= Intel Name= WiFi-Card, ID= 2941, Quantity= 115, Manufacturer= Intel Name= Fan ID= 4512 Quantity= 98 Manufacturer= OEM
— Extract and edit
Another powerful option that we have the ability to perform within sed
is to extract information from a file, edit that information in memory and put the new edited data inside another file, without using pipelines.
Let's use a file for storing the instructions for sed
.
$ vim extract.sed
We want to inspect a whole file, and we're not going to know the number of lines. We need to search from pattern one through to pattern two:
/Name=/,/Man.=/
so we work with the text contained between the start and the end pattern.
Working on that pattern space we can open a curly brackets section, just like a function so we can store the commands to execute in.
/Name=/,/Man.=/ { s/"//g s/.*Man.=//g w manufacturer_list }
Now we can run sed
with this file to create our output file.
$ sed -f extract.sed pieces-list Bosch Phillips Mitsubishi Intel Intel OEM
Of course all of this can be scripted through pipelines but using just sed
we've achieved the same in fewer lines and less time.
Combining Awk and Sed
We've seen that we can take the advantage of clean our text data with sed
, and format it with awk
. Let's go a step further and combine both powers to get a better report.
$ sed 's/"//g' pieces-list | awk -f steps.awk
This way we remove the double quotes from all names and get a clean result.
We can sort results taking any desired field as an index base. In this case we are going to use the Manufacturer's name to perform a sorted list at the items:
$ sed 's/"//g' pieces-list | awk '{ print $8 " " $0 }' | sort | awk -f steps.awk
We know what the sed
line does. Let's analyze the awk
one:
- After the first pipe, we call
awk
to print the eighth value of the list withprint $8
. - Next, we add a blank space with
" "
. This acts as our separator. Since the file is using spaces, we keep the method. - Lastly we print the whole corresponding line so the next program in the pipe can read the information correctly.
Our result is going to be something weird. The formatted list maybe looks like this:
MANUFACTURER | PIECE NAME | ID | QUANTITY ------------------------------------------------------- Man.= Name= 0 0 Man.= Name= 0 0 Man.= Name= 0 0 Man.= Name= 0 0 Man.= Name= 0 0 Man.= Name= 0 0 ---- End of report. Time: 06:44 | Date: 2020-04-01 ----
Since we are adding the eighth field as an index, we have increased the length of the lines and we need to increase the field to print inside our steps.awk
file.
Having to track all this steps individually and in different files is not useful at all, that's why writing shell scripts for multiple tasks is so handy (yes, we can call sed
and awk
from within a shell script!).
— Create a script named format-report.sh
and open it.
- Remember this is a Shell script so indicate it at the beginning of the file.
#!/bin/sh
- First we need to order our list based on the manufacturer's name.
awk '{print $8" " $0 }' $* | sort |
- We have to add a header for our report using the
BEGIN
pattern fromawk
.
awk 'BEGIN{ printf "\n%-15s %-22s %-5s %9s\n", "MANUFACTURER ", "| PIECE NAME ","| ID ","| QUANTITY" print "-------------------------------------------------------\n"}
- Next we execute the main loop of
awk
to print the formatted list.
{printf "%-18s %-23s %6d %9d\n", $9, $3, $5, $7}
- And we can add some condition to check if an item is out of stock.
{ if ($7 < 1) printf "\nWarning! Item %d is out of stock.(%s from %s)\n", $5, $3, $9}
- Once the main loop is done we can print a footer for our report using the
END
pattern, indicating time and date.
END {"date +'%Y-%m-%d'"|getline d; "date +'%H:%M'"|getline t; print "\n---- End of report. Time: " t " | Date: " d " ----"}' |
- Lastly we call
sed
to get rid of the double quotes that the names inside the list have.
sed 's/"//g'
In order to run the script, save it, change its permissions to make it executable, and pass the pieces-list
as the first argument:
$ ./format-report.sh pieces-list
We should see something similar to this:
MANUFACTURER | PIECE NAME | ID | QUANTITY ------------------------------------------------------- Bosch Capacitor 3456 204 Intel Bluetooth-Emmiter 19085 184 Intel WiFi-Card 2941 115 Mitsubishi Fan-Frame 7864 131 OEM Fan 4512 98 Phillips Battery 2760 0 Warning! Item 2760 is out of stock.(Battery from Phillips) ---- End of report. Time: 07:40 | Date: 2020-04-01 ----
Summing up
A fundamental part of the power of *nix systems are pipes and the ability to use them to combine programs as building blocks in many ways to create automated workflows.
We've seen how to manage text data without touching a manual text editor in several ways, so now we can introduce these techniques using awk
and sed
to our pipe workflow with a new level of flexibility.