Starting an AWK Script
Template and Tips for Writing AWK Scripts
August 24, 2019
Note: This is an unedited draft.
AWK is a utility program available on most unix like systems and it really shines with tabular text input. In this post I will be describing my starter template for AWK scripts I write.
The Data
The data for our input needs to be tabular for our example we’ll be using a csv file containing stock trading data
id,local_date,symbol,low,high,open,close
Where id
is a system id that probably has no value to us. low
, high
,
open
, and close
are monetary values of the stock at it’s lowest, highest,
start of day, and end of day respectively.
Writing the AWK Script
The most basic template for AWK will usually consiste of 3 sections. The BEGIN
block, the END
block, and a body section in between for processing each
record(typically a line) in the input. However any of these sections can be
omited.
BEGIN {
# ...config & setup before input is consumption starts.
}
# body here with patterns and blocks.
$0 ~ /xyz/ { }
END {
# ...stuff to do after input is consumed.
}
Tailoring the Script to the Data
The first step in tailoring this template to our stock market data is to set up some utility vairables to help us clearly identify columsn in the input. Usually this can be done by taking the header of the CSV and reformatting it like so:
BEGIN {
FS=","
=1
id=2
local_date=3
symbol=4
low=5
high=6
openclose=7
}
Each of the header get the value of their position starting at 1. Zero represents
the whole line, and each number after that represents the value at that column. While
in awk you could also just use the numbers directly such as $1
, $2
…ect, I’ve
found that it can get quite complex and error prone to continue to use the numbers
rather than names. Save your self some grief and assign aliases to your field positions.
Here we’ve also set the value of FS
. FS
stands for field separator, it is the
value that separates each of our columns(aka fields) in our csv. The default value
of FS
is a white space character, so it’s important to set this if the input is
not white space separated.
Deriving Meaningful Infomation
Next we can start writing the body of our awk script. Suppose we’re interested in
how the stock for AAPL
is doing, and in particular we want to see their lowest
low and highest high.
$symbol ~ /AAPL/ {
= $high > highestHigh ? $high : highestHigh
highestHigh = $low < lowestLow ? $low: highestHigh
lowestLow }
This snippet will look at every line item, tracking with it the highest high and
the lowest low with it. However we’re only collecting the value and not doing
anything with it. Since we need to look at all the data to know the two extremes
we need to wait to the end to print out the final values. This is the type of thing
we would then do in END
END {
print "Highest High: " highestHigh
print "Lowest Low: " lowestLow
}
If similarly we wanted to restrict our original to only values in a particular year could add to the match condition to ignore inputs ourside of years we are interested in.
$symbol ~ /AAPL/ && $local_date ~ /2019/ {
# ...same body as above
}
Putting it all together
The complete script is below. All that’s left to do is execute it on our input.
We would simply pipe it into awk
along with the file name of our script.
cat stock-data.csv | awk -f stock-analysis.awk
stock-analysis.awk:
BEGIN {
FS=","
=1
id=2
local_date=3
symbol=4
low=5
high=6
openclose=7
}
$symbol ~ /AAPL/ && $local_date ~ /2019/ {
= $high > highestHigh ? $high : highestHigh
highestHigh = $low < lowestLow ? $low: highestHigh
lowestLow }
END {
print "Highest High: " highestHigh
print "Lowest Low: " lowestLow
}
Conclusion
AWK is a simple but powerful language. Most people will likely get along with one liners in the command line. However AWK can be a way to quickly analyze some data without having to worry about creating file pointers, line iterators and general I/O boiler plate required in most programming languages. AWK stays very close to your calculation and can be a great way to inspect a tabular blob of data.