Starting an AWK Script

Template and Tips for Writing AWK Scripts

August 24, 2019

Note: This is an unedited draft.

AWK is a utility program available on most unix like systems and it really shines with tabular text input. In this post I will be describing my starter template for AWK scripts I write.

The Data

The data for our input needs to be tabular for our example we’ll be using a csv file containing stock trading data

id,local_date,symbol,low,high,open,close

Where id is a system id that probably has no value to us. low, high, open, and close are monetary values of the stock at it’s lowest, highest, start of day, and end of day respectively.

Writing the AWK Script

The most basic template for AWK will usually consiste of 3 sections. The BEGIN block, the END block, and a body section in between for processing each record(typically a line) in the input. However any of these sections can be omited.

BEGIN {
  # ...config & setup before input is consumption starts.
}

# body here with patterns and blocks.
$0 ~ /xyz/  {  }

END {
  # ...stuff to do after input is consumed.
}

Tailoring the Script to the Data

The first step in tailoring this template to our stock market data is to set up some utility vairables to help us clearly identify columsn in the input. Usually this can be done by taking the header of the CSV and reformatting it like so:

BEGIN {
  FS=","
  id=1
  local_date=2
  symbol=3
  low=4
  high=5
  open=6
  close=7
}

Each of the header get the value of their position starting at 1. Zero represents the whole line, and each number after that represents the value at that column. While in awk you could also just use the numbers directly such as $1, $2…ect, I’ve found that it can get quite complex and error prone to continue to use the numbers rather than names. Save your self some grief and assign aliases to your field positions.

Here we’ve also set the value of FS. FS stands for field separator, it is the value that separates each of our columns(aka fields) in our csv. The default value of FS is a white space character, so it’s important to set this if the input is not white space separated.

Deriving Meaningful Infomation

Next we can start writing the body of our awk script. Suppose we’re interested in how the stock for AAPL is doing, and in particular we want to see their lowest low and highest high.

$symbol ~ /AAPL/ {
    highestHigh = $high > highestHigh ? $high : highestHigh
    lowestLow = $low < lowestLow ? $low: highestHigh
}

This snippet will look at every line item, tracking with it the highest high and the lowest low with it. However we’re only collecting the value and not doing anything with it. Since we need to look at all the data to know the two extremes we need to wait to the end to print out the final values. This is the type of thing we would then do in END

END {
  print "Highest High: " highestHigh 
  print "Lowest Low: " lowestLow
}

If similarly we wanted to restrict our original to only values in a particular year could add to the match condition to ignore inputs ourside of years we are interested in.

$symbol ~ /AAPL/ && $local_date ~ /2019/ {
  # ...same body as above
}

Putting it all together

The complete script is below. All that’s left to do is execute it on our input. We would simply pipe it into awk along with the file name of our script.

cat stock-data.csv | awk -f stock-analysis.awk

stock-analysis.awk:

BEGIN {
  FS=","
  id=1
  local_date=2
  symbol=3
  low=4
  high=5
  open=6
  close=7
}

$symbol ~ /AAPL/ && $local_date ~ /2019/ {
    highestHigh = $high > highestHigh ? $high : highestHigh
    lowestLow = $low < lowestLow ? $low: highestHigh
}

END {
  print "Highest High: " highestHigh 
  print "Lowest Low: " lowestLow
}

Conclusion

AWK is a simple but powerful language. Most people will likely get along with one liners in the command line. However AWK can be a way to quickly analyze some data without having to worry about creating file pointers, line iterators and general I/O boiler plate required in most programming languages. AWK stays very close to your calculation and can be a great way to inspect a tabular blob of data.