lesscake.com

A short introduction to AWK

AWK is a programming language that is perfect for manipulating text. It is powerful yet simple to learn, and comes pre-installed on most Unix computers.

I very recently discovered the existence of AWK, and since then I've found multiple situations where it has been super useful. In this tutorial we will walk through the basics of how to use it with 4 simple examples.

General idea

Before diving into the code, let's go over a very short overview of the language.

To use AWK, we type something like this in a Terminal.

awk "program" input

More concretely, we will write this type of code.

awk "BEGIN {c=0} {c++; if(c<=3) print $0}" input.txt

It's normal if this doesn't make any sense for now, but by the end of this tutorial it will become clear.

Example 1

Let's imagine we have a file called "input.txt" containing this.

foo,aaa,bbb,ccc
bar,aaa,bbb,ccc
bar,aaa,bbb,ccc
foo,aaa,bbb,ccc
foo,aaa,bbb,ccc

And we want to extract all lines starting with the letters "foo". Then we should use AWK with a single pattern { actions } command.

# The pattern
# A regex that will match lines starting with "foo"
/^foo/ 

# The actions
{ 
  # Print the current line
  print $0;
}

Here's a rough English translation of the above code.

For each line of the file
  If the current line starts with "foo" (/^foo/)
    Then print the current line (print $0)
  Otherwise do nothing

This means that once we've gone though the whole file, only the lines starting with "foo" we will be displayed on the screen.

To test our code, we will type it in the command line. So better make it short by:

/^foo/ {print $0}

Then in the Terminal we write this.

awk "/^foo/ {print $0}" input.txt

And as planned, we only get the lines starting with "foo".

foo,aaa,bbb,ccc
foo,aaa,bbb,ccc
foo,aaa,bbb,ccc

Example 2

Let's continue our previous example. This time, instead of printing the lines that start with "foo", we just want to count these lines to know how many there are.

We keep the same pattern /^foo/, but the action has to be different. We don't want to print the current line, we want to count it. So we could try to remove print $0 and replace it with count++.

/^foo/ { 
  count++;
}

This is a good start, but there are 2 issues with this code: the variable count is never initialized and also never printed.

We can fix that by using the special patterns BEGIN and END.

# Pattern that matches once at the very beginning of the file
BEGIN {
  # Initialize the variable
  count = 0;
}

# Pattern that matches every lines starting with "foo"
/^foo/ {
  # Increment the variable
  count++;
}

# Pattern that matches once at the very end of the file
END {
  # Print the variable
  print count;
}

In this case we have 3 pattern { actions } working together.

Next, we should shorten our code by removing comments, newlines, and semicolons. We may also rename count into just c.

BEGIN {c=0} /^foo/ {c++} END {print c}

Then in the Terminal we type this to check how many lines are starting with "foo".

awk "BEGIN {c=0} /^foo/ {c++} END {print c}" input.txt

Example 3

This time, we are going to print the first 3 lines of our file.

The code is going to be similar to the previous example, but with 2 new additions:

# Pattern that matches once at the very beginning of the file
BEGIN {
  # Initialize a variable
  count = 0;
}

# If we omit the pattern, the action will be performed for every line
{
  # Increment count by 1
  count++;

  # If we counted less than 3 lines
  if (count <= 3) {
      # Then print the current line 
      print $0;
   } 
}

And here's the complete version.

awk "BEGIN {c=0} {c++; if(c<=3) print $0}" input.txt

This is the code that was mentioned in the introduction. Pretty simple once we understand how it works, right?

Example 4

Last example, with once more the same file.

foo,aaa,bbb,ccc
bar,aaa,bbb,ccc
bar,aaa,bbb,ccc
foo,aaa,bbb,ccc
foo,aaa,bbb,ccc

Each line is divided into 4 fields, and we wish to only keep the first and last one.

foo ccc
bar ccc
bar ccc
foo ccc
foo ccc

To achieve this, we need to learn about 2 useful features of the language:

So if we set FS = ',', then the line foo,aaa,bbb,ccc would be split into 4 fields: $1 -> foo, $2 -> aaa, $3 -> bbb, and $4 -> ccc.

With this new knowledge, we are able to achieved our goal.

# Once at the beginning
BEGIN {
  # Set ',' as the new field separator
  FS = ',';
}

# For every lines
{
  # Only print the first and fourth field
  print $1 ' ' $4
}

Typing this command in a Terminal will gives us the expected result.

awk "BEGIN {FS=','} {print $1' '$4}" input.txt

Conclusion

There is more to learn about AWK. However, with just the information from this article, there's already a lot that can be done.

The source code of all the examples discussed here are available on GitHub.

If you have any questions or feedback: thomas@lesscake.com :-)