Next: Performance, Previous: Quick Start, Up: General Introduction
Our first example uses pm-gawk
to streamline analysis of a prose
corpus, Mark Twain’s Tom Sawyer and Huckleberry Finn
from
https://gutenberg.org/files/74/74-0.txt
and
https://gutenberg.org/files/76/76-0.txt.
We first convert non-alphabetic characters to newlines (so each line
has at most one word) and convert to lowercase:
$ tr -c a-zA-Z '\n' < 74-0.txt | tr A-Z a-z > sawyer.txt $ tr -c a-zA-Z '\n' < 76-0.txt | tr A-Z a-z > finn.txt
It’s easy to count word frequencies with AWK’s associative arrays.
pm-gawk
makes these arrays persistent, so we need not re-ingest the
entire corpus every time we ask a new question (“read once, analyze
happily ever after”):
$ truncate -s 100M twain.pma $ export GAWK_PERSIST_FILE=twain.pma $ gawk '{ts[$1]++}' sawyer.txt # ingest $ gawk 'BEGIN{print ts["work"], ts["play"]}' # query 92 11 $ gawk 'BEGIN{print ts["necktie"], ts["knife"]}' # query 2 27
The truncate
command above creates a heap file large enough
to store all of the data it must eventually contain, with plenty of
room to spare. (As we’ll see in Sparse Heap Files, this isn’t
wasteful.) The export
command ensures that subsequent
gawk
invocations activate pm-gawk
. The first pm-gawk
command stores
Tom Sawyer’s word frequencies in associative array ts[]
.
Because this array is persistent, subsequent pm-gawk
commands can
access it without having to parse the input file again.
Expanding our analysis to encompass a second book is easy. Let’s
populate a new associative array hf[]
with the word frequencies
in Huckleberry Finn:
$ gawk '{hf[$1]++}' finn.txt
Now we can freely intermix accesses to both books’ data conveniently and efficiently, without the overhead and coding fuss of repeated input parsing:
$ gawk 'BEGIN{print ts["river"], hf["river"]}' 26 142
By making AWK more interactive, pm-gawk
invites casual conversations
with data. If we’re curious what words in Finn are absent from
Sawyer, answers (including “flapdoodle,” “yellocution,” and
“sockdolager”) are easy to find:
$ gawk 'BEGIN{for(w in hf) if (!(w in ts)) print w}'
Rumors of Twain’s death may be exaggerated. If he publishes new books in the future, it will be easy to incorporate them into our analysis incrementally. The performance benefits of incremental processing for common AWK chores such as log file analysis are discussed in https://queue.acm.org/detail.cfm?id=3534855 and the companion paper cited therein, and below in Performance.
Exercise: The “Markov” AWK script on page 79 of Kernighan & Pike’s
The Practice of Programming generates random text reminiscent
of a given corpus using a simple statistical modeling technique. This
script consists of a “learning” or “training” phase followed by an
output-generation phase. Use pm-gawk
to de-couple the two phases and
to allow the statistical model to incrementally ingest additions to
the input corpus.
Our second example considers another domain that plays to AWK’s strengths, data analysis. For simplicity we’ll create two small input files of numeric data.
$ printf '1\n2\n3\n4\n5\n' > A.dat $ printf '5\n6\n7\n8\n9\n' > B.dat
A conventional non-persistent AWK script can compute basic summary statistics:
$ cat summary_conventional.awk 1 == NR { min = max = $1 } min > $1 { min = $1 } max < $1 { max = $1 } { sum += $1 } END { print "min: " min " max: " max " mean: " sum/NR } $ gawk -f summary_conventional.awk A.dat B.dat min: 1 max: 9 mean: 5
To use pm-gawk
for the same purpose, we first create a heap file for
our AWK script variables and tell pm-gawk
where to find it via the
usual environment variable:
$ truncate -s 10M stats.pma $ export GAWK_PERSIST_FILE=stats.pma
pm-gawk
requires changing the above script to ensure that min
and max
are initialized exactly once, when the heap file is
first used, and not every time the script runs. Furthermore,
whereas script-defined variables such as min
retain their
values across pm-gawk
executions, built-in AWK variables such as
NR
are reset to zero every time pm-gawk
runs, so we can’t use
them in the same way. Here’s a modified script for pm-gawk
:
$ cat summary_persistent.awk ! init { min = max = $1; init = 1 } min > $1 { min = $1 } max < $1 { max = $1 } { sum += $1; ++n } END { print "min: " min " max: " max " mean: " sum/n }
Note the different pattern on the first line and the introduction of
n
to supplant NR
. When used with pm-gawk
, this new
initialization logic supports the same kind of cumulative processing
that we saw in the text-analysis scenario. For example, we can ingest
our input files separately:
$ gawk -f summary_persistent.awk A.dat min: 1 max: 5 mean: 3 $ gawk -f summary_persistent.awk B.dat min: 1 max: 9 mean: 5
As expected, after the second pm-gawk
invocation consumes the
second input file, the output matches that of the non-persistent
script that read both files at once.
Exercise: Amend the AWK scripts above to compute the median and
mode(s) using both conventional gawk
and pm-gawk
. (The median is the
number in the middle of a sorted list; if the length of the list is
even, average the two numbers at the middle. The modes are the values
that occur most frequently.)
Our third and final set of examples shows that pm-gawk
allows us to
bundle both script-defined data and also user-defined functions
in a persistent heap that may be passed freely between unrelated AWK
scripts.
The following shell transcript repeatedly invokes pm-gawk
to create and
then employ a user-defined function. These separate invocations
involve several different AWK scripts that communicate via the heap
file. Each invocation can add user-defined functions and add or
remove data from the heap that subsequent invocations will access.
$ truncate -s 10M funcs.pma $ export GAWK_PERSIST_FILE=funcs.pma $ gawk 'function count(A,t) {for(i in A)t++; return ""==t?0:t}' $ gawk 'BEGIN { a["x"] = 4; a["y"] = 5; a["z"] = 6 }' $ gawk 'BEGIN { print count(a) }' 3 $ gawk 'BEGIN { delete a["x"] }' $ gawk 'BEGIN { print count(a) }' 2 $ gawk 'BEGIN { delete a }' $ gawk 'BEGIN { print count(a) }' 0 $ gawk 'BEGIN { for (i=0; i<47; i++) a[i]=i }' $ gawk 'BEGIN { print count(a) }' 47
The first pm-gawk
command creates user-defined function count()
,
which returns the number of entries in a given associative array; note
that variable t
is local to count()
, not global. The
next pm-gawk
command populates a persistent associative array with
three entries; not surprisingly, the count()
call in the
following pm-gawk
command finds these three entries. The next two
pm-gawk
commands respectively delete an array entry and print the
reduced count, 2. The two commands after that delete the entire array
and print a count of zero. Finally, the last two pm-gawk
commands
populate the array with 47 entries and count them.
The following shell script invokes pm-gawk
repeatedly to create a
collection of user-defined functions that perform basic operations on
quadratic polynomials: evaluation at a given point, computing the
discriminant, and using the quadratic formula to find the roots. It
then factorizes x^2 + x - 12 into (x - 3)(x + 4).
#!/bin/sh rm -f poly.pma truncate -s 10M poly.pma export GAWK_PERSIST_FILE=poly.pma gawk 'function q(x) { return a*x^2 + b*x + c }' gawk 'function p(x) { return "q(" x ") = " q(x) }' gawk 'BEGIN { print p(2) }' # evaluate & print gawk 'BEGIN{ a = 1; b = 1; c = -12 }' # new coefficients gawk 'BEGIN { print p(2) }' # eval/print again gawk 'function d(s) { return s * sqrt(b^2 - 4*a*c)}' gawk 'BEGIN{ print "discriminant (must be >=0): " d(1)}' gawk 'function r(s) { return (-b + d(s))/(2*a)}' gawk 'BEGIN{ print "root: " r( 1) " " p(r( 1)) }' gawk 'BEGIN{ print "root: " r(-1) " " p(r(-1)) }' gawk 'function abs(n) { return n >= 0 ? n : -n }' gawk 'function sgn(x) { return x >= 0 ? "- " : "+ " } ' gawk 'function f(s) { return "(x " sgn(r(s)) abs(r(s))}' gawk 'BEGIN{ print "factor: " f( 1) ")" }' gawk 'BEGIN{ print "factor: " f(-1) ")" }' rm -f poly.pma
Next: Performance, Previous: Quick Start, Up: General Introduction