Examples (Persistent-Memory gawk User Manual)

3 Examples

Our first example uses pm-gawk to streamline analysis of a prose corpus, Mark Twain’s Tom Sawyer and Huckleberry Finn from https://gutenberg.org/files/74/74-0.txt and https://gutenberg.org/files/76/76-0.txt. We first convert non-alphabetic characters to newlines (so each line has at most one word) and convert to lowercase:

        $ tr -c a-zA-Z '\n' < 74-0.txt | tr A-Z a-z > sawyer.txt
        $ tr -c a-zA-Z '\n' < 76-0.txt | tr A-Z a-z > finn.txt

It’s easy to count word frequencies with AWK’s associative arrays. pm-gawk makes these arrays persistent, so we need not re-ingest the entire corpus every time we ask a new question (“read once, analyze happily ever after”):

        $ truncate -s 100M twain.pma
        $ export GAWK_PERSIST_FILE=twain.pma
        $ gawk '{ts[$1]++}' sawyer.txt                    # ingest
        $ gawk 'BEGIN{print ts["work"], ts["play"]}'      # query
        92 11
        $ gawk 'BEGIN{print ts["necktie"], ts["knife"]}'  # query
        2 27

The truncate command above creates a heap file large enough to store all of the data it must eventually contain, with plenty of room to spare. (As we’ll see in Sparse Heap Files, this isn’t wasteful.) The export command ensures that subsequent gawk invocations activate pm-gawk. The first pm-gawk command stores Tom Sawyer’s word frequencies in associative array ts[]. Because this array is persistent, subsequent pm-gawk commands can access it without having to parse the input file again.

Expanding our analysis to encompass a second book is easy. Let’s populate a new associative array hf[] with the word frequencies in Huckleberry Finn:

        $ gawk '{hf[$1]++}' finn.txt

Now we can freely intermix accesses to both books’ data conveniently and efficiently, without the overhead and coding fuss of repeated input parsing:

        $ gawk 'BEGIN{print ts["river"], hf["river"]}'
        26 142

By making AWK more interactive, pm-gawk invites casual conversations with data. If we’re curious what words in Finn are absent from Sawyer, answers (including “flapdoodle,” “yellocution,” and “sockdolager”) are easy to find:

        $ gawk 'BEGIN{for(w in hf) if (!(w in ts)) print w}'

Rumors of Twain’s death may be exaggerated. If he publishes new books in the future, it will be easy to incorporate them into our analysis incrementally. The performance benefits of incremental processing for common AWK chores such as log file analysis are discussed in https://queue.acm.org/detail.cfm?id=3534855 and the companion paper cited therein, and below in Performance.

Exercise: The “Markov” AWK script on page 79 of Kernighan & Pike’s The Practice of Programming generates random text reminiscent of a given corpus using a simple statistical modeling technique. This script consists of a “learning” or “training” phase followed by an output-generation phase. Use pm-gawk to de-couple the two phases and to allow the statistical model to incrementally ingest additions to the input corpus.

Our second example considers another domain that plays to AWK’s strengths, data analysis. For simplicity we’ll create two small input files of numeric data.

        $ printf '1\n2\n3\n4\n5\n' > A.dat
        $ printf '5\n6\n7\n8\n9\n' > B.dat

A conventional non-persistent AWK script can compute basic summary statistics:

        $ cat summary_conventional.awk
            1 == NR  { min = max = $1 }
            min > $1 { min  = $1 }
            max < $1 { max  = $1 }
                     { sum += $1 }
            END { print "min: " min " max: " max " mean: " sum/NR }

        $ gawk -f summary_conventional.awk A.dat B.dat
        min: 1 max: 9 mean: 5

To use pm-gawk for the same purpose, we first create a heap file for our AWK script variables and tell pm-gawk where to find it via the usual environment variable:

        $ truncate -s 10M stats.pma
        $ export GAWK_PERSIST_FILE=stats.pma

pm-gawk requires changing the above script to ensure that min and max are initialized exactly once, when the heap file is first used, and not every time the script runs. Furthermore, whereas script-defined variables such as min retain their values across pm-gawk executions, built-in AWK variables such as NR are reset to zero every time pm-gawk runs, so we can’t use them in the same way. Here’s a modified script for pm-gawk:

        $ cat summary_persistent.awk
            ! init   { min = max = $1; init = 1 }
            min > $1 { min = $1 }
            max < $1 { max = $1 }
                     { sum += $1; ++n }
            END { print "min: " min " max: " max " mean: " sum/n }

Note the different pattern on the first line and the introduction of n to supplant NR. When used with pm-gawk, this new initialization logic supports the same kind of cumulative processing that we saw in the text-analysis scenario. For example, we can ingest our input files separately:

        $ gawk -f summary_persistent.awk A.dat
        min: 1 max: 5 mean: 3

        $ gawk -f summary_persistent.awk B.dat
        min: 1 max: 9 mean: 5

As expected, after the second pm-gawk invocation consumes the second input file, the output matches that of the non-persistent script that read both files at once.

Exercise: Amend the AWK scripts above to compute the median and mode(s) using both conventional gawk and pm-gawk. (The median is the number in the middle of a sorted list; if the length of the list is even, average the two numbers at the middle. The modes are the values that occur most frequently.)

Our third and final set of examples shows that pm-gawk allows us to bundle both script-defined data and also user-defined functions in a persistent heap that may be passed freely between unrelated AWK scripts.

The following shell transcript repeatedly invokes pm-gawk to create and then employ a user-defined function. These separate invocations involve several different AWK scripts that communicate via the heap file. Each invocation can add user-defined functions and add or remove data from the heap that subsequent invocations will access.

        $ truncate -s 10M funcs.pma
        $ export GAWK_PERSIST_FILE=funcs.pma
        $ gawk 'function count(A,t) {for(i in A)t++; return ""==t?0:t}'
        $ gawk 'BEGIN { a["x"] = 4; a["y"] = 5; a["z"] = 6 }'
        $ gawk 'BEGIN { print count(a) }'
        3
        $ gawk 'BEGIN { delete a["x"] }'
        $ gawk 'BEGIN { print count(a) }'
        2
        $ gawk 'BEGIN { delete a }'
        $ gawk 'BEGIN { print count(a) }'
        0
        $ gawk 'BEGIN { for (i=0; i<47; i++) a[i]=i }'
        $ gawk 'BEGIN { print count(a) }'
        47

The first pm-gawk command creates user-defined function count(), which returns the number of entries in a given associative array; note that variable t is local to count(), not global. The next pm-gawk command populates a persistent associative array with three entries; not surprisingly, the count() call in the following pm-gawk command finds these three entries. The next two pm-gawk commands respectively delete an array entry and print the reduced count, 2. The two commands after that delete the entire array and print a count of zero. Finally, the last two pm-gawk commands populate the array with 47 entries and count them.

The following shell script invokes pm-gawk repeatedly to create a collection of user-defined functions that perform basic operations on quadratic polynomials: evaluation at a given point, computing the discriminant, and using the quadratic formula to find the roots. It then factorizes x^2 + x - 12 into (x - 3)(x + 4).

        #!/bin/sh
        rm -f                    poly.pma
        truncate -s 10M          poly.pma
        export GAWK_PERSIST_FILE=poly.pma
        gawk 'function q(x) { return a*x^2 + b*x + c }'
        gawk 'function p(x) { return "q(" x ") = " q(x) }'
        gawk 'BEGIN { print p(2) }'           # evaluate & print
        gawk 'BEGIN{ a = 1; b = 1; c = -12 }' # new coefficients
        gawk 'BEGIN { print p(2) }'           # eval/print again
        gawk 'function d(s) { return s * sqrt(b^2 - 4*a*c)}'
        gawk 'BEGIN{ print "discriminant (must be >=0): " d(1)}'
        gawk 'function r(s) { return (-b + d(s))/(2*a)}'
        gawk 'BEGIN{ print "root: " r( 1) "   " p(r( 1)) }'
        gawk 'BEGIN{ print "root: " r(-1) "   " p(r(-1)) }'
        gawk 'function abs(n) { return n >= 0 ? n : -n }'
        gawk 'function sgn(x) { return x >= 0 ? "- " : "+ " } '
        gawk 'function f(s) { return "(x " sgn(r(s)) abs(r(s))}'
        gawk 'BEGIN{ print "factor: " f( 1) ")" }'
        gawk 'BEGIN{ print "factor: " f(-1) ")" }'
        rm -f poly.pma