Next: Virtual Memory and Big Data, Up: Performance
pm-gawk
preserves the efficiency of data access when data structures
are created by one process and later re-used by a different process.
Consider the associative arrays used to analyze Mark Twain’s books in
Examples. We created arrays ts[]
and hf[]
by
reading files sawyer.txt and finn.txt. If N denotes
the total volume of data in these files, building the associative
arrays typically requires time proportional to N, or “O(N)
expected time” in the lingo of asymptotic analysis. If W is the
number of unique words in the input files, the size of the associative
arrays will be proportional to W, or O(W). Accessing
individual array elements requires only constant or O(1)
expected time, not O(N) or O(W) time, because gawk
implements arrays as hash tables.
The performance advantage of pm-gawk
arises when different processes
create and access associative arrays. Accessing an element of a
persistent array created by a previous pm-gawk
process, as we did
earlier in
BEGIN{print ts["river"], hf["river"]},
still requires only O(1) time, which is asymptotically far
superior to the alternatives. Naïvely reconstructing
arrays by re-ingesting all raw inputs in every gawk
process that
accesses the arrays would of course require O(N) time—a
profligate cost if the text corpus is large. Dumping arrays to files
and re-loading them as needed would reduce the preparation time for
access to O(W). That can be a substantial improvement in
practice; N is roughly 19 times larger than W in our Twain
corpus. Nonetheless O(W) remains far slower than pm-gawk
’s
O(1). As we’ll see in Results, the difference is not merely
theoretical.
The persistent memory implementation beneath pm-gawk
enables it to
avoid work proportional to N or W when accessing an element of
a persistent associative array. Under the hood, pm-gawk
stores
script-defined AWK variables such as associative arrays in a
persistent heap laid out in a memory-mapped file (the heap file).
When an AWK script accesses an element of an associative array, pm-gawk
performs a lookup on the corresponding hash table, which in turn
accesses memory on the persistent heap. Modern operating systems
implement memory-mapped files in such a way that these memory accesses
trigger the bare minimum of data movement required: Only those parts
of the heap file containing needed data are “paged in” to the memory
of the pm-gawk
process. In the worst case, the heap file is not in the
file system’s in-memory cache, so the required pages must be faulted
into memory from storage. Our asymptotic analysis of efficiency
applies regardless of whether the heap file is cached or not. The
entire heap file is not accessed merely to access an element of
a persistent associative array.
Persistent memory thus enables pm-gawk
to offer the flexibility of
de-coupling data ingestion from analytic queries without the fuss and
overhead of serializing and loading data structures and without
sacrificing constant-time access to the associative arrays that make
AWK scripting convenient and productive.
Next: Virtual Memory and Big Data, Up: Performance