Previous: Experiments, Up: Performance
Running the script of Experiments with default parameters on an aging laptop yielded the results summarized in the table below. More extensive experiments, not reported here, yield qualitatively similar results. Keep in mind that performance measurements are often sensitive to seemingly irrelevant factors. For example, the program that runs first may have the advantage of a cooler CPU; later contestants may start with a hot CPU and consequent clock throttling by a modern processor’s thermal regulation apparatus. Very generally, performance measurement is a notoriously tricky business. For scripting, whose main motive is convenience rather than speed, the proper role for performance measurements is to qualitatively test hypotheses such as those that follow from asymptotic analyses and to provide a rough idea of when various approaches are practical.
run time peak memory intermediate AWK script (sec) footprint (K) storage (K) naive O(N) 242.132 2,081,360 n/a rwarray build O(N) 250.288 2,846,868 156,832 rwarray query O(W) 11.653 2,081,444 freqtbl build O(N) 288.408 2,400,120 69,112 freqtbl query O(W) 11.624 2,336,616 pm-gawk build O(N) 251.946 2,079,520 2,076,608 pm-gawk query O(1) 0.026 3,252
The results are consistent with the asymptotic analysis of
Constant-Time Array Access. All four approaches require roughly
four minutes to read the synthetic input data. The
naïve approach must do this every time it performs a
query, but the other three build an associative array to support
queries and separately serve such queries. The freqtbl
and
rwarray
approaches build an associative array of word
frequencies, serialize it to an intermediate file, and then read the
entire intermediate file prior to serving queries; the former does
this manually and the latter uses a gawk
extension. Both can serve
queries in roughly ten seconds, not four minutes. As we’d expect from
the asymptotic analysis, performing work proportional to the number of
words is preferable to work proportional to the size of the raw input
corpus: O(W) time is faster than O(N). And as we’d expect,
pm-gawk
’s constant-time queries are faster still, by roughly two orders
of magnitude. For the computations considered here, pm-gawk
makes the
difference between blink-of-an-eye interactive queries and response
times long enough for the user’s mind to wander.
Whereas freqtbl
and rwarray
reconstruct an associative
array prior to accessing an individual element, pm-gawk
stores a
ready-made associative array in persistent memory. That’s why its
intermediate file (the heap file) is much larger than the other two
intermediate files, why the heap file is nearly as large as pm-gawk
’s
peak memory footprint while building the persistent array, and why its
memory footprint is very small while serving a query that accesses a
single array element. The upside of the large heap file is O(1)
access instead of O(W)—a classic time-space tradeoff. If
storage is a scarce resource, all three intermediate files can be
compressed, freqtbl
by a factor of roughly 2.7, rwarray
by roughly 5.6x, and pm-gawk
by roughly 11x using xz
.
Compression is CPU-intensive and slow, another time-space tradeoff.
Previous: Experiments, Up: Performance