Results (Persistent-Memory gawk User Manual)

4.6 Results

Running the script of Experiments with default parameters on an aging laptop yielded the results summarized in the table below. More extensive experiments, not reported here, yield qualitatively similar results. Keep in mind that performance measurements are often sensitive to seemingly irrelevant factors. For example, the program that runs first may have the advantage of a cooler CPU; later contestants may start with a hot CPU and consequent clock throttling by a modern processor’s thermal regulation apparatus. Very generally, performance measurement is a notoriously tricky business. For scripting, whose main motive is convenience rather than speed, the proper role for performance measurements is to qualitatively test hypotheses such as those that follow from asymptotic analyses and to provide a rough idea of when various approaches are practical.

                            run time      peak memory    intermediate
        AWK script            (sec)      footprint (K)    storage (K)

        naive         O(N)   242.132       2,081,360           n/a
        rwarray build O(N)   250.288       2,846,868         156,832
        rwarray query O(W)    11.653       2,081,444
        freqtbl build O(N)   288.408       2,400,120          69,112
        freqtbl query O(W)    11.624       2,336,616
        pm-gawk build O(N)   251.946       2,079,520       2,076,608
        pm-gawk query O(1)     0.026           3,252

The results are consistent with the asymptotic analysis of Constant-Time Array Access. All four approaches require roughly four minutes to read the synthetic input data. The naïve approach must do this every time it performs a query, but the other three build an associative array to support queries and separately serve such queries. The freqtbl and rwarray approaches build an associative array of word frequencies, serialize it to an intermediate file, and then read the entire intermediate file prior to serving queries; the former does this manually and the latter uses a gawk extension. Both can serve queries in roughly ten seconds, not four minutes. As we’d expect from the asymptotic analysis, performing work proportional to the number of words is preferable to work proportional to the size of the raw input corpus: O(W) time is faster than O(N). And as we’d expect, pm-gawk’s constant-time queries are faster still, by roughly two orders of magnitude. For the computations considered here, pm-gawk makes the difference between blink-of-an-eye interactive queries and response times long enough for the user’s mind to wander.

Whereas freqtbl and rwarray reconstruct an associative array prior to accessing an individual element, pm-gawk stores a ready-made associative array in persistent memory. That’s why its intermediate file (the heap file) is much larger than the other two intermediate files, why the heap file is nearly as large as pm-gawk’s peak memory footprint while building the persistent array, and why its memory footprint is very small while serving a query that accesses a single array element. The upside of the large heap file is O(1) access instead of O(W)—a classic time-space tradeoff. If storage is a scarce resource, all three intermediate files can be compressed, freqtbl by a factor of roughly 2.7, rwarray by roughly 5.6x, and pm-gawk by roughly 11x using xz. Compression is CPU-intensive and slow, another time-space tradeoff.