Next: Results, Previous: Persistence versus Durability, Up: Performance
The C-shell (csh
) script listed below illustrates concepts
and implements tips presented in this chapter. It produced the
results discussed in Results in roughly 20 minutes on an aging
laptop. You can cut and paste the code listing below into a file, or
download it from http://web.eecs.umich.edu/~tpkelly/pma/.
The script measures the performance of four different ways to support
word frequency queries over a text corpus: The naïve
approach of reading the corpus into an associative array for every
query; manually dumping a text representation of the word-frequency
table and manually loading it prior to a query; using gawk
’s
rwarray
extension to dump and load an associative array; and
using pm-gawk
to maintain a persistent associative array.
Comments at the top explain prerequisites. Lines 8–10 set input
parameters: the directory where tests are run and where files
including the heap file are held, the off-the-shelf timer used to
measure run times and other performance characteristics such as peak
memory usage, and the size of the input. The default input size
results in pm-gawk
memory footprints under 3 GiB, which is large enough
for interesting results and small enough to fit in DRAM and avoid
paging on today’s computers. Lines 13–14 define a homebrew timer.
Two sections of the script are relevant if the default run directory is changed from /dev/shm/ to a directory in a conventional storage-backed file system: Lines 15–17 define the mechanism for clearing file data cached in DRAM; lines 23–30 set Linux kernel parameters to discourage eager paging.
Lines 37–70 spit out, compile, and run a little C program to generate a random text corpus. This program is fast, flexible, and deterministic, generating the same random output given the same parameters.
Lines 71–100 run the four different AWK approaches on the same random input, reporting separately the time to build and to query the associative array containing word frequencies.
#!/bin/csh -f # Set PMG envar to path of pm-gawk executable and AWKLIBPATH # 2 # to find rwarray.so # 3 # Requires "sudo" to work; consider this for /etc/sudoers file: # 4 # Defaults:youruserid !authenticate # 5 echo 'begin: ' `date` `date +%s` # 6 unsetenv GAWK_PERSIST_FILE # disable persistence until wanted # 7 set dir = '/dev/shm' # where heap file et al. will live # 8 set tmr = '/usr/bin/time' # can also use shell built-in "time" # 9 set isz = 1073741824 # input size; 1 GiB # 10 # set isz = 100000000 # small input for quick testing # 11 cd $dir # tick/tock/tyme below are homebrew timer, good within ~2ms # 12 alias tick 'set t1 = `date +%s.%N`' ; alias tock 'set t2 = `date +%s.%N`' # 13 alias tyme '$PMG -v t1=$t1 -v t2=$t2 "BEGIN{print t2-t1}"' # 14 alias tsync 'tick ; sync ; tock ; echo "sync time: " `tyme`' # 15 alias drop_caches 'echo 3 | sudo tee /proc/sys/vm/drop_caches' # 16 alias snd 'tsync; drop_caches' # 17 echo "pm-gawk: $PMG" ; echo 'std gawk: ' `which gawk` # 18 echo "run dir: $dir" ; echo 'pwd: ' `pwd` # 19 echo 'dir content:' ; ls -l $dir |& $PMG '{print " " $0}' # 20 echo 'timer: ' $tmr ; echo 'AWKLIBPATH: ' $AWKLIBPATH # 21
echo 'OS params:' ; set vm = '/proc/sys/vm/dirty' # 22 sudo sh -c "echo 100 > ${vm}_background_ratio" # restore these # 23 sudo sh -c "echo 100 > ${vm}_ratio" # paging params # 24 sudo sh -c "echo 1080000 > ${vm}_expire_centisecs" # to defaults # 25 sudo sh -c "echo 1080000 > ${vm}_writeback_centisecs" # if necessary # 26 foreach d ( ${vm}_background_ratio ${vm}_ratio \ # 27 ${vm}_expire_centisecs ${vm}_writeback_centisecs ) # 28 printf " %-38s %7d\n" $d `cat $d` # 29 end # 30 tick ; tock ; echo 'timr ovrhd: ' `tyme` 's (around 2ms for TK)' # 31 tick ; $PMG 'BEGIN{print "pm-gawk? yes"}' # 32 tock ; echo 'pmg ovrhd: ' `tyme` 's (around 4-5 ms for TK)' # 33 set inp = 'input.dat' # 34 echo 'input size ' $isz # 35 echo "input file: $inp" # 36 set rg = rgen # spit out and compile C program to generate random inputs # 37 rm -f $inp $rg.c $rg # 38 cat <<EOF > $rg.c # 39 // generate N random words, one per line, no blank lines # 40 // charset is e.g. 'abcdefg@' where '@' becomes newline # 41 #include <stdio.h> # 42 #include <stdlib.h> # 43 #include <string.h> # 44 #define RCH c = a[rand() % L]; # 45 #define PICK do { RCH } while (0) # 46 #define PICKCH do { RCH } while (c == '@') # 47 #define FP(...) fprintf(stderr, __VA_ARGS__) # 48 int main(int argc, char *argv[]) { # 49 if (4 != argc) { # 50 FP("usage: %s charset N seed\n", # 51 argv[0]); return 1; } # 52 char c, *a = argv[1]; size_t L = strlen(a); # 53 long int N = atol(argv[2]); # 54 srand( atol(argv[3])); # 55 if (2 > N) { FP("N == %ld < 2\n", N); return 2; } # 56 PICKCH; # 57 for (;;) { # 58 if (2 == N) { PICKCH; putchar(c); putchar('\n'); break; } # 59 if ('@' == c) { putchar('\n'); PICKCH; } # 60 else { putchar( c ); PICK; } # 61 if (0 >= --N) break; # 62 } # 63 } # 64 EOF # 65 gcc -std=c11 -Wall -Wextra -O3 -o $rg $rg.c # 66 set t = '@@@@@@@' ; set c = "abcdefghijklmnopqrstuvwxyz$t$t$t$t$t$t" # 67 tick ; ./$rg "$c" $isz 47 > $inp ; tock ; echo 'gen time: ' `tyme` # 68 echo "input file: $inp" # 69 echo 'input wc: ' `wc < $inp` ; echo 'input uniq: ' `sort -u $inp | wc` # 70
snd ############################################################################ # 71 tick ; $tmr $PMG '{n[$1]++}END{print "output: " n["foo"]}' $inp # 72 tock ; echo 'T naive O(N): ' `tyme` ; echo '' # 73 rm -f rwa # 74 snd ############################################################################ # 75 echo '' # 76 tick ; $tmr $PMG -l rwarray '{n[$1]++}END{print "writea",writea("rwa",n)}' $inp # 77 tock ; echo 'T rwarray build O(N): ' `tyme` ; echo '' # 78 snd # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 79 tick ; $tmr $PMG -l rwarray 'BEGIN{print "reada",reada("rwa",n); \ # 80 print "output: " n["foo"]}' # 81 tock ; echo 'T rwarray query O(W): ' `tyme` ; echo '' # 82 rm -f ft # 83 snd ############################################################################ # 84 tick ; $tmr $PMG '{n[$1]++}END{for(w in n)print n[w], w}' $inp > ft # 85 tock ; echo 'T freqtbl build O(N): ' `tyme` ; echo '' # 86 snd # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 87 tick ; $tmr $PMG '{n[$2] = $1}END{print "output: " n["foo"]}' ft # 88 tock ; echo 'T freqtbl query O(W): ' `tyme` ; echo '' # 89 rm -f heap.pma # 90 snd ############################################################################ # 91 truncate -s 3G heap.pma # enlarge if needed # 92 setenv GAWK_PERSIST_FILE heap.pma # 93 tick ; $tmr $PMG '{n[$1]++}' $inp # 94 tock ; echo 'T pm-gawk build O(N): ' `tyme` ; echo '' # 95 snd # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 96 tick ; $tmr $PMG 'BEGIN{print "output: " n["foo"]}' # 97 tock ; echo 'T pm-gawk query O(1): ' `tyme` ; echo '' # 98 unsetenv GAWK_PERSIST_FILE # 99 snd ############################################################################ # 100 echo 'Note: all output lines above should be identical' ; echo '' # 101 echo 'dir content:' ; ls -l $dir |& $PMG '{print " " $0}' # 102 echo '' ; echo 'storage footprints:' # 103 foreach f ( rwa ft heap.pma ) # compression is very slow, so we comment it out # 104 echo " $f " `du -BK $dir/$f` # `xz --best < $dir/$f | wc -c` 'bytes xz' # 105 end # 106 echo '' ; echo 'end: ' `date` `date +%s` ; echo '' # 107
Next: Results, Previous: Persistence versus Durability, Up: Performance