Experiments (Persistent-Memory gawk User Manual)

4.5 Experiments

The C-shell (csh) script listed below illustrates concepts and implements tips presented in this chapter. It produced the results discussed in Results in roughly 20 minutes on an aging laptop. You can cut and paste the code listing below into a file, or download it from http://web.eecs.umich.edu/~tpkelly/pma/.

The script measures the performance of four different ways to support word frequency queries over a text corpus: The naïve approach of reading the corpus into an associative array for every query; manually dumping a text representation of the word-frequency table and manually loading it prior to a query; using gawk’s rwarray extension to dump and load an associative array; and using pm-gawk to maintain a persistent associative array.

Comments at the top explain prerequisites. Lines 8–10 set input parameters: the directory where tests are run and where files including the heap file are held, the off-the-shelf timer used to measure run times and other performance characteristics such as peak memory usage, and the size of the input. The default input size results in pm-gawk memory footprints under 3 GiB, which is large enough for interesting results and small enough to fit in DRAM and avoid paging on today’s computers. Lines 13–14 define a homebrew timer.

Two sections of the script are relevant if the default run directory is changed from /dev/shm/ to a directory in a conventional storage-backed file system: Lines 15–17 define the mechanism for clearing file data cached in DRAM; lines 23–30 set Linux kernel parameters to discourage eager paging.

Lines 37–70 spit out, compile, and run a little C program to generate a random text corpus. This program is fast, flexible, and deterministic, generating the same random output given the same parameters.

Lines 71–100 run the four different AWK approaches on the same random input, reporting separately the time to build and to query the associative array containing word frequencies.

#!/bin/csh -f
# Set PMG envar to path of pm-gawk executable and AWKLIBPATH                        #   2
#    to find rwarray.so                                                             #   3
# Requires "sudo" to work; consider this for /etc/sudoers file:                     #   4
#    Defaults:youruserid    !authenticate                                           #   5
echo 'begin:      ' `date` `date +%s`                                               #   6
unsetenv GAWK_PERSIST_FILE  # disable persistence until wanted                      #   7
set dir = '/dev/shm'        # where heap file et al. will live                      #   8
set tmr = '/usr/bin/time'   # can also use shell built-in "time"                    #   9
set isz = 1073741824        # input size; 1 GiB                                     #  10
# set isz = 100000000       # small input for quick testing                         #  11
cd $dir   # tick/tock/tyme below are homebrew timer, good within ~2ms               #  12
alias tick 'set t1 = `date +%s.%N`' ; alias tock 'set t2 = `date +%s.%N`'           #  13
alias tyme '$PMG -v t1=$t1 -v t2=$t2 "BEGIN{print t2-t1}"'                          #  14
alias tsync 'tick ; sync ; tock ; echo "sync time:  " `tyme`'                       #  15
alias drop_caches 'echo 3 | sudo tee /proc/sys/vm/drop_caches'                      #  16
alias snd 'tsync; drop_caches'                                                      #  17
echo "pm-gawk:     $PMG" ; echo 'std gawk:   ' `which gawk`                         #  18
echo "run dir:     $dir" ; echo 'pwd:        ' `pwd`                                #  19
echo 'dir content:'      ; ls -l $dir |& $PMG '{print "             " $0}'          #  20
echo 'timer:      ' $tmr ; echo 'AWKLIBPATH: ' $AWKLIBPATH                          #  21

echo 'OS params:' ; set vm = '/proc/sys/vm/dirty'                                   #  22
sudo sh -c "echo 100     > ${vm}_background_ratio"     # restore these              #  23
sudo sh -c "echo 100     > ${vm}_ratio"                # paging params              #  24
sudo sh -c "echo 1080000 > ${vm}_expire_centisecs"     # to defaults                #  25
sudo sh -c "echo 1080000 > ${vm}_writeback_centisecs"  # if necessary               #  26
foreach d ( ${vm}_background_ratio  ${vm}_ratio \                                   #  27
            ${vm}_expire_centisecs  ${vm}_writeback_centisecs )                     #  28
    printf "             %-38s %7d\n" $d `cat $d`                                   #  29
end                                                                                 #  30
tick ; tock ; echo 'timr ovrhd: ' `tyme` 's (around 2ms for TK)'                    #  31
tick ; $PMG 'BEGIN{print "pm-gawk?     yes"}'                                       #  32
tock ; echo 'pmg ovrhd:  ' `tyme` 's (around 4-5 ms for TK)'                        #  33
set inp = 'input.dat'                                                               #  34
echo 'input size  ' $isz                                                            #  35
echo "input file:  $inp"                                                            #  36
set rg = rgen  # spit out and compile C program to generate random inputs           #  37
rm -f $inp  $rg.c $rg                                                               #  38
cat <<EOF > $rg.c                                                                   #  39
// generate N random words, one per line, no blank lines                            #  40
// charset is e.g. 'abcdefg@' where '@' becomes newline                             #  41
#include <stdio.h>                                                                  #  42
#include <stdlib.h>                                                                 #  43
#include <string.h>                                                                 #  44
#define RCH     c = a[rand() % L];                                                  #  45
#define PICK    do { RCH } while (0)                                                #  46
#define PICKCH  do { RCH } while (c == '@')                                         #  47
#define FP(...) fprintf(stderr, __VA_ARGS__)                                        #  48
int main(int argc, char *argv[]) {                                                  #  49
  if (4 != argc) {                                                                  #  50
    FP("usage:  %s  charset  N  seed\n",                                            #  51
                    argv[0]);   return 1; }                                         #  52
  char c, *a =      argv[1];    size_t L = strlen(a);                               #  53
  long int N = atol(argv[2]);                                                       #  54
  srand(       atol(argv[3]));                                                      #  55
  if (2 > N) { FP("N == %ld < 2\n", N); return 2; }                                 #  56
  PICKCH;                                                                           #  57
  for (;;) {                                                                        #  58
    if (2 == N) { PICKCH; putchar(c); putchar('\n'); break; }                       #  59
    if ('@' == c) { putchar('\n'); PICKCH; }                                        #  60
    else          { putchar( c  ); PICK;   }                                        #  61
    if (0 >= --N) break;                                                            #  62
  }                                                                                 #  63
}                                                                                   #  64
EOF                                                                                 #  65
gcc -std=c11 -Wall -Wextra -O3 -o $rg $rg.c                                         #  66
set t = '@@@@@@@' ; set c = "abcdefghijklmnopqrstuvwxyz$t$t$t$t$t$t"                #  67
tick ; ./$rg "$c" $isz 47 > $inp ; tock ; echo 'gen time:   ' `tyme`                #  68
echo "input file:  $inp"                                                            #  69
echo 'input wc:   ' `wc < $inp` ; echo 'input uniq: ' `sort -u $inp | wc`           #  70

snd ############################################################################    #  71
tick ; $tmr $PMG '{n[$1]++}END{print "output: " n["foo"]}' $inp                     #  72
tock ; echo 'T naive O(N):          ' `tyme` ; echo ''                              #  73
rm -f rwa                                                                           #  74
snd ############################################################################    #  75
echo ''                                                                             #  76
tick ; $tmr $PMG -l rwarray '{n[$1]++}END{print "writea",writea("rwa",n)}' $inp     #  77
tock ; echo 'T rwarray build O(N):  ' `tyme` ; echo ''                              #  78
snd  # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #    #  79
tick ; $tmr $PMG -l rwarray 'BEGIN{print "reada",reada("rwa",n); \                  #  80
                                   print "output: " n["foo"]}'                      #  81
tock ; echo 'T rwarray query O(W):   ' `tyme` ; echo ''                             #  82
rm -f ft                                                                            #  83
snd ############################################################################    #  84
tick ; $tmr $PMG '{n[$1]++}END{for(w in n)print n[w], w}' $inp > ft                 #  85
tock ; echo 'T freqtbl build O(N):  ' `tyme` ; echo ''                              #  86
snd  # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #    #  87
tick ; $tmr $PMG '{n[$2] = $1}END{print "output: " n["foo"]}' ft                    #  88
tock ; echo 'T freqtbl query O(W):   ' `tyme` ; echo ''                             #  89
rm -f heap.pma                                                                      #  90
snd ############################################################################    #  91
truncate -s 3G heap.pma  # enlarge if needed                                        #  92
setenv GAWK_PERSIST_FILE heap.pma                                                   #  93
tick ; $tmr $PMG '{n[$1]++}' $inp                                                   #  94
tock ; echo 'T pm-gawk build O(N):  ' `tyme` ; echo ''                              #  95
snd  # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #    #  96
tick ; $tmr $PMG 'BEGIN{print "output: " n["foo"]}'                                 #  97
tock ; echo 'T pm-gawk query O(1):    ' `tyme` ; echo ''                            #  98
unsetenv GAWK_PERSIST_FILE                                                          #  99
snd ############################################################################    # 100
echo 'Note:  all output lines above should be identical' ; echo ''                  # 101
echo 'dir content:' ; ls -l $dir |& $PMG '{print "              " $0}'              # 102
echo '' ; echo 'storage footprints:'                                                # 103
foreach f ( rwa ft heap.pma )  # compression is very slow, so we comment it out     # 104
    echo "    $f " `du -BK $dir/$f` # `xz --best < $dir/$f | wc -c` 'bytes xz'      # 105
end                                                                                 # 106
echo '' ; echo 'end:        ' `date` `date +%s` ; echo ''                           # 107