Next: Experiments, Previous: Sparse Heap Files, Up: Performance
Arguably the most important general guideline for good performance in
computer systems is, “pay only for what you
need.”3
To apply this maxim to pm-gawk
we must distinguish two concepts that
are frequently conflated: persistence and durability.4 (A third
logically distinct concept is the subject of Data Integrity.)
Persistent data outlive the processes that access them, but
don’t necessarily last forever. For example, as explained in
man mq_overview
, message queues are persistent because they
exist until the system shuts down. Durable data reside on a
physical medium that retains its contents even without continuously
supplied power. For example, hard disk drives and solid state drives
store durable data. Confusion arises because persistence and
durability are often correlated: Data in ordinary file systems backed
by HDDs or SSDs are typically both persistent and durable.
Familiarity with fsync()
and msync()
might lead us to
believe that durability is a subset of persistence, but in fact the
two characteristics are orthogonal: Data in the swap area are durable
but not persistent; data in DRAM-backed file systems such as
/dev/shm/ are persistent but not durable.
Durability often costs more than persistence, so performance-conscious
pm-gawk
users pay the added premium for durability only when
persistence alone is not sufficient. Two ways to avoid unwanted
durability overheads were discussed in Virtual Memory and Big Data: Place pm-gawk
’s heap file in a DRAM-backed file system, or
disable eager writeback to the heap file. Expedients such as these
enable you to remove durability overheads from the critical path of
multi-stage data analyses even when you want heap files to eventually
be durable: Allow pm-gawk
to run at full speed with persistence alone;
force the heap file to durability (using the cp
and
sync
utilities as necessary) after output has been emitted
to the next stage of the analysis and the pm-gawk
process using the
heap has terminated.
Experimenting with synthetic data builds intuition for how persistence and durability affect performance. You can write a little AWK or C program to generate a stream of random text, or just cobble together a quick and dirty generator on the command line:
$ openssl rand --base64 1000000 | tr -c a-zA-Z '\n' > random.dat
Varying the size of random inputs can, for example, find where
performance “falls off the cliff” as pm-gawk
’s memory footprint
exceeds the capacity of DRAM and paging begins.
Experiments require careful methodology, especially when the heap file
is in a storage-backed file system. Overlooking the file system’s
DRAM cache can easily misguide interpretation of results and foil
repeatability. Fortunately Linux allows us to invalidate the file
system cache and thus mimic a “cold start” condition resembling the
immediate aftermath of a machine reboot. Accesses to ordinary files
on durable storage will then be served from the storage devices, not
from cache. Read about sync
and
/proc/sys/vm/drop_caches at
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html.
Remarkably, this guideline is widely ignored in
surprising ways. Certain well-known textbook algorithms continue to
grind away fruitlessly long after having computed all of their
output.
See https://queue.acm.org/detail.cfm?id=3424304.
In recent years the term “persistent memory” has sometimes been used to denote novel byte-addressable non-volatile memory hardware—an unfortunate practice that contradicts sensible long-standing convention and causes needless confusion. NVM provides durability. Persistent memory is a software abstraction that doesn’t require NVM. See https://queue.acm.org/detail.cfm?id=3358957.
Next: Experiments, Previous: Sparse Heap Files, Up: Performance