Next: , Previous: , Up: Performance  


4.4 Persistence versus Durability

Arguably the most important general guideline for good performance in computer systems is, “pay only for what you need.”3 To apply this maxim to pm-gawk we must distinguish two concepts that are frequently conflated: persistence and durability.4 (A third logically distinct concept is the subject of Data Integrity.)

Persistent data outlive the processes that access them, but don’t necessarily last forever. For example, as explained in man mq_overview, message queues are persistent because they exist until the system shuts down. Durable data reside on a physical medium that retains its contents even without continuously supplied power. For example, hard disk drives and solid state drives store durable data. Confusion arises because persistence and durability are often correlated: Data in ordinary file systems backed by HDDs or SSDs are typically both persistent and durable. Familiarity with fsync() and msync() might lead us to believe that durability is a subset of persistence, but in fact the two characteristics are orthogonal: Data in the swap area are durable but not persistent; data in DRAM-backed file systems such as /dev/shm/ are persistent but not durable.

Durability often costs more than persistence, so performance-conscious pm-gawk users pay the added premium for durability only when persistence alone is not sufficient. Two ways to avoid unwanted durability overheads were discussed in Virtual Memory and Big Data: Place pm-gawk’s heap file in a DRAM-backed file system, or disable eager writeback to the heap file. Expedients such as these enable you to remove durability overheads from the critical path of multi-stage data analyses even when you want heap files to eventually be durable: Allow pm-gawk to run at full speed with persistence alone; force the heap file to durability (using the cp and sync utilities as necessary) after output has been emitted to the next stage of the analysis and the pm-gawk process using the heap has terminated.

Experimenting with synthetic data builds intuition for how persistence and durability affect performance. You can write a little AWK or C program to generate a stream of random text, or just cobble together a quick and dirty generator on the command line:

        $ openssl rand --base64 1000000 | tr -c a-zA-Z '\n' > random.dat

Varying the size of random inputs can, for example, find where performance “falls off the cliff” as pm-gawk’s memory footprint exceeds the capacity of DRAM and paging begins.

Experiments require careful methodology, especially when the heap file is in a storage-backed file system. Overlooking the file system’s DRAM cache can easily misguide interpretation of results and foil repeatability. Fortunately Linux allows us to invalidate the file system cache and thus mimic a “cold start” condition resembling the immediate aftermath of a machine reboot. Accesses to ordinary files on durable storage will then be served from the storage devices, not from cache. Read about sync and /proc/sys/vm/drop_caches at
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html.


Footnotes

(3)

Remarkably, this guideline is widely ignored in surprising ways. Certain well-known textbook algorithms continue to grind away fruitlessly long after having computed all of their output.
See https://queue.acm.org/detail.cfm?id=3424304.

(4)

In recent years the term “persistent memory” has sometimes been used to denote novel byte-addressable non-volatile memory hardware—an unfortunate practice that contradicts sensible long-standing convention and causes needless confusion. NVM provides durability. Persistent memory is a software abstraction that doesn’t require NVM. See https://queue.acm.org/detail.cfm?id=3358957.


Next: Experiments, Previous: Sparse Heap Files, Up: Performance