Next: Sparse Heap Files, Previous: Constant-Time Array Access, Up: Performance
Small data sets seldom spoil the delights of AWK by causing
performance troubles, with or without persistence. As the size of the
gawk
interpreter’s internal data structures approaches the capacity
of physical memory, however, acceptable performance requires
understanding modern operating systems and sometimes tuning them.
Fortunately pm-gawk
offers new degrees of control for
performance-conscious users tackling large data sets. A terse
mnemonic captures the basic principle: Precluding paging promotes peak
performance and prevents perplexity.
Modern operating systems feature virtual memory that strives to appear both larger than installed DRAM (which is small) and faster than installed storage devices (which are slow). As a program’s memory footprint approaches the capacity of DRAM, the virtual memory system transparently pages (moves) the program’s data between DRAM and a swap area on a storage device. Paging can degrade performance mildly or severely, depending on the program’s memory access patterns. Random accesses to large data structures can trigger excessive paging and dramatic slowdown. Unfortunately, the hash tables beneath AWK’s signature associative arrays inherently require random memory accesses, so large associative arrays can be problematic.
Persistence changes the rules in our favor: The OS pages data to
pm-gawk
’s heap file instead of the swap area. This won’t help
performance much if the heap file resides in a conventional
storage-backed file system. On Unix-like systems, however, we may
place the heap file in a DRAM-backed file system such as
/dev/shm/, which entirely prevents paging to slow storage
devices. Temporarily placing the heap file in such a file system is a
reasonable expedient, with two caveats: First, keep in mind that
DRAM-backed file systems perish when the machine reboots or crashes,
so you must copy the heap file to a conventional storage-backed file
system when your computation is done. Second, pm-gawk
’s memory
footprint can’t exceed available DRAM if you place the heap file in a
DRAM-backed file system.
Tuning OS paging parameters may improve performance if you decide to
run pm-gawk
with a heap file in a conventional storage-backed file
system. Some OSes have unhelpful default habits regarding modified
(“dirty”) memory backed by files. For example, the OS may write
dirty memory pages to the heap file periodically and/or when the OS
believes that “too much” memory is dirty. Such “eager” writeback
can degrade performance noticeably and brings no benefit to pm-gawk
.
Fortunately some OSes allow paging defaults to be over-ridden so that
writeback is “lazy” rather than eager. For Linux see the discussion
of the dirty_*
parameters at
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html.
Changing these parameters can prevent wasteful eager
paging:2
$ echo 100 | sudo tee /proc/sys/vm/dirty_background_ratio $ echo 100 | sudo tee /proc/sys/vm/dirty_ratio $ echo 300000 | sudo tee /proc/sys/vm/dirty_expire_centisecs $ echo 50000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs
Tuning paging parameters can help non-persistent gawk
as well as
pm-gawk
. [Disclaimer: OS tuning is an occult art, and your mileage may
vary.]
The tee
rigmarole is explained at
https://askubuntu.com/questions/1098059/which-is-the-right-way-to-drop-caches-in-lubuntu.
Next: Sparse Heap Files, Previous: Constant-Time Array Access, Up: Performance