Previous: Debugging, Up: General Introduction
The pm-gawk
persistence feature is based on a new persistent memory
allocator, pma
, whose design is described in
https://queue.acm.org/detail.cfm?id=3534855. It is instructive
to trace the evolutionary paths that led to pma
and pm-gawk
.
I wrote many AWK scripts during my dissertation research on Web
caching twenty years ago, most of which processed log files from Web
servers and Web caches. Persistent gawk
would have made these
scripts smaller, faster, and easier to write, but at the time I was
unable even to imagine that pm-gawk
is possible. So I wrote a lot of
bothersome, inefficient code that manually dumped and re-loaded AWK
script variables to and from text files. A decade would pass before
my colleagues and I began to connect the dots that make persistent
scripting possible, and a further decade would pass before pm-gawk
came
together.
Circa 2011 while working at HP Labs I developed a fault-tolerant
distributed computing platform called “Ken,” which contained a
persistent memory allocator that resembles a simplified pma
: It
presented a malloc()
-like C interface and it allocated memory
from a file-backed memory mapping. Experience with Ken convinced me
that the software abstraction of persistent memory offers important
attractions compared with the alternatives for managing persistent
data (e.g., relational databases and key-value stores).
Unfortunately, Ken’s allocator is so deeply intertwined with the rest
of Ken that it’s essentially inseparable; to enjoy the benefits of
Ken’s persistent memory, one must “buy in” to a larger and more
complicated value proposition. Whatever its other virtues might be,
Ken isn’t ideal for showcasing the benefits of persistent memory in
isolation.
Another entangled aspect of Ken was a crash-tolerance mechanism that,
in retrospect, can be viewed as a user-space implementation of
failure-atomic msync()
. The first post-Ken disentanglement
effort isolated the crash-tolerance mechanism and implemented it in
the Linux kernel, calling the result “failure-atomic msync()
”
(FAMS). FAMS strengthens the semantics of ordinary standard
msync()
by guaranteeing that the durable state of a
memory-mapped file always reflects the most recent successful
msync()
call, even in the presence of failures such as power
outages and OS or application crashes. The original Linux kernel FAMS
prototype is described in a paper by Park et al. in EuroSys 2013. My
colleagues and I subsequently implemented FAMS in several different
ways including in file systems (FAST 2015) and user-space libraries.
My most recent FAMS implementation, which leverages the reflink
copying feature described elsewhere in this manual, is now the
foundation of a new crash-tolerance feature in the venerable and
ubiquitous GNU dbm
(gdbm
) database
(https://queue.acm.org/detail.cfm?id=3487353).
In recent years my attention has returned to the advantages of persistent memory programming, lately a hot topic thanks to the commercial availability of byte-addressable non-volatile memory hardware (which, confusingly, is nowadays marketed as “persistent memory”). The software abstraction of persistent memory and the corresponding programming style, however, are perfectly compatible with conventional computers—machines with neither non-volatile memory nor any other special hardware or software. I wrote a few papers making this point, for example https://queue.acm.org/detail.cfm?id=3358957.
In early 2022 I wrote a new stand-alone persistent memory allocator,
pma
, to make persistent memory programming easy on conventional
hardware. The pma
interface is compatible with malloc()
and, unlike Ken’s allocator, pma
is not coupled to a particular
crash-tolerance mechanism. Using pma
is easy and, at least to
some, enjoyable.
Ken had been integrated into prototype forks of both the V8 JavaScript
interpreter and a Scheme interpreter, so it was natural to consider
whether pma
might similarly enhance an interpreted scripting
language. GNU AWK was a natural choice because the source code is
orderly and because gawk
has a single primary maintainer with an
open mind regarding new features.
Jianan Li, Zi Fan Tan, Haris Volos, and I began considering
persistence for gawk
in late 2021. While I was writing pma
,
they prototyped pm-gawk
in a fork of the gawk
source. Experience
with the prototype confirmed the expected convenience and efficiency
benefits of pm-gawk
, and by spring 2022 Arnold Robbins was implementing
persistence in the official version of gawk
. The persistence
feature in official gawk
differs slightly from the prototype: The
former uses an environment variable to pass the heap file name to the
interpreter whereas the latter uses a mandatory command-line option.
In many respects, however, the two implementations are similar. A
description of the prototype, including performance measurements, is
available at
http://nvmw.ucsd.edu/nvmw2022-program/nvmw2022-data/nvmw2022-paper35-final_version_your_extended_abstract.pdf.
I enjoy several aspects of pm-gawk
. It’s unobtrusive; as you gain
familiarity and experience, it fades into the background of your
scripting. It’s simple in both concept and implementation, and more
importantly it simplifies your scripts; much of its value is measured
not in the code it enables you to write but rather in the code it lets
you discard. It’s all that I needed for my dissertation research
twenty years ago, and more. Anecdotally, it appears to inspire
creativity in early adopters, who have devised uses that pm-gawk
’s
designers never anticipated. I’m curious to see what new purposes
you find for it.
Previous: Debugging, Up: General Introduction