Next: Acknowledgments, Previous: Performance, Up: General Introduction
Mishaps including power outages, OS kernel panics, scripting bugs, and command-line typos can harm your data, but precautions can mitigate these risks. In scripting scenarios it usually suffices to create safe backups of important files at appropriate times. As simple as this sounds, care is needed to achieve genuine protection and to reduce the costs of backups. Here’s a prudent yet frugal way to back up a heap file between uses:
$ backup_base=heap_bk_`date +%s` $ cp --reflink=always heap.pma $backup_base.pma $ chmod a-w $backup_base.pma $ sync $ touch $backup_base.done $ chmod a-w $backup_base.done $ sync $ ls -l heap* -rw-rw-r--. 1 me me 4096000 Aug 6 15:53 heap.pma -r--r--r--. 1 me me 0 Aug 6 16:16 heap_bk_1659827771.done -r--r--r--. 1 me me 4096000 Aug 6 16:16 heap_bk_1659827771.pma
Timestamps in backup filenames make it easy to find the most recent copy if the heap file is damaged, even if last-mod metadata are inadvertently altered.
The cp
command’s --reflink
option reduces both the
storage footprint of the copy and the time required to make it. Just
as sparse files provide “pay as you go” storage footprints, reflink
copying offers “pay as you change” storage
costs.5 A reflink copy shares
storage with the original file. The file system ensures that
subsequent changes to either file don’t affect the other. Reflink
copying is not available on all file systems; XFS, BtrFS, and OCFS2
currently support it.6 Fortunately you
can install, say, an XFS file system inside an ordinary file on
some other file system, such as ext4
.7
After creating a backup copy of the heap file we use sync
to
force it down to durable media. Otherwise the copy may reside only in
volatile DRAM memory—the file system’s cache—where an OS crash or
power failure could corrupt it.8 After sync
-ing the
backup we create and sync
a “success indicator” file with
extension .done to address a nasty corner case: Power may fail
while a backup is being copied from the primary heap file,
leaving either file, or both, corrupt on storage—a particularly
worrisome possibility for jobs that run unattended. Upon reboot, each
.done file attests that the corresponding backup succeeded,
making it easy to identify the most recent successful backup.
Finally, if you’re serious about tolerating failures you must “train as you would fight” by testing your hardware/software stack against realistic failures. For realistic power-failure testing, see https://queue.acm.org/detail.cfm?id=3400902.
The system call that implements reflink copying is
described in man ioctl_ficlone
.
The --reflink
option creates
copies as sparse as the original. If reflink copying is not
available, --sparse=always
should be used.
See https://www.usenix.org/system/files/login/articles/login_winter19_08_kelly.pdf.
On some OSes sync
provides very weak guarantees, but on Linux sync
returns
only after all file system data are flushed down to durable storage.
If your sync
is unreliable, write a little C program that
calls fsync()
to flush a file. To be safe, also call
fsync()
on every enclosing directory on the file’s
realpath()
up to the root.
Next: Acknowledgments, Previous: Performance, Up: General Introduction