[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Due to historical reasons, there are several formats of tar archives. All of them are based on the same principles, but have some subtle differences that often make them incompatible with each other.
GNU tar is able to create and handle archives in a variety of formats. The most frequently used formats are (in alphabetical order):
Format used by GNU tar
versions up to 1.13.25. This format derived
from an early POSIX standard, adding some improvements such as
sparse file handling and incremental archives. Unfortunately these
features were implemented in a way incompatible with other archive
formats.
Archives in ‘gnu’ format are able to hold file names of unlimited length.
Format used by GNU tar
of versions prior to 1.12.
Archive format, compatible with the V7 implementation of tar. This format imposes a number of limitations. The most important of them are:
This format has traditionally been used by Automake when producing
Makefiles. This practice will change in the future, in the meantime,
however this means that projects containing file names more than 100
bytes long will not be able to use GNU tar
1.35 and
Automake prior to 1.9.
Archive format defined by POSIX.1-1988 and later. It stores symbolic ownership information. It is also able to store special files. However, it imposes several restrictions as well:
The format used by the late Jörg Schilling’s star
implementation. GNU tar
is able to read ‘star’ archives but
currently does not produce them.
The format defined by POSIX.1-2001 and later. This is the
most flexible and feature-rich format. It does not impose arbitrary
restrictions on file sizes or file name lengths. This format is more
recent, so some tar
implementations cannot handle it properly.
However, any tar
implementation able to read ‘ustar’
archives should be able to read most ‘posix’ archives as well,
except that it will extract any additional information (such as long
file names) as extra plain text files.
This archive format will be the default format for future versions
of GNU tar
.
The following table summarizes the limitations of each of these formats:
Format | UID | File Size | File Name | Devn |
---|---|---|---|---|
gnu | 1.8e19 | Unlimited | Unlimited | 63 |
oldgnu | 1.8e19 | Unlimited | Unlimited | 63 |
v7 | 2097151 | 8 GiB - 1 | 99 | n/a |
ustar | 2097151 | 8 GiB - 1 | 255 | 21 |
posix | Unlimited | Unlimited | Unlimited | Unlimited |
The default format for GNU tar
is defined at compilation
time. You may check it by running tar --help
, and examining
the last lines of its output. Usually, GNU tar
is configured
to create archives in ‘gnu’ format, however, a future version will
switch to ‘posix’.
8.1 Using Less Space through Compression | ||
8.2 Handling File Attributes | ||
8.3 Making tar Archives More Portable | ||
8.4 Making tar Archives More Reproducible | ||
8.5 Comparison of tar and cpio |
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
8.1.1 Creating and Reading Compressed Archives | ||
8.1.2 Archiving Sparse Files |
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
GNU tar
is able to create and read compressed archives. It supports
a wide variety of compression programs, namely: gzip
,
bzip2
, lzip
, lzma
, lzop
,
zstd
, xz
and traditional compress
. The
latter is supported mostly for backward compatibility, and we recommend
against using it, because it is by far less effective than the other
compression programs(21).
Creating a compressed archive is simple: you just specify a compression option along with the usual archive creation commands. Available compression options are summarized in the table below:
Long | Short | Archive format |
---|---|---|
‘--gzip’ | ‘-z’ | gzip |
‘--bzip2’ | ‘-j’ | bzip2 |
‘--xz’ | ‘-J’ | xz |
‘--lzip’ | lzip | |
‘--lzma’ | lzma | |
‘--lzop’ | lzop | |
‘--zstd’ | zstd | |
‘--compress’ | ‘-Z’ | compress |
For example:
$ tar czf archive.tar.gz .
You can also let GNU tar
select the compression program based on
the suffix of the archive file name. This is done using
‘--auto-compress’ (‘-a’) command line option. For
example, the following invocation will use bzip2
for
compression:
$ tar caf archive.tar.bz2 .
whereas the following one will use lzma
:
$ tar caf archive.tar.lzma .
For a complete list of file name suffixes recognized by GNU tar
,
see auto-compress.
Reading compressed archive is even simpler: you don’t need to specify
any additional options as GNU tar
recognizes its format
automatically. Thus, the following commands will list and extract the
archive created in previous example:
# List the compressed archive $ tar tf archive.tar.gz # Extract the compressed archive $ tar xf archive.tar.gz
The format recognition algorithm is based on signatures, a
special byte sequences in the beginning of file, that are specific for
certain compression formats. If this approach fails, tar
falls back to using archive name suffix to determine its format
(see auto-compress, for a list of recognized suffixes).
Some compression programs are able to handle different compression
formats. GNU tar
uses this, if the principal decompressor for the
given format is not available. For example, if compress
is
not installed, tar
will try to use gzip
. As of
version 1.35 the following alternatives are
tried(22):
Format | Main decompressor | Alternatives |
---|---|---|
compress | compress | gzip |
lzma | lzma | xz |
bzip2 | bzip2 | lbzip2 |
The only case when you have to specify a decompression option while
reading the archive is when reading from a pipe or from a tape drive
that does not support random access. However, in this case GNU tar
will indicate which option you should use. For example:
$ cat archive.tar.gz | tar tf - tar: Archive is compressed. Use -z option tar: Error is not recoverable: exiting now
If you see such diagnostics, just add the suggested option to the
invocation of GNU tar
:
$ cat archive.tar.gz | tar tzf -
Notice also, that there are several restrictions on operations on
compressed archives. First of all, compressed archives cannot be
modified, i.e., you cannot update (‘--update’, alias ‘-u’)
them or delete (‘--delete’) members from them or
add (‘--append’, alias ‘-r’) members to them. Likewise, you
cannot append another tar
archive to a compressed archive using
‘--concatenate’ (‘-A’). Secondly, multi-volume
archives cannot be compressed.
The following options allow to select a particular compressor program:
Filter the archive through gzip
.
Filter the archive through xz
.
Filter the archive through bzip2
.
Filter the archive through lzip
.
Filter the archive through lzma
.
Filter the archive through lzop
.
Filter the archive through zstd
.
Filter the archive through compress
.
When any of these options is given, GNU tar
searches the compressor
binary in the current path and invokes it. The name of the compressor
program is specified at compilation time using a corresponding
‘--with-compname’ option to configure
, e.g.
‘--with-bzip2’ to select a specific bzip2
binary.
See section Using lbzip2 with GNU tar
., for a detailed discussion.
The output produced by tar --help
shows the actual
compressor names along with each of these options.
You can use any of these options on physical devices (tape drives,
etc.) and remote files as well as on normal files; data to or from
such devices or remote files is reblocked by another copy of the
tar
program to enforce the specified (or default) record
size. The default compression parameters are used.
You can override them by using the ‘-I’ option (see
below), e.g.:
$ tar -cf archive.tar.gz -I 'gzip -9 -n' subdir
A more traditional way to do this is to use a pipe:
$ tar cf - subdir | gzip -9 -n > archive.tar.gz
Compressed archives are easily corrupted, because compressed files have little redundancy. The adaptive nature of the compression scheme means that the compression tables are implicitly spread all over the archive. If you lose a few blocks, the dynamic construction of the compression tables becomes unsynchronized, and there is little chance that you could recover later in the archive.
Other compression options provide better control over creating compressed archives. These are:
Select a compression program to use by the archive file name suffix. The following suffixes are recognized:
Suffix | Compression program |
---|---|
‘.gz’ | gzip |
‘.tgz’ | gzip |
‘.taz’ | gzip |
‘.Z’ | compress |
‘.taZ’ | compress |
‘.bz2’ | bzip2 |
‘.tz2’ | bzip2 |
‘.tbz2’ | bzip2 |
‘.tbz’ | bzip2 |
‘.lz’ | lzip |
‘.lzma’ | lzma |
‘.tlz’ | lzma |
‘.lzo’ | lzop |
‘.xz’ | xz |
‘.zst’ | zstd |
‘.tzst’ | zstd |
Use external compression program command. Use this option if you
want to specify options for the compression program, or if you
are not happy with the compression program associated with the suffix
at compile time, or if you have a compression program that GNU tar
does not support. The command argument is a valid command
invocation, as you would type it at the command line prompt, with any
additional options as needed. Enclose it in quotes if it contains
white space (see section Running External Commands).
The command should follow two conventions:
First, when invoked without additional options, it should read data from standard input, compress it and output it on standard output.
Secondly, if invoked with the additional ‘-d’ option, it should do exactly the opposite, i.e., read the compressed data from the standard input and produce uncompressed data on the standard output.
The latter requirement means that you must not use the ‘-d’ option as a part of the command itself.
The ‘--use-compress-program’ option, in particular, lets you
implement your own filters, not necessarily dealing with
compression/decompression. For example, suppose you wish to implement
PGP encryption on top of compression, using gpg
(see gpg —- encryption and signing tool in GNU Privacy Guard Manual). The following script does that:
#! /bin/sh case $1 in -d) gpg --decrypt - | gzip -d -c;; '') gzip -c | gpg -s;; *) echo "Unknown option $1">&2; exit 1;; esac
Suppose you name it ‘gpgz’ and save it somewhere in your
PATH
. Then the following command will create a compressed
archive signed with your private key:
$ tar -cf foo.tar.gpgz -Igpgz .
Likewise, the command below will list its contents:
$ tar -tf foo.tar.gpgz -Igpgz .
8.1.1.1 Using lbzip2 with GNU tar . |
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
tar
. Lbzip2
is a multithreaded utility for handling
‘bzip2’ compression, written by Laszlo Ersek. It makes use of
multiple processors to speed up its operation and in general works
considerably faster than bzip2
. For a detailed description
of lbzip2
see http://freshmeat.net/projects/lbzip2 and
lbzip2: parallel bzip2 utility.
Recent versions of lbzip2
are mostly command line compatible
with bzip2
, which makes it possible to automatically invoke
it via the ‘--bzip2’ GNU tar
command line option. To do so,
GNU tar
must be configured with the ‘--with-bzip2’ command
line option, like this:
$ ./configure --with-bzip2=lbzip2 [other-options]
Once configured and compiled this way, tar --help
will show the
following:
$ tar --help | grep -- --bzip2 -j, --bzip2 filter the archive through lbzip2
which means that running tar --bzip2
will invoke lbzip2
.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Files in the file system occasionally have holes. A hole
in a file is a section of the file’s contents which was never written.
The contents of a hole reads as all zeros. On many operating systems,
actual disk storage is not allocated for holes, but they are counted
in the length of the file. If you archive such a file, tar
could create an archive longer than the original. To have tar
attempt to recognize the holes in a file, use ‘--sparse’
(‘-S’). When you use this option, then, for any file using
less disk space than would be expected from its length, tar
searches the file for holes. It then records in the archive for the file where
the holes (consecutive stretches of zeros) are, and only archives the
“real contents” of the file. On extraction (using ‘--sparse’ is not
needed on extraction) any such files have also holes created wherever the holes
were found. Thus, if you use ‘--sparse’, tar
archives won’t
take more space than the original.
GNU tar
uses two methods for detecting holes in sparse files. These
methods are described later in this subsection.
This option instructs tar
to test each file for sparseness
before attempting to archive it. If the file is found to be sparse it
is treated specially, thus allowing to decrease the amount of space
used by its image in the archive.
This option is meaningful only when creating or updating archives. It has no effect on extraction.
Consider using ‘--sparse’ when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system.
Even if your system has no sparse files currently, some may be
created in the future. If you use ‘--sparse’ while making file
system backups as a matter of course, you can be assured the archive
will never take more space on the media than the files take on disk
(otherwise, archiving a disk filled with sparse files might take
hundreds of tapes). See section Using tar
to Perform Incremental Dumps.
However, be aware that ‘--sparse’ option may present a serious
drawback. Namely, in order to determine the positions of holes in a file
tar
may have to read it before trying to archive it, so in total
the file may be read twice. This may happen when your OS or your FS
does not support SEEK_HOLE/SEEK_DATA feature in lseek (See
‘--hole-detection’, below).
When using ‘POSIX’ archive format, GNU tar
is able to store
sparse files using in three distinct ways, called sparse
formats. A sparse format is identified by its number,
consisting, as usual of two decimal numbers, delimited by a dot. By
default, format ‘1.0’ is used. If, for some reason, you wish to
use an earlier format, you can select it using
‘--sparse-version’ option.
Select the format to store sparse files in. Valid version values are: ‘0.0’, ‘0.1’ and ‘1.0’. See section Storing Sparse Files, for a detailed description of each format.
Using ‘--sparse-format’ option implies ‘--sparse’.
Enforce concrete hole detection method. Before the real contents of sparse
file are stored, tar
needs to gather knowledge about file
sparseness. This is because it needs to have the file’s map of holes
stored into tar header before it starts archiving the file contents.
Currently, two methods of hole detection are implemented:
lseek
system call (SEEK_HOLE
and SEEK_DATA
) which is able to
reuse file system knowledge about sparse file contents - so the
detection is usually very fast. To use this feature, your file system
and operating system must support it. At the time of this writing
(2015) this feature, in spite of not being accepted by POSIX, is
fairly widely supported by different operating systems.
When no ‘--hole-detection’ option is given, tar
uses
the ‘seek’, if supported by the operating system.
Using ‘--hole-detection’ option implies ‘--sparse’.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
When tar
reads files, it updates their access times. To
avoid this, use the ‘--atime-preserve[=METHOD]’ option, which can either
reset the access time retroactively or avoid changing it in the first
place.
Preserve the access times of files that are read. This works only for files that you own, unless you have superuser privileges.
‘--atime-preserve=replace’ works on most systems, but it also
restores the data modification time and updates the status change
time. Hence it doesn’t interact with incremental dumps nicely
(see section Using tar
to Perform Incremental Dumps), and it can set access or data modification times
incorrectly if other programs access the file while tar
is
running.
‘--atime-preserve=system’ avoids changing the access time in
the first place, if the operating system supports this.
Unfortunately, this may or may not work on any given operating system
or file system. If tar
knows for sure it won’t work, it
complains right away.
Currently ‘--atime-preserve’ with no operand defaults to ‘--atime-preserve=replace’, but this is intended to change to ‘--atime-preserve=system’ when the latter is better-supported.
Do not extract data modification time.
When this option is used, tar
leaves the data modification times
of the files it extracts as the times when the files were extracted,
instead of setting it to the times recorded in the archive.
This option is meaningless with ‘--list’ (‘-t’).
Create extracted files with the same ownership they have in the archive.
This is the default behavior for the superuser,
so this option is meaningful only for non-root users, when tar
is executed on those systems able to give files away. This is
considered as a security flaw by many people, at least because it
makes quite difficult to correctly account users for the disk space
they occupy. Also, the suid
or sgid
attributes of
files are easily and silently lost when files are given away.
When writing an archive, tar
writes the user ID and user name
separately. If it can’t find a user name (because the user ID is not
in ‘/etc/passwd’), then it does not write one. When restoring,
it tries to look the name (if one was written) up in
‘/etc/passwd’. If it fails, then it uses the user ID stored in
the archive instead.
Do not attempt to restore ownership when extracting. This is the default behavior for ordinary users, so this option has an effect only for the superuser.
The ‘--numeric-owner’ option allows (ANSI) archives to be written without user/group name information or such information to be ignored when extracting. It effectively disables the generation and/or use of user/group name information. This option forces extraction using the numeric ids from the archive, ignoring the names.
This is useful in certain circumstances, when restoring a backup from an emergency floppy with different passwd/group files for example. It is otherwise impossible to extract files with the right ownerships if the password file in use during the extraction does not match the one belonging to the file system(s) being extracted. This occurs, for example, if you are restoring your files after a major crash and had booted from an emergency floppy with no password file or put your disk into another machine to do the restore.
The numeric ids are always saved into tar
archives.
The identifying names are added at create time when provided by the
system, unless ‘--format=oldgnu’ is used. Numeric ids could be
used when moving archives between a collection of machines using
a centralized management for attribution of numeric ids to users
and groups. This is often made through using the NIS capabilities.
When making a tar
file for distribution to other sites, it
is sometimes cleaner to use a single owner for all files in the
distribution, and nicer to specify the write permission bits of the
files as stored in the archive independently of their actual value on
the file system. The way to prepare a clean distribution is usually
to have some Makefile rule creating a directory, copying all needed
files in that directory, then setting ownership and permissions as
wanted (there are a lot of possible schemes), and only then making a
tar
archive out of this directory, before cleaning
everything out. Of course, we could add a lot of options to
GNU tar
for fine tuning permissions and ownership.
This is not the good way, I think. GNU tar
is
already crowded with options and moreover, the approach just explained
gives you a great deal of control already.
Extract all protection information.
This option causes tar
to set the modes (access permissions) of
extracted files exactly as recorded in the archive. If this option
is not used, the current umask
setting limits the permissions
on extracted files. This option is by default enabled when
tar
is executed by a superuser.
This option is meaningless with ‘--list’ (‘-t’).
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
tar
Archives More PortableCreating a tar
archive on a particular system that is meant to be
useful later on many other machines and with other versions of tar
is more challenging than you might think. tar
archive formats
have been evolving since the first versions of Unix. Many such formats
are around, and are not always compatible with each other. This section
discusses a few problems, and gives some advice about making tar
archives more portable.
One golden rule is simplicity. For example, limit your tar
archives to contain only regular files and directories, avoiding
other kind of special files. Do not attempt to save sparse files or
contiguous files as such. Let’s discuss a few more problems, in turn.
8.3.1 Portable Names | ||
8.3.2 Symbolic Links | ||
8.3.3 Hard Links | ||
8.3.4 Old V7 Archives | ||
8.3.5 Ustar Archive Format | Ustar Archives | |
8.3.6 GNU and old GNU tar format | GNU and old GNU format archives. | |
8.3.7 GNU tar and POSIX tar | POSIX archives | |
8.3.8 Checksumming Problems | ||
8.3.9 Large or Negative Values | Large files, negative time stamps, etc. | |
8.3.10 How to Extract GNU-Specific Data Using Other tar Implementations |
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Use portable file and member names. A name is portable if it contains only ASCII letters and digits, ‘/’, ‘.’, ‘_’, and ‘-’; it cannot be empty, start with ‘-’ or ‘//’, or contain ‘/-’. Avoid deep directory nesting. For portability to old Unix hosts, limit your file name components to 14 characters or less.
If you intend to have your tar
archives to be read on
case-insensitive file systems like FAT32,
you should not rely on case distinction for file names.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Normally, when tar
archives a symbolic link, it writes a
block to the archive naming the target of the link. In that way, the
tar
archive is a faithful record of the file system contents.
When ‘--dereference’ (‘-h’) is used with
‘--create’ (‘-c’), tar
archives the files
symbolic links point to, instead of
the links themselves.
When creating portable archives, use ‘--dereference’ (‘-h’): some systems do not support symbolic links, and moreover, your distribution might be unusable if it contains unresolved symbolic links.
When reading from an archive, the ‘--dereference’ (‘-h’)
option causes tar
to follow an already-existing symbolic
link when tar
writes or reads a file named in the archive.
Ordinarily, tar
does not follow such a link, though it may
remove the link before writing a new file. See section Options Controlling the Overwriting of Existing Files.
The ‘--dereference’ option is unsafe if an untrusted user can
modify directories while tar
is running. See section Security.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Normally, when tar
archives a hard link, it writes a
block to the archive naming the target of the link (a ‘1’ type
block). In that way, the actual file contents is stored in file only
once. For example, consider the following two files:
$ ls -l -rw-r--r-- 2 gray staff 4 2007-10-30 15:11 one -rw-r--r-- 2 gray staff 4 2007-10-30 15:11 jeden
Here, ‘jeden’ is a link to ‘one’. When archiving this directory with a verbose level 2, you will get an output similar to the following:
$ tar cvvf ../archive.tar . drwxr-xr-x gray/staff 0 2007-10-30 15:13 ./ -rw-r--r-- gray/staff 4 2007-10-30 15:11 ./jeden hrw-r--r-- gray/staff 0 2007-10-30 15:11 ./one link to ./jeden
The last line shows that, instead of storing two copies of the file,
tar
stored it only once, under the name ‘jeden’, and
stored file ‘one’ as a hard link to this file.
It may be important to know that all hard links to the given file are stored in the archive. For example, this may be necessary for exact reproduction of the file system. The following option does that:
Check the number of links dumped for each processed file. If this number does not match the total number of hard links for the file, print a warning message.
For example, trying to archive only file ‘jeden’ with this option produces the following diagnostics:
$ tar -c -f ../archive.tar -l jeden tar: Missing links to 'jeden'.
Although creating special records for hard links helps keep a faithful record of the file system contents and makes archives more compact, it may present some difficulties when extracting individual members from the archive. For example, trying to extract file ‘one’ from the archive created in previous examples produces, in the absence of file ‘jeden’:
$ tar xf archive.tar ./one tar: ./one: Cannot hard link to './jeden': No such file or directory tar: Error exit delayed from previous errors
The reason for this behavior is that tar
cannot seek back in
the archive to the previous member (in this case, ‘one’), to
extract it(23).
If you wish to avoid such problems at the cost of a bigger archive,
use the following option:
Dereference hard links and store the files they refer to.
For example, trying this option on our two sample files, we get two copies in the archive, each of which can then be extracted independently of the other:
$ tar -c -vv -f ../archive.tar --hard-dereference . drwxr-xr-x gray/staff 0 2007-10-30 15:13 ./ -rw-r--r-- gray/staff 4 2007-10-30 15:11 ./jeden -rw-r--r-- gray/staff 4 2007-10-30 15:11 ./one
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Certain old versions of tar
cannot handle additional
information recorded by newer tar
programs. To create an
archive in V7 format (not ANSI), which can be read by these old
versions, specify the ‘--format=v7’ option in
conjunction with the ‘--create’ (‘-c’) (tar
also
accepts ‘--portability’ or ‘--old-archive’ for this
option). When you specify it,
tar
leaves out information about directories, pipes, fifos,
contiguous files, and device files, and specifies file ownership by
group and user IDs instead of group and user names.
When updating an archive, do not use ‘--format=v7’ unless the archive was created using this option.
In most cases, a new format archive can be read by an old
tar
program without serious trouble, so this option should
seldom be needed. On the other hand, most modern tar
s are
able to read old format archives, so it might be safer for you to
always use ‘--format=v7’ for your distributions. Notice,
however, that ‘ustar’ format is a better alternative, as it is
free from many of ‘v7’’s drawbacks.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The archive format defined by the POSIX.1-1988 specification is
called ustar
. Although it is more flexible than the V7 format, it
still has many restrictions (see section ustar, for the detailed
description of ustar
format). Along with V7 format,
ustar
format is a good choice for archives intended to be read
with other implementations of tar
.
To create an archive in ustar
format, use the ‘--format=ustar’
option in conjunction with ‘--create’ (‘-c’).
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
tar
formatGNU tar
was based on an early draft of the
POSIX 1003.1 ustar
standard. GNU extensions to
tar
, such as the support for file names longer than 100
characters, use portions of the tar
header record which were
specified in that POSIX draft as unused. Subsequent changes in
POSIX have allocated the same parts of the header record for
other purposes. As a result, GNU tar
format is
incompatible with the current POSIX specification, and with
tar
programs that follow it.
In the majority of cases, tar
will be configured to create
this format by default. This will change in future releases, since
we plan to make ‘POSIX’ format the default.
To force creation a GNU tar
archive, use option
‘--format=gnu’.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
tar
and POSIX tar
Starting from version 1.14 GNU tar
features full support for
POSIX.1-2001 archives.
A POSIX conformant archive will be created if tar
was given ‘--format=posix’ (‘--format=pax’) option. No
special option is required to read and extract from a POSIX
archive.
8.3.7.1 Controlling Extended Header Keywords |
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Handle keywords in PAX extended headers. This option is
equivalent to ‘-o’ option of the pax
utility.
Keyword-list is a comma-separated list of keyword options, each keyword option taking one of the following forms:
delete=pattern
When used with one of archive-creation commands,
this option instructs tar
to omit from extended header records
that it produces any keywords matching the string pattern.
If the pattern contains shell metacharacters like ‘*’, it should
be quoted to prevent the shell from expanding the pattern before
tar
sees it.
When used in extract or list mode, this option instructs tar to ignore any keywords matching the given pattern in the extended header records. In both cases, matching is performed using the pattern matching notation described in POSIX 1003.2, 3.13 (see section Wildcards Patterns and Matching). For example:
--pax-option 'delete=security.*'
would suppress security-related information.
exthdr.name=string
This keyword allows user control over the name that is written into the ustar header blocks for the extended headers. The name is obtained from string after making the following substitutions:
Meta-character | Replaced By |
---|---|
%d | The directory name of the file, equivalent to the
result of the dirname utility on the translated file name. |
%f | The name of the file with the directory information
stripped, equivalent to the result of the basename utility
on the translated file name. |
%p | The process ID of the tar process. |
%% | A ‘%’ character. |
Any other ‘%’ characters in string produce undefined results.
If no option ‘exthdr.name=string’ is specified, tar
will use the following default value:
%d/PaxHeaders/%f
This default helps make the archive more reproducible.
See section Making tar
Archives More Reproducible. POSIX recommends using
‘%d/PaxHeaders.%p/%f’ instead, which means the two archives
created with the same set of options and containing the same set
of files will be byte-to-byte different. This default will be used
if the environment variable POSIXLY_CORRECT
is set.
exthdr.mtime=value
This keyword defines the value of the ‘mtime’ field that is written into the ustar header blocks for the extended headers. By default, the ‘mtime’ field is set to the modification time of the archive member described by that extended header (or to the value of the ‘--mtime’ option, if supplied).
globexthdr.name=string
This keyword allows user control over the name that is written into the ustar header blocks for global extended header records. The name is obtained from the contents of string, after making the following substitutions:
Meta-character | Replaced By |
---|---|
%n | An integer that represents the sequence number of the global extended header record in the archive, starting at 1. |
%p | The process ID of the tar process. |
%% | A ‘%’ character. |
Any other ‘%’ characters in string produce undefined results.
If no option ‘globexthdr.name=string’ is specified, tar
will use the following default value:
$TMPDIR/GlobalHead.%n
If the environment variable POSIXLY_CORRECT
is set, the
following value is used instead:
$TMPDIR/GlobalHead.%p.%n
In both cases, ‘$TMPDIR’ stands for the value of the TMPDIR
environment variable. If TMPDIR is not set, tar
uses ‘/tmp’.
globexthdr.mtime=value
This keyword defines the value of the ‘mtime’ field that
is written into the ustar header blocks for the global extended headers.
By default, the ‘mtime’ field is set to the time when
tar
was invoked.
keyword=value
When used with one of archive-creation commands, these keyword/value pairs
will be included at the beginning of the archive in a global extended
header record. When used with one of archive-reading commands,
tar
will behave as if it has encountered these keyword/value
pairs at the beginning of the archive in a global extended header
record.
keyword:=value
When used with one of archive-creation commands, these keyword/value pairs will be included as records at the beginning of an extended header for each file. This is effectively equivalent to keyword=value form except that it creates no global extended header records.
When used with one of archive-reading commands, tar
will
behave as if these keyword/value pairs were included as records at the
end of each extended header; thus, they will override any global or
file-specific extended header record keywords of the same names.
For example, in the command:
tar --format=posix --create \ --file archive --pax-option gname:=user .
the group name will be forced to a new value for all files stored in the archive.
In any of the forms described above, the value may be a string enclosed in curly braces. In that case, the string between the braces is understood either as a textual time representation, as described in Date input formats, or a name of the existing file, starting with ‘/’ or ‘.’. In the latter case, the modification time of that file is used.
For example, to set all modification times to the current date, you use the following option:
--pax-option 'mtime:={now}'
As another example, the following option helps make the archive
more reproducible. See section Making tar
Archives More Reproducible.
--pax-option delete=atime
If you extract files from such an archive and recreate the archive from them, you will also need to eliminate changes due to ctime:
--pax-option 'delete=atime,delete=ctime'
Normally tar
saves an mtime value with subsecond resolution
in an extended header for any file with a timestamp that is not on a
one-second boundary. This is in addition to the traditional mtime
timestamp in the header block. Although you can suppress subsecond
timestamp resolution with ‘--pax-option delete=mtime’,
this hack will not work for timestamps before 1970 or after 2242-03-16
12:56:31 UTC.
If the environment variable POSIXLY_CORRECT
is set, two POSIX
archives created using the same options on the same set of files might
not be byte-to-byte equivalent even with the above options. This is
because the POSIX default for extended header names includes
the tar
process ID, which typically differs at each
run. To produce byte-to-byte equivalent archives in this case, either
unset POSIXLY_CORRECT
, or use the following option, which can be
combined with the above options:
--pax-option exthdr.name=%d/PaxHeaders/%f
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
SunOS and HP-UX tar
fail to accept archives created using
GNU tar
and containing non-ASCII file names, that
is, file names having characters with the eighth bit set, because they
use signed checksums, while GNU tar
uses unsigned
checksums while creating archives, as per POSIX standards. On
reading, GNU tar
computes both checksums and accepts either of them.
It is somewhat worrying that a lot of people may go
around doing backup of their files using faulty (or at least
non-standard) software, not learning about it until it’s time to
restore their missing files with an incompatible file extractor, or
vice versa.
GNU tar
computes checksums both ways, and accepts either of them
on read, so GNU tar can read Sun tapes even with their
wrong checksums. GNU tar
produces the standard
checksum, however, raising incompatibilities with Sun. That is to
say, GNU tar
has not been modified to
produce incorrect archives to be read by buggy tar
’s.
I’ve been told that more recent Sun tar
now read standard
archives, so maybe Sun did a similar patch, after all?
The story seems to be that when Sun first imported tar
sources on their system, they recompiled it without realizing that
the checksums were computed differently, because of a change in
the default signing of char
’s in their compiler. So they
started computing checksums wrongly. When they later realized their
mistake, they merely decided to stay compatible with it, and with
themselves afterwards. Presumably, but I do not really know, HP-UX
has chosen their tar
archives to be compatible with Sun’s.
The current standards do not favor Sun tar
format. In any
case, it now falls on the shoulders of SunOS and HP-UX users to get
a tar
able to read the good archives they receive.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
(This message will disappear, once this node revised.)
The above sections suggest to use ‘oldest possible’ archive
format if in doubt. However, sometimes it is not possible. If you
attempt to archive a file whose metadata cannot be represented using
required format, GNU tar
will print error message and ignore such a
file. You will than have to switch to a format that is able to
handle such values. The format summary table (see section Controlling the Archive Format) will
help you to do so.
In particular, when trying to archive files 8 GiB or larger, or with
timestamps not in the range 1970-01-01 00:00:00 through 2242-03-16
12:56:31 UTC, you will have to chose between GNU and
POSIX archive formats. When considering which format to
choose, bear in mind that the GNU format uses
two’s-complement base-256 notation to store values that do not fit
into standard ustar range. Such archives can generally be
read only by a GNU tar
implementation. Moreover, they sometimes
cannot be correctly restored on another hosts even by GNU tar
. For
example, using two’s complement representation for negative time
stamps that assumes a signed 32-bit time_t
generates archives
that are not portable to hosts with differing time_t
representations.
On the other hand, POSIX archives, generally speaking, can be extracted by any tar implementation that understands older ustar format. The exceptions are files 8 GiB or larger, or files dated before 1970-01-01 00:00:00 or after 2242-03-16 12:56:31 UTC
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
tar
ImplementationsIn previous sections you became acquainted with various quirks
necessary to make your archives portable. Sometimes you may need to
extract archives containing GNU-specific members using some
third-party tar
implementation or an older version of
GNU tar
. Of course your best bet is to have GNU tar
installed,
but if it is for some reason impossible, this section will explain
how to cope without it.
When we speak about GNU-specific members we mean two classes of them: members split between the volumes of a multi-volume archive and sparse members. You will be able to always recover such members if the archive is in PAX format. In addition split members can be recovered from archives in old GNU format. The following subsections describe the required procedures in detail.
8.3.10.1 Extracting Members Split Between Volumes | Members Split Between Volumes | |
8.3.10.2 Extracting Sparse Members | Sparse Members |
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
If a member is split between several volumes of an old GNU format archive
most third party tar
implementation will fail to extract
it. To extract it, use tarcat
program (see section Concatenate Volumes into a Single Archive).
This program is available from
GNU tar
home page. It concatenates several archive volumes into a single
valid archive. For example, if you have three volumes named from
‘vol-1.tar’ to ‘vol-3.tar’, you can do the following to
extract them using a third-party tar
:
$ tarcat vol-1.tar vol-2.tar vol-3.tar | tar xf -
You could use this approach for most (although not all) PAX
format archives as well. However, extracting split members from a PAX
archive is a much easier task, because PAX volumes are constructed in
such a way that each part of a split member is extracted to a
different file by tar
implementations that are not aware of
GNU extensions. More specifically, the very first part retains its
original name, and all subsequent parts are named using the pattern:
%d/GNUFileParts/%f.%n
where symbols preceded by ‘%’ are macro characters that have the following meaning:
Meta-character | Replaced By |
---|---|
%d | The directory name of the file, equivalent to the
result of the dirname utility on its full name. |
%f | The file name of the file, equivalent to the result
of the basename utility on its full name. |
%p | The process ID of the tar process that
created the archive. |
%n | Ordinal number of this particular part. |
For example, if the file ‘var/longfile’ was split during archive creation between three volumes, then the member names will be:
var/longfile var/GNUFileParts/longfile.1 var/GNUFileParts/longfile.2
When you extract your archive using a third-party tar
, these
files will be created on your disk, and the only thing you will need
to do to restore your file in its original form is concatenate them in
the proper order, for example:
$ cd var $ cat GNUFileParts/longfile.1 \ GNUFileParts/longfile.2 >> longfile $ rm -f GNUFileParts
Notice, that if the tar
implementation you use supports PAX
format archives, it will probably emit warnings about unknown keywords
during extraction. They will look like this:
Tar file too small Unknown extended header keyword 'GNU.volume.filename' ignored. Unknown extended header keyword 'GNU.volume.size' ignored. Unknown extended header keyword 'GNU.volume.offset' ignored.
You can safely ignore these warnings.
If your tar
implementation is not PAX-aware, you will get
more warnings and more files generated on your disk, e.g.:
$ tar xf vol-1.tar var/PaxHeaders/longfile: Unknown file type 'x', extracted as normal file Unexpected EOF in archive $ tar xf vol-2.tar tmp/GlobalHead.1: Unknown file type 'g', extracted as normal file GNUFileParts/PaxHeaders/sparsefile.1: Unknown file type 'x', extracted as normal file
Ignore these warnings. The ‘PaxHeaders.*’ directories created will contain files with extended header keywords describing the extracted files. You can delete them, unless they describe sparse members. Read further to learn more about them.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Any tar
implementation will be able to extract sparse members from a
PAX archive. However, the extracted files will be condensed,
i.e., any zero blocks will be removed from them. When we restore such
a condensed file to its original form, by adding zero blocks (or
holes) back to their original locations, we call this process
expanding a compressed sparse file.
To expand a file, you will need a simple auxiliary program called
xsparse
. It is available in source form from
GNU tar
home page.
Let’s begin with archive members in sparse format version 1.0(24), which are the easiest to expand. The condensed file will contain both file map and file data, so no additional data will be needed to restore it. If the original file name was ‘dir/name’, then the condensed file will be named ‘dir/GNUSparseFile.n/name’, where n is a decimal number(25).
To expand a version 1.0 file, run xsparse
as follows:
$ xsparse ‘cond-file’
where ‘cond-file’ is the name of the condensed file. The utility will deduce the name for the resulting expanded file using the following algorithm:
In the unlikely case when this algorithm does not suit your needs, you can explicitly specify output file name as a second argument to the command:
$ xsparse ‘cond-file’ ‘out-file’
It is often a good idea to run xsparse
in dry run mode
first. In this mode, the command does not actually expand the file,
but verbosely lists all actions it would be taking to do so. The dry
run mode is enabled by ‘-n’ command line argument:
$ xsparse -n /home/gray/GNUSparseFile.6058/sparsefile Reading v.1.0 sparse map Expanding file '/home/gray/GNUSparseFile.6058/sparsefile' to '/home/gray/sparsefile' Finished dry run
To actually expand the file, you would run:
$ xsparse /home/gray/GNUSparseFile.6058/sparsefile
The program behaves the same way all UNIX utilities do: it will keep quiet unless it has something important to tell you (e.g. an error condition or something). If you wish it to produce verbose output, similar to that from the dry run mode, use ‘-v’ option:
$ xsparse -v /home/gray/GNUSparseFile.6058/sparsefile Reading v.1.0 sparse map Expanding file '/home/gray/GNUSparseFile.6058/sparsefile' to '/home/gray/sparsefile' Done
Additionally, if your tar
implementation has extracted the
extended headers for this file, you can instruct xstar
to use them in order to verify the integrity of the expanded file.
The option ‘-x’ sets the name of the extended header file to
use. Continuing our example:
$ xsparse -v -x /home/gray/PaxHeaders/sparsefile \ /home/gray/GNUSparseFile/sparsefile Reading extended header file Found variable GNU.sparse.major = 1 Found variable GNU.sparse.minor = 0 Found variable GNU.sparse.name = sparsefile Found variable GNU.sparse.realsize = 217481216 Reading v.1.0 sparse map Expanding file '/home/gray/GNUSparseFile.6058/sparsefile' to '/home/gray/sparsefile' Done
An extended header is a special tar
archive header
that precedes an archive member and contains a set of
variables, describing the member properties that cannot be
stored in the standard ustar
header. While optional for
expanding sparse version 1.0 members, the use of extended headers is
mandatory when expanding sparse members in older sparse formats: v.0.0
and v.0.1 (The sparse formats are described in detail in Storing Sparse Files.) So, for these formats, the question is: how to obtain
extended headers from the archive?
If you use a tar
implementation that does not support PAX
format, extended headers for each member will be extracted as a
separate file. If we represent the member name as
‘dir/name’, then the extended header file will be
named ‘dir/PaxHeaders/name’.
Things become more difficult if your tar
implementation
does support PAX headers, because in this case you will have to
manually extract the headers. We recommend the following algorithm:
tar
implementation for an
option that prints block numbers along with the archive
listing (analogous to GNU tar
’s ‘-R’ option). For example,
star
has ‘-block-number’.
star
on our
archive we obtain:
$ star -t -v -block-number -f arc.tar … star: Unknown extended header keyword 'GNU.sparse.size' ignored. star: Unknown extended header keyword 'GNU.sparse.numblocks' ignored. star: Unknown extended header keyword 'GNU.sparse.name' ignored. star: Unknown extended header keyword 'GNU.sparse.map' ignored. block 56: 425984 -rw-r--r-- gray/users Jun 25 14:46 2006 GNUSparseFile.28124/sparsefile block 897: 65391 -rw-r--r-- gray/users Jun 24 20:06 2006 README …
(as usual, ignore the warnings about unknown keywords.)
N = Bs - Bn - size/512 - 2
This number gives the size of the extended header part in tar blocks.
In our example, this formula gives: 897 - 56 - 425984 / 512 - 2
= 7
.
dd
to extract the headers:
dd if=archive of=hname bs=512 skip=Bs count=N
where archive is the archive name, hname is a name of the file to store the extended header in, Bs and N are computed in previous steps.
In our example, this command will be
$ dd if=arc.tar of=xhdr bs=512 skip=56 count=7
Finally, you can expand the condensed file, using the obtained header:
$ xsparse -v -x xhdr GNUSparseFile.6058/sparsefile Reading extended header file Found variable GNU.sparse.size = 217481216 Found variable GNU.sparse.numblocks = 208 Found variable GNU.sparse.name = sparsefile Found variable GNU.sparse.map = 0,2048,1050624,2048,… Expanding file 'GNUSparseFile.28124/sparsefile' to 'sparsefile' Done
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
tar
Archives More ReproducibleSometimes it is important for an archive to be reproducible, so that one can easily verify it to have been derived solely from its input. We call an archive reproducible, if an archive created from the same set of input files with the same command line options is byte-to-byte equivalent to the original one.
However, two archives created by GNU tar
from two sets of input
files normally might differ even if the input files have the same
contents and GNU tar
was invoked the same way on both sets of input.
This can happen if the inputs have different modification dates or
other metadata, or if the input directories’ entries are in different orders.
To avoid this problem when creating an archive, and thus make the
archive reproducible, you can run GNU tar
in the C locale with
some or all of the following options:
Omit irrelevant information about directory entry order.
Avoid problems with large files or files with unusual timestamps. This also enables ‘--pax-option’ options mentioned below.
Omit the process ID of tar
.
This option is needed only if POSIXLY_CORRECT
is set in the environment.
Omit irrelevant information about file access or status change time.
Omit irrelevant information about file timestamps after ‘$SOURCE_EPOCH’, which should be a time no less than any timestamp of any source file.
Omit irrelevant information about user and group names.
Omit irrelevant information about file ownership and group.
Omit irrelevant information about file permissions.
When creating a reproducible archive from version-controlled source files,
it can be useful to set each file’s modification time
to be that of its last commit, so that the timestamps
are reproducible from the version-control repository.
If these timestamps are all on integer second boundaries, and if you use
‘--format=posix --pax-option='delete=atime,delete=ctime'
--clamp-mtime --mtime="$SOURCE_EPOCH"’
where $SOURCE_EPOCH
is the the time of the most recent commit,
and if all non-source files have timestamps greater than $SOURCE_EPOCH
,
then GNU tar
should generate an archive in ustar format,
since no POSIX features will be needed and the archive will be in the
ustar subset of posix format.
Also, if compressing, use a reproducible compression format; e.g.,
with gzip
you should use the ‘--no-name’ (‘-n’) option.
Here is an example set of shell commands to produce a reproducible
tarball with git
and gzip
, which you can tailor to
your project’s needs.
function get_commit_time() { TZ=UTC0 git log -1 \ --format=tformat:%cd \ --date=format:%Y-%m-%dT%H:%M:%SZ \ "$@" } # # Set each source file timestamp to that of its latest commit. git ls-files | while read -r file; do commit_time=$(get_commit_time "$file") && touch -md $commit_time "$file" done # # Set timestamp of each directory under $FILES # to the latest timestamp of any descendant. find $FILES -depth -type d -exec sh -c \ 'touch -r "$0/$(ls -At "$0" | head -n 1)" "$0"' \ {} ';' # # Create $ARCHIVE.tgz from $FILES, pretending that # the modification time for each newer file # is that of the most recent commit of any source file. SOURCE_EPOCH=$(get_commit_time) TARFLAGS=" --sort=name --format=posix --pax-option=exthdr.name=%d/PaxHeaders/%f --pax-option=delete=atime,delete=ctime --clamp-mtime --mtime=$SOURCE_EPOCH --numeric-owner --owner=0 --group=0 --mode=go+u,go-w " GZIPFLAGS="--no-name --best" LC_ALL=C tar $TARFLAGS -cf - $FILES | gzip $GZIPFLAGS > $ARCHIVE.tgz
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
tar
and cpio
(This message will disappear, once this node revised.)
The cpio
archive formats, like tar
, do have maximum
file name lengths. The binary and old ASCII formats have a maximum file
length of 256, and the new ASCII and CRC ASCII formats have a max
file length of 1024. GNU cpio
can read and write archives
with arbitrary file name lengths, but other cpio
implementations
may crash unexplainedly trying to read them.
tar
handles symbolic links in the form in which it comes in BSD;
cpio
doesn’t handle symbolic links in the form in which it comes
in System V prior to SVR4, and some vendors may have added symlinks
to their system without enhancing cpio
to know about them.
Others may have enhanced it in a way other than the way I did it
at Sun, and which was adopted by AT&T (and which is, I think, also
present in the cpio
that Berkeley picked up from AT&T and put
into a later BSD release—I think I gave them my changes).
(SVR4 does some funny stuff with tar
; basically, its cpio
can handle tar
format input, and write it on output, and it
probably handles symbolic links. They may not have bothered doing
anything to enhance tar
as a result.)
cpio
handles special files; traditional tar
doesn’t.
tar
comes with V7, System III, System V, and BSD source;
cpio
comes only with System III, System V, and later BSD
(4.3-tahoe and later).
tar
’s way of handling multiple hard links to a file can handle
file systems that support 32-bit i-numbers (e.g., the BSD file system);
cpio
s way requires you to play some games (in its “binary”
format, i-numbers are only 16 bits, and in its “portable ASCII” format,
they’re 18 bits—it would have to play games with the "file system ID"
field of the header to make sure that the file system ID/i-number pairs
of different files were always different), and I don’t know which
cpio
s, if any, play those games. Those that don’t might get
confused and think two files are the same file when they’re not, and
make hard links between them.
tar
s way of handling multiple hard links to a file places only
one copy of the link on the tape, but the name attached to that copy
is the only one you can use to retrieve the file; cpio
s
way puts one copy for every link, but you can retrieve it using any
of the names.
What type of check sum (if any) is used, and how is this calculated.
See the attached manual pages for tar
and cpio
format.
tar
uses a checksum which is the sum of all the bytes in the
tar
header for a file; cpio
uses no checksum.
If anyone knows why
cpio
was made whentar
was present at the unix scene,
It wasn’t. cpio
first showed up in PWB/UNIX 1.0; no
generally-available version of UNIX had tar
at the time. I don’t
know whether any version that was generally available within AT&T
had tar
, or, if so, whether the people within AT&T who did
cpio
knew about it.
On restore, if there is a corruption on a tape tar
will stop at
that point, while cpio
will skip over it and try to restore the
rest of the files.
The main difference is just in the command syntax and header format.
tar
is a little more tape-oriented in that everything is blocked
to start on a record boundary.
Is there any differences between the ability to recover crashed archives between the two of them. (Is there any chance of recovering crashed archives at all.)
Theoretically it should be easier under tar
since the blocking
lets you find a header with some variation of ‘dd skip=nn’.
However, modern cpio
’s and variations have an option to just
search for the next file header after an error with a reasonable chance
of resyncing. However, lots of tape driver software won’t allow you to
continue past a media error which should be the only reason for getting
out of sync unless a file changed sizes while you were writing the
archive.
If anyone knows why
cpio
was made whentar
was present at the unix scene, please tell me about this too.
Probably because it is more media efficient (by not blocking everything
and using only the space needed for the headers where tar
always uses 512 bytes per file header) and it knows how to archive
special files.
You might want to look at the freely available alternatives. The
major ones are afio
, GNU tar
, and
pax
, each of which have their own extensions with some
backwards compatibility.
Sparse files were tar
red as sparse files (which you can
easily test, because the resulting archive gets smaller, and
GNU cpio
can no longer read it).
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated on August 23, 2023 using texi2html 5.0.