One of the most common tasks that find
is used for is locating
files that can be deleted. This might include:
This example concentrates on the actual deletion task rather than on sophisticated ways of locating the files that need to be deleted. We’ll assume that the files we want to delete are old files underneath /var/tmp/stuff.
xargs
-exec
-exec
-delete
actionThe traditional way to delete files in /var/tmp/stuff that have not been modified in over 90 days would have been:
find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \;
The above command uses ‘-exec’ to run the /bin/rm
command
to remove each file. This approach works and in fact would have
worked in Version 7 Unix in 1979. However, there are a number of
problems with this approach.
The most obvious problem with the approach above is that it causes
find
to fork every time it finds a file that needs to delete,
and the child process then has to use the exec
system call to
launch /bin/rm
. All this is quite inefficient. If we are
going to use /bin/rm
to do this job, it is better to make it
delete more than one file at a time.
The most obvious way of doing this is to use the shell’s command expansion feature:
/bin/rm `find /var/tmp/stuff -mtime +90 -print`
or you could use the more modern form
/bin/rm $(find /var/tmp/stuff -mtime +90 -print)
The commands above are much more efficient than the first attempt. However, there is a problem with them. The shell has a maximum command length which is imposed by the operating system (the actual limit varies between systems). This means that while the command expansion technique will usually work, it will suddenly fail when there are lots of files to delete. Since the task is to delete unwanted files, this is precisely the time we don’t want things to go wrong.
xargs
So, is there a way to be more efficient in the use of fork()
and exec()
without running up against this limit?
Yes, we can be almost optimally efficient by making use
of the xargs
command. The xargs
command reads arguments
from its standard input and builds them into command lines. We can
use it like this:
find /var/tmp/stuff -mtime +90 -print | xargs /bin/rm
For example if the files found by find
are
/var/tmp/stuff/A,
/var/tmp/stuff/B and
/var/tmp/stuff/C then xargs
might issue the commands
/bin/rm /var/tmp/stuff/A /var/tmp/stuff/B /bin/rm /var/tmp/stuff/C
The above assumes that xargs
has a very small maximum command
line length. The real limit is much larger but the idea is that
xargs
will run /bin/rm
as many times as necessary to get
the job done, given the limits on command line length.
This usage of xargs
is pretty efficient, and the xargs
command is widely implemented (all modern versions of Unix offer it).
So far then, the news is all good. However, there is bad news too.
Unix-like systems allow any characters to appear in file names with
the exception of the ASCII NUL character and the slash.
Slashes can occur in path names (as the directory separator) but
not in the names of actual directory entries. This means that the
list of files that xargs
reads could in fact contain white space
characters – spaces, tabs and newline characters. Since by default,
xargs
assumes that the list of files it is reading uses white
space as an argument separator, it cannot correctly handle the case
where a filename actually includes white space. This makes the
default behaviour of xargs
almost useless for handling
arbitrary data.
To solve this problem, GNU findutils introduced the ‘-print0’
action for find
. This uses the ASCII NUL character to separate
the entries in the file list that it produces. This is the ideal
choice of separator since it is the only character that cannot appear
within a path name. The ‘-0’ option to xargs
makes it
assume that arguments are separated with ASCII NUL instead of white
space. It also turns off another misfeature in the default behaviour
of xargs
, which is that it pays attention to quote characters
in its input. Some versions of xargs
also terminate when they
see a lone ‘_’ in the input, but GNU find
no longer does
that (since it has become an optional behaviour in the Unix standard).
So, putting find -print0
together with xargs -0
we get
this command:
find /var/tmp/stuff -mtime +90 -print0 | xargs -0 /bin/rm
The result is an efficient way of proceeding that correctly handles all the possible characters that could appear in the list of files to delete. This is good news. However, there is, as I’m sure you’re expecting, also more bad news. The problem is that this is not a portable construct; although other versions of Unix (notably BSD-derived ones) support ‘-print0’, it’s not universal. So, is there a more universal mechanism?
-exec
There is indeed a more universal mechanism, which is a slight modification to the ‘-exec’ action. The normal ‘-exec’ action assumes that the command to run is terminated with a semicolon (the semicolon normally has to be quoted in order to protect it from interpretation as the shell command separator). The SVR4 edition of Unix introduced a slight variation, which involves terminating the command with ‘+’ instead:
find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
The above use of ‘-exec’ causes find
to build up a long
command line and then issue it. This can be less efficient than some
uses of xargs
; for example xargs
allows building up
new command lines while the previous command is still executing, and
allows specifying a number of commands to run in parallel.
However, the find … -exec … +
construct has the advantage
of wide portability. GNU findutils did not support ‘-exec … +’
until version 4.2.12; one of the reasons for this is that it already
had the ‘-print0’ action in any case.
-exec
The command above seems to be efficient and portable. However,
within it lurks a security problem. The problem is shared with
all the commands we’ve tried in this worked example so far, too. The
security problem is a race condition; that is, if it is possible for
somebody to manipulate the filesystem that you are searching while you
are searching it, it is possible for them to persuade your find
command to cause the deletion of a file that you can delete but they
normally cannot.
The problem occurs because the ‘-exec’ action is defined by the
POSIX standard to invoke its command with the same working directory
as find
had when it was started. This means that the arguments
which replace the {} include a relative path from find
’s
starting point down the file that needs to be deleted. For example,
find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
might actually issue the command:
/bin/rm /var/tmp/stuff/A /var/tmp/stuff/B /var/tmp/stuff/passwd
Notice the file /var/tmp/stuff/passwd. Likewise, the command:
cd /var/tmp && find stuff -mtime +90 -exec /bin/rm {} \+
might actually issue the command:
/bin/rm stuff/A stuff/B stuff/passwd
If an attacker can rename stuff to something else (making use
of their write permissions in /var/tmp) they can replace it
with a symbolic link to /etc. That means that the
/bin/rm
command will be invoked on /etc/passwd. If you
are running your find
command as root, the attacker has just managed
to delete a vital file. All they needed to do to achieve this was
replace a subdirectory with a symbolic link at the vital moment.
There is however, a simple solution to the problem. This is an action
which works a lot like -exec
but doesn’t need to traverse a
chain of directories to reach the file that it needs to work on. This
is the ‘-execdir’ action, which was introduced by the BSD family
of operating systems. The command,
find /var/tmp/stuff -mtime +90 -execdir /bin/rm {} \+
might delete a set of files by performing these actions:
/bin/rm ./file1 ./file2 ./file3
/bin/rm ./file99 ./file100 ./file101
This is a much more secure method. We are no longer exposed to a race
condition. For many typical uses of find
, this is the best
strategy. It’s reasonably efficient, but the length of the command
line is limited not just by the operating system limits, but also by
how many files we actually need to delete from each directory.
Is it possible to do any better? In the case of general file processing, no. However, in the specific case of deleting files it is indeed possible to do better.
-delete
actionThe most efficient and secure method of solving this problem is to use the ‘-delete’ action:
find /var/tmp/stuff -mtime +90 -delete
This alternative is more efficient than any of the ‘-exec’ or
‘-execdir’ actions, since it entirely avoids the overhead of
forking a new process and using exec
to run /bin/rm
. It
is also normally more efficient than xargs
for the same
reason. The file deletion is performed from the directory containing
the entry to be deleted, so the ‘-delete’ action has the same
security advantages as the ‘-execdir’ action has.
The ‘-delete’ action was introduced by the BSD family of operating systems.
Is it possible to improve things still further? Not without either modifying the system library to the operating system or having more specific knowledge of the layout of the filesystem and disk I/O subsystem, or both.
The find
command traverses the filesystem, reading
directories. It then issues a separate system call for each file to
be deleted. If we could modify the operating system, there are
potential gains that could be made:
readdir()
also returns the inode number of each
directory entry) to be deleted.
The above possibilities sound interesting, but from the kernel’s point of view it is difficult to enforce standard Unix access controls for such processing by inode number. Such a facility would probably need to be restricted to the superuser.
Another way of improving performance would be to increase the
parallelism of the process. For example if the directory hierarchy we
are searching is actually spread across a number of disks, we might
somehow be able to arrange for find
to process each disk in
parallel. In practice GNU find
doesn’t have such an intimate
understanding of the system’s filesystem layout and disk I/O
subsystem.
However, since the system administrator can have such an understanding they can take advantage of it like so:
find /var/tmp/stuff1 -mtime +90 -delete & find /var/tmp/stuff2 -mtime +90 -delete & find /var/tmp/stuff3 -mtime +90 -delete & find /var/tmp/stuff4 -mtime +90 -delete & wait
In the example above, four separate instances of find
are used
to search four subdirectories in parallel. The wait
command
simply waits for all of these to complete. Whether this approach is
more or less efficient than a single instance of find
depends
on a number of things:
The fastest and most secure way to delete files with the help of
find
is to use ‘-delete’. Using xargs -0 -P N
can
also make effective use of the disk, but it is not as secure.
In the case where we’re doing things other than deleting files, the most secure alternative is ‘-execdir … +’, but this is not as portable as the insecure action ‘-exec … +’.
The ‘-delete’ action is not completely portable, but the only other possibility which is as secure (‘-execdir’) is no more portable. The most efficient portable alternative is ‘-exec …+’, but this is insecure and isn’t supported by versions of GNU findutils prior to 4.2.12.