Deleting Files (GNU Findutils 4.10.0)

Next: Copying A Subset of Files, Up: Worked Examples [Contents][Index]

10.1 Deleting Files

One of the most common tasks that find is used for is locating files that can be deleted. This might include:

Files last modified more than 3 years ago which haven’t been accessed for at least 2 years
Files belonging to a certain user
Temporary files which are no longer required

This example concentrates on the actual deletion task rather than on sophisticated ways of locating the files that need to be deleted. We’ll assume that the files we want to delete are old files underneath /var/tmp/stuff.

The Traditional Way
Making Use of xargs
Unusual characters in filenames
Going back to -exec
A more secure version of -exec
Using the -delete action
Improving things still further
Conclusion

10.1.1 The Traditional Way

The traditional way to delete files in /var/tmp/stuff that have not been modified in over 90 days would have been:

find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \;

The above command uses ‘-exec’ to run the /bin/rm command to remove each file. This approach works and in fact would have worked in Version 7 Unix in 1979. However, there are a number of problems with this approach.

The most obvious problem with the approach above is that it causes find to fork every time it finds a file that needs to delete, and the child process then has to use the exec system call to launch /bin/rm. All this is quite inefficient. If we are going to use /bin/rm to do this job, it is better to make it delete more than one file at a time.

The most obvious way of doing this is to use the shell’s command expansion feature:

/bin/rm `find /var/tmp/stuff -mtime +90 -print`

or you could use the more modern form

/bin/rm $(find /var/tmp/stuff -mtime +90 -print)

The commands above are much more efficient than the first attempt. However, there is a problem with them. The shell has a maximum command length which is imposed by the operating system (the actual limit varies between systems). This means that while the command expansion technique will usually work, it will suddenly fail when there are lots of files to delete. Since the task is to delete unwanted files, this is precisely the time we don’t want things to go wrong.

10.1.2 Making Use of `xargs`

So, is there a way to be more efficient in the use of fork() and exec() without running up against this limit? Yes, we can be almost optimally efficient by making use of the xargs command. The xargs command reads arguments from its standard input and builds them into command lines. We can use it like this:

find /var/tmp/stuff -mtime +90 -print | xargs /bin/rm

For example if the files found by find are /var/tmp/stuff/A, /var/tmp/stuff/B and /var/tmp/stuff/C then xargs might issue the commands

/bin/rm /var/tmp/stuff/A /var/tmp/stuff/B
/bin/rm /var/tmp/stuff/C

The above assumes that xargs has a very small maximum command line length. The real limit is much larger but the idea is that xargs will run /bin/rm as many times as necessary to get the job done, given the limits on command line length.

This usage of xargs is pretty efficient, and the xargs command is widely implemented (all modern versions of Unix offer it). So far then, the news is all good. However, there is bad news too.

10.1.3 Unusual characters in filenames

Unix-like systems allow any characters to appear in file names with the exception of the ASCII NUL character and the slash. Slashes can occur in path names (as the directory separator) but not in the names of actual directory entries. This means that the list of files that xargs reads could in fact contain white space characters – spaces, tabs and newline characters. Since by default, xargs assumes that the list of files it is reading uses white space as an argument separator, it cannot correctly handle the case where a filename actually includes white space. This makes the default behaviour of xargs almost useless for handling arbitrary data.

To solve this problem, GNU findutils introduced the ‘-print0’ action for find. This uses the ASCII NUL character to separate the entries in the file list that it produces. This is the ideal choice of separator since it is the only character that cannot appear within a path name. The ‘-0’ option to xargs makes it assume that arguments are separated with ASCII NUL instead of white space. It also turns off another misfeature in the default behaviour of xargs, which is that it pays attention to quote characters in its input. Some versions of xargs also terminate when they see a lone ‘_’ in the input, but GNU find no longer does that (since it has become an optional behaviour in the Unix standard).

So, putting find -print0 together with xargs -0 we get this command:

find /var/tmp/stuff -mtime +90 -print0 | xargs -0 /bin/rm

The result is an efficient way of proceeding that correctly handles all the possible characters that could appear in the list of files to delete. This is good news. However, there is, as I’m sure you’re expecting, also more bad news. The problem is that this is not a portable construct; although other versions of Unix (notably BSD-derived ones) support ‘-print0’, it’s not universal. So, is there a more universal mechanism?

10.1.4 Going back to `-exec`

There is indeed a more universal mechanism, which is a slight modification to the ‘-exec’ action. The normal ‘-exec’ action assumes that the command to run is terminated with a semicolon (the semicolon normally has to be quoted in order to protect it from interpretation as the shell command separator). The SVR4 edition of Unix introduced a slight variation, which involves terminating the command with ‘+’ instead:

find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+

The above use of ‘-exec’ causes find to build up a long command line and then issue it. This can be less efficient than some uses of xargs; for example xargs allows building up new command lines while the previous command is still executing, and allows specifying a number of commands to run in parallel. However, the find … -exec … + construct has the advantage of wide portability. GNU findutils did not support ‘-exec … +’ until version 4.2.12; one of the reasons for this is that it already had the ‘-print0’ action in any case.

10.1.5 A more secure version of `-exec`

The command above seems to be efficient and portable. However, within it lurks a security problem. The problem is shared with all the commands we’ve tried in this worked example so far, too. The security problem is a race condition; that is, if it is possible for somebody to manipulate the filesystem that you are searching while you are searching it, it is possible for them to persuade your find command to cause the deletion of a file that you can delete but they normally cannot.

The problem occurs because the ‘-exec’ action is defined by the POSIX standard to invoke its command with the same working directory as find had when it was started. This means that the arguments which replace the {} include a relative path from find’s starting point down the file that needs to be deleted. For example,

find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+

might actually issue the command:

/bin/rm /var/tmp/stuff/A /var/tmp/stuff/B /var/tmp/stuff/passwd

Notice the file /var/tmp/stuff/passwd. Likewise, the command:

cd /var/tmp && find stuff -mtime +90 -exec /bin/rm {} \+

might actually issue the command:

/bin/rm stuff/A stuff/B stuff/passwd

If an attacker can rename stuff to something else (making use of their write permissions in /var/tmp) they can replace it with a symbolic link to /etc. That means that the /bin/rm command will be invoked on /etc/passwd. If you are running your find command as root, the attacker has just managed to delete a vital file. All they needed to do to achieve this was replace a subdirectory with a symbolic link at the vital moment.

There is however, a simple solution to the problem. This is an action which works a lot like -exec but doesn’t need to traverse a chain of directories to reach the file that it needs to work on. This is the ‘-execdir’ action, which was introduced by the BSD family of operating systems. The command,

find /var/tmp/stuff -mtime +90 -execdir /bin/rm {} \+

might delete a set of files by performing these actions:

Change directory to /var/tmp/stuff/foo
Invoke /bin/rm ./file1 ./file2 ./file3
Change directory to /var/tmp/stuff/bar
Invoke /bin/rm ./file99 ./file100 ./file101

This is a much more secure method. We are no longer exposed to a race condition. For many typical uses of find, this is the best strategy. It’s reasonably efficient, but the length of the command line is limited not just by the operating system limits, but also by how many files we actually need to delete from each directory.

Is it possible to do any better? In the case of general file processing, no. However, in the specific case of deleting files it is indeed possible to do better.

10.1.6 Using the `-delete` action

The most efficient and secure method of solving this problem is to use the ‘-delete’ action:

find /var/tmp/stuff -mtime +90 -delete

This alternative is more efficient than any of the ‘-exec’ or ‘-execdir’ actions, since it entirely avoids the overhead of forking a new process and using exec to run /bin/rm. It is also normally more efficient than xargs for the same reason. The file deletion is performed from the directory containing the entry to be deleted, so the ‘-delete’ action has the same security advantages as the ‘-execdir’ action has.

The ‘-delete’ action was introduced by the BSD family of operating systems.

10.1.7 Improving things still further

Is it possible to improve things still further? Not without either modifying the system library to the operating system or having more specific knowledge of the layout of the filesystem and disk I/O subsystem, or both.

The find command traverses the filesystem, reading directories. It then issues a separate system call for each file to be deleted. If we could modify the operating system, there are potential gains that could be made:

We could have a system call to which we pass more than one filename for deletion
Alternatively, we could pass in a list of inode numbers (on GNU/Linux systems, readdir() also returns the inode number of each directory entry) to be deleted.

The above possibilities sound interesting, but from the kernel’s point of view it is difficult to enforce standard Unix access controls for such processing by inode number. Such a facility would probably need to be restricted to the superuser.

Another way of improving performance would be to increase the parallelism of the process. For example if the directory hierarchy we are searching is actually spread across a number of disks, we might somehow be able to arrange for find to process each disk in parallel. In practice GNU find doesn’t have such an intimate understanding of the system’s filesystem layout and disk I/O subsystem.

However, since the system administrator can have such an understanding they can take advantage of it like so:

find /var/tmp/stuff1 -mtime +90 -delete &
find /var/tmp/stuff2 -mtime +90 -delete &
find /var/tmp/stuff3 -mtime +90 -delete &
find /var/tmp/stuff4 -mtime +90 -delete &
wait

In the example above, four separate instances of find are used to search four subdirectories in parallel. The wait command simply waits for all of these to complete. Whether this approach is more or less efficient than a single instance of find depends on a number of things:

Are the directories being searched in parallel actually on separate disks? If not, this parallel search might just result in a lot of disk head movement and so the speed might even be slower.
Other activity - are other programs also doing things on those disks?

10.1.8 Conclusion

The fastest and most secure way to delete files with the help of find is to use ‘-delete’. Using xargs -0 -P N can also make effective use of the disk, but it is not as secure.

In the case where we’re doing things other than deleting files, the most secure alternative is ‘-execdir … +’, but this is not as portable as the insecure action ‘-exec … +’.

The ‘-delete’ action is not completely portable, but the only other possibility which is as secure (‘-execdir’) is no more portable. The most efficient portable alternative is ‘-exec …+’, but this is insecure and isn’t supported by versions of GNU findutils prior to 4.2.12.

10.1 Deleting Files

10.1.1 The Traditional Way

10.1.2 Making Use of xargs