Typically grep
is an efficient way to search text. However,
it can be quite slow in some cases, and it can search large files
where even minor performance tweaking can help significantly.
Although the algorithm used by grep
is an implementation
detail that can change from release to release, understanding its
basic strengths and weaknesses can help you improve its performance.
The grep
command operates partly via a set of automata that
are designed for efficiency, and partly via a slower matcher that
takes over when the fast matchers run into unusual features like
back-references. When feasible, the Boyer–Moore fast string
searching algorithm is used to match a single fixed pattern, and the
Aho–Corasick algorithm is used to match multiple fixed patterns.
Generally speaking grep
operates more efficiently in
single-byte locales, since it can avoid the special processing needed
for multi-byte characters. If your patterns will work just as well
that way, setting LC_ALL
to a single-byte locale can help
performance considerably. Setting ‘LC_ALL='C'’ can be
particularly efficient, as grep
is tuned for that locale.
Outside the ‘C’ locale, case-insensitive search, and search for bracket expressions like ‘[a-z]’ and ‘[[=a=]b]’, can be surprisingly inefficient due to difficulties in fast portable access to concepts like multi-character collating elements.
Interval expressions may be implemented internally via repetition. For example, ‘^(a|bc){2,4}$’ might be implemented as ‘^(a|bc)(a|bc)((a|bc)(a|bc)?)?$’. A large repetition count may exhaust memory or greatly slow matching. Even small counts can cause problems if cascaded; for example, ‘grep -E ".*{10,}{10,}{10,}{10,}{10,}"’ is likely to overflow a stack. Fortunately, regular expressions like these are typically artificial, and cascaded repetitions do not conform to POSIX so cannot be used in portable programs anyway.
A back-reference such as ‘\1’ can hurt performance significantly in some cases, since back-references cannot in general be implemented via a finite state automaton, and instead trigger a backtracking algorithm that can be quite inefficient. For example, although the pattern ‘^(.*)\1{14}(.*)\2{13}$’ matches only lines whose lengths can be written as a sum 15x + 14y for nonnegative integers x and y, the pattern matcher does not perform linear Diophantine analysis and instead backtracks through all possible matching strings, using an algorithm that is exponential in the worst case.
On some operating systems that support files with holes—large
regions of zeros that are not physically present on secondary
storage—grep
can skip over the holes efficiently without
needing to read the zeros. This optimization is not available if the
-a (--binary-files=text) option is used (see File and Directory Selection), unless the -z (--null-data)
option is also used (see Other Options).
For efficiency grep
does not always read all its input.
For example, the shell command ‘sed '/^...$/d' | grep -q X’ can
cause grep
to exit immediately after reading a line
containing ‘X’, without bothering to read the rest of its input data.
This in turn can cause sed
to exit with a nonzero status because
sed
cannot write to its output pipe after grep
exits.
For more about the algorithms used by grep
and about
related string matching algorithms, see:
grep
.
grep
in the future.
grep
.