Next: Common Commands, Previous: Addresses, Up: sed Programs
To know how to use sed, people should understand regular expressions (regexp for short). A regular expression is a pattern that is matched against a subject string from left to right. Most characters are ordinary: they stand for themselves in a pattern, and match the corresponding characters in the subject. As a trivial example, the pattern
The quick brown fox
matches a portion of a subject string that is identical to itself. The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of special characters, which do not stand for themselves but instead are interpreted in some special way. Here is a brief description of regular expression syntax as used in sed.
*
\
, a .
, a grouped regexp
(see below), or a bracket expression. As a GNU extension, a
postfixed regular expression can also be followed by *
; for
example, a**
is equivalent to a*
. POSIX
1003.1-2001 says that *
stands for itself when it appears at
the start of a regular expression or subexpression, but many
nonGNU implementations do not support this and portable
scripts should instead use \*
in these contexts.
\+
*
, but matches one or more. It is a GNU extension.
\?
*
, but only matches zero or one. It is a GNU extension.
\{
i\}
*
, but matches exactly i sequences (i is a
decimal integer; for portability, keep it between 0 and 255
inclusive).
\{
i,
j\}
\{
i,\}
\(
regexp\)
\(abcd\)*
:
this will search for zero or more whole sequences
of ‘abcd’, while abcd*
would search
for ‘abc’ followed by zero or more occurrences
of ‘d’. Note that support for \(abcd\)*
is
required by POSIX 1003.1-2001, but many non-GNU
implementations do not support it and hence it is not universally
portable.
.
^
In most scripts, pattern space is initialized to the content of each
line (see How sed
works). So, it is a
useful simplification to think of ^#include
as matching only
lines where ‘#include’ is the first thing on line—if there are
spaces before, for example, the match fails. This simplification is
valid as long as the original content of pattern space is not modified,
for example with an s
command.
^
acts as a special character only at the beginning of the
regular expression or subexpression (that is, after \(
or
\|
). Portable scripts should avoid ^
at the beginning of
a subexpression, though, as POSIX allows implementations that
treat ^
as an ordinary character in that context.
$
^
, but refers to end of pattern space.
$
also acts as a special character only at the end
of the regular expression or subexpression (that is, before \)
or \|
), and its use at the end of a subexpression is not
portable.
[
list]
[^
list]
[aeiou]
matches all vowels. A list may include
sequences like char1-
char2, which
matches any character between (inclusive) char1
and char2.
A leading ^
reverses the meaning of list, so that
it matches any single character not in list. To include
]
in the list, make it the first character (after
the ^
if needed), to include -
in the list,
make it the first or last; to include ^
put
it after the first character.
The characters $
, *
, .
, [
, and \
are normally not special within list. For example, [\*]
matches either ‘\’ or ‘*’, because the \
is not
special here. However, strings like [.ch.]
, [=a=]
, and
[:space:]
are special within list and represent collating
symbols, equivalence classes, and character classes, respectively, and
[
is therefore special within list when it is followed by
.
, =
, or :
. Also, when not in
POSIXLY_CORRECT mode, special escapes like \n
and
\t
are recognized within list. See Escapes.
\|
regexp2\|
, ^
, and
$
, but less tightly than the other regular expression
operators.
\
digit\(...\)
parenthesized
subexpression in the regular expression. This is called a back
reference. Subexpressions are implicity numbered by counting
occurrences of \(
left-to-right.
\n
\
char$
,
*
, .
, [
, \
, or ^
.
Note that the only C-like
backslash sequences that you can portably assume to be
interpreted are \n
and \\
; in particular
\t
is not portable, and matches a ‘t’ under most
implementations of sed, rather than a tab character.
Note that the regular expression matcher is greedy, i.e., matches are attempted from left to right and, if two or more matches are possible starting at the same character, it selects the longest.
Examples: