35.9 Emacs versus POSIX Regular Expressions
Regular expression syntax varies significantly among computer programs.
When writing Elisp code that generates regular expressions for use by other
programs, it is helpful to know how syntax variants differ.
To give a feel for the variation, this section discusses how
Emacs regular expressions differ from two syntax variants standarded by POSIX:
basic regular expressions (BREs) and extended regular expressions (EREs).
Plain grep
uses BREs, and ‘grep -E’ uses EREs.
Emacs regular expressions have a syntax closer to EREs than to BREs,
with some extensions. Here is a summary of how POSIX BREs and EREs
differ from Emacs regular expressions.
- In POSIX BREs ‘+’ and ‘?’ are not special.
The only backslash escape sequences are ‘\(…\)’,
‘\{…\}’, ‘\1’ through ‘\9’, along with the
escaped special characters ‘\$’, ‘\*’, ‘\.’, ‘\[’,
‘\\’, and ‘\^’.
Therefore ‘\(?:’ acts like ‘\([?]:’.
POSIX does not define how other BRE escapes behave;
for example, GNU
grep
treats ‘\|’ like Emacs does,
but does not support all the Emacs escapes.
- In POSIX BREs, it is an implementation option whether ‘^’ is special
after ‘\(’; GNU
grep
treats it like Emacs does.
In POSIX EREs, ‘^’ is always special outside of bracket expressions,
which means the ERE ‘x^’ never matches.
In Emacs regular expressions, ‘^’ is special only at the
beginning of the regular expression, or after ‘\(’, ‘\(?:’
or ‘\|’.
- In POSIX BREs, it is an implementation option whether ‘$’ is
special before ‘\)’; GNU
grep
treats it like Emacs
does. In POSIX EREs, ‘$’ is always special outside of bracket
expressions (see bracket expressions), which means
the ERE ‘$x’ never matches. In Emacs regular expressions,
‘$’ is special only at the end of the regular expression, or
before ‘\)’ or ‘\|’.
- In POSIX EREs ‘{’, ‘(’ and ‘|’ are special,
and ‘)’ is special when matched with a preceding ‘(’.
These special characters do not use preceding backslashes;
‘(?’ produces undefined results.
The only backslash escape sequences are the escaped special characters
‘\$’, ‘\(’, ‘\)’, ‘\*’, ‘\+’, ‘\.’,
‘\?’, ‘\[’, ‘\\’, ‘\^’, ‘\{’ and ‘\|’.
POSIX does not define how other ERE escapes behave;
for example, GNU ‘grep -E’ treats ‘\1’ like Emacs does,
but does not support all the Emacs escapes.
- In POSIX BREs and EREs, undefined results are produced by repetition
operators at the start of a regular expression or subexpression
(possibly preceded by ‘^’), except that the repetition operator
‘*’ has the same behavior in BREs as in Emacs.
In Emacs, these operators are treated as ordinary.
- In BREs and EREs, undefined results are produced by two repetition
operators in sequence. In Emacs, these have well-defined behavior,
e.g., ‘a**’ is equivalent to ‘a*’.
- In BREs and EREs, undefined results are produced by empty regular
expressions or subexpressions. In Emacs these have well-defined
behavior, e.g., ‘\(\)*’ matches the empty string,
- In BREs and EREs, undefined results are produced for the named
character classes ‘[:ascii:]’, ‘[:multibyte:]’,
‘[:nonascii:]’, ‘[:unibyte:]’, and ‘[:word:]’.
- BREs and EREs can contain collating symbols and equivalence
class expressions within bracket expressions, e.g., ‘[[.ch.]d[=a=]]’.
Emacs regular expressions do not support this.
- BREs, EREs, and the strings they match cannot contain encoding errors
or NUL bytes. In Emacs these constructs simply match themselves.
- BRE and ERE searching always finds the longest match.
Emacs searching by default does not necessarily do so.
See Longest-match searching for regular expression matches.