POSIX Regexps (GNU Emacs Lisp Reference Manual)

35.9 Emacs versus POSIX Regular Expressions

Regular expression syntax varies significantly among computer programs. When writing Elisp code that generates regular expressions for use by other programs, it is helpful to know how syntax variants differ. To give a feel for the variation, this section discusses how Emacs regular expressions differ from two syntax variants standarded by POSIX: basic regular expressions (BREs) and extended regular expressions (EREs). Plain grep uses BREs, and ‘grep -E’ uses EREs.

Emacs regular expressions have a syntax closer to EREs than to BREs, with some extensions. Here is a summary of how POSIX BREs and EREs differ from Emacs regular expressions.

In POSIX BREs ‘+’ and ‘?’ are not special. The only backslash escape sequences are ‘$…$’, ‘\{…\}’, ‘\1’ through ‘\9’, along with the escaped special characters ‘\$’, ‘\*’, ‘\.’, ‘\[’, ‘\\’, and ‘\^’. Therefore ‘\(?:’ acts like ‘\([?]:’. POSIX does not define how other BRE escapes behave; for example, GNU grep treats ‘\|’ like Emacs does, but does not support all the Emacs escapes.
In POSIX BREs, it is an implementation option whether ‘^’ is special after ‘\(’; GNU grep treats it like Emacs does. In POSIX EREs, ‘^’ is always special outside of bracket expressions, which means the ERE ‘x^’ never matches. In Emacs regular expressions, ‘^’ is special only at the beginning of the regular expression, or after ‘\(’, ‘\(?:’ or ‘\|’.
In POSIX BREs, it is an implementation option whether ‘$’ is special before ‘\)’; GNU grep treats it like Emacs does. In POSIX EREs, ‘$’ is always special outside of bracket expressions (see bracket expressions), which means the ERE ‘$x’ never matches. In Emacs regular expressions, ‘$’ is special only at the end of the regular expression, or before ‘\)’ or ‘\|’.
In POSIX EREs ‘{’, ‘(’ and ‘|’ are special, and ‘)’ is special when matched with a preceding ‘(’. These special characters do not use preceding backslashes; ‘(?’ produces undefined results. The only backslash escape sequences are the escaped special characters ‘\$’, ‘$’, ‘$’, ‘\*’, ‘\+’, ‘\.’, ‘\?’, ‘\[’, ‘\\’, ‘\^’, ‘\{’ and ‘\|’. POSIX does not define how other ERE escapes behave; for example, GNU ‘grep -E’ treats ‘\1’ like Emacs does, but does not support all the Emacs escapes.
In POSIX BREs and EREs, undefined results are produced by repetition operators at the start of a regular expression or subexpression (possibly preceded by ‘^’), except that the repetition operator ‘*’ has the same behavior in BREs as in Emacs. In Emacs, these operators are treated as ordinary.
In BREs and EREs, undefined results are produced by two repetition operators in sequence. In Emacs, these have well-defined behavior, e.g., ‘a**’ is equivalent to ‘a*’.
In BREs and EREs, undefined results are produced by empty regular expressions or subexpressions. In Emacs these have well-defined behavior, e.g., ‘*’ matches the empty string,
In BREs and EREs, undefined results are produced for the named character classes ‘[:ascii:]’, ‘[:multibyte:]’, ‘[:nonascii:]’, ‘[:unibyte:]’, and ‘[:word:]’.
BREs and EREs can contain collating symbols and equivalence class expressions within bracket expressions, e.g., ‘[[.ch.]d[=a=]]’. Emacs regular expressions do not support this.
BREs, EREs, and the strings they match cannot contain encoding errors or NUL bytes. In Emacs these constructs simply match themselves.
BRE and ERE searching always finds the longest match. Emacs searching by default does not necessarily do so. See Longest-match searching for regular expression matches.