19 Regular Expressions

The character ‘.’ matches any single character except the null character.

+

match one or more occurrences of the previous atom or regexp.

?

match zero or one occurrences of the previous atom or regexp.

\+

matches a ‘+

\?

matches a ‘?’.

Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are invalid. Within square brackets, ‘\’ is taken literally. Character classes are supported; for example ‘[[:digit:]]’ matches a single decimal digit.

GNU extensions are supported:

\w

matches a character within a word

\W

matches a character which is not within a word

\<

matches the beginning of a word

\>

matches the end of a word

\b

matches a word boundary

\B

matches characters which are not a word boundary

\`

matches the beginning of the whole input

\'

matches the end of the whole input

Grouping is performed with parentheses ‘()’. An unmatched ‘)’ matches just itself. A backslash followed by a digit acts as a back-reference and matches the same thing as the previous grouped expression indicated by that number. For example, ‘\2’ matches the second group expression. The order of group expressions is determined by the position of their opening parenthesis ‘(’.

The alternation operator is ‘|’.

The characters ‘^’ and ‘$’ always represent the beginning and end of a string respectively, except within square brackets. Within brackets, an initial ‘^’ inverts the character class being matched.

*’, ‘+’ and ‘?’ are special at any point in a regular expression except the following places, where they are not allowed:

  1. At the beginning of a regular expression
  2. After an open-group, ‘(
  3. After the alternation operator, ‘|

Intervals are specified by ‘{’ and ‘}’. Invalid intervals such as ‘a{1z’ are not accepted.

The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to sub-expressions within groups.