Back-reference Operator (GNU Gnulib)

Next: Anchoring Operators, Previous: Grouping Operators (( … ) or \( … \)), Up: Common Operators [Contents][Index]

18.3.8 The Back-reference Operator (\`digit`) ¶

If the syntax bit RE_NO_BK_REF isn’t set, then Regex recognizes back-references. A back-reference matches a specified preceding group. The back-reference operator is represented by ‘\digit’ anywhere after the end of a regular expression’s digit-th group (see Grouping Operators (( … ) or \( … \))).

digit must be between ‘1’ and ‘9’. The matcher assigns numbers 1 through 9 to the first nine groups it encounters. By using one of ‘\1’ through ‘\9’ after the corresponding group’s close-group operator, you can match a substring identical to the one that the group does.

Back-references match according to the following (in all examples below, ‘(’ represents the open-group, ‘)’ the close-group, ‘{’ the open-interval and ‘}’ the close-interval operator):

If the group matches a substring, the back-reference matches an identical substring. For example, ‘(a)\1’ matches ‘aa’ and ‘(bana)na\1bo\1’ matches ‘bananabanabobana’. Likewise, ‘(.*)\1’ matches any (newline-free if the syntax bit RE_DOT_NEWLINE isn’t set) string that is composed of two identical halves; the ‘(.*)’ matches the first half and the ‘\1’ matches the second half.
If the group matches more than once (as it might if followed by, e.g., a repetition operator), then the back-reference matches the substring the group last matched. For example, ‘((a*)b)*\1\2’ matches ‘aabababa’; first group 1 (the outer one) matches ‘aab’ and group 2 (the inner one) matches ‘aa’. Then group 1 matches ‘ab’ and group 2 matches ‘a’. So, ‘\1’ matches ‘ab’ and ‘\2’ matches ‘a’.
If the group doesn’t participate in a match, i.e., it is part of an alternative not taken or a repetition operator allows zero repetitions of it, then the back-reference makes the whole match fail. For example, ‘(one()|two())-and-(three\2|four\3)’ matches ‘one-and-three’ and ‘two-and-four’, but not ‘one-and-four’ or ‘two-and-three’. For example, if the pattern matches ‘one-and-’, then its group 2 matches the empty string and its group 3 doesn’t participate in the match. So, if it then matches ‘four’, then when it tries to back-reference group 3—which it will attempt to do because ‘\3’ follows the ‘four’—the match will fail because group 3 didn’t participate in the match.

You can use a back-reference as an argument to a repetition operator. For example, ‘(a(b))\2*’ matches ‘a’ followed by two or more ‘b’s. Similarly, ‘(a(b))\2{3}’ matches ‘abbbb’.

If there is no preceding digit-th subexpression, the regular expression is invalid.

Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2020 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.

18.3.8 The Back-reference Operator (\digit) ¶

18.3.8 The Back-reference Operator (\`digit`) ¶