Regexp Field Splitting (The GNU Awk User’s Guide)

Next: Making Each Character a Separate Field, Previous: Whitespace Normally Separates Fields, Up: Specifying How Fields Are Separated [Contents][Index]

4.5.2 Using Regular Expressions to Separate Fields ¶

The previous subsection discussed the use of single characters or simple strings as the value of FS. More generally, the value of FS may be a string containing any regular expression. In this case, each match in the record for the regular expression separates fields. For example, the assignment:

FS = ", \t"

makes every area of an input line that consists of a comma followed by a space and a TAB into a field separator.

For a less trivial example of a regular expression, try using single spaces to separate fields the way single commas are used. FS can be set to "[ ]" (left bracket, space, right bracket). This regular expression matches a single space and nothing else (see Regular Expressions).

There is an important difference between the two cases of ‘FS = " "’ (a single space) and ‘FS = "[ \t\n]+"’ (a regular expression matching one or more spaces, TABs, or newlines). For both values of FS, fields are separated by runs (multiple adjacent occurrences) of spaces, TABs, and/or newlines. However, when the value of FS is " ", awk first strips leading and trailing whitespace from the record and then decides where the fields are. For example, the following pipeline prints ‘b’:

$ echo ' a b c d ' | awk '{ print $2 }'
-| b

However, this pipeline prints ‘a’ (note the extra spaces around each letter):

$ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t\n]+" }
>                                  { print $2 }'
-| a

In this case, the first field is null, or empty.

The stripping of leading and trailing whitespace also comes into play whenever $0 is recomputed. For instance, study this pipeline:

$ echo '   a b c d' | awk '{ print; $2 = $2; print }'
-|    a b c d
-| a b c d

The first print statement prints the record as it was read, with leading whitespace intact. The assignment to $2 rebuilds $0 by concatenating $1 through $NF together, separated by the value of OFS (which is a space by default). Because the leading whitespace was ignored when finding $1, it is not part of the new $0. Finally, the last print statement prints the new $0.

There is an additional subtlety to be aware of when using regular expressions for field splitting. It is not well specified in the POSIX standard, or anywhere else, what ‘^’ means when splitting fields. Does the ‘^’ match only at the beginning of the entire record? Or is each field separator a new string? It turns out that different awk versions answer this question differently, and you should not rely on any specific behavior in your programs. (d.c.)

As a point of information, BWK awk allows ‘^’ to match only at the beginning of the record. gawk also works this way. For example:

$ echo 'xxAA  xxBxx  C' |
> gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
>                             printf "-->%s<--\n", $i }'
-| --><--
-| -->AA<--
-| -->xxBxx<--
-| -->C<--

Finally, field splitting with regular expressions works differently than regexp matching with the sub(), gsub(), and gensub() (see String-Manipulation Functions). Those functions allow a regexp to match the empty string; field splitting does not. Thus, for example ‘FS = "()"’ does not split fields between characters.