Programming with awk

Extended regular expressions

awk provides more powerful patterns for searching for strings of characters than the comparisons illustrated in the previous section. These patterns are called regular expressions, and are like those in grep(C) and lex(CP). The simplest extended regular expression is a string of characters enclosed in slashes, like

   /Asia/

This program prints all input records that contain the substring Asia. (If a record contains Asia as part of a larger string like Asian or Pan-Asiatic, it is also printed.) In general, if re is an extended regular expression, then the pattern

   /re/

matches any line that contains a substring specified by the extended regular expression re.

To restrict a match to a specific field, you use the matching operators ~ (matches) and !~ (does not match). The program

   $4 ~ /Asia/ { print $1 }

prints the first field of all lines in which the fourth field matches Asia, while the program

   $4 !~ /Asia/ { print $1 }

prints the first field of all lines in which the fourth field does not match Asia.

In extended regular expressions, the symbols

   \ ^ $ . []  + ? () | {}

are metacharacters with special meanings like the metacharacters in the SCO OpenServer shell. For example, the metacharacters ^ and $ match the beginning and end, respectively, of a string, and the metacharacter . (dot) matches any single character. Thus,

   /^.$/

matches all records that contain exactly one character.

A group of characters enclosed in square brackets matches any one of the enclosed characters; for example, /[ABC]/ matches records containing any one of A, B, or C anywhere. Ranges of letters or digits can be abbreviated within square brackets: /[a-zA-Z]/ matches any single letter in the default locale.

If the first character after the [ is a ^, this complements the class so it matches any character not in the set: /[^a-zA-Z]/ matches any non-letter. The character + means ``one or more.'' Thus, the program

   $2 !~ /^[0-9]+$/

prints all records in which the second field is not a string of one or more digits (^ for beginning of string, [0-9]+ for one or more digits, and $ for end of string). Programs of this type are often used for data validation.

awk also accepts the newer square bracket constructs. These constructs permit programs to be sensitive to the current locale. For examle, instead of using [^a-zA-Z] to mean non-letter and [0-9] to mean ``digit'' as above, you can use [^[:alpha:]] and [[:digit:]], which are more descriptive and more portable. See grep(C) for more details.

Parentheses () are used for grouping and the character | is used for alternatives. The program

   /(apple|cherry) (pie|tart)/

matches lines containing any one of the four substrings apple pie,
apple tart, cherry pie, or cherry tart.

Extended regular expressions provide a more general form of repetition via the ``interval'' operator. This operator is of the form {low,high}, with the high limit optional. The three operators ?, * and + are equivalent, respectively, to the interval constructs [0,1}, {0,} and {1}. To denote an exact number of matches, use the form {count}.

To turn off the special meaning of a metacharacter, precede it by a \ (backslash). Thus, the program

   /b\$/

prints all lines containing b followed by a dollar sign.

In addition to recognizing metacharacters, awk recognizes the following C programming language escape sequences within regular expressions and strings:

\b backspace

\f formfeed

\n newline

\r carriage return

\t tab

\ddd octal value ddd

\" quotation mark

\c any other character c literally

\xhhh hexadecimal value hhh

For example, to print all lines containing a tab, use the program

   /\t/

awk interprets any string or variable on the right side of a ~ or !~ as an extended regular expression. For example, we could have written the program

   $2 !~ /^[0-9]+$/

   BEGIN     { digits = "^[0-9]+$" }
   $2 !~ digits

Suppose you want to search for a string of characters like ^[0-9]+$. When a literal quoted string like "^[0-9]+$" is used as an extended regular expression, one extra level of backslashes is needed to protect metacharacters. This is because one level of backslashes is removed when a string is originally parsed. If a backslash is needed in front of a character to turn off its special meaning in an extended regular expression, then that backslash needs a preceding backslash to protect it in a string.

For example, suppose you want to match strings containing b followed by a dollar sign. The extended regular expression for this pattern is b\$. If you want to create a string to represent this extended regular expression, you must add one more backslash: "b\\$". The two extended regular expressions on each of the following lines are equivalent:

   x ~ "b\\$"	x ~ /b\$/
   x ~ "b\$"	x ~ /b$/
   x ~ "b$"	x ~ /b$/
   x ~ "\\t"	x ~ /\t/

The precise form of extended regular expressions and the substrings they match is in ``awk extended regular expressions''. The unary operators , +, ? and intervals have the highest precedence, then concatenation, and then alternation |. All operators are left associative. r stands for any extended regular expression.

awk extended regular expressions

Expression Matches

c any non-metacharacter c

\c character c literally

^ beginning of string

$ end of string

. any character but newline

[s] any character in set s

[^s] any character not in set s

r zero or more r's

r+ one or more r's

r? zero or one r

r{low,high} at least low rs but not more than high

(r) r

r[1] r[2] r[1] then r[2] (concatenation)

r[1]|r[2] r[1] or r[2] (alternation)

\b	backspace
\f	formfeed
\n	newline
\r	carriage return
\t	tab
\*ddd*	octal value *ddd*
\"	quotation mark
\c	any other character c literally
\xhhh	hexadecimal value *hhh*

Expression	Matches
c	any non-metacharacter c
\c	character c literally
^	beginning of string
$	end of string
.	any character but newline
[s]	any character in set s
[^s]	any character not in set s
r	zero or more r's
r+	one or more r's
r?	zero or one r
r{*low,high*}	at least *low* rs but not more than *high*
(r)	r
r[1] r[2]	r[1] then r[2] (concatenation)
r[1]\|r[2]	r[1] or r[2] (alternation)