A Portable C++ Regular Expression Facility - Regex(C++)

Subexpressions

Every regular expression consists of a number of subexpressions. As in egrep(C), we distort the standard mathematical terminology and consider only those components of the regular expression enclosed by matching pairs of parentheses to be subexpressions. Every subexpression has a subexpression number. The subexpression number of a subexpression can be calculated by starting at the left of the pattern and counting left parentheses, the leftmost left parenthesis being number one. The subexpression grouped on the left by the n'th left parenthesis, is subexpression number n. Thus, in a(b(c))(d), the first, second and third subexpressions are (b(c)), (c) and (d), respectively. By convention the entire regular expression is the zero'th subexpression.

The subexpressions of the pattern can be picked out using Regex::subex.

       Regex r("(foo|bar)(baz)*");
       Substrinfo ss = r.subex(2);  // ss.i = 9, ss.len = 5

Obeying the general convention stated at the end of the previous section, subex will construct the actual substring if an optional String argument is provided.

       Regex r("(foo|bar)(baz)*");
       String the_subex;
       r.subex(2, the_subex);  // the_subex = "(baz)"

As expected, if the specified subexpression does not exist, or if the pattern is invalid, the value returned by subex will test false.

       Regex r("(foo|bar)(baz)*");
       if (r.subex(3)) ...  // false

Sometimes when matching it is necessary to be able to pick out the substrings of the target which matched the various subexpressions of the pattern. For example, suppose the pattern is (foo)+(baz); after a successful match, we might want to pick out the substring of the target which matched (baz). Such subexpression information is modeled in the library by the single class Subex. Notice that the matching substring by definition matches the zero'th subexpression; here we are interested in finding the substrings of the matching substring which matched the other (parenthesized) pattern subexpressions.

In order to get subexpression matching information, the programmer must do two things. First, an optional Subex argument to match must be supplied. Thusly,

       Regex r("(foo)+(baz)");
       Subex subs;
       r.match("foobazbazbar", subs);
       // ...

When match returns, subs will encapsulate all the subexpression matching information for that match. This information can then be retrieved by calling Subex::operator().

       // ... from above
       Substrinfo ss = subs(2);  // ss.i = 3, ss.len = 3

Since operator() is a function returning a Substrinfo, then by our convention it also takes an optional String argument which, if present, will be assigned the actual substring of the target.

       // ... from above
       String mss;
       subs(2, mss);  // mss = "baz"

As may be expected, the return value of operator() will test false if the specified pattern subexpression did not match anything.

       Regex r("(foo)|(baz)");
       Subex subs;
       r.match("foobar", subs);
       if (subs(2)) ...  // false

Note that even if the target had been ``foobazbar'', subs(2) would still have been false, since match always finds the leftmost longest matching substring. Of course, if the match fails, then all the subexpression matches test false.

       Regex r("(foo)|(baz)");
       Subex subs;
       r.match("blech", subs);
       if (subs(0)) ...  // false
       if (subs(1)) ...  // false
       if (subs(2)) ...  // false

There is a difference between ``not matching anything'' and ``matching the empty substring.'' In the above program, the various subexpressions failed to match anything. In the following program, the subexpression ((baz)*) matches the empty substring beginning at position 3:

       Regex r("foo((baz)*)bar");
       Subex subs;
       r.match("foobar", subs);
       Substrinfo ss = subs(1);  // ss.i = 3, ss.len = 0
       if (ss) ...  // true!

Observe how we wrote the pattern in the last example: foo((baz)*)bar. Notice that the outer pair of parentheses is redundant: the simpler pattern foo(baz)*bar is equivalent. The reason we used the extra parentheses was so that we could get a handle (``subexpression number one'') on the (baz)* subexpression.

Concerning the closure operator *, consider the following program (again notice the redundant parentheses used in the pattern):

       Regex r("foo((baz)*)bar");
       Subex subs;
       r.match("foobazbazbar", subs);
       String sub1, sub2;
       Substrinfo ss1 = subs(1, sub1);
           // ss.i = 3, ss.len = 6, sub1 = "bazbaz"
       Substrinfo ss2 = subs(2, sub2);
           // ss.i = 6, ss.len = 3, sub1 = "baz"

The point of this example is that whereas the closure ((baz)*) is taken to match the entire substring ``bazbaz'', the argument (baz) of the closure is taken to match only the final repetition of ``baz''. (There is no particular justification for this behavior, it is adopted simply to conform with the behavior of the various regular expression facilities found in the section, ``Invalid patterns''.) The other repetition meta-characters (? and +) work the same way. So, for example, we have the following:

       Regex r("foo((baz)?)bar");
       Subex subs;
       r.match("foobazbar", subs);
       String sub1, sub2;
       Substrinfo ss1 = subs(1, sub1);
           // ss.i = 3, ss.len = 3, sub1 = "baz"
       Substrinfo ss2 = subs(2, sub2);
           // ss.i = 3, ss.len = 3, sub1 = "baz"

But

       Regex r("foo((baz)?)bar");
       Subex subs;
       r.match("foobar", subs);
       String sub1, sub2;
       Substrinfo ss1 = subs(1, sub1);
           // ss.i = 3, ss.len = 0, sub1 = "
       Substrinfo ss2 = subs(2, sub2);
           // ss.i = -1, ss.len = 0
       if (ss1) ...  // true!
       if (ss2) ...  // false!

If Subex::operator() is called on a Subex which has not yet been a participant in a match, its return value tests false:

       Regex r("foo(bar)*");
       Subex subs;
       if (subs(1)) ...  // false