Grouping in Perl Regular Expressions

Perl Regular Expressions – Part 3

Perl Course

Foreword: In this part of the series, I explain regex Grouping and Capturing in Perl.

By: Chrysanthus Date Published: 5 Oct 2015

Introduction

This is part 3 of my series, Perl Regular Expressions. In this part of the series, I explain regex Grouping and Capturing in Perl. You should have read the previous parts of the series before reaching here, as this is a continuation.

Grouping
We can use parenthesis to group characters in a pattern. Consider the following pattern:

          /The (guitarist)/

“guitarist” is in parenthesis. The parentheses form a group, which has the text, “guitarist”. Consider the following:

         /The (guitarist is good)/

“guitarist is good” is in parenthesis. The parentheses form a group, which has the text, “guitarist is good”.

Perl treats a group as an entity. A group on its own is not serving any purpose. It becomes important when used in conjunction with other pattern techniques. There is another use, which we shall see below.

Sub Strings with common Parts
Imagine that you have a bookshop where there is a bookkeeper and a bookshelf. Here, bookkeeper is the person who looks after the books. Also imagine that you have any of the following subject strings:

          $subject = "There is a bookshelf in my shop.";

          $subject = "I have a bookkeeper.";

          $subject = "The bookkeeper takes care of the bookshelf.";

In your code, you might not know which subject is present (the string might have been taken from somewhere and assigned to a variable); however, let us say your interest is to know whether there is a bookshelf or bookkeeper in the subject string. The regex for this can be:

          /bookshelf|bookkeeper/

Note that in this regex, we have to type the word, “book” twice. We can avoid this double typing by using the following regex:

          /book(shelf|keeper)/

This second regex is more efficient, because we do not have to type the word, “book” twice. In the second one we have the group: (shelf|keeper). Perl treats a group as a single unit. Also, in this group, Perl has to choose “shelf” or “keeper”. In this way, Perl will have to look for bookshelf or bookkeeper. The following expression produces a match for all the above subjects.

           $subject =~ /book(shelf|keeper)/

Here, $subject can be any of the above strings.

Here, the word “book” is common to both words and it is the first part of both words. You can have a sub string that is the second common part of two phrases; the same technique applies, but this time the group is in front; like this:

           /(non-common|non-common)common/

The pattern can actually become complex and you would have the same pattern as above in different places of a bigger pattern.

Sub Strings with Alternation at Beginning of Subject String in Group
Let us look at the case of sub strings with common part where matching has to occur at the beginning of the subject string.

Consider the following pattern:

              /(^x|y)z/

The anchor metacharacter ^ is used to match the regex at the beginning of the subject. The above pattern matches 'xz' at start of the subject string or 'yz' anywhere in the subject string. The following expressions match:

         "xz 5678" =~ /(^x|y)z/

         "34 yz 56 G" =~ /(^x|y)z/

Here, the anchor is for x and not y. If you want to match yz at the beginning of the subject, then the regex has to be,

    /^(x|y)z/

With this, the following does not produce a match:

         "34 yz 56 G" =~ /^(x|y)z/

while the following produces a match:

    "yz 56 G" =~ /^(x|y)z/

More on Sub Strings with Common Parts
Imagine that you want to match “book” or “bookkeeper” or “bookkeepers”.  The sub string “book” occurs in the three phrases (sub strings). The sub string “bookkeeper” occurs in two of the phrases; and the sub string “bookkeepers” occurs in only one of the phrases. Our aim in this section is to develop an efficient pattern to match sub strings such as the above three.

You can do this:

        /book|bookkeeper|bookkeepers/

The problem here (inefficiency) is that you have to type “book” three times and you have to type “keeper” two times.

The following pattern is efficient:

         /book(keeper(s|)|)/

First of all, note here that you have a nested group; groups can be nested. There are also two alternation metacharacters, |; one inside an inner nested group; the other inside the outer group.

In the inner nested group, Perl has to chose between “s” or nothing. In the outer group, Perl still has to choose between “keeper(s|)” or nothing.  “book” will always be chosen. In this way, Perl will match “book”, “bookkeeper” or “bookkeepers”. The situation here is similar to the first situation above, but is more complex.

Capturing Matches
The grouping metacharacters (), that is, parentheses also serve another purpose: they allow the capture (return) of sub strings that match in the subject. Well, the pattern is not usually an exact word or an exact phrase. After the matching has occurred with the subject, can one know the exact word or phrase in the subject that was matched? Yes, you can know this, and it is thanks to grouping.

Groups that match can be remembered by Perl in the internal variables, $1, $2, $3, $4, $5, etc. .

Let us look at an example first before we continue. Consider the following code:

use strict;

if ("This is one and that is two." =~ /(one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

print '$1 is: ', $1, "\n";

print '$2 is: ', $2, "\n";

The subject is "This is one and that is two." Let us look at the pattern. The pattern is /(one).*(two)/; it will match any sub string in the subject that begins with “one” and ends with “two”. Remember that the dot, followed by the asterisk represents any set of characters (zero or more characters).

Note that there are two groups in the pattern. The groups are (one) and (two). In the subject, you have the sub string, “one” then after some distance, you have the sub string “two”. The group (one) matches the sub string “one” in the subject. The group “two” matches the sub string “two”, in the subject. Because of this matching, the sub string “one” in the subject is assigned to the internal variable, $1; the sub string, “two” in the subject is assigned to the internal variable, $2. In the code, the last two statements print out the values of these two variables. If you never had the groups (parentheses), matching would still occur but nothing would be assign to the internal variables, ($1 and $2). In other words, nothing would be captured. The output of the code is:

$1 is: one
$2 is: two

The assignment is done in the order in which the groups are found in the subject. If there are at least 9 sub strings in the subject that correspond to 9 groups in the pattern, then the 9 sub strings in the subject would be assigned to $1, $2 … $9 respectively. That is the first one matched would go to $1; the second one to $2, the third one to $3 and so on. This is how you remember or capture sub strings in the subject after matching. Note: if there is no group, then there would be nothing to remember (no assignment will occur); no capture.

What about Nested Groups
Consider the following code:

use strict;

if ("bookkeepers, bookkeeper and book go together." =~ /book(keeper(s|)|)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

print '$1 is: ', $1, "\n";

print '$2 is: ', $2, "\n";

print '$3 is: ', $3, "\n";

The output is:

$1 is: keepers
$2 is: s
$3 is:

The pattern would match “bookkeepers”, “bookkeeper” or “book”. However, we have two groups; one inside the other. It is these two groups that can be remembered. That is why for the output, $3 has nothing to display (as nothing was assigned to it).

The outer group in the pattern is keeper(s|)| and the inner group is (s|). The outer group corresponds to “keepers” in “bookkeepers”. The inner group corresponds to “s” at the end of “bookkeepers”.

Let me do some more explanation on these capturing. “(keeper(s|)|)” means “keeper(s|)” or nothing, and “keeper(s|)” means “keepers” or “keeper”; so “keepers” next to “book” is captured. “(s|)” is a group and any group is captured; it means “s” or nothing. Note that it is not necessarily the group that is matched; it is the pattern that is matched. The match-able sub string that has our “s” is bookkeepers. As “bookkeepers” is matched, our “s” is captured.

Capturing and matching are not the same thing. After matching occurs, if there is any group of the matched sub string in the subject, the group is captured (assigned to a variable).

Capturing in List Context
In list context, a match, /regex/, with groupings will return the list of matched group values ($1,$2,...) . I illustrate this by showing you how to match time; this is an important example. The following produces a match.

(my $hrs, my $mins, my $secs) = ($theTime =~ /(\d\d):(\d\d):(\d\d)/);

This statement is not in the if-condition. The following code illustrates this:

use strict;

my $theTime = "10:20:15";

(my $hrs, my $mins, my $secs) = ($theTime =~ /(\d\d):(\d\d):(\d\d)/);

print "Hrs is: ", $hrs, "\n";

print "Mins is: ", $mins, "\n";

print "Secs is: ", $secs, "\n";

The output of this code is:

Hrs is: 10
Mins is: 20
Secs is: 15

If you know the meaning of List Context in Perl, everything in the code should be self-explanatory. You can also use an array in place of a list.

Non-Capturing Group
If you do not want a group to be captured, then precede the content inside the group with ?: . In the following code, the second group is not captured, so $2 prints nothing:

use strict;

if ("This is one and that is two." =~ /(one).*(?:two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

print '$1 is: ', $1, "\n";

print '$2 is: ', $2, "\n";

The output is:

$1 is: one
$2 is:

Time to take a break. We continue in the next part of the series.

Chrys

Broad Network

Related Articles

Grouping in Perl Regular Expressions

Perl Regular Expressions – Part 3

Perl Course

Introduction

Related Links

Comments