Broad Network


Regex Groupings in Perl

Regular Expressions in Perl for the Novice – Part 4

Forward: In this part of the series, I explain regex Grouping and Capturing in Perl.

By: Chrysanthus Date Published: 13 Aug 2012

Introduction

Introduction
This is the fourth part of my series, Regular Expressions in Perl for the Novice. In this part of the series, I explain regex Grouping and Capturing in Perl.

Groupings
We can use parenthesis to group characters in a pattern. Consider the following pattern:

          /The (guitarist)/

“guitarist” is in parenthesis. The parentheses form a group, which has the text, “guitarist”. Consider the following:

         /The (guitarist is good)/

“guitarist is good” is in parenthesis. The parentheses form a group, which has the text, “guitarist is good”.

Perl treats a group as an entity. A group on its own is not serving any purpose. It becomes important when used in conjunction with other pattern techniques. There is another use, which we shall see below.

Sub Strings with common Parts
Imagine that you have a bookshop where there is a bookkeeper and a bookshelf. Here, bookkeeper is the person who looks after the books. Also imagine that you have any of the following available strings:

          $availStr = "There is a bookshelf in my shop.";

          $availStr = "I have a bookkeeper.";

          $availStr = "The bookkeeper takes care of the bookshelf.";

In your code, you might not know which available string is present (the string might have been taken from somewhere and assigned to a variable); however, let us say your interest is to know whether there is a bookshelf or bookkeeper in the subject string. The regex for this can be:

          /bookshelf|bookkeeper/

Note that in the above regex, we have to type the word, “book” twice. We can avoid this double typing by using the following regex:

          /book(shelf|keeper)/

This second regex is more efficient, because we do not have to type the word, “book” twice. In the second one we have the group: (shelf|keeper). Perl treats a group as a single unit. Also, in this group, Perl has to choose “shelf” or “keeper”. In this way, Perl will have to look for bookshelf or bookkeeper in the group. The following expression produces a match.

           $availStr =~ /book(shelf|keeper)/

Here, $availStr can be any of the above strings.

Here, the word “book” is common to both words and is the first part of both words. You can have a sub string that is the second common part of two phrases; the same technique applies, but this time the group is in front; like this:

           /(non-common|non-common)common/

The pattern can actually become complex and you would have the same pattern as above in different places of a bigger pattern.

Sub Strings with Alternation at Beginning of Available String in Group
Let us look at the case of sub strings with common part where matching has to occur at the beginning of the available string.

Consider the following pattern:

              /(^x|y)z/

The anchor metacharacter ^ is used to match the regex at the beginning of the available string. The above pattern matches 'xz' at start of the available string or 'yz' anywhere in the available string. The following expressions match:

         "xz 5678" =~ /(^x|y)z/

         "34 yz 56 G" =~ /(^x|y)z/

More on Sub Strings with Common Parts
Imagine that you want to match “book” or “bookkeeper” or “bookkeepers”.  The sub string “book” occurs in the three phrases (sub strings). The sub string “bookkeeper” occurs in two of the phrases; and the sub string bookkeepers occur in only one of the phrases. Our aim in this section is to develop an efficient pattern to match sub strings such as the above three.

You can do this:

        /book|bookkeeper|bookkeepers/

The problem here (inefficiency) is that you have to type “book” three times and you have to type “keeper” two times.

The following pattern is efficient:

         /book(keeper(s|)|)/

First of all, note here that you have a nested group; groups can be nested. There are also two alternation metacharacters, |; one inside an inner nested group; the other inside the outer group.

In the inner nested group, Perl has to chose between “s” or nothing. In the outer group, Perl still has to choose between “keeper(s|)” or nothing.  “book” will always be chosen. In this way, Perl will match “book”, “bookkeeper” or “bookkeepers”. The situation here is similar to the first situation above, but is more complex here.

Capturing Matches
The grouping metacharacters (), that is, parentheses also serve another completely different purpose: they allow the capture of sub strings in the available string that matched. Well, the pattern is not usually an exact word or an exact phrase. After the matching has occurred with the available string, can you know the exact word or phrase in the available string that was matched? Yes, you can know this, and it is thanks to grouping.

Up to 9 groups in a pattern can be remembered when matching occurs. In other words, you can know up to 9 exact sub strings in the available string, when matching has occurred. Perl has many internal variables, 9 of which are $1, $2, $3, $4, $5, $6, $7, and $9.

Let us look at an example first before we continue. Consider the following code:

use strict;

if ("This is one and that is two." =~ /(one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

print "$1 is: ", $1, "\n";

print "$2 is: ", $2, "\n";

The available string is "This is one and that is two." Let us look at the pattern. The pattern is /(one).*(two)/; it will match any sub string in the available string that begins with “one” and ends with “two”. Remember that the dot, followed by the asterisk represents any set of characters.

Note that there are two groups in the pattern. The groups are (one) and (two). In the available string, you have the sub string, “one” then after some distance, you have the sub string “two”. The group (one) matches the sub string “one” in the available string. The group “two” matches the sub string “two”, in the available string. Because of this matching, the sub string “one” in the available string is assigned to the internal variable, $1; the sub string, “two” in the available string is assigned to the internal variable, $2. In the code, the last two statements print out the values of these two variables. If you never had the groups (parentheses), matching would still occur but nothing would be assign to the internal variables, ($1 and $2). In other words, nothing would be captured. The output of the code is:

$1 is: one
$2 is: two

You can have up to 9 groups (pairs of parentheses) in the pattern. If there are at least 9 sub strings in the available string that corresponds to the 9 groups in the pattern, then the 9 sub strings in the available string would be assigned to $1, $2 … $9 respectively. That is the first one matched would go to $1; the second one to $2, the third one to $3 and so on. This is how you remember or capture sub strings in the available string after matching. Note: if there is no group, then there would be nothing to remember (no assignment will occur); no capture. You do not need to have up to 9 groups in the pattern; you can have any number less.

What about Nested Groups
Consider the following code:

use strict;

if ("bookkeepers, bookkeeper and book go together." =~ /book(keeper(s|)|)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

print "$1 is: ", $1, "\n";

print "$2 is: ", $2, "\n";

print "$3 is: ", $3, "\n";

This is output of the above code:

$1 is: keepers
$2 is: s
$3 is:

The pattern would match “bookkeepers”, “bookkeeper” or “book”. However, we have two groups; one inside the other. It is these two groups that can be remembered. That is why for the output, $3 has nothing to display, as nothing was assigned to it.

The outer group in the pattern is cat(s|) and the inner group is (s|). The outer group corresponds to “cats” in “housecats”. The inner group corresponds to “s” at the end of “housecats”.

Let me do some more explanation on these capturing. “(keeper(s|)|)” means “keeper(s|)” or nothing, and “keeper(s|)” means “keepers” or “keeper”; so “keepers” next to “book” is captured. “(s|)” is a group and any group can be captured; it means “s” or nothing. Note that it is not necessarily the group that is matched; it is the pattern that is matched. The match-able sub string that has our “s” is bookkeepers. As “bookkeepers” is matched, our “s” is captured.

Capturing and matching are not the same things. After matching occurs, if there is any group in the matched sub string in the available string, the group is captured (assigned to a variable).

Capturing in List Context
In list context, a match, /regex/, with groupings will return the list of matched group values ($1,$2,...) . I illustrate this by showing you how to match time; this is an important example. The following produces a match.

(my $hrs, my $mins, my $secs) = ($theTime =~ /(\d\d):(\d\d):(\d\d)/);

This statement is not in the if-condition. The following code illustrates this:

use strict;

my $theTime = "10:20:15";

(my $hrs, my $mins, my $secs) = ($theTime =~ /(\d\d):(\d\d):(\d\d)/);

print "Hrs is: ", $hrs, "\n";

print "Mins is: ", $mins, "\n";

print "Secs is: ", $secs, "\n";

The output of this code is:

Hrs is: 10
Mins is: 20
Secs is: 15

If you know the meaning of List Context in Perl, everything in the code should be self-explanatory. You can also use an array in place of a list.

Time to take a break. We continue in the next part of the series.

Chrys

Related Links

Perl Reference
Object Oriented Programming in Perl
Date and Time in Perl
Regular Expressions in Perl
Perl Course
Web Development Course
Major in Website Design
NEXT

Comments

Become the Writer's Fan
Send the Writer a Message