Capturing Matches in Perl Regular Expressions

Advanced Perl Regular Expressions – Part 2

Foreword: In this part of the series I explain how to capture matches in Perl regular expression operations; the word “capture” here means holding the sub-string matched in the subject.

By: Chrysanthus Date Published: 2 Apr 2016

Introduction

This is part 2 of my series, Advanced Perl Regular Expressions. In this part of the series I explain how to capture matches in Perl regular expression operations; the word “capture” here means holding the sub-string matched in the subject. You should have read the previous part of the series before reaching here; this is a continuation.

The Binding Operator
The biding operator, =~ is the main operator used for matching in Perl. It has the subject string on the left. It has the regex on the right. The binding operator with its left and right operands can return a list (array) if the regex has the g modifier. In this case, if there is only one match, only one sub-string will be return in the list. If there is more than one match, more than one sub-strings will be returned. If there is no match, the returned list will be empty. Read and try the following code that illustrates this:

use strict;

    my $subject = "one two three four";

    my @arr = $subject =~ /tw./g;
    print $arr[0], "\n";

    my @arr = $subject =~ /tw.|thre./g;
    print $arr[0], "\n";
    print $arr[1], "\n";

The output is,

two
two
three

Remember, a dot metacharacter matches any character in the subject, at its position.

Grouping
When you look at the subject, you may be interested in a particular sub-string of the overall sub-string to be matched, and you will target that sub-string in the regex by placing parentheses around the sub-pattern in the regex. The sub-pattern within parentheses in the regex is called a group. After the match with the binding operator, the sub-string of the overall sub-string is identified. This method does not need the g modifier. Read and try the following code that illustrates this:

use strict;

    my $subject = "one two three four five";
    $subject =~ /tw. (thre.) fou./;
    print $1;

The overall matched sub-string is, “two three four”. The sub-string of the overall sub-string matched is “three”. This is from the group, (thre.) . $1 is a predefined special variable. Note that in the overall pattern, the spaces required for matching are respected as the overall pattern must to match.

There are many of such special variables, $1, $2, $3, ….etc. $1 captures the first group; $2 captures the second group, and so on. Read and try the following code:

use strict;

    my $subject = "one two three four five six seven eight nine ten eleven";
    $subject =~ /(on.) (tw.) (thre.) (fou.) (fiv.) (si.) (seve.) (eigh.) (nin.) (te.) (eleve.)/;
    print $1, "\n";
    print $11, "\n";

This kind of capturing can be used if you know the number of matches in advance, otherwise, use the previous approach with the g modifier. With this approach, it is possible to have nested groups. With a nested group, the outermost group is assigned to $1, the immediate inner to $2, the next inner to $3, etc. read and try the following code:

use strict;

    my $subject = "one two three four five six seven eight nine ten eleven";
    $subject =~ /(on. (tw.) thre.)/;
    print $1, "\n";
    print $2, "\n";

In the output, $1 is “one two three” and $2 is “two”. The outer group includes the nested group. With many groups, the $4 group for example may nest the $5 group. In this code the overall matched group is the outer group.

Instead of using $1, $2, $3, etc. you can assign the matched sub-strings of the groups to an array. The matched sub-strings have to correspond to groups (in parentheses) in the regex. If there is no group in the regex, the array will be empty. Consider the following statement:

    my @arr = "This is one and that is two." =~ /(one).*(two)/;

The operator, =~ is simply of a higher precedence than the operator, =. So, the =~ operator with its left and right operands, is evaluated first. The result becomes the right operand of the = operator, which is evaluated next. So the array receives the sub-strings that correspond to the groups in the regex. The overall matched sub-string does not go to the array in this case, because it does not have its own parentheses. Read and try the following code:

use strict;

    my @arr = "This is one and that is two." =~ /(one).*(two)/;
    for (my $i=0;$i<@arr;++$i)
        {
            print $arr[$i], "\n";
        }

Alternative Capture Group Numbering
Here, alternative means Or. Consider the USA time, 8:5:13. The month can be written as 8 or 08; the day of the month can be written as 5 or 05; the year can be written as 13 or 2013. There are several ways in which this date can be written because of the different alternatives of each of the figures. A subject for the date may be, "8:05:2013"; another subject may have but, "08:5:13", same thing but written in a different way. A regex to match the whole date and capture the different possible figures is,

    /(\d)|(\d\d):(\d)|(\d\d):(\d\d)|(\d\d\d\d)/

where \d represents a digit, | means Or and we would have a statement like,

    $subject = /(\d)|(\d\d):(\d)|(\d\d):(\d\d)|(\d\d\d\d)/

In this case, $1 is for (\d), $2 is for (\d\d), $3 is for (\d), $4 is for (\d\d), $5 is for (\d\d) and $6 is for (\d\d\d\d). That is alright but the $ variables involved are too many.

There are three figures (month, day, year) in the regex. Would it not be nice if we had one dollar variable for each figure? In fact it is possible for us to have one $ variable for each figure. That is: $1 for the alternatives, (\d)|(\d\d); $2 for the alternatives (\d)|(\d\d) and $3 for the alternatives, (\d\d)|(\d\d\d\d). To achieve this, place parentheses round each figure alternatives in the regex. Just after the opening parenthesis of each figure group, type ?|. Here, ? means embedded extension (see later) and | means alternative (Or). Read and try the following code that illustrates alternative capture group numbering:

use strict;

    my $subject = "08:5:2013";
    $subject =~ /(?|(\d)|(\d\d)):(?|(\d)|(\d\d)):(?|(\d\d)|(\d\d\d\d))/;
    print $1,':', $2, ':', $3;

The output is,

    08:5:20

The output month is what you want; the output day of the month is also what you want; but the year is not what you want. The group for the year is (?|(\d\d)|(\d\d\d\d)). In an alternative if the binding operator sees the first alternative, it takes it and ignores the second or third alternative present. The only way to solve this type of problem is if you can predict mentally the outcome. For example if you think the year has four digits (2013) then you would reorder the alternatives to have, (?|(\d\d\d\d)|(\d\d)). With this you would capture 2013 for the output year of the above code.

Named Capture Groups
Today you can give a group a name. Consider the following date regex:

    /((\d)|(\d\d)):((\d)|(\d\d)):((\d\d\d\d)|(\d\d))/

This regex has nested groups. Two alternative groups for the month are nested in one larger group; two alternative groups for the day of the month are nested in one larger group; two alternative groups for the year are nested in one larger group. You can refer to each of these larger groups with a name and use the name to obtain the matched sub-string in the subject instead of using $1, $2, and $3. You can also use a name for an inner nested group, but for this example that will not really be useful, since for instance, you are interested in a month and not the different (alternative) forms of the month. You need to know how to code a name in the regex and you need to know how to obtain the matched sub-string from the name.

To give a group a name, use one of the following syntaxes just after the opening parenthesis of the group:

    (?<name>...)
or
    (?'name'...)

where … is for the group pattern in the absence of the parentheses. I prefer the syntax, (?'name'...). Again, Here, ? means embedded extension (see later) and <name> or 'name' is for the name of the group. So the above date regex can be modified to

     /(?'month'(\d)|(\d\d)):(?'dayMonth'(\d)|(\d\d)):(?'year'(\d\d\d\d)|(\d\d))/

There is a predefined hash with the name %+ . After evaluation of the regex by Perl, this hash will have key/value pairs where for each pair, the key is the name of the regex group and the value is the captured (matched) sub-string of the group. Consider the following statement:

    "08:5:2013" =~ /(?'month'(\d)|(\d\d)):(?'dayMonth'(\d)|(\d\d)):(?'year'(\d\d\d\d)|(\d\d))/

Here, the subject is, "08:5:2013" (which is not held by a variable). The name, month will capture “08”, the name, dayMonth will capture “5”, the name, year will capture “2013”. Read and try the following code, which also shows how the hash is used:

use strict;

    "08:5:2013" =~ /(?'month'(\d)|(\d\d)):(?'dayMonth'(\d)|(\d\d)):(?'year'(\d\d\d\d)|(\d\d))/;
    print $+{'month'},':', $+{'dayMonth'}, ':', $+{'year'};

Your output should be,

    08:5:2013

Named capture group is not another way of coding alternative capture group. With alternative capture groups, instead of having $1 and $2 you have $1, instead of having $3 and $4, you have $2, and so on. It is just a coincidence that I have used a similar coding example for alternative capture groups and named capture groups. A name capture group is still applicable to the smallest group as in,

    "one two three" =~ /(?'aa'on.) (?'bb'tw.) (?'cc'thre.)/;

where aa is the name for the sub-string, “one”, bb is the name for “two” and cc is the name for “three”.

Capturing into an Array
Note: as of now, the only way I know to send the matched sub-strings of the binding operation to an array, is to use the global, g modifier (see above); even so, if the regex has any group, the overall matched sub-string will not go to the array but the group will go. If you want the overall match to go to the array, then place the overall pattern in a group (in parentheses); and in this case, you do not need the g modifier; but you capture only what is in each group (nested or not), not the other matches not found in groups. If you want to capture everything of interest of the subject, into the array, place every matching possibility in a group and use the g modifier.

That is it for this part of the series. Before we leave this part, always remember that the whole regex has to match, not just the groups inside the regex (so do not ignore corresponding spaces in the sub-string that have to be in the regex).

It has been a long ride, we take a break here and continue in the next part of the series.

Chrys

Broad Network

Related Articles

Capturing Matches in Perl Regular Expressions

Advanced Perl Regular Expressions – Part 2

Introduction

Related Links

Comments