Position Information in Perl Regular Expression

Advanced Perl Regular Expressions – Part 1

Foreword: In this part of the series, I explain the positioning features in the subject or available string of the Perl match with the binding operator.

By: Chrysanthus Date Published: 2 Apr 2016

Introduction

This is part 1 of my series, Advanced Perl Regular Expressions. In this part of the series, I explain the positioning features in the subject or available string of the Perl match with the binding operator. I use a localhost web server, and ActivePerl for the code samples. If you are using traditional Perl, then precede your script with something like, #!/usr/bin/perl . In this series, the available string will be called the subject. The subject is the string in which the regular expression finds the match. Note: the abbreviation for regular expression in this series is, regex.

This series is part of my advanced Perl course. Perl is widely renowned for excellence in text processing, and regular expressions are one of the big factors behind this fame. Perl regular expressions display an efficiency and flexibility unknown in most other computer languages. Your studies of the advanced course in Perl will not be complete if you do not cover the advanced topics of Perl regular expressions, as in this series.

You should have covered the professional series on Perl regular expression titled, Perl Regular Expressions. If you have covered that series and understood, you should be able to solve 80% of Perl regular expression problems in Perl, today.  By the time you complete this series, you would be able to solve much of the remaining 20%, today. Most of the time, you do not use this 20% of the material. However, occasionally, some of the principles of the 20% become crucial and you have no choice other than to use them.

Pre-knowledge
At the bottom of this page, are links to the different series you should have learned before coming here, to better understand this one.

Difference Between Regular Expression and Pattern
There is not much difference between regular expression and pattern. Consider the example,

    /[crb]at/

The delimiting forward slashes are an example of Perl quote-like operators. The whole expression including the forward slashes is the regular expression. The content, [crb]at, within the forward slashes is the pattern.

The pos() Function
After a match with the global (g) operator, the predefined pos() function can be used to return the next position where the searching in the subject is to begin for the next match. This search is easily done in a while-loop. Read and try the following script:

use strict;

my $subject = "A cat is an animal. A rat is an animal. A bat is a creature.";

while($subject =~ /[cbr]at/g)
  {
    print "Next search starts at position: ", pos($subject), "\n";
  }

Here is the output of the code above:

Next search starts at position: 5
Next search starts at position: 25
Next search starts at position: 45

Note how the binding operator has been used. The while-loop condition returns true when a match occurs and false when there is no more match. With the g modifier, after a match, the search is shifted ahead.

The pos() function takes as argument the variable of the subject. Index counting of the characters in the subject or any string in Perl, begins from zero (‘A’ in the subject is at position zero). So, after seeing cat, the search continued from index 5; after seeing rat, the search continued from index 25; and after seeing bat, the search continued from index 45.

Potions of the Subject concerning Matching
There are three predefined variables, which are ${^PREMATCH}, ${^MATCH} and ${^POSTMATCH}. These variables hold (and can return) certain portions of the subject within the scope of the binding operator (=~). ${^PREMATCH} holds the portion of the subject from the beginning before the match. ${^MATCH} holds the portion of the subject that matched. ${^POSTMATCH} holds the potion of the subject after the match, to the end of the subject string. With these variables, you still have to use the g modifier, even if the match would occur only once in the subject. Each of the variables works per match. In the following script, the first binding operator is in the “global” scope. The second one is in the Perl while-block scope. Read and try the code:

use strict;

    my $subject0 = "This is a big car owner.";
    $subject0 =~ /big/g;
    print ${^PREMATCH}, "\n";
    print ${^MATCH}, "\n";
    print ${^POSTMATCH}, "\n\n";

    my $subject1 = "A cat is an animal. A rat is an animal. A bat is a creature.";

    while($subject1 =~ /[cbr]at/g)
     {
        print ${^PREMATCH}, "\n";
        print ${^MATCH}, "\n";
        print "Next search starts at position: ", pos($subject1), "\n";
        print ${^POSTMATCH}, "\n\n";
     }

The values for the variables are developed only after the binding operation has been executed. Each binding operation comes with its own values.

Arrays to Hold Start and End Positions
There are two special variable arrays called, @- and @+ . For each match, the start index in the subject is held in $-[0] of @- and the end index is held in the corresponding element $+[0] of @+. The element variables of the array have to be used in the scope in which the binding operator is used. In the following program, there are two binding operators: one has one match and the other with the g modifier has three matches. The array variable pair works per match even if the g modifier is involved. Read and try the code:

use strict;

    my $subject0 = "This is a big car owner.";
    $subject0 =~ /big/;
    print $-[0], ', ', $+[0], "\n";

    my $subject1 = "A cat is an animal. A rat is an animal. A bat is a creature.";

    while ($subject1 =~ /[cbr]at/g)
        {
            print $-[0], ', ', $+[0], "\n";
        }

The start index in the arrays, corresponds to the position of the first character of the match in the subject. The end index corresponds to the position just after the matched phrase in the subject. With this scheme, only the first elements of the pair of arrays are used. At the moment, I do not know the use of the other elements, which are $-[1].and $+[1], $-[2].and $+[2], etc.

That is it for this part of the series. We stop here and continue in the next part.

Chrys

Broad Network

Related Articles

Position Information in Perl Regular Expression

Advanced Perl Regular Expressions – Part 1

Introduction

Related Links

Comments