More Regular Expressions in Perl

Regular Expressions in Perl for the Novice – Part 8

Forward: We have learned a lot about regular expressions in Perl. What we have learned would solve many of our problems. However, there will come a time when you would want to do more in Regex. So this last part is to enable you to do more in Regex.

By: Chrysanthus Date Published: 13 Aug 2012

Introduction

This is the eighth part of my series, Regular Expressions in Perl for the Novice. We have learned a lot about regular expressions in Perl. What we have learned would solve many of our problems. However, there will come a time when you would want to do more in Regex. So this last part is to enable you to do more in Regex.

Compiling Regular Expressions
It is possible for you to use the same regular expression over and over in a script. Normally, regex, the way we have learned has to be re-evaluated each time you use it. To increase speed, you can have the regex compiled once and then use it in the compiled state over and over, in your script. The qr// operator complies a regex and returns a form of the regex that can be assigned to a variable. So you can have:

          my $reg = qr/pattern/;

The $reg can now be used in a binding operation. So you can have the following code segment:

      my $availableString = "Hello World!";

      my $reg = qr/World/;

      $availableString =~ $reg;

The above binding expression produces a match.

The second statement above does the compilation and assignment to the variable. Consider the following:

      my $availableString = "Hello World";

      my $reg = qr/World/;

      $availableString =~ $reg;

      $availableString =~ /World/;

The second statement here, compiles the regex and assigns the result to a variable. The last two statements are the same, however the last but one statement is executed faster than the last, since it is already compiled.

You can include the compiled $reg inside another regex, e.g.

       "Hello World!" =~ /$reg!/

The above statement is equivalent to the slow

       "Hello World!" =~ /World!/

Note the presence of the exclamation sign in the regex; $reg and ‘!’ forms the regex.

Embedding Comments and Modifiers in a Regular Expression
Embedding Comments
We saw how a comment can be embedded in a regex with the x modifier. The method of embedding comment in this section is like the clean or official way of embedding comments.

The expression to use to embed a comment is

               (?#Comment)

You start with ‘(?#’ you type your comment and then you end with ‘)’. The word “Internet” normally starts with ‘I’ in upper case. The regex,

         /the I(?i)nternet/

can be commented as follows:

/the I(?# the first part of the regex)(?i)nternet(?# I for Internet must be in upper case)/

We saw the use of the x modifier to include a comment in a regex in part VI. Using the tag “(?#Comment)” is good when your regex and comments are on one line. If you want your regex and it comments to be on more than one line, then you should use the x modifier and escape all the white spaces, as follows:

$availableString =~ /the I# the first part of the regex
                              nternet# I for Internet must be in upper case
                             /x

The following code illustrates this:

use strict;

my $availableString = "Use the Internet.";

if ($availableString =~ /the I# the first part of the regex
                                 nternet# I for Internet must be in upper case
                                 /x)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

With the x modifier, # begins a comment.

If you are not using the x modifier, note that the “(?#Comment)” tag cannot be nested, You cannot have “(?#Comment(?#Comment))” in a regex.

Embedding Modifiers
You can embed modifiers in the regex (in the pattern). I will use the case-less modifier, i to illustrate this. Remember, the case-less modifier makes the matching insensitive. The exception to this is when the modifier is in a subpattern (see below). A modifier is embedded by enclosing it in the characters, (?), just after the ‘?’ sign.

Consider the available string,

              "XYZ"

and the regex,

              /(?i)xyz/

Note the character set, “(?i)” that has the i modifier.  The above regex would match all of the above subject. The following expression produces a match:

            "XYZ" =~ /(?i)xyz/

Consider the following regex:

              /xy(?i)z/

Here, the modifier has been put just before the last character, ‘z’. The effect is the same as before. It does not matter where you put the modifier. The whole regex is affected independent of where you put the modifier. So

      /(?i)xyz/, /xy(?i)z/ and /xyz/i

mean the same thing.

Non-capturing Groups
A group is text in parentheses in regex. By default, any such text is captured into a variable or list on the left hand side of the =~ operator. Consider the following code:

use strict;

my @arr = ("This is one and that is two." =~ /(one).*(two)/);

print $arr[0], "\n";

print $arr[1], "\n";

The is the output of the above code:

one
two

The output consists of the words, “one” and “two”. These are elements captured and stored in the array, @arr.

You may not want to capture every group. If you do not want to capture a group, precede the content of the group with “?:”. To prevent the group “(one)” above from being captured, you need “(?:one)” for the group. The group still remains valid in the pattern with its other advantage, but it is not captured. The following code illustrates this:

use strict;

my @arr = ("This is one and that is two." =~ /(?:one).*(two)/);

print $arr[0], "\n";

print $arr[1], "\n";

The output of the code is:

two

We prevented the first group, “(one)” from being captured by transforming it into, “(?:one)”. From the output, we see that “one” of the available string has not been captured, as we expected. “two” has been captured, and it is the only element in the array.

To make a group non-capturing, use the following syntax:

(?:groupContent)

Including Modifiers in Non-Capturing Groups
We have seen how you can embed modifiers in a regex. You may want to include a modifier in a non-capturing group. There are two ways of doing this. Let us say you want to include the modifier, i in the non-capturing sub group “(?:one)” above. You can do it like this:

          (?:(?i)one)

The following expression produces a match:

         "This is ONE and that is two." =~ /(?:(?i)one).*(two)/

You can use the following code to test:

use strict;

if ("This is ONE and that is two." =~ /(?:(?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

Modifiers in Groups
We have said that putting a modifier in a regex is the same as having the modifier just at the end of the regex. The question you may have is this: “If the modifier is in a group, would it have its effect only in the group or in the whole regex.

Let us just write four short scripts to verify that. This is the first:

use strict;

if ("This is ONE and that is two." =~ /((?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The regex above is “/((?i)one).*(two)/”. Note that in the available string, “ONE” is in upper case. Matching occurs in the above code. Here we are dealing with a capturing group. Consider the following code still with a capturing group:

use strict;

if ("This is ONE and that is TWO." =~ /((?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The regex above is still “/((?i)one).*(two)/”. Note that in the available string, “ONE” is still in the upper case and “TWO”, this time is in the upper case. Matching does not occur in the code above. In the regex “two” is in lowers case; this is why matching does not occur.

The above two programs deal with capturing groups, would non- capturing groups behave in the same way? We shall use two more simple programs to verify this. Consider the following:

use strict;

if ("This is ONE and that is two." =~ /(?:(?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The regex above is “/(?:(?i)one).*(two)/”. Note that in the available string, “ONE” is in upper case. Matching occurs in the above code. Here we are dealing with a non-capturing group. Consider the following code, which is also with a non-capturing group:

use strict;

if ("This is ONE and that is TWO." =~ /(?:(?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The regex above is still “/(?:(?i)one).*(two)/”. Note that in the available string, “ONE” is still in the upper case and “TWO” this time is in the upper case. Matching does not occur in the code above. In the regex “two” is in lowers case; this is why matching does not occur.

Well, this is a fact: whether you are dealing with capturing or non-capturing groups, a modifier inside a group affects only that group; it does not affect the rest of the regex.

That is it for this section.

And, finally we have come to the end of the series. We saw so many things. If you have understood the series, then you will be able to do a lot on Regular Expressions in Perl. Your immediate problem now is how to handle patterns; that is, how to quickly build an efficient pattern and how to see a pattern and deduce the set of possible matched sub strings. I intend to write a short series on Handling Patterns in Perl Regular Expressions.

Chrys

Broad Network

Related Articles

More Regular Expressions in Perl

Regular Expressions in Perl for the Novice – Part 8

Introduction

Related Links

Comments