Basic Modifiers in Perl

Perl Regular Expressions – Part 4

Perl Course

Foreword: You may not know if what you are looking for is in lower case or upper case or has mixed cases. It is possible for you to make a case insensitive match. You need what is called a modifier for this. There are modifiers to do other things. You will learn some of them in this part of the series.

By: Chrysanthus Date Published: 5 Oct 2015

Introduction

This is part 4 of my series, Perl Regular Expressions. You may not know if what you are looking for is in lower case or upper case or has mixed cases. It is possible for you to make a case insensitive match. You need what is called a modifier for this. There are modifiers to do other things. You will learn some of them in this part of the series. You should have read the previous parts of the series before coming here, as this is a continuation.

The i Modifier
By default, matching is case sensitive. To make it case insensitive, you have to use what is called the i modifier.

So if we have the regex,

          /send/

and then we also have

    my  $subject = “Click the Send button.”

Then the following code will not produce a match:

use strict;

my  $subject = "Click the Send button.";

if ($subject =~ /send/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The regex did not match the subject string because the regex has “send” where s is in lower case, but the subject has “Send” where S is in upper case. If you want this matching to be case insensitive, then your regex will have to be,

         /send/i

Note the i just next to the second forward slash. The following code will produce a match.

use strict;

my  $subject = "Click the Send button.";

if ($subject =~ /send/i)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

Matching has occurred because we have made the regex case insensitive, with the i modifier.

The g Modifier
It is possible for you to have more than one sub string in the subject that would match the regex. So far, we have been assuming that there is only one sub string in the subject that produces a match. If you want more than one sub string to be matched, you have to use the g modifier. You use it just as you use the i modifier. One of the situations in which you would need the g modifier is when you want to collect all the sub strings in the subject that match.

Consider the following subject string:

my $subject = "A cat is an animal. A rat is an animal. A bat is a creature.";

In the above subject, you have the sub strings: cat, rat and bat. You have “cat” first, then “rat” and then “bat”. Each of these sub strings match the following regex:

                   /[cbr]at/

Normally, this pattern will match only the first sub string, “cat”. If you want “cat”, “rat” and “bat” all to be matched, you have to use the g modifier as follows:

                  /[cbr]at/g

Just putting the g modifier at the end of the regex is not useful. You would want to collect all the sub strings that matched. The following statement collects all the sub strings in an array.

     my @arr = ($subject =~ /[cbr]at/g);

The array is @arr. The first item in the array is “cat”; the second is “rat” and the third is “bat”. The following code illustrates this:

use strict;

my $subject = "A cat is an animal. A rat is an animal. A bat is a creature.";

my @arr = ($subject =~ /[cbr]at/g);

print "First Item is: ", $arr[0], "\n";

print "Second Item is: ", $arr[1], "\n";

print "Third Item is: ", $arr[2], "\n";

The output is:

First Item is: cat
Second Item is: rat
Third Item is: bat

Note: g as a modifier means global.

The pos() function
After a match, the pos() function can be used to return the next position that the searching of the subject is to begin, for the next match. This works with the global modifier.

In the above case, after the first match of “cat”, pos() would return 5. Counting position in string begins from zero. A good way to use the pos() function is in a while loop. The following code illustrates this:

use strict;

my $subject = "A cat is an animal. A rat is an animal. A bat is a creature.";

while($subject =~ /[cbr]at/g)
  {
    print "Next search starts at position: ", pos($subject), "\n";
  }

This is the output of the code above:

Next search starts at position: 5
Next search starts at position: 25
Next search starts at position: 45

The pos() function takes as argument the variable of the subject. The pos() function can also be used to set the position where search will continue, in the subject – see later.

The s and m modifiers
The s modifier refers to a single line and the m modifier refers to multiple lines in a string. In many cases, without these modifiers, we get what we want. Sometimes, however, we want to keep track of newline characters. A file in the hard disk might be made up of many lines of text each ending with the \n character. By default, the ^ and $ characters anchor at the beginning and at the end of the subject string. We can make them anchor the beginning and end of lines. The s and the m modifiers affect the interpretation of the ^, $ and the dot metahcaracter. Here is the full description of the s and m modifiers:

- no modifiers: Here we look at the case where there is no modifier just after the second forward slash. Under this condition '.' matches any character except "\n" . ^ matches only at the start of the string and $ matches only at the subject string end or before \n at the end. This is the default behavior of the dot metacharacter.

- s modifier: This makes the subject string behaves like a long line independent of any newline character that may be there. So '.' matches any character, even "\n" . ^ matches only at the start of the string and $ matches only at the end of the string.

- m modifier: This makes the subject behaves like a set of multiple lines. In the subject string, consecutive lines are separated by the \n character. So '.' matches any character except "\n". In this way ^ and $ are able to match at the start or end of any line within the subject. Here, ^ matches at the beginning of the string or just after the \n character, while $ matches just before the \n character.

I shall use examples to illustrate the above three conditions. I start by looking at the first condition.

No Modifiers
Read the first condition above again. Consider the following subject:

my $subject = "The first sentence.\n The second sentence.\n The third sentence.\n";

The subject string has three lines. The following expression produces a match.

$subject =~ /second/

The sub string “second”, in the second line is matched. Consider the following pattern:

/(^.*$)/

This pattern (regex) is expected under normal circumstances, to match the whole string. Let us see if it does so with the above multi-line subject string. Consider the following code:

use strict;

my $subject = "The first sentence.\n The second sentence.\n The third sentence.";

(my $matchedString) = $subject =~ /(^.*$)/;

print $matchedString;

If you run this code, no matching will occur. This is because of the presence of the \n character in the subject string. So the following expression, got from the code above, does not produce a match:

"The first sentence.\n The second sentence.\n The third sentence.\n" =~ /(^.*$)/);

By default, ‘.’ does not match the \n character. In the expression, ^ would match the start of the subject and $ would match the end of the subject. However, ‘.’ does not match the \n and so the matching fails. I hope you now appreciate what the first condition is talking about.

s modifier
Read the second condition above again. We shall do a similar thing that we did above. Consider the following subject string:

my $subject = "The first sentence.\n The second sentence.\n The third sentence.\n";

The subject string has three lines. The following expression produces a match.

         $subject =~ /second/s

Note that the s modifier has been used. The sub string “second”, in the second line is matched. Consider the following pattern:

          /(^.*$)/s

This pattern (regex) is supposed to match the whole string. Let us see if it does so with the above multi-line subject string, now that there is the s modifier. Consider the following code:

use strict;

my $subject = "The first sentence.\n The second sentence.\n The third sentence.\n";

(my $matchedString) = $subject =~ /(^.*$)/s;

print $matchedString;

This is the output:

The first sentence.
The second sentence.
The third sentence.

In the output, the whole string has been displayed. The first line is displayed. The new-line character is not displayed. It caused the next line to be displayed below. The second and third lines are displayed. The \n character in the string causes the next line to be displayed below current line.

Here, because of the s modifier, /(^.*$)/ matches the whole subject string. I hope you now appreciate what the second condition is talking about.

The m Modifier
Read the third condition above again. Here we look at the effect of the m modifier. Consider the following subject string:

my $subject = "The first sentence.\n The second sentence.\n The third sentence.\n";

The subject has three lines. The following expression produces a match.

         $subject =~ /second/m

Note that the m modifier has been used. The sub string “second”, in the second line is matched. Consider the following pattern:

          /(^.*$)/m

With the m modifier, this pattern (regex) should match a line. Let us see if it does so with the above multi-line subject string. Consider the following code:

use strict;

my $subject = "The first sentence.\n The second sentence.\n The third sentence.\n";

(my $matchedString) = ($subject =~ /(^.*$)/m);

print $matchedString;

The output of the code is:

The first sentence.

So it matched the first line. This is all right. The use of the m modifier can be complicated. So I will not give any more explanation about it, in this series.

The x Modifier
If you want to include comments in your regex, you can use the x modifier. With the x modifier whitespaces in regex are ignored. We shall see an example later.

Using more than one Modifier
We shall soon take a break. Before we have the break, know that you can have more than one modifier in a regex, as in:

          /send/ig

It is time. Let us take the break. Rendezvous in the next part of the series.

Chrys

Broad Network

Related Articles

Basic Modifiers in Perl

Perl Regular Expressions – Part 4

Perl Course

Introduction

Related Links

Comments