The Text Line and Perl Regular Expression

Advanced Perl Regular Expressions – Part 3

Foreword: In this part of the series I explain how to handle text lines in Perl regular expression; I explain the use of the ^, $, dot, s and m metacharacters and modifiers.

By: Chrysanthus Date Published: 2 Apr 2016

Introduction

This is part 3 of my series, Advanced Perl Regular Expressions. In this part of the series I explain how to handle text lines in Perl regular expression; I explain the use of the ^, $, dot, s and m metacharacters and modifiers. To be precise, ^, $ and the dot are metacharacters, while s and m are modifiers. You should have read the previous parts of the series because this is a continuation. However, if you are very good in Perl professional regular expressions, you can understand this tutorial without reading the previous parts of the series.

The dot Metacharacter
The dot (.) metacharacter is used to match any character within a string but will not match the newline (\n) character. Read and try the following code:

use strict;

    my $subject = "The wind is blowing to the north.";
    if ($subject =~ /(w.nd)/)
        {
            print $1, "\n";
        }

    my $subject = "You have to escape \" to avoid interpolation.";
    if ($subject =~ /(escape . to)/)
        {
            print $1, "\n";
        }

In the first code segment of interest, the word, “wind” is matched with the dot corresponding to i. In the second code segment, the phrase “escape " to” is matched with the dot corresponding to " (which is escaped in the string). The second code segment shows that the dot can match an escaped sequence, e.g. \"  .The escaped sequence acts as one character.

The dot and the s Modifier
The subject string may have newline (\n) characters. A source of such a subject is a text file content. Recall: in a text file, a line is terminated by the \n character, which is not displayed in the text editor (you see only the effect).  If you want the dot to match a \n character, you have to add the modifier s at the end of the regex. Read and try the following code:

use strict;

    my $subject = "This is it\n That is it";
    if ($subject =~ /(it. That)/s)
        {
            print $1, "\n";
        }

Without the s modifier no matching will be produced. With the s modifier, matching has been produced and the phrase, “it\n That” has been matched with the dot in the regex corresponding to ‘\n’ .

With or without the s modifier, the dot operator will match any character in front or after the \n character in the subject string. The following code illustrates this. Read and try it.

use strict;

    my $subject = "This is a line\n This is another line";
    if ($subject =~ /l.ne/)
        {
            print "Matched", "\n";
        }
    if ($subject =~ /an.ther/)
        {
            print "Matched", "\n";
        }
    if ($subject =~ /l.ne/s)
        {
            print "Matched", "\n";
        }
    if ($subject =~ /an.ther/s)
        {
            print "Matched", "\n";
        }

So, the presence of the \n character does not disturb the matching of the dot metacharacter, however, the dot will not match (corresponding to) \n if the modifier, s is not used.

Matching Start and End

Beginning of a string and Beginning of a Line
The ^ metacharacter in a regex matches the beginning of a string whether or not the string has \n characters within. A string that does not have \n within is considered as a one-line string. A multi-line string is a string that has one or more \n characters. The following program, which you should read and try, illustrates this:

use strict;

    my $subject0 = "This is a sentence";
    if ($subject0 =~ /^This/)
        {
            print "Matched", "\n";
        }
    my $subject1 = "This is a sentence\nThis is another sentence";
    if ($subject1 =~ /^This is a sentence/)
        {
            print "Matched", "\n";
        }
    if ($subject1 =~ /^This is another/)
        {
            print "Matched", "\n";
        }
    else
        {
            print "Not Matched";
        }

The first two if-constructs output “Matched”. The last if-construct outputs “Not Matched”, indicating that ^ would match the beginning of a subject string whether or not \n is present in the subject.

Note: ^ is normally typed at the beginning of the regex.

End of a String and End of a Line
The $ metacharacter in a regex matches the end of a string whether or not the string has \n characters within. A string that does not have \n within is considered as a one-line string. A multi-line string is a string that has one or more \n characters. The following program, which you should read and try, illustrates this:

use strict;

    my $subject0 = "This is a sentence";
    if ($subject0 =~ /a sentence$/)
        {
            print "Matched", "\n";
        }
    my $subject1 = "This is a sentence\nThis is another sentence";
    if ($subject1 =~ /another sentence$/)
        {
            print "Matched", "\n";
        }
    if ($subject1 =~ /a sentence$/)
        {
            print "Matched", "\n";
        }
    else
        {
            print "Not Matched";
        }

The first two if-constructs output “Matched”. The last if-construct outputs “Not Matched”, indicating that $ would match the end of a subject string whether or not \n is present in the subject.

Note: $ is normally typed at the end of the regex.

Matching a Line within Lines
The m modifier together with the ^ and $ metacharacters are used to identify a line within lines. Also the ^ matches the beginning of any line within a multi-line string and the $ matches the end of any line within a multi-line string. Read and try the following code that illustrates matching of a line:

use strict;

    my $subject = "You can use The as article\nThe second line\nA line is good";
    if ($subject =~ /^The second line$/m)
        {
            print "Matched", "\n";
        }
    if ($subject =~ /^You can line is good$/)
        {
            print "Matched", "\n";
        }
    else
        {
            print "Not Matched";
        }

For the first code segment (if-construct) of interest, there is a match because ^ and $ in the regex start and end a particular line in the subject. In the second code segment (if-construct) of interest, there is no match because the line suggested by the regex, i.e. “You can line is good”, does not exist in the subject.

To every rule, there is an exception: In the presence of the m modifier, ^ will match the beginning of the subject string or the beginning of a line inside the subject string. It matches the beginning of the string or the line depending on the coding (type of regex used, presence of g modifier or loop). In the presence of the m modifier, $ too, matches the end of the string or the line depending on the coding (type of regex used, presence of g modifier or loop). The following code illustrates this:

use strict;

    my $subject = "I am a man\nYou are a man\nShe is a woman";
    if ($subject =~ /(I am)/m)
        {
            print $1, "\n";
        }
    if ($subject =~ /(^You)/m)
        {
            print $1, "\n";
        }
    if ($subject =~ /(woman)/)
        {
            print $1, "\n";
        }
    if ($subject =~ /(man$)/m)
        {
            print $1, "\n";
        }

The output is,

I am
You
woman
man

Try the code.

You can use the ^ metacharacter to match the beginning of a line anywhere in the string, without matching the end of the line. You can also use the $ character to match the end of any line in the string, without matching the beginning of the line. Read and try the following code that illustrates this:

use strict;

    my $subject = "You can use The in first line right\nThe second line\nA line is good";
    if ($subject =~ /(^The....)/m)
        {
            print $1, "\n";
        }
    if ($subject =~ /(The....)/)
        {
            print $1, "\n";
        }
    if ($subject =~ /(..line$)/m)
        {
            print $1, "\n";
        }
    if ($subject =~ /(..line)/)
        {
            print $1, "\n";
        }

The output is:

The sec
The in
d line
t line

In the first if-construct the beginning of the second line is matched because of the ^ and the modifier, m of the regex. In the second if-construct, it is a phrase within the first line that is matched because of the absence of  ^ and m. It the third if-construct, it is the end of the second line that is matched because of the presence of $ and m. In the fourth if-construct, it is a phrase within the first line that is matched because of the absence of  $ and m.

Note: in the absence of the modifier, m the ^ will match the beginning of the string independent of whether the string has \n characters. Also, in the absence of m, the $ will match the end of the string independent of whether the string has \n characters. Read and try the following code for these:

use strict;

    my $subject = "The first line\nThe second line\nThe third line";
    if ($subject =~ /(^The....)/)
        {
            print $1, "\n";
        }
    if ($subject =~ /(....line$)/)
        {
            print $1, "\n";
        }

Also note that the ^ character matches effectively just after the beginning (“) of the line (or just after the newline character, if present, with m) and the $ character matches effectively just before the end (”) of the line (or just before the next newline character, if present, with m).

Comment on the Dot and Start and End Metacharacters
So far as lines are concerned, the start (^) and end ($) metacharacters as a pair, and the dot metacharacter handle the string in a somewhat similar way. The dot metacharacter will match any character in a string except \n that is present in the string. If you want the dot metacharacter to match a \n present in the string, you have to use the s modifier. In a somewhat similar way, the ^ and $ pair matches the beginning and end of a string whether or not the \n character is present. If you want the pair to match a line within the string, you have to use the m modifier.

Matching \n with the start or end of line
In a multi-line string, you may want to match some text before and after the \n character as well as the \n character itself. It is simple, just use the dot at the position of the \n character in the subject and remember to use the s modifier. One of the early code samples above, illustrates this.

Do not confuse between end-of-line and end-of-string: the end of line has \n but the end of string is the literal end of the string and may or may not have \n.

If you want to match the \n character as well as the beginning or end of a line, in general terms, use the s and m modifiers. The s modifier makes the dot match the \n character, while the m modifier makes the regex (with ^ or $) match the beginning or end of a line within the string. In the absence of m only the very beginning or very end of the string will be matched by the regex. In the absence of s the dot in the regex cannot match \n in the subject (\n in regex will match \n in the subject at all times). Read and try the following code that gives an illustration for the \n character and end of a line:

use strict;

    my $subject = "The first line\nThe second line\nThe third line\nThe fourth line";
    if ($subject =~ /(line...........line$)/sm)
        {
            print $1, "\n";
        }

The matched portion of the subject is,

    line\nThe third line

The regex has 11 dots: one dot per corresponding character in the subject. The first dot is for \n. So the portion of the string that has been matched has the third line. If the regex had twelve dots then the portion of the string matched would have been that with the second line. I hope you appreciate the fact that because of the modifiers, sm, the \n and next end of line have been matched. You can modify the above code for \n and previous start of line, using ^.

One can say that the sm combination modifier makes a multi-line string looks like an ordinary string with the ^, $ and dot metacharacters behaving as ordinary metacharacters.

Conclusion
So far as lines are concerned, the start and end metacharacters as a pair, and the dot metacharacter, handle the string in a somewhat similar way. The dot metacharacter will match any character in a string except \n that is present in the string. If you want the dot metacharacter to match a \n present in the string, you have to use the s modifier. In a somewhat similar way, the ^ and $ pair matches the beginning and end of a string whether or not the \n character is present. If you want the pair to match a line within the string, you have to use the m modifier. You can use the ^ and the m modifier to match any beginning of line in a multi-line string; you can use the $ and the m modifier to match any end of line in a multi-line string. You do not always have to use the pair, ^ and $ together. To match \n with the start or end of a line you use the dot metacharacter with the ^ or $ accordingly, and with the sm modifier combination.

It has been a long ride. We have to take a break. Rendezvous in the next part of the series.

Chrys

Broad Network

Related Articles

The Text Line and Perl Regular Expression

Advanced Perl Regular Expressions – Part 3

Introduction

Related Links

Comments