More Regular Expression Patterns in Perl

Regular Expressions in Perl for the Novice – Part 3

Forward: In this part of the series, we continue to analyze patterns in Perl Regular Expressions.

By: Chrysanthus Date Published: 13 Aug 2012

Introduction

This is the third part of my series, Regular Expressions in Perl for the Novice. In this part of the series, we continue to analyze patterns in Perl Regular Expressions.

Matching Repetitions
In the available string, characters or groups of characters may repeat themselves. We shall talk about groups of characters, as a topic, later. For now, let us concentrate on single character repeating itself. There are quantifier metacharacters that allow us to match repetition of single characters or groups of characters in the available string. These meta characters are: ?, * , + , and {}. They allow us to decide on the number of repeats we are looking for. Quantifiers are put immediately after the character, character class, or grouping (see later) in the regex. Here they are with their meanings, where x refers to a particular character:

x*         :   means match 'x' 0 or more times, i.e., any number of times

x+         :   means match 'x' 1 or more times, i.e., at least once

x?         :   means match 'x' 0 or 1 times

x{n,}    :   means match 'x' at least n or more times; note the comma.

x{n}     :    match 'x'  exactly n times

x{n,m} :  match 'x'  at least n times, but not more than m times.

Note: the letter ‘x’ above stands for any character of a text, e.g. ‘b’, ‘c’, ‘d’, ‘1’, ‘2’, etc. The qualifier is typed inside a pattern (regex).

Examples
*
Matches the preceding item 0 or more times. /o*/ matches ‘o’ in 'ghost' of the available string, "A ghost booooed". It would also match “oooo” in the available string. To give the regex more meaning you have to combine it with other characters. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted", even though this last string has an ‘o’.

+
Matches the preceding item 1 or more times. Equivalent to {1,} – see below. /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy".

?
Matches the preceding item 0 or 1 time. /e?le?/ matches the 'el' in "angel" and the 'le' in "angle.". /e?le?/ means, you have a word which has ‘l’ optionally preceded by ‘e’ and optionally followed by ‘e’. This means, it will also match, “lying”. By the time you finish this series, you will know how to modify the regex, to restrict it to match only “angel” or “angle”.

{n,}
Where n is a positive integer. This matches at least n occurrences of the preceding item.

For example, /a{2,} doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy.".

{n}
Where n is a positive integer. This matches exactly n occurrences of the preceding item. /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and only the first two a's in "caaandy."

{n,m}
Where n and m are positive integers. This matches at least n and at most m occurrences of the preceding item.

For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy". Notice that when matching "caaaaaaandy", the match is "aaa", even though the available string had more a's in it.

The following code produces a match:

my $year = "2009";

$year =~ /\d{2,4}/

This is a simple validation that makes sure the year is at least 2 digits and not more than 4 digits. You can try the above with the following program:

use strict;

my $year = "2009";

if ($year =~ /\d{2,4}/)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

Matching Alternation
We can match different character strings with the alternation metacharacter '|'. To match ‘pig’ or ‘sheep’, we form the regex, /pig|sheep/. Perl will try to match the regex at the earliest possible point in the available string. At each character position, Perl will first try to match the first alternative, ‘pig’. If ‘pig’ doesn't match, Perl will then try the next alternative, ‘sheep’. If ‘sheep’ doesn't match either, then Perl moves on to the next position and starts with the first alternative again

Some examples:
The following produces a match:

            "pigs are a group of animals" =~ /pig|sheep|cow/

Here, ‘pig’ is matched. There is no ‘sheep’ or ‘cow’ in the available string.

Note that in the available string, it is the set of letters, ‘p’,’i’, and ’g’ that is matched. It is not ‘pigs’ that is matched. There is no ‘s’ after “pig” in the regex. ‘pig’ is a sub-string among all the characters in the available string that is matched. Also note that it is not a word that is matched, but a sub-string (which consists of characters and may even be one character).

Note as well, that the space in the available string is a character, which could be a member of a string sub string.  What I have just said, applies to all other matching, not only alternations.

The following produces a match:

            "sheep are a group of animals" =~ /pig|sheep|cow/

Here, ‘sheep’ is matched. There is no ‘pig’ or ‘cow’ in the available string. The search did not see ‘pig’, so it matched ‘sheep’

The following produces a match:

            "cows are a group of animals" =~ /pig|sheep|cow/

Here, ‘cow’ is matched. There is no ‘pig’ or ‘sheep’ in the available string. The search did not see ‘pig’ or ‘sheep’, so it matched ‘cow’

Now, in the following expression ‘pig’ and not ‘sheep’ is matched.

            "pigs and sheep are groups of animals" =~ /pig|sheep|cow/

This is because ‘pig’ appears first in the available string before ‘sheep’.

Also in the following expression ‘sheep’ and not ‘pig’ is matched.

            "pigs and sheep are groups of animals" =~ /sheep|pig|cow/

This is because, even though ‘sheep’ is the first alternative in the regex, ‘pig’ appears first in the available string before ‘sheep’.

Metacharacters
There are some characters that you cannot use in a regex. These characters simply have special meanings in the regex. Here they are:

    \ {} [] () ^ $ . | * + ? /

They are called metacharacters.

A metacharacter can be matched by putting a backslash before it. The following examples illustrate this:

"3+3=6" =~ /3+3/                # doesn't match because '+' is a metacharacter
"3+3=6" =~ /3+3/               # matches because '+' becomes an ordinary '+'

The following expression produces a match.
"www.website.com/contact.html" =~ /www\.website\.com/contact\.html/

Always remember that a decimal point as a character in a pattern (regex) always has to be escaped, that is “\.”.

Combining Matching Features
You can combine matching features. We have seen some of these such as in /[cbr]at/. This is another example

    $year =~ /\d{2,4}/

The above is to verify that year is at least 2 but not more than 4 digits.

Variable in Regex
In a pattern, you can have a variable in place of a sub string. Consider the following statement:

my $var = "dog";

The following statement matches:

"This is his dog by me." =~ /his $var by/

Here, the pattern, /his dog by/ is the same as /his $var by/. In the later pattern, “dog” has been replaced by $var.

Upper and Lower case in Regex
Letters
In a pattern, it is possible for you to have a lower case letter converted to an upper case letter or have an upper case letter, converted to a lower case letter. You have to use the escape sequences \u and \l do the work. Note that matching is case sensitive.

The Escape sequences \u and \l
The escape sequence, u converts the next lower case letter in a pattern into an upper case letter. The following expression produces a match.

         "This is Mr. Smith." =~ /is \umr/

In the available string, you have the upper case letter for M. In the regex (pattern), you have the lower case letter for M. The escape sequence \u changes m to M in the regex.

The escape sequence, \l converts the next upper case letter in a pattern into a lower case letter. The following expression produces a match.

         "The lady is here." =~ /\lLady/

In the available string, you have the lower case letter for L. In the regex (pattern), you have the uppercase letter for L. The escape sequence \l changes L to l in the regex.

If the next letter in the pattern is already in uppercase, the escape sequence, u in front of it has no effect. If the next letter is already in lowercase, the escape sequence, \l in front of it has no effect.

If the next letter in the regex is inside a variable, \u and \l will still do their work. The following code produces a match.

    my $var = "perl";
    "This is Perl" =~ /\u$var/

In $var, P is in lowercase; in the available string, it is in uppercase.

Sub-strings
In a pattern, it is possible for you to have a sub-string in lower case converted to upper case or have a sub-string in upper case, converted to lower case. You have to use the escape sequences \U and \L do the work. Note here that we have U, not u and L, not l. Here, we are dealing with sub-strings and not single letters as above. The upper case U and L are for sub strings

If you have \U or \L in a pattern, the conversion takes place till the end of the pattern. If you do not want the conversion to take place to the end of the pattern, put \E where you what the conversion to stop.

The following expression produces a match:

    "The boy IS BIG" =~ /\Uis big/

Here, \U converts the sub string “is big” to “IS BIG” in the pattern. “IS BIG” is in the available string. So matching occurs.

The following expression does not produce a match.

     "The boy IS BIG" =~ /\Uis\E big/

In the pattern, \U with \E converts only “is” to “IS”, while in the available string, we have “IS BIG”. Matching is case sensitive. So, no match occurs.

The following expression produces a match.

    "The boy IS big" =~ /\Uis\E big/

Here, in the pattern, \U with \E converts “is” to “IS”. However, in the available string, only “IS” as a sub string in upper case. So matching occurs.

The use of \L to convert a sub string to lower case can be similarly explained. \L can work with \E as above.

\U and \E can also work with variables in the pattern.

Let us take a break here. We continue in the next part of the series.

Chrys

Broad Network

Related Articles

More Regular Expression Patterns in Perl

Regular Expressions in Perl for the Novice – Part 3

Introduction

Related Links

Comments