Broad Network


Patterns in Perl Regular Expressions

Perl Regular Expressions – Part 2

Perl Course

Foreword: In this part of the series, I analyze patterns in Perl Regular Expressions.

By: Chrysanthus Date Published: 5 Oct 2015

Introduction

This is part 2 of my series, Perl Regular Expressions. In this part of the series, I analyze patterns in Perl Regular Expressions. You should have read the previous part of the series before reaching here, as this is a continuation.

Character Classes
The Square Brackets
A character class is a set of possible characters, of which one character would match at a particular point (character), in the subject string. Character classes are denoted by square brackets [...], with the set (class) of characters to possibly match, inside. Here are some examples:

Let your subject string be

                             “He has a cat.”

You may know that he has an animal, but it does not matter to you which animal he has. You will be satisfied if he has a cat, bat or a rat. Note that the words, “cat”, “bat” and “rat”, each has “at” but begins with a “c” or “b” or “r”. The regex to check this is,

                              /[bcr]at/

The following produces a match

                    "He has a cat. " =~ /[bcr]at/

Here, because of the square brackets we interpret the regex as follows: the pattern should match any word whose first character is a “b” or “c”, or “t”; the rest of the characters being ‘at’.

The square brackets denote a class of elements. However, it is any one element in the class (square brackets) that is to match, not all of them together. Here, the class is the group of letters, ‘b’, ‘c’ and ‘t’; only one has to match in conjunction with “at”.

Range of Characters
The ‘-‘ Character
There may come a time when you would want to match any occurrence of a digit between 0 to 9, or a lower case character between ‘a’ to ‘z’, or an uppercase character between A to Z. These are ranges of characters and for each range you would want to know if one character in the range exist in the subject (I will address the issue of multiple occurrences of a character of a range in the subject string later).

The ‘-‘ Character is used for this. So the range 0 to 9 is denoted by 0-9; ‘a’ to ‘z’ by a-z; and A to Z by A-Z.

The following code produces a match:

                     "ID5id" =~ /[0-9]/

Recall that the square brackets indicate that any element it contains should be tested for matching. A range of characters is a class, and so you have to use the square brackets, in this sanitation. In this binding, a match occurs from 5 in the range 0 to 9, across to 5 in the subject string, “ID5id”.

The above expression is the same as

               "ID5id" =~ /[0123456789]/

Note the use of the square brackets. The following code will produce a match for a similar reason:

                     "ID5i" =~ /[a-z]/

A match occurs between ‘i’ in the range a-z and ‘i’, the only lowercase later in our present subject. Matching is case sensitive.

Of course you can combine a range with other characters in the regex. The regex /ID[0-9]id/ will match “ID4id”, “ID5id”, “ID6id”; in fact any word beginning with ‘ID’ followed by a digit and then ‘id’. So

                "ID2id is an identifier" =~ /ID[0-9]id/

produces a match.

Note: the range format gives a short form of writing a class. It is any one element in the square brackets that is matched.

Negation
Character ranges and some special regex characters can be negated.

Any character except a digit is written as

             [^0-9]

This refers to any character existing, that is not in the range 0-9. The following code produces a match:

                      "12P34" =~ /[^0-9]/

P is not in the range [0-9]; P is outside. Concerning all characters, P is in the range [^0-9]. Note the presence and absence of the ‘^’ character between the classes [0-9] and [^0-9], in this paragraph.

The special character used for negation is “^”.

The range outside [a-z] is [^a-z]. That is [^a-z] is the negation of [a-z].

The range outside [A-Z] is [^A-Z]. That is [^A-Z] is the negation of [A-Z].

We shall see other negations below.

Abbreviations for Common Character Classes
\d
\d means, any digit, and it abbreviates [0-9]. The following code produces a match:

               "ID5id is an ID" =~ /ID\did/

Negated \d
\D is negated \d. It represents any character that is not a digit, that is [^0-9].

s
\t\r\n\f  are white space characters. ‘\ ‘ or simply ‘ ‘ is produced when you press the spacebar of your keyboard. \t is produces when you press the tab key on your keyboard. \r is the carriage return character. \n is the new line character and \f is the form feed character.

\s is the abbreviation for any whitespace character. That is \s is equivalent to [ \t\r\n\f].

The following expression produces a match:

             "The first line.\r\nThe second line. " =~ /\n/

The following expression also produces a match:

             "The first line.\r\nThe second line. " =~ /\s/

\s is a class of whitespace characters.

Negated \s
\S
\S is negated \s. It represents any character that is not a whitespace, that is [^s].

\S, [^s] and [^ \t\r\n\f] are equivalent.

The negation symbol negates the class (within the square brackets).

\w
This is a word character. It represents any alphanumeric character including the underscore. \w and [0-9a-zA-Z_] are equivalent.

Negated \w
\W is negated \w. It represents any non-word character. \W and [^\w] are equivalent.

The Period ‘.’
The period ‘.’ matches any character except \n. For example, /.s/ matches 'is' in the subject string, "An apple is on the tree". /.s/ represents two characters, which are any character (except \n) followed by ‘s’.

You can use the \d, \s, \w, \D, \S, \W abbreviations both inside and outside of a character class.

Beginning and End of a String
The aim here is to see how you can match a regex to the beginning of the subject string or the end of the subject string (or both the beginning and the end).

The ^ Character for Matching at the Beginning
If you want the matching to occur at the beginning of the subject, start the regex with the ‘^’ character.

The following expression produces a match:

                 "one and two" =~ /^one/

The following expression does not produce a match:

                 "The one I saw" =~ /^one/

In the first case the word ‘one’ is at the beginning of the subject string. In the second case, the word ‘one’ is not at the beginning of the subject.

At this point, you may ask, “Is ‘^’ not the negation symbol?” Well it is the negation symbol. The problem is to know when to use it. When used inside a class (square brackets) it is the negation symbol; when used at the beginning of a regex, just after the forward slash, it is the regex character for matching at the beginning of the subject. It is an anchor metacharacter (reserved character).

The $ Character for Matching at End
If you want the matching to occur at the end of the subject, end the regex with the ‘$’ character.

The following expression produces a match:

                  "This is the last" =~ /last$/

The following expression does not produce a match:

                  "The last boy" =~ /last$/

In the first case the word ‘last’ is at the end of the subject. In the second case, the word ‘last’ is not at the end of the subject.

Note: $ actually matches the end of the subject, or just before a newline character at the end of the subject.

^ and $ are called anchor meta characters.

Matching the Whole String
Before we continue, note that the .* character combination (period followed by asterisk) in the pattern, matches any sub string including a sub string of zero length.

You can match the whole subject, using the ‘^’ with the ‘$’ characters. The following code produces a match:

                 "beginning and end" =~ /^be.*end$/

The following code also produces a match:

                "beginning with end" =~ /^be.*end$/

The subject of the first case is, “beginning and end”. The subject of the second case is “beginning with end”. The difference occurs in the word in the middle (and/with).

The regex pattern of both cases is the same. The pattern begins with ‘^’ and ends with ‘$’. The regex indicates that the subject to be matched has to begin with “de”, followed by any character, any number of times; and the subject has to end with “end”.

Note: All along, when we say match, we are actually searching the subject for a sub-string, represented by the pattern of the regex. However, when you are matching the whole subject, the regex represents the whole string.

So, you can now match a whole string. By the time you complete this series, you will be able to match a whole subject string having particular words within the string. I will not show you how to do that. It will be an exercise for you. You will simply need to combine many of the features I explain in the series.

Matching Repetitions
In the subject, a character or a group of characters may repeat itself. I will talk about groups of characters, as a topic, later. For now, let us concentrate on single character repeating itself. There are quantifier metacharacters that allow us to match repetition of a single character or a group of characters, in the subject. These meta characters are: ?, * , + , and {}. They allow us to decide on the number of repeats we are looking for. Quantifiers are put immediately after the character, character class, or grouping (see later) in the regex. In the following list, I give you the meanings of quantifiers, where x refers to a particular character:

x*         :   means match 'x' 0 or more times, i.e., any number of times

x+         :   means match 'x' 1 or more times, i.e., at least once

x?         :   means match 'x' 0 or 1 times

x{n,}    :   means match 'x' at least n or more times; note the comma.

x{n}     :    match 'x'  exactly n times

x{n,m} :  match 'x'  at least n times, but not more than m times.

Note: the letter ‘x’ above stands for any character of a text, e.g. ‘b’, ‘c’, ‘d’, ‘1’, ‘2’, etc. The quantifier is typed inside a pattern (regex).

Examples
*
Matches the preceding item 0 or more times. /o*/ matches ‘o’ in 'ghost' of the subject string, "A ghost booooed". It would also match “oooo” in the subject. To give the regex more meaning you have to combine it with other characters. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted", even though this last string has an ‘o’. The second regex is b, optionally followed by o, zero or more times.

+
Matches the preceding item 1 or more times. Equivalent to {1,} – see below. /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy".

?
Matches the preceding item 0 or 1 time. /e?le?/ matches the 'el' in "angel" and the 'le' in "angle.". /e?le?/ means, you have a word which has ‘l’ optionally preceded by ‘e’ and optionally followed by ‘e’. This means, it will also match, “lying”. By the time you finish this series, you will know how to modify the regex, to restrict it to match only “angel” or “angle”.

{n,}
Where n is a positive integer. This matches at least n occurrences of the preceding item.

For example, /a{2,} doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy.".

{n}
Where n is a positive integer. This matches exactly n occurrences of the preceding item. /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and only the first two a's in "caaandy."

{n,m}
Where n and m are positive integers. This matches at least n and at most m occurrences of the preceding item.

For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy". Notice that when matching "caaaaaaandy", the match is "aaa", even though the subject had more a's.

The following code produces a match:

my $year = "2009";
$year =~ /\d{2,4}/

This is a simple validation that makes sure the year is at least 2 digits and not more than 4 digits. You can try the above with the following program:

use strict;

my $year = "2009";

if ($year =~ /\d{2,4}/)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

Matching Alternation
We can match different character strings with the alternation metacharacter '|'. To match ‘pig’ or ‘sheep’, we form the regex, /pig|sheep/. Perl will try to match the regex at the earliest possible point in the subject string. At each character position, Perl will first try to match the first alternative, ‘pig’. If ‘pig’ doesn't match, Perl will then try the next alternative, ‘sheep’. If ‘sheep’ doesn't match either, then Perl moves on to the next position and starts with the first alternative again

Some examples:
The following produces a match:

            "pigs are a group of animals" =~ /pig|sheep|cow/

Here, ‘pig’ is matched. There is no ‘sheep’ or ‘cow’ in the subject string.

Note that in the subject, it is the set of letters, ‘p’,’i’, and ’g’ that is matched. It is not ‘pigs’ that is matched. There is no ‘s’ after “pig” in the regex. ‘pig’ is a sub-string among all the characters in the subject that is matched. Also note that it is not a word that is matched, but a sub-string (which consists of characters and may even be one character).

Note as well, that the space in the subject is a character, which could be a member of a sub string. What I have just said, applies to all other matching, not only alternations.

The following produces a match:

            "sheep are a group of animals" =~ /pig|sheep|cow/

Here, ‘sheep’ is matched. There is no ‘pig’ or ‘cow’ in the subject string. The search did not see ‘pig’, so it matched ‘sheep’.

The following produces a match:

            "cows are a group of animals" =~ /pig|sheep|cow/

Here, ‘cow’ is matched. There is no ‘pig’ or ‘sheep’ in the subject. The search did not see ‘pig’ or ‘sheep’, so it matched ‘cow’

Now, in the following expression ‘pig’ and not ‘sheep’ is matched.

            "pigs and sheep are groups of animals" =~ /pig|sheep|cow/

This is because ‘pig’ appears first in the subject before ‘sheep’, and big is the first alternative in the regex .

In the following expression, still ‘pig’ and not ‘sheep’ is matched.

            "pigs and sheep are groups of animals" =~ /sheep|pig|cow/

This is because, even though ‘pig’ appears first in the subject, but ‘sheep’ is the first alternative, Perl will always match what appears first in the subject.

Metacharacters
There are some characters that you cannot use in a regex directly. These characters simply have special meanings in the regex. Here they are:

    { } [ ] ( ) ^ $ . | * + ? \


They are called metacharacters.

A metacharacter can be matched by putting a backslash before it. The following examples illustrate this:

"3+3=6" =~ /3+3/                # doesn't match because '+' is a metacharacter
"3+3=6" =~ /3\+3/             # matches because '+' becomes an ordinary '+'

The following expression produces a match:

"www.website.com/contact.html" =~ /www\.website\.com\/contact\.html/

Always remember that a decimal point as a character in a pattern (regex) always has to be escaped, that is “\.”.

So, to use a metacharacter in a pattern (regex), escape it with a backslash. You also have to escape a forward slash character within a pattern, if the regex is delimited by forward slashes.

Combining Matching Features
You can combine matching features. You should have seen some of these, such as in /[cbr]at/. Here is another example:

    $year =~ /\d{2,4}/

This is to verify that year is at least 2 but not more than 4 digits.

Variable in Regex
In a pattern, you can have a variable in place of a sub string. Consider the following statement:

my $var = "dog";

The following statement matches:

"This is his dog near me." =~ /his $var near/

Here, the pattern, /his dog near/ is the same as /his $var near/. In the later pattern, “dog” has been replaced by $var.

Upper and Lower case in Regex
Letters
In a pattern, it is possible for you to have a lower case letter converted to an upper case letter or have an upper case letter, converted to a lower case letter. You have to use the escape sequences \u and \l to do the work. Note that matching is case sensitive.

The Escape sequences \u and \l
The escape sequence, u converts the next lower case letter in a pattern into an upper case letter. The following expression produces a match.

         "This is Mr. Smith." =~ /is \umr/

In the subject, you have the upper case letter for M. In the regex (pattern), you have the lower case letter for M. The escape sequence \u changes m to M in the regex.

The escape sequence, \l converts the next upper case letter in a pattern into a lower case letter. The following expression produces a match.

         "The lady is here." =~ /\lLady/

In the subject, you have the lower case letter for L. In the regex (pattern), you have the uppercase letter for L. The escape sequence \l changes L to l in the regex.

If the next letter in the pattern is already in uppercase, the escape sequence, u in front of it has no effect. If the next letter is already in lowercase, the escape sequence, \l in front of it has no effect.

If the next letter in the regex is inside a variable, \u and \l will still do their work. The following code produces a match.

    my $var = "perl";
    "This is Perl" =~ /\u$var/

In $var, P is in lowercase; in the subject string, it is in uppercase.

Sub-strings
In a pattern, it is possible for you to have a sub-string in lower case converted to upper case or have a sub-string in upper case, converted to lower case. You have to use the escape sequences \U and \L to do the work. Note here that we have U, not u and L, not l. Here, we are dealing with sub-strings and not single letters as above. The upper case U and L are for sub strings

If you have \U or \L in a pattern, the conversion takes place till the end of the pattern. If you do not want the conversion to take place to the end of the pattern, put \E where you what the conversion to stop.

The following expression produces a match:

    "The boy IS BIG" =~ /\Uis big/

Here, \U converts the sub string “is big” to “IS BIG” in the pattern. “IS BIG” is in the subject string. So matching occurs.

The following expression does not produce a match.

     "The boy IS BIG" =~ /\Uis\E big/

In the pattern, \U with \E converts only “is” to “IS”, while in the subject string, we have “IS BIG”. Matching is case sensitive. So, no match occurs.

The following expression produces a match.

    "The boy IS big" =~ /\Uis\E big/

Here, in the pattern, \U with \E converts “is” to “IS”. However, in the subject, only “IS” is a sub string in upper case. So matching occurs.

The use of \L to convert a sub string to lower case can be similarly explained. \L can work with \E as well.

\U and \E can also work with variables in the pattern.

Well, at this point, we really must take a break. We continue in the next part of the series.

Chrys

Related Links

Perl Basics
Perl Data Types
Perl Syntax
Perl References Optimized
Handling Files and Directories in Perl
Perl Function
Perl Package
Perl Object Oriented Programming
Perl Regular Expressions
Perl Operators
Perl Core Number Basics and Testing
Commonly Used Perl Predefined Functions
Line Oriented Operator and Here-doc
Handling Strings in Perl
Using Perl Arrays
Using Perl Hashes
Perl Multi-Dimensional Array
Date and Time in Perl
Perl Scoping
Namespace in Perl
Perl Eval Function
Writing a Perl Command Line Tool
Perl Insecurities and Prevention
Sending Email with Perl
Advanced Course
Miscellaneous Features in Perl
Perl Two-Dimensional Structures
Advanced Perl Regular Expressions
Designing and Using a Perl Module
More Related Links
Perl Mailsend
PurePerl MySQL API
Perl Course - Professional and Advanced
Major in Website Design
Web Development Course
Producing a Pure Perl Library
MySQL Course

BACK NEXT

Comments

Become the Writer's Fan
Send the Writer a Message