Broad Network


Regular Expression Patterns in Perl

Regular Expressions in Perl for the Novice – Part 2

Forward: In this part of the series, we start analyzing patterns in Perl Regular Expressions.

By: Chrysanthus Date Published: 13 Aug 2012

Introduction

This is the second part of my series, Regular Expressions in Perl for the Novice. In this part of the series, we start analyzing patterns in Perl Regular Expressions.

Character Classes
The Square Brackets
A character class allows a set of possible characters, where one of them would match at a particular point, a character, in the available string. Character classes are denoted by brackets [...], with the set (class) of characters to be possibly matched inside. Here are some examples:

Let your available string be

                             “He has a cat.”

You may know that he has an animal, but it does not matter to you which animal he has. You will be satisfied if he has a cat, bat or a rat. Note that the words, “cat”, “bat” and “rat”, each has “at” but begins with a “c” or “b” or “r”. The regex to check this is

                              /[bcr]at/

The following produces a match

                    "He has a cat. " =~ /[bcr]at/

Here, because of the square brackets we interpret the regex as follows: the pattern should match any word whose first character is a “b”, “c”, or “t”, the rest of the characters being ‘at’.

The square brackets denote a class of elements. However, it is any one element in the class (square brackets) that is to be matched, not all of them together. Here, the class is the group of letters, ‘b’, ‘c’ and ‘t’; only one has to match in conjunction with “at”.

Range of Characters
The ‘-‘ Character
There may come a time when you would want to match any occurrence of a digit between 0 to 9, or a lower case character between ‘a’ to ‘z’, or an uppercase character between A to Z. These are ranges of characters and for each range you would want to know if one character in the range exist in the available string (I will address the issue of multiple occurrences of a character of a range in the available string later).

The ‘-‘ Character is used for this. So the range 0 to 9 is denoted by 0-9; ‘a’ to ‘z’ by a-z; and A to Z by A-Z.

The following code produces a match:

                      "ID5id" =~ /[0-9]/

Recall that the square brackets indicate that any element it contains should be tested for matching. A range of characters is a class (see above), and so you have to use the square brackets, as in the above expression. In the above case, a match occurs between 5 in the range 0 to 9 and 5 in the unavailable string, “ID5id”.

The above expression is the same as

               "ID5id" =~ /[0123456789]/

Note the use of the square brackets. The following code will produce a match for a similar reason:

                       "ID5i" =~ /[a-z]/

A match occurs between ‘i’ in the range a-z and ‘i’, the only lowercase later in our present available string. Matching is case sensitive.

Of course you can combine a range with other characters in the regex. The regex /ID[0-9]id/ will match “ID4id”, “ID5id”, “ID6id”; in fact any word beginning with ‘ID’ followed by a digit and then ‘id’. So

                "ID2id is an ID" =~ /ID[0-9]id/

produces a match.

Note: the range format gives a short form of writing a class. It is any one element in the square brackets that is matched.

Negation
Character ranges and some special regex characters can be negated.

Any character except a digit is written as

             [^0-9]

This refers to all characters existing, that are not in the range 0-9. The following code produces a match:

                      "12P34" =~ /[^0-9]/

P is not in the range [0-9]; P is outside. Concerning all characters, P is in the range [^0-9]. Note the presence and absence of the ‘^’ character between the classes [0-9] and [^0-9], in this paragraph.

The special character used for negation is “^”.

The range outside [a-z] is [^a-z]. That is [^a-z] is the negation of [a-z].

The range outside [A-Z] is [^A-Z]. That is [^A-Z] is the negation of [A-Z].

We shall see other negations below.

Abbreviations for Common Character Classes
\d
\d means, any digit, and it abbreviates [0-9]. The following code produces a match:

               "ID5id is an ID" =~ /ID\did/

Negated \d
\D is negated \d. It represents any character that is not a digit, that is [^0-9].

s
\t\r\n\f  are white space characters. ‘\ ‘ or simply ‘ ‘ is produced when you press the spacebar of your keyboard. \t is produces when you press the tab key on your keyboard. \r is the carriage return character. \n is the new line character and \f is the form feed character.

\s is the abbreviation for any white space character. That is \s is equivalent to [ \t\r\n\f].

The following expression produces a match:

             "The first line.\r\nThe second line. " =~ /\n/

The following expression also produces a match:

             "The first line.\r\nThe second line. " =~ /\s/

\s is a class of white space characters.

Negated \s
\S
\S is negated \s. It represents any character that is not a white space, that is [^s].

\S, [^s] and [^ \t\r\n\f] are equivalent.

The negation symbol negates the class (within the square brackets).

w
This is a word character. It represents any alphanumeric character including the underscore. \w and [0-9a-zA-Z_] are equivalent.

Negated \w
\W is negated \w. It represents any non-word character. \W and [^w] are equivalent.

The Period ‘.’
The period ‘.’ matches any character except \n. For example, /.s/ matches 'is' in the available string, "An apple is on the tree". /.s/ represents two characters, which are any character (except \n) followed by ‘s’.

You can use the \d\s\w\D\S\W abbreviations both inside and outside of character classes.

Beginning and End of a String
The aim here is to see how you can match a regex to the beginning of the available string or the end of the available string (or both the beginning and the end).

The ^ Character for Matching at the Beginning
If you want the matching to occur at the beginning of the available string, start the regex with the ‘^’ character.

The following expression produces a match:

                 "one and two" =~ /^one/

The following expression does not produce a match:

                 "The one I saw" =~ /^one/

In the first case the word ‘one’ is at the beginning of the available string. In the second case, the word ‘one’ is not at the beginning of the available string.

At this point, you may ask, “Is ‘^’ not the negation symbol?” Well it is the negation symbol. The problem is to know when to use it. When used inside a class (square brackets) it is the negation symbol; when used at the beginning of a regex, just after the forward slash, it is the regex character for matching at the beginning of the available string. It is an anchor metacharacter.

The $ Character for Matching at End
If you want the matching to occur at the end of the available string, end the regex with the ‘$’ character.

The following expression produces a match:

                  "This is the last" =~ /last$/

The following expression does not produce a match:

                  "The last boy" =~ /last$/

In the first case the word ‘last’ is at the end of the available string. In the second case, the word ‘last’ is not at the end of the available string.

Note: $ actually matches the end of the available string, or just before a newline character at the end of the available string.

^ and $ are called anchor meta characters.

Matching the Whole String
Now, note that the .* character combination (period followed by asterisk)  in the pattern matches any sub string including a sub string of zero length.

You can match the whole available string, using the ‘^’ with the ‘$’ characters. The following code produces a match:

                 "beginning and end" =~ /^be.*end$/

The following code also produces a match:

                 "beginning with end" =~ /^be.*end$/

The available string of the first case is, “beginning and end”. The available string of the second case is “beginning with end”. The difference occurs in the word in the middle (and/with).

The regex pattern of both cases is the same. The pattern begins with ‘^’ and ends with ‘$’. The regexp indicates that the available string to be matched has to begin with “de”, followed by any character, any number of times; and the available string has to end with “end”.

Note: All along, when we say match, we are actually searching the available string for a sub-string, represented by the pattern of the regex. However, when you are matching the whole available string, the regex represents the whole string.

So, you can now match a whole string. By the time you complete this series, you will be able to match a whole available string having particular words within the string. I will not show you how to do that. It will be an exercise for you. You will simply need to combine many of the features I explain in the series.

Wow, we have done a lot so far, there are still many things to be learned. Regular Expressions is relatively new in software programming. So, we shall continue to take it step by step.

This is a good place to take a break. We continue in the next part of the series.

Chrys

Related Links

Perl Reference
Object Oriented Programming in Perl
Date and Time in Perl
Regular Expressions in Perl
Perl Course
Web Development Course
Major in Website Design
NEXT

Comments

Become the Writer's Fan
Send the Writer a Message