Building a Regular Expression in Perl

Regular Expressions in Perl for the Novice – Part 6

Forward: Many of the examples we have come across are simple examples. In this section we look at two examples that are more demanding.

By: Chrysanthus Date Published: 13 Aug 2012

Introduction

This is the sixth part of my series, Regular Expressions in Perl for the Novice. Many of the examples we have come across are simple examples. In this section we look at two examples that are more demanding. Before we leave this part of the series, we talk about what is called Backtracking, and then we look again at the x modifier.

Steps required to build a Regex
These are the steps required to build a regex:

- Specify the task in detail,

- Break down the problem into smaller parts,

- Translate the small parts into regexes,

- Combine the regexes,

- Optimize the final combined regexes.

Two Examples
Example 1
Hexadecimal Color Code Check
Specifying the Task in Detail
An example of a hexadecimal color code is #4C8. Another example is #44CC88.
- It begins with a hash, followed by either 3 hexadecimal numbers or 6 hexadecimal numbers.
- Hexadecimal digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F.
- The hexadecimal letters can be in lower or upper case.

Breaking Down the Problem into Smaller Parts
- It begins with a #.
- It is followed by 3 hexadecimal numbers or
- 6 hexadecimal numbers
- There is no character after the 3 or 6 hexadecimal digits.

Translating into regexes
There are three small parts above. The first part gives the regex:

                  /^#/

The second part gives the regex:

                 /[0-9a-fA-F]{3}/

The third part gives the regex:

                 /[0-9a-fA-F]{6}/

The last part gives the regex

              /$/

Combining the Regexes
This is the combined regex:

               /^#([0-9a-fA-F]{3}$)|([0-9a-fA-F]{6}$)/

Note the alternate metacharacter, | for the three or six hexadecimal digits. Also note the parentheses that separate the alternate groups.

Optimizing the Combined Regex
This means shortening the combined regex. Note that 0-9 is abbreviated to \d. So in the combined regex, we change the two occurrences of 0-9 to \d. There are actually two groups; we reduce these to one group by removing the inner brackets. The optimized regex is:

               /^#([\da-fA-F]{3}$|[\da-fA-F]{6}$)/

This expression is shorter than the above by four characters.

The following code illustrates use of the regex:

use strict;

(my $matchedString) = ("#44CC88" =~ /^#([\da-fA-F]{3}$|[\da-fA-F]{6}$)/);

print $matchedString;

The output is:

44CC88

Example 2
User Name Check
Specifying the Task in Detail
Assume that we have a site where users have to login. We can tell the user that his name should contain letters in lower or upper case and/or digits from 0 to 9 and/or the underscore, _. We also insist that the name must not be less than 3 characters or greater that 18 characters. In this example we have imposed the specification details.

Breaking Down the Problem into Smaller Parts
Name is made up of
- letters of the alphabet in lower or upper case between 3 to 18 letters, inclusive, and/or
- digits from 0 to 9 between 3 to 18 digits, inclusive, and/or
- the underscore between 3 to 18 digits, inclusive. This means, you can have up to 18 underscores for a name. Let us allow that for simplicity.
- We must limit the available string to 3 or 6 characters.

Translating into regexes
The regex for the first part is:

               /^[a-zA-Z]{3,18}$/

The regex for the second part is:

               /^[0-9]{3,18}$/

The regex for the third part is:

               /^[_]{3,18}$/

Combining the Regexes
In the break down section, the above three part are combined with the phrase, “and/or” There is no direct way of doing this, so we have to deduce it. This is the combined regex:

               /^[a-zA-Z0-9_]{3,18}$/

Optimizing the Combined Regex
This means shortening the combined regex. Note that the class [a-zA-Z0-9_] is abbreviated to \w. The optimized regex is:

               /^[\w]{3,18}$/

Backtracking
We have seen how to match alternatives using the alternation metacharacter, |. When matching alternatives, Perl uses a process known as backtracking. I will illustrate this with an example. Consider the following expression:

"12345" =~ /(124|123)(46|4|45)/

I will explain backtracking by explaining the operation of the above expression. The following steps explain how PHP resolves the above expression.

A) It starts with the first number in the available string '1'.

B) It tries the first alternative in the first group '124'.

C) It sees the matching of ‘1’ followed by ‘2’. That is all right.

D) It notices that '4' in the regex doesn't match '3' in the available string – that is a dead end. So it backtracks two characters in the available string and picks the second alternative in the first group '123'.

E) It matches '1' followed by '2' followed by '3'. The first group is satisfied.

F) It moves on to the second group and picks the first alternative '46'.

G) It matches the '4' in the available string.

H) However, '6' in the regex doesn't match '5' in the available string, so that is a dead end. It backtracks one character in the available string and picks the second alternative in the second group '4'.

I) '4' matches. The second grouping is satisfied.

J) We are at the end of the regex; we are done! We have matched '1234' out of the available string "12345".

There are two things to note about this process. First, the third alternative in the second group '45' also allows a match, but the process stopped before it got to the third alternative - at a given character position, leftmost conquers. Secondly, the process was able to get a match at the first character position of the available string '1'. If there were no matches at the first position, Perl would move to the second character position '2' and attempt the match all over again. Perl gives up and declares "12345" =~ /(124|123)(46|4|45)/, to be false, only when all possible paths at all possible character positions have been exhausted.

The x Modifier Details
This modifier is set by putting x (in lower case) just next to the second forward slash of the regex. That is:

                          /pattern/x

Wwhitespace data characters in the pattern are totally ignored except when escaped or inside a character class, when this modifier is set. When this modifier is set, characters between an unescaped # outside a character class and the next n character, inclusive, are also ignored. I will illustrate all this.

It says whitespace data characters in the pattern are totally ignored except when escaped or inside a character class. Consider the available string:

          $availableString = "I am a man sitting down.";

The following two expressions with the x modifier does not produce a match.

          $availableString =~ /man sitting down/x

This is because in the regex, the single-spaces between “man” and “sitting” and “sitting” and “down” are not recognized, with the presence of the x modifier. If you remove these corresponding spaces in the available string you will have a match, with the x modifier. The following available string will produce a match with the above regex:

         $availableString = "I am a mansittingdown.";

If you want the original available string and regex to match, then you have to escape the spaces in the regex. The following expression produces a match with the original available string:

          $availableString  =~ /man sitting down/x

An escaped single space is “ ”.

Let us now talk about white space in a character class. Note that whitespace is actually [ trnf], not just “ ”. However, let us continue our illustration using “ ”. We use the same available string, that is:

          $availableString = "I am a man sitting down.";

If we want to match the space in front of sitting, followed by “sitting”, with the x modifier, then our regex could be;

              /[ ]sitting/x

Note that the whitespace in the character class has not been escaped. That is, with the x modifier, whitespace inside a character class is not escaped, while whitespace outside the character class is escaped. The following expression produces a match:

             $availableString =~ /[ ]sitting/x

With the x modifier, any text between the # character and an implicit or explicit newline character is ignored. An implicit newline character is achieved by pressing the Enter key when you are typing. An explicit newline character is achieved by typing the n character. Consider the following code:

use strict;

my $availableString = "I am a man sitting down.";

if ($availableString =~ /man #Comment goes here
                                 sitting/x)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The available string is:

    $availableString = "I am a man sitting down.";

The regex is:

     /man #Comment goes here
     sitting/x

Note the presence of the # character and the implicit newline character, obtained after the word, “here” by pressing the Enter key of the keyboard. A match is produced. The sub string that is actually matched is “man sitting”.

In the following code, the newline character is explicit, with n. A match is also produced.

use strict;

my $availableString = "I am a man sitting down.";

if ($availableString =~ /man #Comment goes herensitting/x)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

Notice the explicit newline character, n between the words “here” and “sitting”.

When the x modifier is set, you can add comments into your regex especially when you have a complex regex.

Let us take a break here and continue in the next part of the series.

Chrys

Broad Network

Related Articles

Building a Regular Expression in Perl

Regular Expressions in Perl for the Novice – Part 6

Introduction

Related Links

Comments