Perl Building of a Regular Expression

Perl Regular Expressions – Part 5

Perl Course

Foreword: Many of the examples we have come across are simple examples. In this tutorial we look at two examples that are more demanding. Before we leave this part of the series, I talk about what is called Backtracking, and then we see how to embed comments in a regex. I also talk about a few other things.

By: Chrysanthus Date Published: 5 Oct 2015

Introduction

This is part 5 of my series, Perl Regular Expressions. Many of the examples we have come across are simple examples. In this tutorial we look at two examples that are more demanding. Before we leave this part of the series, I talk about what is called Backtracking, and then we see how to embed comments in a regex. I also talk about a few other things. You should have read the previous parts of the series before coming here, as this is a continuation.

Steps required to build a Regex
These are the steps required to build a regex:

- Specify the task in detail,

- Break down the problem into smaller parts,

- Translate the small parts into regexes,

- Combine the regexes,

- Optimize the final combined regexes.

Two Examples

Example 1
Hexadecimal Color Code Check

Specifying the Task in Detail
An example of a hexadecimal color code is #4C8. Another example is #44CC88.
- It begins with a hash sign, #, followed by either 3 hexadecimal numbers or 6 hexadecimal numbers.
- Hexadecimal digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F.
- The hexadecimal letters can be in lower or upper case.

Breaking Down the Problem into Smaller Parts
- It begins with a #.
- It is followed by 3 hexadecimal numbers Or
- 6 hexadecimal numbers
- There is no character after the 3 or 6 hexadecimal digits.

Translating into regexes
There are three small parts above. The first part gives the regex:

                  /^#/

The second part gives the regex:

                 /[0-9a-fA-F]{3}/

The third part gives the regex:

                 /[0-9a-fA-F]{6}/

The last part gives the regex

              /$/

Combining the Regexes
This is the combined regex:

               /^#([0-9a-fA-F]{3}$)|([0-9a-fA-F]{6})$/

Note the alternate metacharacter, | for the three or six hexadecimal digits. Also note the parentheses that separate the alternate groups.

Optimizing the Combined Regex
This means shortening the combined regex. Note that 0-9 is abbreviated to \d. So in the combined regex, we change the two occurrences of 0-9 to \d. There are actually two groups; we reduce these to one group. The optimized regex is:

               /^#([\da-fA-F]{3}$|[\da-fA-F]{6})$/

This expression is shorter than the above by four characters and )(.

The following code illustrates use of the regex:

use strict;

(my $matchedString) = "#44CC88" =~ /^#([\da-fA-F]{3}$|[\da-fA-F]{6})$/;

print $matchedString;

The output is:

44CC88

Example 2
User Name Check

Specifying the Task in Detail
Assume that we have a site where users have to login. We can tell the user that his name should contain letters in lower or upper case and/or digits from 0 to 9 and/or the underscore, _. We also insist that the name must not be less than 3 characters or greater that 18 characters. In this example we have imposed the specification details.

Breaking Down the Problem into Smaller Parts
Name is made up of
- letters of the alphabet in lower or upper case between 3 to 18 letters, inclusive, And/Or
- digits from 0 to 9 between 3 to 18 digits, inclusive, And/Or
- the underscore between 3 to 18 characters, inclusive. This means, you can have up to 18 underscores for a name. Let us allow that for simplicity.

Translating into regexes
The regex for the first part is:

               /^[a-zA-Z]{3,18}$/

The regex for the second part is:

               /^[0-9]{3,18}$/

The regex for the third part is:

               /^[_]{3,18}$/

Combining the Regexes
In the break down section, the above three part are combined with the phrase, “And/Or” There is no direct way of doing this, so we have to deduce it. This is the combined regex:

               /^[a-zA-Z0-9_]{3,18}$/

Optimizing the Combined Regex
This means shortening the combined regex. Note that the class [a-zA-Z0-9_] is abbreviated to \w. The optimized regex is:

               /^[\w]{3,18}$/

Backtracking
You should have seen how to match alternatives using the alternation metacharacter, |. When matching alternatives, Perl uses a process known as backtracking. I will illustrate this with an example. Consider the following expression:

"12345" =~ /(124|123)(46|4|45)/

I will explain backtracking by explaining the operation of the above expression. The following steps explain how Perl resolves the above expression.

A) It starts with the first number in the subject string '1'.

B) It tries the first alternative in the first group '124'.

C) It sees the matching of ‘1’ followed by ‘2’. That is all right.

D) It notices that '4' in the regex doesn't match '3' in the subject – that is a dead end. So it backtracks two characters in the subject and picks the second alternative in the first group '123'.

E) It matches '1' followed by '2' followed by '3'. The first group is satisfied.

F) It moves on to the second group and picks the first alternative '46'.

G) It matches the '4' in the subject.

H) However, '6' in the regex doesn't match '5' in the subject, so that is a dead end. It backtracks one character in the subject and picks the second alternative in the second group '4'.

I) '4' matches. The second group is satisfied.

J) We are at the end of the regex; we are done! We have matched '1234' out of the subject string "12345".

There are two things to note about this process. First, the third alternative in the second group '45' also allows a match, but the process stopped before it got to the third alternative - at a given character position in subject, leftmost conquers. Secondly, the process was able to get a match at the first character position of the subject string '1'. If there were no matches at the first position, Perl would move to the second character position '2' and attempt the match all over again. Perl gives up and declares "12345" =~ /(124|123)(46|4|45)/, to be false, only when all possible paths at all possible character positions have been exhausted.

Embedding Comments and Modifiers in a Regular Expression

Embedding Comments
The expression to use to embed comment is:

               (?#Comment)

You start with ‘(?#’ , you type your comment and then you end with ‘)’. The word “Internet” normally starts with ‘I’ in upper case. The regex,

         /the (?i)INTERNET/

where the embedded (?i) makes “INTERNET” case insensitive, can be commented as follows:

     /The (?i)(?# for the rest of the regex)INTERNET(?# I alone for Internet should be in uppercase)/

There are two comments in the regex. (?i) is not a comment; it is an embedded modifier. A comment is ignored in the regex. The following code produces a match.

use strict;

if ("The Internet" =~ /The (?i)(?# for the rest of the regex)INTERNET(?# I alone for Internet should be in uppercase)/)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

The embedded (?i) makes all the regex text to its right case insensitive.

Using the tag “(?#Comment)” is good when your regex and comments are on one line. If you want your regex and its comments to be on more than one line, then you should use the x modifier and escape all the unnecessary white spaces, as follows:

"The Internet" =~ /The (?i)# the first part of the regex
\                              INTERNET# I alone for Internet should be in uppercase
                             /x

Note that the normal # comment symbol has been used at end of a line, without the embedded “(?#Comment)”. Also note that all the whitespace in front of INTERNET has been escaped with \. The following code illustrates the use:

use strict;

if ("The Internet" =~ /The (?i)# the first part of the regex
\                              INTERNET# I alone for Internet should be in uppercase
                             /x)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

Matching has occurred. With the x modifier, # begins a comment.

If you are not using the x modifier, note that the “(?#Comment)” tag cannot be nested, You cannot have “(?#Comment(?#Comment))” in a regex.

Non-capturing Groups
A group is text in parentheses in regex. By default, any such text is captured into a variable or list on the left hand side of the =~ expression. Consider the following code:

use strict;

my @arr = ("This is one and that is two." =~ /(one).*(two)/);

print $arr[0], "\n";

print $arr[1], "\n";

The output is:

one
two

The output consists of the words, “one” and “two”. These are elements captured and stored in the array, @arr.

You may not want to capture every group. If you do not want to capture a group, precede the content of the group with “?:”. To prevent the group “(one)” from being captured, you need “(?:one)” for the group. The group still remains valid in the pattern with its other grouping advantage, but it is not captured. The following code illustrates this:

use strict;

my @arr = ("This is one and that is two." =~ /(?:one).*(two)/);

print $arr[0], "\n";

print $arr[1], "\n";

The output of the code is:

two

We prevented the first group, “(one)” from being captured by transforming it into, “(?:one)”. From the output, we see that “one” in the subject has not been captured, as we anticipated. “two” has been captured, and it is the only element in the array.

To make a group non-capturing, use the following syntax:

               (?:groupContent)

Including Modifiers in Non-Capturing Groups
You should have seen how you could embed modifiers in a regex. You may want to include a modifier in a non-capturing group. There are two ways of doing this. Let us say you want to include the modifier, i in the non-capturing sub group “(?:one)” above. You can do it like this:

          (?:(?i)one)

The following expression produces a match:

         "This is ONE and that is two." =~ /(?:(?i)one).*(two)/

You can use the following code to test:

use strict;

if ("This is ONE and that is two." =~ /(?:(?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The syntax is:

    (?:(?modifier)groupContent)

Modifiers in Groups
A modifier embedded in a regex has its effect from that point to the end of the regex, everything being equal. The question you may have is this: “If the modifier is in a group, would it have its effect only in the group or in the whole regex.

Let us just write four short scripts to verify that. This is the first:

use strict;

if ("This is ONE and that is two." =~ /((?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The regex is “/((?i)one).*(two)/”. Note that in the subject string, “ONE” is in upper case. Matching occurs in the code. Here we are dealing with a capturing group. Consider the following code still with a capturing group:

use strict;

if ("This is ONE and that is TWO." =~ /((?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The regex is still “/((?i)one).*(two)/”. Note that in the subject, “ONE” is still in the upper case and “TWO”, this time is in the upper case. Matching does not occur in the code. In the regex “two” is in lowers case; this is why matching does not occur.

The above two programs deal with capturing groups, would non- capturing groups behave in the same way? We shall use two more simple programs to verify this. Consider the following:

use strict;

if ("This is ONE and that is two." =~ /(?:(?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The regex is “/(?:(?i)one).*(two)/”. Note that in the subject, “ONE” is in upper case. Matching occurs in the code. Here we are dealing with a non-capturing group. Consider the following code, which is also with a non-capturing group:

use strict;

if ("This is ONE and that is TWO." =~ /(?:(?i)one).*(two)/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

The regex is still “/(?:(?i)one).*(two)/”. Note that in the subject, “ONE” is still in the upper case and “TWO” this time is in the upper case. Matching does not occur in the code. In the regex “two” is in lower case; this is why matching does not occur.

Well, this is a fact: whether you are dealing with capturing or non-capturing groups, a modifier inside a group affects only that group; it does not affect the rest of the regex.

Well, let us take a break here and continue in the next part of the series.

Chrys

Broad Network

Related Articles

Perl Building of a Regular Expression

Perl Regular Expressions – Part 5

Perl Course

Introduction

Embedding Comments and Modifiers in a Regular Expression

Related Links

Comments