Broad Network


Building a Regular Expression in PHP

PHP Regular Expressions – Part VI

Forward: In this section we look at two examples that are more demanding.

By: Chrysanthus Date Published: 11 Aug 2012

Introduction

Many of the examples we have come across are simple examples. In this section we look at two examples that are more demanding. Before we leave this part of the series, we shall talk about what is called Backtracking.

Steps required to Build a Regex
These are the steps required to build a regex:

- Specify the task in detail,

- Break down the problem into smaller parts,

- Translate the small parts into regexes,

- Combine the regexes,

- Optimize the final combined regexes.

Two Examples
Example 1
Hexadecimal Color Code Check
Specifying the Task in Detail
An example of a hexadecimal color code is #4C8. Another example is #44CC88.
- A hexadecimal code begins with a hash, followed by either 3 hexadecimal numbers or 6 hexadecimal numbers.
- Hexadecimal digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F.
- The hexadecimal letters can be in lower or upper case.

Breaking Down the Problem into Smaller Parts
- It begins with a #.
- It is followed by 3 hexadecimal numbers or
- 6 hexadecimal numbers
- There is no character after the 3 or 6 hexadecimal digits.

Translating into regexes
There are three small parts above. The first part gives the regex:

                  /^#/

The second part gives the regex:

                 /[0-9a-fA-F]{3}/

The third part gives the regex:

                 /[0-9a-fA-F]{6}/

The last part gives the regex:

              /$/

Combining the Regexes
This is the combined regex:

               /^#([0-9a-fA-F]{3}$)|([0-9a-fA-F]{6}$)/

Note the alternate metacharacter, | for the three or six hexadecimal digits. Also note the parentheses that separate the alternate groups.

Optimizing the Combined Regex
This means shortening the combined regex. Note that 0-9 is abbreviated to d. So in the combined regex, we change the two occurrences of 0-9 to d. The optimized regex is:

               /^#([da-fA-F]{3}$)|([da-fA-F]{6}$)/

This expression is shorter than the above by two characters.

The following code illustrates this:

<?php
    $subject = "#44CC88";

    if (preg_match("/^#([da-fA-F]{3}$)|([da-fA-F]{6}$)/", $subject))
     echo "Matched" . "<br />";
    else
     echo "Not Matched" . "<br />";
?>

Example 2
User Name Check
Specifying the Task in Detail
Assume that we have a site where users have to login. We can tell the user that his name should contain letters in lower or upper case and/or digits from zero to 9 and/or the underscore, _. We also insist that the name must not be less than 3 characters or greater that 18 characters. In this example we have imposed the specification details.

Breaking Down the Problem into Smaller Parts
A login name is made up of
- letters of the alphabet in lower or upper case between 3 to 18 letters, inclusive, and/or
- digits from 0 to 9 between 3 to 18 digits, inclusive, and/or
- the underscore between 3 to 18 digits, inclusive. This means, you can have up to 18 underscores for a name. Let us allow that for simplicity.
- We must limit the subject string to 3 or 6 characters.

Translating into regexes
The regex for the first point is:

               /^[a-zA-Z]{3,18}$/

The regex for the second point is:

               /^[0-9]{3,18}$/

The regex for the third point is:

               /^[_]{3,18}$/

The fourth point is inherent in the above regexes.

Combining the Regexes
In the break down section, the above three points are combined with the phrase, “and/or” There is no direct way of doing this, so we have to deduce it. This is the combined regex:

               /^[a-zA-Z0-9_]{3,18}$/

Optimizing the Combined Regex
This means shortening the combined regex. Note that the class [a-zA-Z0-9_] is abbreviated to w. The optimized regex is:

               /^[w]{3,18}$/

Backtracking
We have seen how to match alternatives using the alternation metacharacter, |. When matching alternatives, PHP uses a process known as backtracking. I will illustrate this with an example. Consider the following expression:

            preg_match("/(124|123)(46|4|45)/", "12345")

I will explain backtracking by explaining the operation of the above expression. The following steps explain how PHP resolves this expression.

A) It starts with the first number in the subject string '1'.

B) It tries the first alternative in the first subpattern '124'.

C) It sees the matching of ‘1’ followed by ‘2’. That is all right.

D) It notices that '4' in the regex doesn't match '3' in the subject string – that is a dead end. So it backtracks two characters in the subject string and picks the second alternative in the first subpattern '123'.

E) It matches '1' followed by '2' followed by '3'. The first subpattern is satisfied.

F) It moves on to the second subpattern and picks the first alternative '46'.

G) It matches the '4' in the subpattern string.

H) However, '6' in the regex doesn't match '5' in the subpattern string, so that is a dead end. It backtracks one character in the subpattern string and picks the second alternative in the second subpattern '4'.

I) '4' matches. The second subpattern is satisfied.

J) We are at the end of the regex; we are done! We have matched '1234' out of the subject string "12345".

There are two things to note about this process. First, the third alternative in the second subpattern '45' also allows a match, but the process stopped before it got to the third alternative - at a given character position, leftmost conquers. Secondly, the process was able to get a match at the first character position of the subject string '1'. If there were no matches at the first position, PHP would move to the second character position '2' and attempt the match all over again. PHP gives up and declares "12345" =~ /(124|123)(46|4|45)/, to be false, only when all possible paths at all possible character positions have been exhausted.

The x Modifier Details
This modifier is set by putting x in lower case just next to the second forward slash of the regex. That is:

                          /pattern/x

If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. I will illustrate all this.

It says whitespace data characters in the pattern are totally ignored except when escaped or inside a character class. Consider the subject string:

          $subject = "I am a man sitting down.";

The following expression with the x modifier does not produce a match.

          preg_match("/man sitting down/x", $subject)

This is because in the regex, the single-space between “man” and “sitting” and “sitting” and “down” are not recognized, with the presence of the x modifier. If you remove these corresponding spaces in the subject you will have a match, with the x modifier. The following subject will produce a match with the above regex:

         $subject = "I am a mansittingdown.";

If you want the original subject and regex to match, then you have to escape the spaces in the regex. The following expression produces a match with the original subject:

          preg_match("/man sitting down/x", $subject)

An escaped single space is “ ”.

Let us now talk about white space in a character class. Note that whitespace is actually [ trnf], not only “ ”. However, let us continue our illustration using “ ”. We use the same subject, that is:

          $subject = "I am a man sitting down.";

If we want to match the space in front of sitting, followed by “sitting”, with the x modifier, then our regex could be;

              /[ ]sitting/x

Note that the whitespace in the character class has not been escaped. That is, with the x modifier, whitespace inside a character class is not escaped, while whitespace outside the character class is escaped. The following expression produces a match:

           preg_match("/[ ]sitting/x", $subject)

With the x modifier, any text between the # character and implicit or explicit newline character is ignored. An implicit newline character is achieved by pressing the Enter key when you are typing. An explicit newline character is achieved by typing the n character. Consider the following code:

<?php
    $subject = "I am a man sitting down.";

    $re = "/man #Comment goes here
           sitting/x";

    if (preg_match($re, $subject))
     echo "Matched" . "<br />";
    else
     echo "Not Matched" . "<br />";
?>

The subject is:

    $subject = "I am a man sitting down.";

The regex is:

    $re = "/man #Comment goes here
           sitting/x";
Note the presence of the # character and the implicit newline character, obtained after the word, “here” by pressing the Enter key. A match is produced. The sub string that is actually matched is “man sitting”.

In the following code, the newline character is explicit, with n. A match is also produced.

<?php
    $subject = "I am a man sitting down.";

    $re = "/man #Comment goes herensitting/x";

    if (preg_match($re, $subject))
     echo "Matched" . "<br />";
    else
     echo "Not Matched" . "<br />";
?>

Notice the explicit newline character, n between the words “here” and “sitting”.

When the x modifier is set, you can add comments into your regex especially when you have a complex regex.

Let us take a break here and continue in the next part of the series.

Chrys

Related Links

Major in Website Design
Web Development Course
HTML Course
CSS Course
ECMAScript Course
PHP Course
NEXT

Comments

Become the Writer's Fan
Send the Writer a Message