Broad Network


Building and Using PHP Regular Expressions

PHP Regular Expressions with Security Considerations - Part 7

Foreword: In this part of the series, I explain how to build and use PHP regular expressions.

By: Chrysanthus Date Published: 18 Jan 2019

Introduction

This is part 7 of my series, PHP Regular Expressions with Security Considerations. In this part of the series, I explain how to build and use PHP regular expressions. You should have read the previous parts of the series before reaching here, as this is the continuation.

Building a Regular Expression

Steps required to build a Regex
These are the steps required to build a regex:

- Specify the task in detail,

- Break down the problem into smaller parts,

- Translate the small parts into regexes,

- Combine the regexes,

- Optimize the final combined regexes.

Two Examples

Example 1
Hexadecimal Color Code Check

Specifying the Task in Detail
An example of a hexadecimal color code is #4C8. Another example is #44CC88.
- A hexadecimal code begins with a hash, followed by either 3 hexadecimal numbers or 6 hexadecimal numbers.
- Hexadecimal digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F.
- The hexadecimal letters can be in lower or upper case.

Breaking Down the Problem into Smaller Parts
- It begins with a # anchored
- It is followed by 3 hexadecimal numbers or
- 6 hexadecimal numbers
- There is no character after the 3 or 6 hexadecimal digits.

Translating into regexes
There are three small parts above. The first part gives the regex:

                  /^#/

The second part gives the regex:

                 /[0-9a-fA-F]{3}/

The third part gives the regex:

                 /[0-9a-fA-F]{6}/

The last part gives the regex:

              /$/

Combining the Regexes
This is the combined regex:

               /^#([0-9a-fA-F]{3}$)|([0-9a-fA-F]{6}$)/

Note the alternate metacharacter, | for the three or six hexadecimal digits. Also note the parentheses that separate the alternate groups.

Optimizing the Combined Regex
This means shortening the combined regex. Note that 0-9 is abbreviated to d. So in the combined regex, we change the two occurrences of 0-9 to d. The optimized regex is:

               /^#([da-fA-F]{3}$)|([da-fA-F]{6}$)/

This expression is shorter than the above by two characters.

The following code illustrates its use:

        <script type="text/ECMAScript">
         subject = "#44CC88";
        
         if (subject.search(/^#([da-fA-F]{3}$)|([da-fA-F]{6}$)/) != -1)
               alert('Matched');
         else
               alert('Not Matched');
        </script>

Example 2
User Name Check

Specifying the Task in Detail
Assume that we have a site where users have to login. We can tell the user that his login name should contain letters in lower or upper case and/or digits from zero to 9 and/or the underscore, _. We also insist that the name must not be less than 3 characters or greater that 18 characters. In this example we have imposed the specification details.

Breaking Down the Problem into Smaller Parts
From above, a login name is made up of
- letters of the alphabet in lower or upper case between 3 to 18 letters, inclusive, and/or
- digits from 0 to 9 between 3 to 18 digits, inclusive, and/or
- the underscore between 3 to 18 digits, inclusive. This means, you can have up to 18 underscores for a name. Let us allow that for simplicity.
- We must limit the subject string to 3 or 6 characters.

Translating into Regexes
The regex for the first part is:

               /^[a-zA-Z]{3,18}$/

The regex for the second part is:

               /^[0-9]{3,18}$/

The regex for the third part is:

               /^[_]{3,18}$/

The fourth part is inherent in the above regexes.

Combining the Regexes
In the break down section, the above three parts are combined with the phrase, “and/or” There is no direct way of doing this, so we have to deduce it. This is the combined regex:

               /^[a-zA-Z0-9_]{3,18}$/

Optimizing the Combined Regex
This means shortening the combined regex. Note that the class [a-zA-Z0-9_] is abbreviated to w. The optimized regex is:

               /^[w]{3,18}$/

Backtracking

We have seen how to match alternatives using the alternation metacharacter, |. When matching alternatives, ECMAScript uses a process known as backtracking. I will illustrate this with an example. Consider the following expression:

            "12345".search("/(124|123)(46|4|45)/")

I will explain backtracking by explaining the operation of the above expression. The following steps explain how ECMAScript resolves this expression.

- It starts with the first number in the subject string '1'.

- It tries the first alternative in the first group '124'.

- It sees the matching of ‘1’ followed by ‘2’. That is all right.

- It notices that '4' in the regex doesn't match '3' in the subject string – that is a dead end. So it backtracks two characters in the subject string and picks the second alternative in the first group '123'.

- It matches '1' followed by '2' followed by '3'. The first group is satisfied.

- It moves on to the second group and picks the first alternative '46'.

- It matches the '4' in the group string.

- However, '6' in the regex doesn't match '5' in the group string, so that is a dead end. It backtracks one character in the group string and picks the second alternative in the second group '4'.

- '4' matches. The second group is satisfied.

- We are at the end of the regex; we are done! We have matched '1234' out of the subject string "12345".

There are two things to note about this process. First, the third alternative in the second group '45' also allows a match, but the process stopped before it got to the third alternative - at a given character position, leftmost conquers. Secondly, the process was able to get a match at the first character position of the subject string '1'. If there were no matches at the first position, ECMAScript would move to the second character position '2' and attempt the match all over again. ECMAScript gives up and declares "12345" =~ /(124|123)(46|4|45)/, to be “false”, only when all possible paths at all possible character positions have been exhausted.

Using Regular Expressions

Obtaining the Match
Consider the simple regex, /.ir../ and the subject, “boys and girls in a family”. The regex would match, “girls” in the subject. The question is, how do you return and use “girls” further down in the code? You have to use the string match method of the string object. The syntax is:

    preg_match ( string $pattern , string $subject [, array &$matches])

When a match occurs, the found sub string goes into the array, $matches. You do not have to pre-declare this array. You can have any name of your choice for the array.

If $matches is provided (optionally added to the function call), then it is filled with the results of the search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized (group) subpattern, and so on.

The preg_match() function returns 1 if the pattern matches the given subject, 0 if it does not, or FALSE if an error occurred (for example, if the coding in the pattern does not make sense).

Try the following code:  

<?php

    $subject = "boys and girls in a family";
    $re = "/.ir../";

    preg_match($re, $subject, $matches);
    echo $matches[0];

?>

The output is:

    girls

Note: null will be returned if nothing is found. You do not precede $matches with & in the function call.

Search and Replace
You can search for a match in the subject, and have the sub strings matched (found) replaced. Consider the following subject string:

             "I am a man. You are a man."

The sub string “man” occurs in this subject in two places. You can have the occurrence of the sub string “man” replaced by woman. You do this using the preg_replace() method, whose simplified syntax is:

        mixed preg_replace ( mixed $pattern , mixed $replacement , mixed $subject)

If matches are found, the new subject (copy) will be returned, otherwise subject (copy) will be returned unchanged or NULL if an error occurred. The old subject remains the same.

The following code illustrates this:

<?php

    $subject = "I am a man. You are a man.";

    $str = preg_replace("/man/", 'woman', $subject);

    echo $subject, '<br>';
    echo $str;

?>

The output is:

             I am a man. You are a man.
             I am a woman. You are a woman.

There are four lines in the code. The first line is the declaration and assignment of the subject string. The second line does the replacement. The first argument of the replace() method is the regex; the second argument is the sub string for replacement. The third argument is the subject.

The first echo construct displays the subject. The second echo construct displays the string returned by the replace() method.

From the output, we see that the subject remains unchanged. The return string above is the subject, where all the occurrences of the sub string, “man” have been replaced to "woman".

If you want to replace only the first limited number of occurrences, then you have to use an additional argument, called the limit argument. The following code illustrates this:

<?php

    $subject = "I am a man. You are a man.";

    $str = preg_replace("/man/", 'woman', $subject, 1);

    echo $subject, '<br>';
    echo $str;

?>

The output is:

             I am a man. You are a man.
             I am a woman. You are a man.

The value of the limit argument is the number, 1 (not in quotes). And so 1 (the first) occurrence has been replaced. The limit argument is actually the maximum number of occurrences that can be replaced, beginning from the left. If there is no occurrence, nothing will be replaced.

The Split Operation
PHP has a function called the preg_split() function. This function splits the string (subject) into an array of sub strings. The simplified syntax is:

array preg_split ( string $separator , string $subject [, int $limit = -1])

The subject is the string to be split. It is not changed after the split. The separator is a regex. The return array contains the sub strings separated. The limit is an integer. Some strings (subjects) may have characters at their end that you do not want to split. If you know the number of sub strings in the subject that you want, you can type this number as the limit. The rest of the string that cannot be split, goes into the array as the last sub string. The limit argument is optional. If absent, the splitting will go across the whole subject.

Consider the following subject string:

    $subject = "one two three";

If we know the regex (pattern) to identify space between words, then we can split this string into an array made up of the words, “one”, “two” and “three”. \  is the character class for space. + will match a space, one or more times. The regex to separate the above words is

               \ +

A space might be created by hitting the spacebar more than once. The following code illustrates the use of the split function:

<?php

    $subject = "one two three four five";

    $arr = preg_split("/\ +/", $subject, 3);

    echo $arr[0], '<br>';
    echo $arr[1], '<br>';
    echo $arr[2], '<br>';

?>

In the subject string the words are separated by spaces. The output of the above code is:

one
two
three four five

The split function has split the words in the subject string using the space between the words, and put the words as elements in the returned array. The word, “split” is not really proper in this section, since the subject string remains unchanged; however, that is the vocabulary the PHP specification uses.

It is possible to have words in a string separated by a comma and a space, like

    $subject = "one, two, three";

The regex to separate these words is:

          /, +/

The following code illustrates this:

<?php

    $subject = "one, two, three";

    $arr = preg_split("/, +/", $subject);

    echo $arr[0], '<br>';
    echo $arr[1], '<br>';
    echo $arr[2], '<br>';

?>

The output of the above code is:

one
two
three

It is possible to split and have null values as array elements. In the following code, the regex is a comma, and there are two consecutive commas in the subject:

<?php

    $subject = "one, two,,three";

    $arr = preg_split("/,/", $subject);

    echo $arr[0], '<br>';
    echo $arr[1], '<br>';
    echo $arr[2], '<br>';
    echo $arr[3], '<br>';

?>

The output is:

one
two

three

where the third value is null and not a space.

We have seen a lot in this part of the series. Let us stop here and continue in the next part.

Chrys


Related Links

Basics of PHP with Security Considerations
White Space in PHP
PHP Data Types with Security Considerations
PHP Variables with Security Considerations
PHP Operators with Security Considerations
PHP Control Structures with Security Considerations
PHP String with Security Considerations
PHP Arrays with Security Considerations
PHP Functions with Security Considerations
PHP Return Statement
Exception Handling in PHP
Variable Scope in PHP
Constant in PHP
PHP Classes and Objects
Reference in PHP
PHP Regular Expressions with Security Considerations
Date and Time in PHP with Security Considerations
Files and Directories with Security Considerations in PHP
Writing a PHP Command Line Tool
PHP Core Number Basics and Testing
Validating Input in PHP
PHP Eval Function and Security Risks
PHP Multi-Dimensional Array with Security Consideration
Mathematics Functions for Everybody in PHP
PHP Cheat Sheet and Prevention Explained
More Related Links

Cousins

BACK NEXT

Comments