Broad Network


Grouping in PHP Regular Expressions

PHP Regular Expressions with Security Considerations - Part 4

Foreword: You can actually group parts of a pattern and do something with it. That is what this part of the series is about.

By: Chrysanthus Date Published: 18 Jan 2019

Introduction

This is part 4 of my series, PHP Regular Expressions with Security Considerations. You can actually group parts of a pattern and do something with it. That is what this part of the series is about. You should have read the previous parts of the series before coming here, as this is the continuation.

Groupings
We can use parenthesis to group characters in a pattern. Consider the following pattern:

          /The (guitarist)/

“guitarist” is in parenthesis. The parentheses form a group, which has the text: “guitarist”. Consider the following:

         /The (guitarist is good)/

“guitarist is good” is in parenthesis. The parentheses form a group, which has the text: “guitarist is good”.

PHP treats a group as an entity. A group on its own is not serving any purpose. It becomes important when used in conjunction with other pattern techniques. There is another use, which we shall see below.

More about the preg_match() Function
The syntax of the function can be given as follows:

    preg_match ( string $pattern , string $subject [, array &$matches])

When a match occurs, the found sub string goes into the array, $matches. You do not have to pre-declare this array. You can have any name of your choice for the array. Try the following code:

<?php

    preg_match("/World/", "Hello World!", $matches);
    echo $matches[0];

?>

The output is:

    World

Note that $matches in the function call, is not preceded by & .

Sub Strings with common Parts
Imagine that you have a bookshop where there is a bookkeeper and a bookshelf. Here, bookkeeper is the person who looks after the books. Also imagine that you have any of the following subject strings:

          var subject = "There is a bookshelf in my shop.";

          var subject = "I have a bookkeeper.";

          var subject = "The bookkeeper takes care of the bookshelf.";

In your code, you might not know which subject string is present (the string might have been taken from somewhere and assigned to a variable); however, let us say your interest is to know whether there is a bookshelf or bookkeeper in the subject string. The regex for this can be:

          /bookshelf|bookkeeper/

Note that in the above regex, we have to type the word, “book” twice. We can avoid this double typing by using the following regex:

          /book(shelf|keeper)/

This second regex is more efficient, because we do not have to type the word, “book” twice. In the second case we have the group, (shelf|keeper). PHP treats a group as a single unit. Also, in this group, PHP has to choose “shelf” or “keeper”. In this way, PHP will have to look for bookshelf or bookkeeper in the pattern, resulting from more efficient typing. The following code produces a match (returns an array).

    <?php

        $subject = "There is a bookshelf in my shop.";
        preg_match("/book(shelf|keeper)/", $subject, $matches);
        print_r($matches);

    ?>

The output is:

    Array
        (
           [0] => bookshelf
           [1] => shelf
        )

Here, the subject variable can be any of the above strings.

Here, the word “book” is common to both words and is the first part of both words. You can have a sub string that is the second common part of two phrases; the same technique applies, but this time the group is in front; like this:

           /(non-common|non-common)common/

The pattern can actually become complex and you would have the same group as above in different places of a bigger pattern.

Sub Strings with Alternation at Beginning of Subject String in Group
Let us look at the case of sub strings with common part where matching has to occur at the beginning of the ^.

Consider the following pattern:

              /(^x|y)z/

The anchor metacharacter ^ is used to match the regex at the beginning of the subject string. The above pattern matches 'xz' at start of the subject string or 'yz' anywhere in the subject string. The following expressions match:

    preg_match("/^(x|y)z/", "xz 5678")

    preg_match("/(^x|y)z/", "34 yz 56 G")

More on Sub Strings with Common Parts
Imagine that you want to match “book” or “bookkeeper” or “bookkeepers”.  The sub string “book” occurs in the three phrases (sub strings). The sub string “bookkeeper” occurs in two of the phrases; and the sub string bookkeepers occurs in only one of the phrases. Our aim in this section is to develop an efficient pattern to match sub strings such as these three.

You can do this:

        /book|bookkeeper|bookkeepers/

The problem here (inefficiency) is that you have to type “book” three times and you have to type “keeper” two times.

The following pattern is efficient:

         /book(keeper(s|)|)/

First of all, note here that you have a nested group; groups can be nested. There are also two alternation metacharacters, |; one inside an inner nested group; the other inside an outer group.

In the inner nested group, PHP has to chose between “s” or nothing. In the outer group, PHP still has to choose between “keeper(s|)” or nothing.  “book” will always be chosen. In this way, PHP will match “book”, “bookkeeper” or “bookkeepers”. The situation here is similar to the first situation above, but is more complex here.

Capturing
The grouping metacharacters, (), that is, parentheses also serve another completely different purpose: they allow the capture of sub strings in the subject that matched. Well, pattern (regex) is not usually an exact word or an exact phrase. After the matching has occurred with the subject, can you know the exact word or phrase that was matched? Yes, you can know this, and it is thanks to grouping (parentheses).

Consider the following subject string:

    "This is one and that is two."

The pattern, /(one).*(two)/) matches the sub string “one and that is two” in the subject. Now, the whole sub string matched, is captured in an array, when you use the string object match() method. The portions of the subject string matched by a group can also be captured in the array. In this case, “one” and “two” that are in the subject string matched can be captured in an array.

The following code illustrates this:

<?php

        preg_match("/(one).*(two)/", "This is one and that is two.", $matches);
        echo $matches[0], '<br>';
        echo $matches[1], '<br>';
        echo $matches[2];

?>

For the output, the first echo construct displays:

         one and that is two

The second echo construct displays:

         one

The third echo construct displays:

         two

Let us look at the code first before we look at its output. The first statement is:

        $subject = "There is a bookshelf in my shop.";

First of all note that we have used the match() method. There are two groups in the regex. Now, after execution of the matching process, the sub string in the subject that matches the whole group goes into $matches[0]. The sub string in the subject that matches the first group goes into $matches[1]. The sub string in the subject that matches the second group goes into $matches[2]. Note: The array will acquire these sub strings only if there is matching. If there is no matching, the array will not acquire any sub strings from the subject; it will have only one element with the value, null.

After the first statement in the script, the other three echo statements display the three array values, accordingly. From the explanation given above, there can only be three elements in the array. This explains what you have as the output.

So, to capture the whole regex including any of its groups, you need the string object match() method and an array to hold returned sub strings matched. The whole regex gives rise to the value of the first element of the array. The first group in the regex (from left to right) gives rise to the value of the second element of the array. The second group gives rise to the value of the third element in the array. The rest of the groups, if available, fill the array in that order. The length of the array is the number of sub strings captured. This consists of the sub string in the subject that corresponds to the whole regex and the other sub strings corresponding to the groups in the regex. Remember, if there is no matching, there is no array and the return value of the match() method is, null.

What about Nested Groups
Consider the following code:

<?php

        preg_match("/school (boy(s|)|)/", "I like school boys.", $matches);
        echo $matches[0], '<br>';
        echo $matches[1], '<br>';
        echo $matches[2];

?>

This is the output of the above code:

school boys
boys
s

The whole pattern would match “school boys”, “school boy” or “school”. We have two groups; one inside the other.

The outer group in the regex is boy(s|) and the inner group is (s|). The outer group corresponds to “boys” in “school boys”. The inner group corresponds to “s” at the end of “school boys”.

Let me do some more explanation on these capturing. “(boy(s|)|)” means “boy(s|)” or nothing, and then “boy(s|)” means “boys” or “boy”; so “boys” next to “school ” is captured. “(s|)” is a group and any group can be captured; it means “s” or nothing. Note that it is not necessarily the group that is matched; it is the pattern that is matched. The match-able sub string that has our “s” is “school boys”. As “school boys” is matched, our “s” is captured.

Capturing and matching are not the same things. After matching occurs, if there is any group in the matched sub string in the subject string, it is captured.

Now, you may ask, “with nesting groups, how do we know which group is considered as the first group, which group is considered as the second group, which is considered as the third and so on.” This is the answer: The left most opened bracket “(” is for the first group; the second opened bracket in the regex from the left, is for the second group; the bracket after this from the left, is for the third group, and so on.

Time to take a break. We continue in the next part of the series.

Chrys


Related Links

Basics of PHP with Security Considerations
White Space in PHP
PHP Data Types with Security Considerations
PHP Variables with Security Considerations
PHP Operators with Security Considerations
PHP Control Structures with Security Considerations
PHP String with Security Considerations
PHP Arrays with Security Considerations
PHP Functions with Security Considerations
PHP Return Statement
Exception Handling in PHP
Variable Scope in PHP
Constant in PHP
PHP Classes and Objects
Reference in PHP
PHP Regular Expressions with Security Considerations
Date and Time in PHP with Security Considerations
Files and Directories with Security Considerations in PHP
Writing a PHP Command Line Tool
PHP Core Number Basics and Testing
Validating Input in PHP
PHP Eval Function and Security Risks
PHP Multi-Dimensional Array with Security Consideration
Mathematics Functions for Everybody in PHP
PHP Cheat Sheet and Prevention Explained
More Related Links

Cousins

BACK NEXT

Comments