Broad Network


Regex SubPattern in PHP

PHP Regular Expressions – Part IV

Forward: In this part of the series, I explain regex Grouping (subpattern) and Capturing in PHP.

By: Chrysanthus Date Published: 11 Aug 2012

Introduction

In this part of the series, I explain regex Grouping (subpattern) and Capturing in PHP.

Groupings
We can use parenthesis to group characters in a pattern. Consider the following pattern:

          /The (guitarist)/

“guitarist” is in parenthesis. The parentheses form a group or subpattern, which has the text: “guitarist”. Consider the following:

         /The (guitarist is good)/

“guitarist is good” is in parenthesis. The parentheses form a group, which has the text: “guitarist is good”.

PHP treats a subpattern as an entity. A subpattern on its own is not serving any purpose. It becomes important when used in conjunction with other pattern techniques. There is another use, which we shall see below.

Sub Strings with common Parts
Imagine that you have a bookshop where there is a bookkeeper and a bookshelf. Here, bookkeeper is the person who looks after the books. Also imagine that you have any of the following subject strings:

          $subject = "There is a bookshelf in my shop.";

          $subject = "I have a bookkeeper.";

          $subject = "The bookkeeper takes care of the bookshelf.";

In your code, you might not know which available string is present (the string might have been taken from somewhere and assigned to a variable); however, let us say your interest is to know whether there is a bookshelf or bookkeeper in the subject string. The regex for this can be:

          /bookshelf|bookkeeper/

Note that in the above regex, we have to type the word, “book” twice. We can avoid this double typing by using the following regex:

          /book(shelf|keeper)/

This second regex is more efficient, because we do not have to type the word, “book” twice. In the second case we have the subpattern, (shelf|keeper). PHP treats a subpattern (group) as a single unit. Also, in this group, PHP has to choose “shelf” or “keeper”. In this way, PHP will have to look for bookshelf or bookkeeper in the group. The following expression produces a match.

           preg_match("/book(shelf|keeper)/", $subject)

Here, $subject can be any of the above strings.

Here, the word “book” is common to both words and is the first part of both words. You can have a sub string that is the second common part of two phrases; the same technique applies, but this time the group is in front; like this:

           /(non-common|non-common)common/

The pattern can actually become complex and you would have the same pattern as above in different places of a bigger pattern.

Sub Strings with Alternation at Beginning of Available String in subpattern
Let us look at the case of sub strings with common part where matching has to occur at the beginning of the available string.

Consider the following pattern:

              /(^x|y)z/

The anchor metacharacter ^ is used to match the regex at the beginning of the subject string. The above pattern matches 'xz' at start of the subject string or 'yz' anywhere in the subject string. The following expressions match:

         preg_match("/(^x|y)z/", "xz 5678")

         preg_match("/(^x|y)z/", "34 yz 56 G")

More on Sub Strings with Common Parts
Imagine that you want to match “book” or “bookkeeper” or “bookkeepers”.  The sub string “book” occurs in the three phrases (sub strings). The sub string “bookkeeper” occurs in two of the phrases; and the sub string bookkeepers occurs in only one of the phrases. Our aim in this section is to develop an efficient pattern to match sub strings such as the three we saw before.

You can do this:

        /book|bookkeeper|bookkeepers/

The problem here (inefficiency) is that you have to type “book” three times and you have to type “keeper” two times.

The following pattern is efficient:

         /book(keeper(s|)|)/

First of all, note here that you have a nested group; groups can be nested. There are also two alternation metacharacters, |; one inside an inner nested group; the other inside the outer group.

In the inner nested group, PHP has to chose between “s” or nothing. In the outer group, PHP still has to choose between “keeper(s|)” or nothing.  “book” will always be chosen. In this way, PHP will match “book”, “bookkeeper” or “bookkeepers”. The situation here is similar to the first situation above, but is more complex here.

Capturing
The grouping metacharacters (), that is, parentheses also serve another completely different purpose: they allow the capture of sub strings in the subject string that matched. Well, pattern is not usually an exact word or an exact phrase. After the matching has occurred with the subject string, can you know the exact word or phrase in the subject string that was matched? Yes, you can know this, and it is thanks to grouping (parentheses).

Consider the following subject string:

"This is one and that is two."

The pattern, /(one).*(two)/) matches the sub string “one and that is two” in the subject. Now, the sub string matched, can be captured in an array. The portions of the subject string matched by a subpattern (group) can also be captured in an array. In this case, “one” and “two” that are in the subject string matched can be captured in an array.

The following code illustrates this:

<html>
<head>
</head>
<body>
<?php
   if (preg_match("/(one).*(two)/", "This is one and that is two.", $matches))
    echo "Matched" . "<br />";
   else
    echo "Not Matched" . "<br />";

   echo $matches[0] . "<br />";
   echo $matches[1] . "<br />";
   echo $matches[2] . "<br />";
?>
</body>
</html>

The output of the code is:

Matched

one and that is two
one
two

Let us look at the code first before we look at its output. The if-condition is:

preg_match("/(one).*(two)/", "This is one and that is two.", $matches)

The regex is the one we mentioned above. The subject string is the subject string we mentioned above. Look, carefully at the arguments of the preg_match() function; notice that a new argument has been added. There are now three arguments. The third argument is optional; it is an array. The variable identifying the array, here, is $matches.

There are two subpatterns in the regex. Now, after execution of the matching process (the if-condition), the sub string in the subject that matches the whole pattern goes into $matches[0]. The sub string in the subject that matches the first subpattern (group) goes into $matches[1]. The sub string in the subject that matches the second subpattern goes into $matches[2]. Note: The array will acquire these sub strings only if there is matching; if there is no matching, the array will not acquire any sub strings.

In the code above, if there is matching, the code will echo “Matched”. If there is no matching, the code will echo “Not Matched”. The last code segment of the script echoes the first three elements of the array. From the explanation given above, there can only be three elements in the array. This explains what you have as the output.

So, to capture the whole regex and any of its subpatterns, you need an array as the third argument of the preg_match() function. The whole regex gives rise to the value of the first element of the array. The first subpattern gives rise to the value of the second element of the array. The second subpattern gives rise to the value of the third element in the array. The rest of the subpatterns fill the array in that order. The length of the array is the number of sub strings captured. This consists of the sub string in the subject that corresponds to the whole regex and the other sub strings corresponding to the subpatterns (in parentheses) in the regex. Remember, if there is no matching, then there are no elements in the array.

What about Nested Groups
Consider the following code:

<html>
    <head>
    </head>
    <body>
        <script type="text/javascript">
            var arr = "I like school boys.".match(/school (boy(s|)|)/);
            alert(arr[0]);
            alert(arr[1]);
            alert(arr[2]);
        </script>
     </body>
</html>
For output, the three alert boxes display:

school boys
boys
s

The whole pattern would match “school boys”, “school boy” or “school”. We have two groups; one inside the other.

The outer group in the regex is boy(s|) and the inner group is (s|). The outer group corresponds to “boys” in “school boys”. The inner group corresponds to “s” at the end of “school boys”.

Let me do some more explanation on these capturing. “(boy(s|)|)” means “boy(s|)” or nothing, and then “boy(s|)” means “boys” or “boy”; so “boys” next to “school ” is captured. “(s|)” is a group and any group can be captured; it means “s” or nothing. Note that it is not necessarily the group that is matched; it is the pattern that is matched. The match-able sub string that has our “s” is “school boys”. As “school boys” is matched, our “s” is captured.

Capturing and matching are not the same things. After matching occurs, if there is any group in the matched sub string in the subject, it is captured.

Now, you may ask, “with nesting groups, how do we know which group is considered as the first group, which group is considered as the second group, which is considered as the third and so on.” This is the answer: The left most opened bracket “(” is for the first group; the second opened bracket in the regex from the left, is for the second group; the bracket after this from the left, is for the third group, and so on.

Time to take a break. We continue in the next part of the series.

Chrys

Related Links

Major in Website Design
Web Development Course
HTML Course
CSS Course
ECMAScript Course
PHP Course
NEXT

Comments

Become the Writer's Fan
Send the Writer a Message