Broad Network


Extra Features in PHP Regular Expressions

PHP Regular Expressions – Part VIII

Forward: What we have learned would solve many of our problems. However, there will come a time when you would want to do more in regex. So this last part of the series is to add to what we have learned and introduce you to extra features in PHP.

By: Chrysanthus Date Published: 11 Aug 2012

Introduction

We have learned a lot about regular expressions in PHP. What we have learned would solve many of our problems. However, there will come a time when you would want to do more in regex. So this last part of the series is to add to what we have learned and introduce you to extra features in PHP. I intend to write articles on these extra features.

Internal Option Setting
You can embed modifiers in the regex (in the pattern). I will use the case-less modifier, i to illustrate this. Remember, the case-less modifier makes the matching insensitive. However, when you embed a modifier, it has its effect from the point of embedding to the end of the regex. The exception to this is when the modifier is in a subpattern (see below). A modifier is embedded by enclosing it in the characters, (?), just after the ‘?’ sign.

Consider the subject,

              "XYZ"

and the regex,

              "/(?i)xyz/"

Note the character set, “(?i)” that has the i modifier.  The above regex would match all of the above subject, since the modifier is the first element in the regex. The following expression produces a match:

        preg_match("/(?i)xyz/", "XYZ")

Consider the following regex:

              "/xy(?i)z/"

Here, the modifier has been put just before the last character, ‘z’. When the modifier is included in the regex, it has effect from the point of inclusion to the end of the regex. So the above regex would match, “xyZ” or “xyz”.

Putting the modifier at the end, just before the last forward slash has no effect. The following expression does not produce a match:

            preg_match("/xyz(?i)/", "XYZ")

Modifiers embedded in this way, are called Internal Options.

Now, the following two regexes are the same:

            /(?i)xyz/

and

             /xyz/i
For the second one, the whole regex is case insensitive; we have seen this. For the first one, the whole regex is case insensitive by virtue of the fact that the modifier is at the beginning of the regex (inside the pattern).

You can unset a modifier by preceding it’s letter inside the pattern with a hyphen. Let us now look for a regex that can match, "XYz" or "Xyz" or "xYz" or "xyz". The regex for these subjects is:

            /xy(?-i)z/i

Note that at the end of the regex, you have the i modifier which makes all the regex case insensitive. So in the regex, x and y are case insensitive. However, the case insensitivity of z has been unset by the presence of the option (?-i) in front of it. So, now, z in the regex is in lower case and would only match a corresponding lower case z in the subject.

Internal options can be used with long subject strings as well. The following expression produces a match.

        preg_match("/the I(?i)nternet/", "I work with the Internet.")

The regex would match “the Internet” or the “the INTERNET”.

Embedding Comments in a Regular Expressions
You use the following tag to insert a comment into your regex:

             (?#Comment)

You start with ‘(?#’ you type your comment and then you end with ‘)’. The regex, /the I(?i)nternet/ can be commented as follows:

/the I(?# the first part of the regex)(?i)nternet(?# I for Internet must be in upper case)/

We saw the use of the x modifier to include a comment in a regex in part VI. Using the tag “(?#Comment)” is good when your regex and comments are on one line. If you want your regex and it comments to be on more than one line, then you should use the x modifier and escape all the white spaces, as follows:

     $re = "/the I# the first part of the regex
           nternet# I for Internet must be in upper case
           /x";

The above literal is assigned to a variable and the variable would be used in the preg_match() function as follows:

preg_match("/the I# the first part of the regex
                   nternet# I for Internet must be in upper case
                  /x", "I work with the Internet.")

Note: the “(?#Comment)” tag cannot be nested, You cannot have “(?#Comment(?#Comment))” in a regex

Non-capturing Subpatterns
A subpattern is a pattern in parentheses in regex. By default, any such pattern is captured into an array. The variable of this array is the third argument in the preg_match() function. Consider the following code:

<html>
<head>
</head>
<body>
<?php
   if (preg_match("/(one).*(two)/", "This is one and that is two.", $matches))
    echo "Matched" . "<br />";
   else
    echo "Not Matched" . "<br />";

   echo $matches[0] . "<br />";
   echo $matches[1] . "<br />";
   echo $matches[2] . "<br />";
?>
</body>
</html>

This is the output of the above code:

Matched

one and that is two
one
two

The first item in the output is “Matched”. This is displayed by the if-statement in the code when matching occurs in the function, preg_match(). The next three lines in the output are elements captured and stored in the array, $matches, by the preg_match() function. The first element in the array is the complete sub string matched in the subject string. The next two elements in the array are the sub strings of the subpatterns captured. The two subpatterns are “(one)” and “(two)”. So “one” and “two” in the subject string are captured.

You may not want to capture every sub pattern. If you do not want to capture a subpattern, precede the content of the subpattern with “?:”. To prevent the subpattern “(one)” above from being captured, you need “(?:one)” for the pattern. The pattern still remains a valid pattern with its other advantage, but it is not captured. The following code illustrates this:

<html>
<head>
</head>
<body>
<?php
   if (preg_match("/(?:one).*(two)/", "This is one and that is two.", $matches))
    echo "Matched" . "<br />";
   else
    echo "Not Matched" . "<br />";
   
   echo "<br />";
   echo $matches[0] . "<br />";
   echo $matches[1] . "<br />";
   echo $matches[2] . "<br />";
?>
</body>
</html>

The output of the code is:

Matched

one and that is two
two

The last two lines are the elements of the array. The first element of this array is the entire sub string matched. The rest of the elements are sub strings captured. We prevented the first subpattern, “(one)” from being captured by transforming it into, “(?:one)”. From the output, we see that “one” of the subject string has not been captured, as we expected. “two” has been captured.

The tag for making group non-capturing is

          (?:subpattern)

Including Modifiers in Non-Capturing Subpatterns
We have seen how you can embed modifiers in a regex. You may want to include a modifier in a non-capturing subpattern. There are two ways of doing this. Let us say you want include the modifier, i in the non-capturing sub pattern “(?:one)” above. You can do it like this:

          (?:(?i)one)

or like this

          (?i:one)

The first method above is the more obvious way (based on what we have learned). The second method is like a contraction of the first method. The following expression produces a match:

         preg_match("/(?:(?i)one).*(two)/", "This is ONE and that is two.")

The following expression also produces a match.

         preg_match("/(?i:one).*(two)/", "This is ONE and that is two.")

Modifiers in Subpatterns
We said at the beginning of this part of the series, that a modifier embedded in a regex, has its effects from the point of inclusion to the end of the regex. The question you may have is this: “If the modifier is in a subpattern, would it have its effect only in the subpattern or right to the end of the regex out of the subpattern?”

Let us just write two short scripts to verify that. This is the first:

<html>
<head>
</head>
<body>
<?php
   if (preg_match("/(?i:one).*(two)/", "This is ONE and that is TWO."))
    echo "Matched" . "<br />";
   else
    echo "Not Matched" . "<br />";
?>
</body>
</html>

The above script does not produce a match. In the above script, the i modifier is inside a non-capturing subpattern. The word “TWO” inside the subject is in upper case. A match is not produced.

In the following script, we are not dealing with a non-capturing subpattern; we are dealing with a capturing subpattern.

<html>
<head>
</head>
<body>
<?php
   if (preg_match("/((?i)one).*(two)/", "This is ONE and that is TWO."))
    echo "Matched" . "<br />";
   else
    echo "Not Matched" . "<br />";
?>
</body>
</html>

The above script does not produce a match. In the above script, the i modifier is inside a capturing subpattern. The word “TWO” inside the subject is in upper case. A match is not produced.

We conclude for this section that if a modifier is in a subpattern, captured or non-captured, it has its effect only on that subpattern and not outside the subpattern. If the modifier is not inside a subpattern, it has its effect from its point of insertion to the end of the regex.

That is it for this section.

And, finally we have come to the end of the series. We saw so many things. There are still some extra features in PHP regexes to be seen. I intend to address the extra issues as independent articles. I hope you appreciated this series.

Chrys

Related Links

Major in Website Design
Web Development Course
HTML Course
CSS Course
ECMAScript Course
PHP Course

Comments

Become the Writer's Fan
Send the Writer a Message