Broad Network


PHP Regular Expression Basics and Security Risks

PHP Regular Expressions with Security Considerations - Part 1

Foreword: In this part of the series I talk about the basics of what is known as PHP Regular Expression.

By: Chrysanthus Date Published: 18 Jan 2019

Introduction

This is part 1 of my series, PHP Regular Expressions with Security Considerations. In this part of the series I talk about the basics of what is known as PHP Regular Expression. I will give you samples of code that you can try. You should have read the previous series before coming here, as this is the continuation.

Motivation
Consider the string,
       
          “This is a man”.

Assume that you do not know the content of the string; the string might have been typed by the user and PHP code has assigned it to a variable. You may have the following two questions:

- Does the sting have the word, “man”?
- If the string has the word, “man”, can you change it to “woman”.

There are many other questions that are similar (and rather complex) to the above two questions. Handling this in code is the subject called Regular Expressions, abbreviated, Regex.

The Word, Regex
In the above example, “man” is a Regex. More generally, Regex is a sub string of characters that you want to know, if it exists in some subject string. This subject string might also have been assigned to a variable.

Matching
When the Regex is seen in the subject string, we say matching has occurred. That is the Regex has match the string. When matching occurs, replacement can follow. If the regex, “man” in the above example is seen, it can be replaced by the word “woman”.

The preg_match() Function
In simple terms, this PHP function is:

    int preg_match ( string $regex , string $subject);

where regex, in general is,

    "/pattern/"

In the above case, you would have,

    "/man/"

The quotes may be single or double. If single, is has nowdoc effects; if double it has heredoc effects. $subject is the string where the search is to take place.

Simple Word Matching
Consider the following code:

<?php

        $ret = preg_match("/World/", "Hello World");
        echo $ret;

?>

If you try the above code, you would have the output, 1.

The first statement uses the preg_match() function. The first argument of the preg_match() function is "/World/". The second argument is "Hello World!"; this is a string literal; it is the subject string, from where the search will be made.

The regex is

     "/World/"    

Here, the regex is made up of the word, “World”, preceded by a forward slash and terminated by another forward slash; all that in quotes.

The subject string is:

            "Hello World"

Now, if “World” is found in the subject string, the preg_match() function would returns 1. If there is no matching, that is if no sub string is found in the subject, the preg_match() function would return, 0.

In many cases, you would just want to know if matching occurs or not. For that, you can use the following code:

<?php

        if (preg_match("/World/", "Hello World") == 1)
            {
               echo 'Matched';
            }
        else
            {
               echo 'Not Matched';
            }

?>

Or

<?php

        if (preg_match("/World/", "Hello World"))
            {
               echo 'Matched';
            }
        else
            {
               echo 'Not Matched';
            }

?>

These two code samples are the same. When compared to 1, the == operator returns true. When compared to 0, it returns false.

Mote: Matching is case sensitive. So if we had “World” in the regex as “world” with the W in lower case, the if-condition would not hold, and our code would display, “Not Matched”.

You can have the regex and the subject as string variables. The following code illustrates this:

<?php

        $re = "/World/";
        $subject = "Hello World!";

        if (preg_match($re, $subject))
            {
               echo 'Matched';
            }
        else
            {
               echo 'Not Matched';
            }

?>

In this code, you have the variables,

   $re = "/Would/";
   $subject = "Hello World";

The if-condition is now:

        if (preg_match($re, $subject))

The first argument for the preg_match() function is, $re, and the second argument is, $subject.

Meaning of Pattern
Consider the following string assigned to the variable, $subject.

   $subject = "Examples of creatures are the bat, the cat and the rat. ";

You may want to know if the word, “bat”, “cat” or “rat” exist in the string. Examining the string we see that “bat”, “cat” and “rat”, each end in “at”. The following regex will be used to determine if “bat”, or “cat” or “rat” exists in the string:

  $re = "/[bcr]at/";

Note the square brackets around “bcr”; b is the first letter in “bat”; c is the first letter in “cat” and r is the first letter in “rat”. These first letters are inside the square brackets. After the square brackets, you have the next two letters that are common in the three words and follow the different first letters.

The following script will produce a match at the browser:

<?php

        $re = "/[bcr]at/";
        $subject = "Examples of creatures are the bat, the cat and the rat.";

        if (preg_match($re, $subject))
            {
                echo 'Matched';
            }
        else
            {
                echo 'Not Matched';
            }

?>

Now, the regular expression content is

                   [bcr]at

The two forward slashes added to the ends (as shown below) make the above expression a regular expression.

                    /[bcr]at/

What you have inside the two forward slashes is a pattern that describes a set of words (bat, cat and rat).

In this subject (Regular Expressions) the content inside the two forward slashes is called a pattern. So far, you have seen two types of patterns in code, one of them, /[bcr]at/ that describes a set of words and another, /World/ that describes only one word. The two forward slashes are the delimiters of the pattern. I will show you many more patterns in this series. The pattern and its delimiters are together called the regex. Well, in some documents, distinction is not made between pattern and regex.

Some Special Characters
There are some ASCII characters, which don't have printable character equivalents and are instead represented by escape sequences. Common examples are \t for a horizontal tab, \n for a newline, \r for a carriage return and \a for a bell.

The Horizontal Tab
If you want a horizontal tab to appear in text, you should type “\t” in the text. Consider the following:

    $subject = "\tThis is a new section and it continues as a paragraph.";

Note the ‘t’ for a horizontal tab at the beginning of the subject.

You might want to match the horizontal tab, \t. Your regular expression would be

            "/\\t/"

With the above, the following code produces a match

<?php

        $re = "/\\t/";
        $subject = "\tThis is a new section and it continues as a paragraph.";

        if (preg_match($re, $subject))
            {
                echo 'Matched';
            }
        else
            {
                echo 'Not Matched';
            }

?>

So, to match \t in the subject, just use \t in the pattern.

Hexadecimal Numbers
Hexadecimal numbers can be written as:

             \xhh      e.g  \xBF


I will not give you explanation about hexadecimal numbers in this series; just know that you will find many examples like the above.

The notation for matching hexadecimal numbers is,

                      \xhh

where h is a hexadecimal digit.

If you only want to match a hexadecimal number, the regex is:

            /\xhh/

Characters can be represented by escaped hexadecimal numbers. The following code produces a match:

<?php

        $re = "/\x61\x74/";
        $subject = "cat";

        if (preg_match($re, $subject))
            {
                echo 'Matched';
            }
        else
            {
                echo 'Not Matched';
            }

?>

A match is produced, because the hexadecimal number for the character, ‘a’ is \x61 and that for ‘t’ is \x74.

Word Boundary
A word boundary is the boundary between a word character and a non-word character.

Consider the following strings:

             “one two three four five”

             “one,two,three,four,five”

             “one, two, three, four, five”

             “one-two-three-four-five”

The following conditional will produce a match:

            if (preg_match("/\b/", "one two three four five"))

The notation ‘b’ is used to match a word boundary. In the above conditional, it is the boundary between the opening double quotation mark and the word, “one” that has been matched. If you want to match the boundary between the word “one” and the space that follows it, you have to modify the regex to:

              /one\b/

Here, you have the word ‘one’, followed by ‘\b’. The pattern, one\b is what is matched.

The following conditional will produce a match:

            if (preg_match("/one\b/", "one two three four five"))

“\b” indicates a word boundary. The following conditional will not produce a match:

            if (preg_match("/on\be/", "one two three four five"))

This is because the character “\b”, at its position does not correspond to a word boundary (it is inside the word, ‘one’).

Now, the following conditional will produce a match:

            if (preg_match("/two\b/", "one,two,three,four,five"))

Here the string portion ‘two\b’ is what has been matched. The “\b” corresponds to the boundary between the word “two” and the comma that follows it. The following conditional will also produce a match:

            if (preg_match("/two\b/", "one, two, three, four, five"))

Here, though there is a space between the comma and the word, “three”, the “\b” still corresponds to the boundary between the word, “two” and the comma that follows it; the comma is a non-word character and so there is a boundary between the word, “two” and the comma.

Now, the following conditional will produced a match:

            if (preg_match("/three\b/", "one-two-three-four-five"))

Here the string portion ‘three’ is what has been matched. The “\b” corresponds to the boundary between the word “three” and the character, “-” that follows it. The character, “-” is a word separator; it separates two words joined together it is not a word character.

The following conditional will produce a match:

            if (preg_match("/five\b/", "one two three four five"))

Here the “\b”, corresponds to the boundary between the word, “five” and the closing double quotation mark.

Combining with Other Characters
You can combine the special characters above with other characters as you have seen. The following expression will produce a match:

            if (preg_match("/five\b six/", "one two three four five six"))

This is similar to the last example we saw. You have the word, “five” followed by \b, a space and then “six” in the regex.

Security Consideration
-       if (preg_match("/World/", "Hello World") == 1)
And
-       if (preg_match("/World/", "Hello World"))

are the same. They both compare with == which does loose comparison.

The preg_match() function returns 1 if the pattern matches the given subject, 0 if it does not, or FALSE if an error occurred (for example, if the coding in the pattern does not make sense).

Solution: Since the if-condition with == interprets 0 and false as false, henceforth use === instead of

        if (preg_match("/World/", "Hello World") == 1)
Or
        if (preg_match("/World/", "Hello World"))

as in the following code:

<?php

        if (preg_match("/World/", "Hello World") === 1)
            {
             echo 'Matched';
            }
        else
            {
             echo 'Not Matched or Error occurred!';
            }

?>

Well, let us take a break at this point. We continue in the next part of the series.

Chrys


Related Links

Basics of PHP with Security Considerations
White Space in PHP
PHP Data Types with Security Considerations
PHP Variables with Security Considerations
PHP Operators with Security Considerations
PHP Control Structures with Security Considerations
PHP String with Security Considerations
PHP Arrays with Security Considerations
PHP Functions with Security Considerations
PHP Return Statement
Exception Handling in PHP
Variable Scope in PHP
Constant in PHP
PHP Classes and Objects
Reference in PHP
PHP Regular Expressions with Security Considerations
Date and Time in PHP with Security Considerations
Files and Directories with Security Considerations in PHP
Writing a PHP Command Line Tool
PHP Core Number Basics and Testing
Validating Input in PHP
PHP Eval Function and Security Risks
PHP Multi-Dimensional Array with Security Consideration
Mathematics Functions for Everybody in PHP
PHP Cheat Sheet and Prevention Explained
More Related Links

Cousins

NEXT

Comments