Broad Network


PHP Assertions for Regular Expression

Advanced PHP Regular Expressions - Part 3

Foreword: In this part of the series, I talk about PHP Assertions for Regular Expression.

By: Chrysanthus Date Published: 11 Jul 2019

Introduction

This is part 3 of my series, Advanced PHP Regular Expressions. In this part of the series, I talk about PHP Assertions for Regular Expression. The word, “subject” in this series, is the string in which the regular expression finds the match. Note: the abbreviation for Regular Expression in this series is, regex. You should have read the previous part of the series before reaching here, as this is a continuation.

Simple Assertions

The \b Assertion
This is an escape sequence. It is used in the regex (pattern). It indicates a word boundary in the regex and consumes no character in the subject. It is a position in the subject (that has no width). The boundary between a word character and a space is a word boundary. The boundary between a character and a hyphen is a word boundary. The start of a string before the first character is a word boundary. The end of a string after the last character is a word boundary. A word character is a letter or digit or underscore. So, \b will also match the boundary between a word character and a punctuation mark. It will also match the boundary between a word character and a special character like * . Read and test the following code:

<?php

        $subject = "This is a self-reliance project.";
        $regex = '/\bThis|\ba|self\b|project\b/';

        preg_match_all($regex, $subject, $matches);

        for($i=0; $i<count($matches[0]); ++$i)
            {
                echo $matches[0][$i], '<br>';
            }

?>

The output is:

    This
    a
    self
    project

Note: in order to match more than one occurrence in the string, I had to use the preg_match_all() function.

\B
This is the opposite of \b to match a non-word boundary. This means it would match the boundary between characters of a word. Try the following code:

<?php

        $subject = "This is a self-reliance project.";
        $regex = '/proj\B/';

        preg_match($regex, $subject, $matches);

        echo $matches[0], '<br>';

?>

The output is:

    proj

I do not yet know a practical use for \B.

The ^ Character
An assertion is a pattern component that does not consume any character in the subject. The ^ assertion matches (asserts) the start of a subject string or the start of a line in subject, if the subject has more than one line and the m modifier is used. An end of line is supposed to be marked by “\r\n”. Try the following code:

<?php

        $subject = "The first sentence.\r\nAnd the second one.";
        $regex = '/^.../m';

        preg_match_all($regex, $subject, $matches);

        echo $matches[0][0], '<br>';
        echo $matches[0][1], '<br>';

?>

The output is:

    The
    And

Note: when the m modifier is used, the regex (pattern) is said to be in multi-line mode. In the absence of the m modifier, only the start of the subject string is matched (whether or not the preg_match_all() function is used, for a subject with many lines).

The preg_match_all() function enables the different occurrences to be sent to the receiving array.

The $ Assertion
The $ assertion matches (asserts) the end of a subject string or the end of a line in subject string, if the subject has more than one line and the m modifier is used. Here, end of line means, just before “\n” (or end of last line in a multi-line subject string). Try the following code:

<?php

        $subject = "The first sentence.\r\nAnd the second one.";
        $regex = '/.....$/m';

        preg_match_all($regex, $subject, $matches);

        echo $matches[0][0], '<br>';
        echo $matches[0][1], '<br>';

?>

The output is:

    nce.
    one.

where \r is counted as a character.

In the absence of the m modifier only the very end of the subject string is matched, whether or not the preg_match_all() function is used. If the end of the subject string has “\r\n”, just before the “\n” is matched.

An Assertion
An assertion is a component in a regex that matches a position in the subject, and does not consume any character. You have simple assertions and patterned assertions. A simple assertion is just one special character or an escaped sequence as the above assertions are. A Patterned Assertions is a subpatern (group, delimited by parentheses) with a special character sequence; the special character sequence including any text in the subpattern does not consume any character in the subject.

Patterned Assertions
A patterned assertion is a subpatern (group delimited by parentheses) with a special character sequence; the special character sequence including any text in the subpattern does not consume any character in the subject (and is not captured). There are two kinds of patterned assertions: Lookahead and Lookbehind. A lookahead assertion is further divided into two categories: positive and negative assertions.

Assertion
The syntax for a positive lookahead assertion is:

    /pattern(?=txt)/

where (?=txt) is the lookahead assertion and does not consume any character in the subject. txt is text of your choice which works in conjunction with the special sequence, ?= . pattern is the rest of the regex. An example of such regex is, /\w+(?=;)/ , which matches any word followed by a semicolon. The semicolon (txt) in this lookahead assertion, including the whole lookahead assertion (in parentheses), does not consume any character in the subject (and is not captured). However, \w+ (the pattern), which is not inside the lookahead assertion in the regex, consumes characters. Try the following code:

<?php

        $subject = "We are the world; we are the children of the world.";
        $regex = '/\w+(?=;)/';

        preg_match($regex, $subject, $matches);

        echo $matches[0], '<br>';

?>

The return value is,

    world

It is the first “world” that has been matched. As you can see, (?=;) including the semicolon (inside the parentheses) that form the assertion does not consume any character in the subject (has not been returned). \w+ for pattern in the regex consumes characters in the subject.

The syntax for a negative lookahead assertion is:

    /pattern(?!txt)/

where (?!txt) is the lookahead assertion and does not consume any character in the subject. txt is text of your choice, which works in conjunction with the special sequence, ?! . pattern is the rest of the regex. An example of such regex is, /foo(?!bar)/ , which matches foo not followed by bar. The bar (txt) in the lookahead assertion, including the whole lookahead assertion, does not consume any character in the subject. However, foo (the pattern), which is not inside the lookahead assertion in the regex, consumes characters. Try the following code:

<?php

        $subject = "You would see bar and foo in computer language documentation";
        $regex = '/foo(?!bar)/';

        preg_match($regex, $subject, $matches);

        echo $matches[0], '<br>';

?>

The return value is:

    foo

As you can see, (?!bar) including the bar that forms the assertion, does not consume any character in the subject. foo for pattern in the regex consumes characters in the subject. There is only one foo in the subject, which is not followed by bar; the bar in the subject is in front of foo.

Time to take a break. We stop here and continue in the next part of the series.

Chrys


Related Links

Basics of PHP with Security Considerations
White Space in PHP
PHP Data Types with Security Considerations
PHP Variables with Security Considerations
PHP Operators with Security Considerations
PHP Control Structures with Security Considerations
PHP String with Security Considerations
PHP Arrays with Security Considerations
PHP Functions with Security Considerations
PHP Return Statement
Exception Handling in PHP
Variable Scope in PHP
Constant in PHP
PHP Classes and Objects
Reference in PHP
PHP Regular Expressions with Security Considerations
Date and Time in PHP with Security Considerations
Files and Directories with Security Considerations in PHP
Writing a PHP Command Line Tool
PHP Core Number Basics and Testing
Validating Input in PHP
PHP Eval Function and Security Risks
PHP Multi-Dimensional Array with Security Consideration
Mathematics Functions for Everybody in PHP
PHP Cheat Sheet and Prevention Explained
More Related Links

Cousins

BACK NEXT

Comments