PHP Regular Expressions

PHP Regular Expressions – Part I

Forward: In this part of the series, I introduce you to what is known as PHP regular expressions.

By: Chrysanthus Date Published: 11 Aug 2012

Introduction

Consider the string,

“This is a man”.

Assume that you do not know the content of the string; the string might have been typed by the user and the PHP code has assigned it to a variable. You may have the following two questions:

Does the sting have the word, “man”?
If the string has the word, “man”, can you change it to “woman”.

There are many other questions that are similar (and rather complex) to the above two questions. Handling this in code is the subject called Regular Expressions, abbreviated, Regex.

This is an article series.

The Word, Regex
In the above example, “man” is a Regex. More generally, Regex is a string (usually small string) of characters that you want to know, if it exists in some subject string. This subject string might have been assigned to a variable.

Matching
When the Regex is seen in the subject string, we say matching has occurred. That is, the Regex has match the string. When matching occurs, replacement can follow. If the regex, “man” in the above example is seen in the subject string, it can be replaced by the word “woman”.

Modern and Old Fashion Ways of coding Regex
At first, to answer the above type of questions you had to do the coding using programming basics (declaration of variable, conditions, loops, etc). Know that the questions such as the ones above can be classified (grouped). PHP came up with functions in a module, to handle the above questions; this gives the programmer less work. The programmer uses these functions in a special ways without really being conscious that he is using them. The use of these inbuilt functions is made convenient with special symbols. In this series, we learn the special ways of answering questions of the above types.
Requirements
PHP supports what is called the Perl-Compatible Regular Expressions syntax and also what is called the POSIX-extended syntax. The Perl-Compatible Regular Expressions module abbreviated PCRE comes with PHP 4.2.0 or higher. You normally do not need to do anything to have it running in the computer after you have installed PHP. This series is based on the Perl-Compatible Regular Expressions (PCRE) syntax. So once you have your PHP installed, you can start using the function.

I use an Apache server in conjunction with the PHP software for the code samples in this series. I use one computer, so my domain is localhost. The link is http://localhost/. I assume that you understand the information in this paragraph.

Simple Word Matching
Consider the following code:

<html>
<body>
<?php
if (preg_match("/World/", "Hello World"))
echo "Matched";
else
echo "Not Matched";
?>
</body>
</html>

This is a simple PHP/HTML file. There is a BODY element. The Body element has only one tag, which is the PHP tag. This file is at the root directory in the server, and it is called, regex.php. I said above that my domain is localhost. So to obtain the result of this file at the browser, you have to type

http://localhost/regex.php

If you do that, you will see, “Matched” as the only content of the web page for the browser.

Let us look at the PHP script.  This is the PHP script content:

if (preg_match("/World/", "Hello World"))
echo "Matched";
else
echo "Not Matched";

You have an if-statement. The if-statement will echo “Matched” if its condition returns true or “Not Matched” if its condition returns false. The if-condition is what is in the parentheses of the if-statement.

The if-condition is:

preg_match("/World/", "Hello World")

PHP Regular Expression (as a topic) has 8 functions (PCRE). The preg_match(), which you see above is the most important function. We shall treat the functions and their details towards the end of the series. For now, let us concentrate on Regular Expression basics.
The preg_match() above has two arguments, which are "/World/" and "Hello World". The first one is the regex and the second one is called the subject string or simply subject. The aim is to know if the regex is found in the subject string.

The regex is

/World/

Here, the regex is made up of the word, “World”, proceeded by a forward slash and terminated by another forward slash.

The subject string is:

"Hello World"

Now if “World” is found in the subject string, preg_match() would return true, otherwise it would return false. Actually, it returns the integer, 1, for true and 0, for false; PHP interprets these for true and false respectively. Our code above will echo “Matched” because “World” is found in the subject.

Mote: Matching is case sensitive. So if we had “World” in the regex as “world” with the W in lower case, the if-condition would return false, and our code would display, “Not Matched”.

Before the if-statement in the code, you can have the regex and the subject as string variables. The following code illustrates this:

<html>
<body>
<?php

$re = "/Would/";$subject = "Hello World";
if (preg_match($re,$subject))
echo "Matched";
else
echo "Not Matched";
?>
</body>
</html>

In the code, you have the variables,

$re = "/Would/";$subject = "Hello World";

The if-condition is now:

preg_match($re,$subject)

The arguments in the preg_match() function are now variables that corresponds to the regex and the subject.
Meaning of Pattern
Consider the following string assigned to the variable, $subject.$subject = "Examples of creatures are the bat, the cat and the rat. ";

You may want to know if the word, “bat”, “cat” or “rat” exist in the string. Examining the string we see that “bat”, “cat” and “rat”, each end in “at”. The following regex will be used to determine if “bat”, “cat” or “rat” exist in the string:

$re = "/[bcr]at/ "; Note the square brackets around “bcr”; b is the first letter in “bat”; c is the first letter in “cat” and r is the first letter in “rat”. These first letters are inside the square brackets. After the square brackets, you have the next two letters that are common in the three words and follow the different first letters. The following script will echo a match at the browser: <html> <head> </head> <body> <?php$subject = "Examples of creatures are the bat, the cat and the rat.";
$re = "/[bcr]at/ "; if (preg_match($re, $subject)) echo "Matched"; else echo "Not Matched"; ?> </body> </html> Now, the regular expression content is [bcr]at The two forward slashes added to the ends (as shown below) make the above expression a regular expression. /[bcr]at/ What you have inside the two forward slashes is a pattern that describes a set of words (bat, cat and rat). In this subject (Regular Expressions) the content inside the two forward slashes is called a pattern. So far, we have seen two types of patterns, one of them, /[bcr]at/ that describes a set of words and another, /World/ that describes only one word. The two forward slashes are the delimiters of the pattern. We shall see many more patterns in this series. The pattern and its delimiters are together called the regex or regex value. Well, in some documents, distinction is not made between the pattern and regex. Some Special Characters There are some ASCII characters, which don't have printable character equivalents and are instead represented by escape sequences. Common examples are t for a tab, n for a newline, r for a carriage return and a for a bell. The horizontal tab If you want a horizontal tab to appear in text you should type “\t” in the text. Consider the following: my$subject = "\tThis is a new section and it continues as a paragraph.";

Note the ‘\t’ for a horizontal tab at the beginning of the subject string.

You might want to match the horizontal tab, \t. Your regular expression would be

/\t/

With the above, the following expression should return true (matched)

preg_match($re,$subject)

So, to match \t in the available string (subject string), just use \t in the pattern.

The Control Characters
The notation in the pattern, for matching a control character is

cX

where X is a letter from A to Z.

If you only want to match a control character (not associated with other characters), the literal text expression for the regex is:

/\cX/

The following expression produces a match:

preg_match("/\cZ/", "\cZ That is it.")

So, just use the escaped control character in the pattern.

Hexadecimal numbers can be written as:

\xhh       e.g  \xBF

I will not give you further explanation about hexadecimal numbers; just know that you will find many examples like the above.

The notation for matching hexadecimal numbers is

\xhh

where h is a hexadecimal digit.

If you only want to match a hexadecimal number, the regex is:

/\xhh /

Characters can be represented by escaped hexadecimal numbers. The following expression produces a match:

preg_match("/\x61\x74/", "cat")

A match is produced, because the hexadecimal number for the character, ‘a’ is \x61 and that for ‘t’ is \x74.
Word Boundary
A word boundary is the boundary between a word character and a non-word character.

Consider the following strings:

“one two three four five”

“one,two,three,four,five”

“one, two, three, four, five”

“one-two-three-four-five”

The following expression will return true (match):

preg_match("/\b/", "one two three four five")

The notation ‘\b’ is used to match a word boundary. In the above expression, it is the boundary between the opening double quotation mark and the word, “one” that has been matched. If you want to match the boundary between the word “one” and the space that follows it, you have to modify the regex to:

/one\b/

Here, you have the word ‘one’, followed by ‘\b’. The pattern, one\b is what is matched.

The following expression will return true:

preg_match("/one\b/", "one two three four five")

“\b” indicates a word boundary. The following expression will return false (not matched):

preg_match("/on\be/", "one two three four five")

This is because the character “\b” at its position does not correspond to a word boundary (it is inside the word, ‘one’).

Now, the following expression will return true:

preg_match("/two\b/", "one,two,three,four,five")

Here the string portion ‘twob’ is what has been matched. The “\b” corresponds to the boundary between the word “two” and the comma that follows it. The following expression will also produce a match:

preg_match("/two\b/", "one, two, three, four, five")

Here, even though there is a space between the comma and the word, “three”, the “\b” still corresponds to the boundary between the word, “two” and the comma that follows it; the comma is a non-word character and so there is a boundary between the word, “two” and the comma.

Now, the following expression will return true:

preg_match("/three\b/", "one-two-three-four-five")

Here the string portion ‘three’ is what has been matched. The “\b” corresponds to the boundary between the word “three” and the character, “-” that follows it. The character, “-” is a word separator; it separates two words joined together it is not a word character.

The following expression will return true:

preg_match("/five\b/", "one two three four five")

Here the “\b”, corresponds to the boundary between the word, “five” and the closing double quotation mark.

Combining with Other Characters
You can combine the special characters above with other characters as we have seen. The following expression will return true:

preg_match("/five\b six/", "one two three four five six")

This is similar to the last example we saw. You have the word, “five” followed by b, a space and then “six” in the regex.

Well, let us rest at this point. We continue in the next part of the series.

Chrys

Basics of PHP
PHP Directory Function Basics
Understanding PHP Reference
PHP Function Arguments
Understanding Variable Scope in PHP
Object Oriented Programming in PHP
PHP Data Types Simplified
Exception Handling in PHP
PHP Regular Expressions
Sending Email with PHP
PHP Strings
Date and Time in PHP
PHP String Functions for Website Design
PHP Variable Scope
Array in PHP
PHP Two Dimensional Arrays
Understanding Object Oriented Programming in PHP
Some features of PHP Entities
PHP Namespace
PHP Web Application