Perl Regular Expression Basics

Perl Regular Expressions – Part 1

Perl Course

Foreword: In this part of the series I talk about the basics of what is known as Perl Regular Expression

By: Chrysanthus Date Published: 5 Oct 2015

Introduction

This is part 1 of my series, Perl Regular Expressions. In this part of the series I talk about the basics of what is known as Perl Regular Expression. I will give you samples of code that you can try. I am using ActivePerl and Windows OS. AcivePerl is Perl for windows. All the code I give you in this series, will work with Perl for the different operating systems. Now, AcivePerl does not need the following line at the beginning of the code, while Perl for other operating systems need the line:

         #!/usr/local/bin/perl

I use the DOS Prompt (window) to run all the code samples you will have in this series. You can use a similar console of your operating system to try the samples.

Pre-Knowledge
This is part of the volume, Perl Course. At the bottom of this page, you have links to the different series you should have read before coming here, in order to understand this series well.

Motivation
Consider the string,

          “This is a man”.

Assume that you do not know the content of the string; the string might have been typed by the user and the Perl code has assigned it to a variable. You may have the following two questions:

- Does the sting have the word, “man”?
- If the string has the word, “man”, can you change it to “woman”.

There are many other questions that are similar (and rather complex) to the above two questions. Handling this in code is the subject called Regular Expressions, abbreviated, Regex.

The Word, Regex
In the above example, “man” is a Regex. More generally, Regex is a sub string of characters that you want to know, if it exists in some subject string. This subject string might also have been assigned to a variable.

Matching
When the Regex is seen in the subject string, we say matching has occurred. That is the Regex has match the string. When matching occurs, replacement can follow. If the regex, “man” in the above example is seen, it can be replaced by the word “woman”.

Modern and Old Fashion Ways of coding Regex
At first, to answer the above type of questions you had to do the coding using programming basics (declaration of variable, conditions, loops, etc). Know that the questions such as the ones above can be classified. Perl came up with operators and patterns, to handle the above questions; this gives the programmer less work. The programmer uses these operators and patterns in special ways.

The Binding Operator
Perl has an operator called the binding operator, which is,

    =~

This operator needs two operands, one on the left and another on the right. It is used to determine if the one on the right matches the one on the left. If matching occurs, it returns true. If there is no matching, it returns false.

Simple Word Matching
Consider the following:

            "Hello World" =~ /World/;

This is an expression. We can call the string on the left, the subject string. =~ is the binding operator. It binds the subject string with what is on its right, (/World/). Now /World/ is known as the regex literal. What is inside the two forward slashes is called the pattern. It can be more complex than what (World) you are seeing. The binding operator is said to have two arguments: one ("Hello World") on its left and the other (/World/) on its right. The two arguments and the binding operator form an expression.

This expression can be used in conditionals (if condition). If the pattern, in this case “World” is found in the subject string, then the expression returns true. If it is not found then the expression returns false. Matching is said to occur, if the pattern, (in this case, “World”) is found in the subject. The following Perl code, which you can try, illustrates this:

use strict;

if ("Hello World" =~ /World/)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

If you try the above code, the monitor would print, “Matched”.

Note: a variable can be used in place of the subject, "Hello World".

Pattern
Consider the following string assigned to the variable, subject.

   $subject = “Examples of creatures are the bat, the cat and the rat.”;

You may want to know if the word, “bat” or “cat” or “rat” exist in the string. Examining the string we see that “bat”, “cat” and “rat”, each end in “at”. The following regex will be used to determine if “bat” or “cat” or “rat” exist in the string:

                      /[bcr]at/

Note the square brackets around “bcr”; b is the first letter in “bat”; c is the first letter in “cat” and r is the first letter in “rat”. These first letters are inside the square brackets. After the square brackets, you have the next two letters that are common in the three words.

The following script will produce a match:

use strict;

my $subject = "Examples of creatures are the bat, the cat and the rat.";

if ($subject  =~ /[bcr]at/)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

The regular expression literal is:

                    /[bcr]at/

In this topic (Regular Expressions) the content inside the two forward slashes is called a pattern.  So far, we have seen two patterns, one, /[bcr]at/ that describes a set of words and another, /World/ that describes only one word. We shall see many more patterns in this series.

Now, [bcr] means b or c or r. So bat and cat and rat do not have to be in the subject string. So, the following will still produce a match:

my $subject = "An example of a creature is a cat.";

if ($subject  =~ /[bcr]at/)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

Some Special Characters
There are some ASCII characters, which don't have printable character equivalents and are instead represented by escape sequences. Common examples are \t for a tab, \n for a newline, \r for a carriage return and \a for a bell.

The horizontal tab
If you want a horizontal tab to appear in text you should type “\t” in the text. Consider the following:

my $subject = "\tThis is a new section and it continues as a paragraph.";

Note the ‘\t’ for a horizontal tab at the beginning of the subject string.

You might want to match the horizontal tab, \t. Your regular expression would be

            /\t/

With the above, the following expression should return true (matched)

             $subject =~ /\t/

So, to match \t in the subject string, just use \t in the pattern.

The Control Characters
The notation in the pattern, for matching a control character is

             \cX

where X is a letter from A to Z.

If you only want to match a control character (not associated with other characters), the literal text expression for the regex is:

            /\cX/

The following expression produces a match:

            "\cZ That is it." =~ /\cZ/

So, just use escaped control character in the pattern.

Hexadecimal Numbers
In programming, some hexadecimal numbers are written as:

             \xhh      e.g  \xBF

Other hexadecimal numbers are written as:

           \xhhhh     e.g.    \xAF7B

I will not give you further explanation about hexadecimal numbers; just know that you will find many examples like those above.

The notation for matching hexadecimal numbers is

                      \xhh     or     \xhhhh

where h is a hexadecimal digit.

If you only want to match a hexadecimal number, the literal text expression for the regex is:

            /\xhh/      or      /\xhhhh/

Characters can be represented by escaped hexadecimal numbers. The following expression produces a match:

            "cat" =~ /\x61\x74/

Because, the hexadecimal number for the character, ‘a’ is \x61 and that for t is \x74.

Word Boundary
A word boundary is the boundary between a word character and a non-word character.

Consider the following strings:

             “one two three four five”

             “one,two,three,four,five”

             “one, two, three, four, five”

             “one-two-three-four-five”

The following expression will return true (match):

            "one two three four five" =~ /\b/

The notation ‘\b’ is used to match a word boundary. In the above expression, it is the boundary between the opening double quotation mark and the word, “one” that has been matched. If you want to match the boundary between the word “one” and the space that follows it, you have to modify the regex to:

             /one\b/

Here, you have the word ‘one’, followed by ‘\b’. The pattern, “one\b” is what is matched.

The following expression will return true:

        "one two three four five" =~ /one\b/

“\b” indicates a word boundary. The following expression will return false (not matched):

            "one two three four five" =~ /on\be /

This is because the character “\b” at this position does not correspond to a word boundary (it is inside the word, ‘one’).

Now, the following expression will return true:

            "one,two,three,four,five" =~ /two\b/

Here the string portion ‘two\b’ is what has been matched. The “\b” corresponds to the boundary between the word “two” and the comma that follows it. The following expression will also produce a match:

            "one, two, three, four, five" =~ /two\b/

Here, even though there is a space between the comma and the word, “three”, the “\b” still corresponds to the boundary between the word, “two” and the comma that follows it; the comma is a non-word character and so there is a boundary between the word, “two” and the comma.

Now, the following expression will return true:

            "one-two-three-four-five" =~ /three\b/

Here the string portion ‘three’ is what has been matched. The “\b” corresponds to the boundary between the word “three” and the character, “-” that follows it. The character, “-” is a word separator; it separates two words joined together; it is not a word character.

The following expression will return true:

            "one two three four five" =~ /five\b/

Here the “\b”, corresponds to the boundary between the word, “five” and the closing double quotation mark.

Combining \b with Other Characters
You can combine the special characters above with other characters as we have seen. The following expression will return true:

            "one two three four five six" =~ /five\b six/

This is similar to the last example we saw. You have the word, “five” followed by \b and then a space and “six”, in the regex.

The Empty Regex
The empty regex is //. You may not like this, but the empty regex produces a match with the binding operator, independent of the content of the subject string. The empty regex is nothing, and it matches nothing in the subject, to return true. Try the following code and note that the output is “matched”:

use strict;

if ("Hello World" =~ //)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

The Negated Binding Operator
The negated binding operator is !~ . It does the opposite of the =~ .

Variable in Regex
The binding operator has two operands: one on the left and one on the right. The one on the left is the subject; and we already know that it can be a scalar variable. The one on the right is the regex. It can also have a variable within its pattern. In the following code that produces a match, the pattern has a variable:

use strict;

my $var = "am";

if ("I am the one." =~ /I $var/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

Here, we have the variable,

        my $var = "am";

The regex is

     /I $var/

which is

     /I am/

$var in the pattern is replaced by “am”.

Regex as a variable
The whole regex can be assigned to a scalar variable. Read and try the following code:

use strict;

    my $re = /World/;

if ("Hello World" =~ $re)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

Matching has occurred.

So, the whole regex can be a variable. You can have a variable in a regex. The whole subject string can be a variable.

Well, let us rest at this point. We continue in the next part of the series.

Chrys

Broad Network

Related Articles

Perl Regular Expression Basics

Perl Regular Expressions – Part 1

Perl Course

Introduction

Related Links

Comments