Regular Expressions in Perl for the Novice

Regular Expressions in Perl for the Novice – Part 1

Forward: This is the first part of my series, Regular Expressions in Perl for the Novice.

By: Chrysanthus Date Published: 13 Aug 2012

Introduction

This is the first part of my series, Regular Expressions in Perl for the Novice. Consider the string,

“This is a man”.

Assume that you do not know the content of the string; the string might have been typed by the user and the Perl code has assigned it to a variable. You may have the following two questions:

- Does the sting have the word, “man”?
- If the string has the word, “man”, can you change it to “woman”.

There are many other questions that are similar (and rather complex) to the above two questions. Handling this in code is the subject called Regular Expressions, abbreviated, Regex.

This is an article series. Even though this series is referring to the Novice, as in the title, I cover a lot about Perl Regular Expressions. The word, “Novice” simply refers to the simplified and progressive manner in which I have presented the information.

The Word, Regex
In the above example, “man” is a Regex. More generally, Regex is a sub string of characters that you want to know, if it exists in some available string. This available string might also have been assigned to a variable.

Matching
When the Regex is seen in the available string, we say matching has occurred. That is the Regex has match the string. When matching occurs, replacement can follow. If the regex, “man” in the above example is seen, it can be replaced by the word “woman”.

Modern and Old Fashion Ways of coding Regex
At first, to answer the above type of questions you had to do the coding using programming basics (declaration of variable, conditions, loops, etc). Know that the questions such as the ones above can be classified. Perl came up with functions in a module, to handle the above questions; this gives the programmer less work. The programmer uses these functions in special ways without really being conscious that he is using them. The use of these inbuilt functions is made convenient with special symbols. In this series, we learn the special ways of answering questions of the above types.

Requirements
I will give you samples of code that you can try. I am using ActivePerl and Windows XP. AcivePerl is Perl for windows. You can use Perl for any other operating system, but its version should be 5.6 or higher. All the code I give you in this series, will work with Perl for the different operating systems. Now, AcivePerl does not need the following line at the beginning of the code, while Perl for other operating systems need the line:

         #!/usr/local/bin/perl

I use the DOS Prompt (window) of windows XP to run all the samples you will have in this series. You can use a similar console in your operating system to try the samples of code.

Simple Word Matching
Consider the following:

            "Hello World" =~ /World/;

The above is an expression. We can call the string on the left the available string. =~ is called the binding operator. It binds the available string with what is on its right, (/World/). Now /World/ is known as the regex literal. What is inside the two forward slashes is called the pattern. It can be more complex than what (World) you are seeing. The binding operator is said to have two arguments: one ("Hello World") on its left and the other (/World/) on its right. The two arguments and the binding operator form an expression.

This expression can be used in conditionals (if condition). If the pattern, in this case “World” is found in the available string, then the expression returns true. If it is not found then the expression returns false. Matching is said to occur, if the pattern, (in this case, “World”) is found in the available string. The following Perl code, which you can try, illustrates this:

use strict;

if ("Hello World" =~ /World/)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

If you try the above code, the monitor would print, “Matched”.

Note: a variable can be used in place of the available string, "Hello World".

Pattern
Consider the following string assigned to the variable, availableString.

   $availableString = “Examples of creatures are the bat, the cat and the rat.”;

You may want to know if the word, “bat”, “cat” or “rat” exist in the string. Examining the string we see that “bat”, “cat” and “rat”, each end in “at”. The following regex will be used to determine if “bat”, “cat” or “rat” exist in the string:

                      /[bcr]at/

Note the square brackets around “bcr”; b is the first letter in “bat”; c is the first letter in “cat” and r is the first letter in “rat”. These first letters are inside the square brackets. After the square brackets, you have the next two letters that are common in the three words and follow the different first letters.

The following script will produce a match:

use strict;

my $availableString = "Examples of creatures are the bat, the cat and the rat.";

if ($availableString =~ /[bcr]at/)
  {
    print "Matched";
  }
else
  {
    print "Not Matched";
  }

The regular expression literal is:

                    /[bcr]at/

In this subject (Regular Expressions) the content inside the two forward slashes is called a pattern.  So far, we have seen two patterns, one, /[bcr]at/ that describes a set of words and another, /World/ that describes only one word. We shall see many more patterns in this series.

Some Special Characters
There are some ASCII characters, which don't have printable character equivalents and are instead represented by escape sequences. Common examples are \t for a tab, \n for a newline, \r for a carriage return and \a for a bell.

The horizontal tab
If you want a horizontal tab to appear in text you should type “\t” in the text. Consider the following:

my $availableString = "\tThis is a new section and it continues as a paragraph.";

Note the ‘\t’ for a horizontal tab at the beginning of the available string.

You might want to match the horizontal tab, \t. Your regular expression would be

            /\t/

With the above, the following expression should return true (matched)

             $availableString =~ /\t/

So, to match t in the available string, just use t in the pattern.

The Control Characters
The notation in the pattern, for matching a control character is

             \cX

where X is a letter from A to Z.

If you only want to match a control character (not associated with other characters), the literal text expression for the regex is:

            /\cX/

The following expression produces a match:

            "\cZ That is it." =~ /\cZ/

So, just use escaped control character in the pattern.

Hexadecimal Numbers
In programming, some hexadecimal numbers are written as:

             \xhh      e.g  \xBF

Other hexadecimal numbers are written as:

           \xhhhh     e.g.    \xAF7B

I will not give you further explanation about hexadecimal numbers; just know that you will find many examples like those above.

The notation for matching hexadecimal numbers is

                      \xhh     or     \xhhhh

where h is a hexadecimal digit.

If you only want to match a hexadecimal number, the literal text expression for the regex is:

            /\xhh/      or      /\xhhhh/

Characters can be represented by escaped hexadecimal numbers. The following expression produces a match:

            "cat" =~ /\x61\x74/

Because, the hexadecimal number for the character, ‘a’ is \x61 and that for t is \x74.

Word Boundary
A word boundary is the boundary between a word character and a non-word character.

Consider the following strings:

             “one two three four five”

             “one,two,three,four,five”

             “one, two, three, four, five”

             “one-two-three-four-five”

The following expression will return true (match):

            “one two three four five” =~ /\b/

The notation ‘b’ is used to match a word boundary. In the above expression, it is the boundary between the opening double quotation mark and the word, “one” that has been matched. If you want to match the boundary between the word “one” and the space that follows it, you have to modify the regex to:

             /one\b/

Here, you have the word ‘one’, followed by ‘\b’. The pattern, “one\b” is what is matched.

The following expression will return true:

        "one two three four five" =~ /one\b/

“\b” indicates a word boundary. The following expression will return false (not matched):

            "one two three four five" =~ /on\be /

This is because the character “\b” at this position does not correspond to a word boundary (it is inside the word, ‘one’).

Now, the following expression will return true:

            "one,two,three,four,five" =~ /two\b/

Here the string portion ‘two\b’ is what has been matched. The “\b” corresponds to the boundary between the word “two” and the comma that follows it. The following expression will also produce a match:

            "one, two, three, four, five" =~ /two\b/

Here, even though there is a space between the comma and the word, “three”, the “\b” still corresponds to the boundary between the word, “two” and the comma that follows it; the comma is a non-word character and so there is a boundary between the word, “two” and the comma.

Now, the following expression will return true:

            "one-two-three-four-five" =~ /three\b/

Here the string portion ‘three’ is what has been matched. The “\b” corresponds to the boundary between the word “three” and the character, “-” that follows it. The character, “-” is a word separator; it separates two words joined together; it is not a word character.

The following expression will return true:

            "one two three four five" =~ /five\b/

Here the “\b”, corresponds to the boundary between the word, “five” and the closing double quotation mark.

Combining with Other Characters
You can combine the special characters above with other characters as we have seen. The following expression will return true:

            "one two three four five six" =~ /five\b six/

This is similar to the last example we saw. You have the word, “five” followed by \b and then “six” in the regex.

Well, let us rest at this point. We continue in the next part of the series.

Chrys

Broad Network

Related Articles

Regular Expressions in Perl for the Novice

Regular Expressions in Perl for the Novice – Part 1

Introduction

Related Links

Comments