Using Regular Expressions in Perl

Regular Expressions in Perl for the Novice – Part 7

Forward: In this part of the series, we shall learn two important features titled “Search and Replace” and “The Split Operation”.

By: Chrysanthus Date Published: 13 Aug 2012

Introduction

This is the seventh part of my series, Regular Expressions in Perl for the Novice. We have seen some uses or regex in Perl. We know how to verify if a regex is found in an available string. We know how to find the position of matched regex in the available string. We have seen other uses. Note that the available string can be a whole page of text. In this part of the series, we shall learn two important features titled “Search and Replace” and “The Split Operation”. Before we leave this part, we shall talk about the regex delimiter.

Variable in Regex
Before we look at the two features, let us be aware that the regex pattern can have variables. The following code works:

use strict;

my $var = "am";

if ("I am the one." =~ /I $var/)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

Here, we have the variable,

        my $var = "am";

The regex is

     /I $var/

which is

     /I am/

“am” in the pattern is replaced by $var.

Search and Replace
You can search for a match in the available string and have the sub string matched replaced. The syntax is:

$availableString =~ s/regex/replacement/modifiers

regex, you already know what that means; replacement, is the text that will replace the sub string found. We have seen modifiers. An example is the g modifier. In the statement modifiers are optional.

The following code illustrates this.

use strict;

my $availableString = "I am a man.";

$availableString =~ s/man/woman/;

print $availableString;

The output is:

I am a woman.

The available string content is “I am a man”. The Search and Replace statement is “$availableString =~ s/man/woman/;”. The available string, after Search and Replace is “I am a woman.”. So the word “man” in the available string has been matched and replaced by “woman”. The pattern for matching is /man/. “woman” is the sub string for replacement in the available string.

Using the g Modifier
If the regex would match more than one sub string in the available string, then without the g (global) modifier, only the first sub string would be matched and replaced. The following code illustrates this:

use strict;

my $availableString = "I am a man. You are a man.";

$availableString =~ s/man/woman/;

print $availableString;

The pattern for matching is “man”. The first sub string to be matched is “man”; the second sub string to be matched is still “man”. No g modifier has been used. The output is:

I am a woman. You are a man.

Without the global modifier, matching and/or replacement always affect the first occurrence. The second man in the available string has not been replaced. With the global (g) modifier all the matched sub strings are replaced. The following code illustrates this:

use strict;

my $availableString = "I am a man. He is a man.";

$availableString =~ s/man/woman/g;

print $availableString;

The output is:

I am a woman. He is a woman.

In the output, all the instances of the word “man” has been replaced by “woman”; thanks to the g modifier.

Internal Variables, $1, $2 … $9
Here, we want to see the values the internal variables, $1, $2, etc take after the replacement. The following code illustrates this:

use strict;

my $availableString = "I am a man. You are a man.";

$availableString =~ s/(man)/woman/;

print "$1 is: ",$1, "\n";

The output is:

$1 is: man

There is one group (man) in the matching pattern. This corresponds to $1. After the replacement, $1 is “man” and not “woman”. So, after you search and replace, the internal variable holds what is matched and not what is replaced. I have not considered the case with the g modifier.

The Split Operation
There is an operator called the Split Operator. The syntax is:

split /pattern/, string

The split operator splits a string into a list of sub strings and returns the list. The pattern is the separator e.g. a comma. The separator is not part of the returned list.  Consider the following available string:

$availableString = "one two three";

If we know the regex pattern to identify space between words, then we can split this string into a list made up of the words, “one”, “two” and “three”. This list can be an array. is the character for space. + will match a space one or more times. The regex to separate the above words is

               / +/
We assume that space might be created by hitting the spacebar more than once. The following code illustrates the use of the split operator with the above pattern.

use strict;

my $availableString = "one two three";

my @words = split / +/, $availableString;

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";
print "Third Element is: ", $words[2], "\n";

In the available string the words are separated by spaces. The output of the above code is:

First Element is: one
Second Element is: two
Third Element is: three

The spilt operator has split the words in the available string using the space between the words, and put each word as element in an array.

It is possible to have words in a string separated by a comma and a space, like

my $availableString = "one, two, three";

The regex to separate these words is:

          /, +/

The following code illustrates this:

use strict;

my $availableString = "one, two, three";

my @words = split /, +/, $availableString;

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";
print "Third Element is: ", $words[2], "\n";

The output of the above code is:

First Element is: one
Second Element is: two
Third Element is: three

Now, if the regex has groupings, then the list produced contains the matched sub strings from the groupings as well. Consider the following available string:

my $availableString = "/dir1/dir2";

The available string is a path to a directory

We can use the following regex to split the string:

/(\/)/

The forward slash in the pattern is escaped and is in a group. The following code illustrates this:

use strict;

my $availableString = "/dir1/dir2";

my @words = split /(\/)/, $availableString;

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";
print "Third Element is: ", $words[2], "\n";
print "Fourth Element is: ", $words[3], "\n";
print "Fifth Element is: ", $words[4], "\n";

The output of the above code is:

First Element is:
Second Element is: /
Third Element is: dir1
Fourth Element is: /
Fifth Element is: dir2

Now, this code and its output needs explanation because of what we have as the value of the first array element. We said above that if the regex has groupings, then the list produced contains the matched sub strings from the groupings as well. The array has the words and the matched sub strings for the group. Now, note that the separator begins the available string. So the split operator separates the beginning of the available string, which is nothing, from the first character of the available string. It sends nothing as its first separated value.

An Interesting example
Consider the following available string:

my $availableString = "http://www.somewebsite.com/dir1/dir2/file.htm";

This is a URL. Let us split this URL into its components, that is, “http:”, “www.somewebsite.com”, “dir1”, “dir2” and “file.htm”. The separator here is either a forward slash or a double forward slash. The pattern for this separator is:

              /\/{1,2}/

The pattern wants between one or two forward slashes. This will satisfy the single or double slashes. The following code illustrates this:

use strict;

my $availableString = "http://www.somewebsite.com/dir1/dir2/file.htm";

my @words = split /\/{1,2}/, $availableString;

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";
print "Third Element is: ", $words[2], "\n";
print "Fourth Element is: ", $words[3], "\n";
print "Fifth Element is: ", $words[4], "\n";

So “http:” becomes the first array element, “www.somewebsite.com”, becomes the second array element, “dir1” becomes the third array element, “dir2” becomes the fourth array element and “file.htm” becomes the fifth array element.

The Delimiters
Must you always use the // delimiters for the regex. No. Perl gives you the possibility of using delimiters of your choice.

The following expressions each produce a match:

    "Hello World" =~ m!Hello!;
    "Hello World" =~ m{Hello};
    "/dir1/dir/perl.exe" =~ m"/perl.exe";

The // default delimiters for a match can be changed to arbitrary delimiters by putting an 'm' out front. In the first example, the delimiters are !!. In the second expression the delimiters are {}. In the third example, the delimiters are "". The first delimiter of whatever delimiter pair you choose, must be preceded by m.

The following code illustrates the first case:

use strict;

if ("Hello World" =~ m!Hello!)
  {
    print "Matched\n";
  }
else
  {
    print "Not Matched\n";
  }

Wow, we have done a lot. We have just one part of the series to see. All what we have done so far is good. You can do a lot with what we have done. I showed you in the previous part of the series, how to handle problems that are involving. In the next part of the series, we shall cover features, which you will want when you need more power in regex. These features are not always used, but you would need them occasionally. The next and last part of the series is titled, More Regular Expressions in Perl.

So, let us take a break here and continue in the next part of the series.

Chrys

Broad Network

Related Articles

Using Regular Expressions in Perl

Regular Expressions in Perl for the Novice – Part 7

Introduction

Related Links

Comments