Broad Network


Introduction to ECMAScript String Regular Expressions

ECMAScript String Regular Expressions - Part 1

Forward: This is part 1 of my series, ECMAScript String Regular Expressions.

By: Chrysanthus Date Published: 25 Jul 2012

Introduction

This is part 1 of my series, ECMAScript String Regular Expressions.

Consider the string,
       
               “This is a man”.

Assume that you do not know the content of the string; the string might have been typed by the user and the ECMAScript code has assigned it to a variable. You may have the following two questions:

- Does the sting have the word, “man”?
- If the string has the word, “man”, can you change it to “woman”.

There are many other questions that are similar (and rather complex) to the above two questions. Handling this in code is the subject called Regular Expressions, abbreviated, Regex.

This is part 1 of a series of articles. You need basic knowledge in HTML and ECMAScript to understand this series.

The Word, Regex
In the above example, “man” is a Regex. More generally, Regex is a string (usually small string) of characters that you want to know, if it exists in some subject string. This subject string might have been assigned to a variable.

Matching
When the Regex is seen in the subject string, we say matching has occurred. That is, the Regex has match the string. When matching occurs, replacement can follow. If the regex, “man” in the above example is seen in the subject string, it can be replaced by the word “woman”.

Modern and Old Fashion Ways of coding Regex
At first, to answer the above type of questions you had to do the coding using programming basics (declaration of variable, conditions, loops, etc). Know that the questions such as the ones above can be classified. ECMAScript came up with functions, to handle the above questions; this gives the programmer less work. The programmer uses these functions in special ways without really being conscious that he is using them. The use of these inbuilt functions is made convenient with special symbols. In this series, we learn the special ways of answering questions of the above types, using ECMAScript strings.

Simple Word Matching
Consider the following code:

<html>
    <head>
    </head>
    <body>
        <script type="text/ECMAScript">
            var pos = "Hello World!".search(/World/);
                alert(pos);
        </script>
     </body>
</html>

This is a simple HTML/ECMAScript file. There is a BODY element. This Body element has just a ECMAScript. As soon as the web page is loaded, this script is executed.

If you try the above code, the alert box will display, the number 6.

Let us look at the ECMAScript script.  This is the script content:

            var pos = "Hello World!".search(/World/);
                  alert(pos);

The first statement uses the search() method of the String object. The argument of the search() method is /World/. The string object for the method is "Hello World!"; this is a literal string object; this is the subject string.

The regex is

     /World/      

Here, the regex is made up of the word, “World”, preceded by a forward slash and terminated by another forward slash.

The subject string is:

            "Hello World"

Now, if “World” is found in the subject string, the string method, search() returns the position where the match occurred in the subject. Position counting in a string begins from zero. The position here is the position in the string where the sub string found begins. In our case it is 6. The sub string, “World” begins at position 6 in the subject. Now, if there is no matching, that is if no sub string is found in the subject string (that represents the regex), the search() method would return, -1.

If you just want to know whether or not matching occurs, you can use the following code.

<html>
    <head>
    </head>
    <body>
        <script type="text/ECMAScript">
            if ("Hello World!".search(/World/) != -1)
                alert('Matched');
            else
                alert('Not Matched');
        </script>
     </body>
</html>

If matching occurs, the search() method returns the position in the string where the matching occurred. If matching does not occur it returns –1. This feature is used in the if-condition of the above code. If matching occurs, the code alerts “Matched”. If matching does not occur, the code alerts “Not Matched”.

Mote: Matching is case sensitive. So if we had “World” in the regex as “world” with the W in lower case, the if-condition would not hold, and our code would display, “Not Matched”.

Well, we shall use the second code (and its derivatives) above more often than the first in this article series.

Before the if-statement in the second code, you can have the regex and the subject as string variables. The following code illustrates this:

<html>
    <head>
    </head>
    <body>
        <script type="text/ECMAScript">
            var re = /World/;
            var subject = "Hello World!"

            if (subject.search(re) != -1)
                alert('Matched');
            else
                alert('Not Matched');
        </script>
     </body>
</html>

In this code, you have the variables,

   re = "/Would/";
   subject = "Hello World";

The if-condition is now:

        (subject.search(re) != -1)

The string object for the search() method is, subject, and the argument for the search() method is, re.

Meaning of Pattern
Consider the following string assigned to the variable, subject.

   subject = "Examples of creatures are the bat, the cat and the rat. ";

You may want to know if the word, “bat”, “cat” or “rat” exist in the string. Examining the string we see that “bat”, “cat” and “rat”, each end in “at”. The following regex will be used to determine if “bat”, “cat” or “rat” exists in the string:

  re = /[bcr]at/;

Note the square brackets around “bcr”; b is the first letter in “bat”; c is the first letter in “cat” and r is the first letter in “rat”. These first letters are inside the square brackets. After the square brackets, you have the next two letters that are common in the three words and follow the different first letters.

The following script will produce a match at the browser:

<html>
    <head>
    </head>
    <body>
        <script type="text/ECMAScript">
            re = /[bcr]at/;
            subject = "Examples of creatures are the bat, the cat and the rat.";

            if (subject.search(re) != -1)
                alert('Matched');
            else
                alert('Not Matched');
        </script>
     </body>
</html>

Now, the regular expression content is

                   [bcr]at

The two forward slashes added to the ends (as shown below) make the above expression a regular expression.

                    /[bcr]at/

What you have inside the two forward slashes is a pattern that describes a set of words (bat, cat and rat).

In this subject (Regular Expressions) the content inside the two forward slashes is called a pattern. So far, we have seen two types of patterns, one of them, /[bcr]at/ that describes a set of words and another, /World/ that describes only one word. The two forward slashes are the delimiters of the pattern. We shall see many more patterns in this series. The pattern and its delimiters are together called the regex. Well, in some documents, distinction is not made between the pattern and regex.

Some Special Characters
There are some ASCII characters, which don't have printable character equivalents and are instead represented by escape sequences. Common examples are \t for a horizontal tab, \n for a newline, \r for a carriage return and \a for a bell.

The horizontal tab
If you want a horizontal tab to appear in text you should type “\t” in the text. Consider the following:

var subject = "\tThis is a new section and it continues as a paragraph.";

Note the ‘t’ for a horizontal tab at the beginning of the subject.

You might want to match the horizontal tab, \t. Your regular expression would be

            /\t/

With the above, the following conditional produces a match

            if (subject.search(re) != -1)

So, to match \t in the available string, just use \t in the pattern.

Hexadecimal Numbers
Hexadecimal numbers can be written as:

             \xhh      e.g  \xBF


I will not give you further explanation about hexadecimal numbers in this series; just know that you will find many examples like the above.

The notation for matching hexadecimal numbers is

                      \xhh

where h is a hexadecimal digit.

If you only want to match a hexadecimal number, the regex is:

            /\xhh/

Characters can be represented by escaped hexadecimal numbers. The following conditional produces a match:

         if ("cat".search(/\x61\x74/) != -1)

A match is produced, because the hexadecimal number for the character, ‘a’ is \x61 and that for ‘t’ is \x74.

Word Boundary
A word boundary is the boundary between a word character and a non-word character.

Consider the following strings:

             “one two three four five”

             “one,two,three,four,five”

             “one, two, three, four, five”

             “one-two-three-four-five”

The following conditional will produce a match:

             if ("one two three four five".search(/\b/) != -1)

The notation ‘b’ is used to match a word boundary. In the above conditional, it is the boundary between the opening double quotation mark and the word, “one” that has been matched. If you want to match the boundary between the word “one” and the space that follows it, you have to modify the regex to:

              /one\b/

Here, you have the word ‘one’, followed by ‘\b’. The pattern, one\b is what is matched.

The following conditional will produce a match:

            if ("one two three four five".search(/one\b/) != -1)

“\b” indicates a word boundary. The following conditional will not produce a match:

            if ("one two three four five".search(/on\be/) != -1)

This is because the character “\b” at its position does not correspond to a word boundary (it is inside the word, ‘one’).

Now, the following conditional will produce a match:

            if ("one,two,three,four,five".search(/two\b/) != -1)

Here the string portion ‘two\b’ is what has been matched. The “\b” corresponds to the boundary between the word “two” and the comma that follows it. The following conditional will also produce a match:

            if ("one, two, three, four, five".search(/two\b/) != -1)

Here, even though there is a space between the comma and the word, “three”, the “\b” still corresponds to the boundary between the word, “two” and the comma that follows it; the comma is a non-word character and so there is a boundary between the word, “two” and the comma.

Now, the following conditional will produced a match:

            if ("one-two-three-four-five".search(/three\b/) != -1)

Here the string portion ‘three’ is what has been matched. The “\b” corresponds to the boundary between the word “three” and the character, “-” that follows it. The character, “-” is a word separator; it separates two words joined together it is not a word character.

The following conditional will produce a match:

            if ("one two three four five".search(/five\b/) != -1)

Here the “\b”, corresponds to the boundary between the word, “five” and the closing double quotation mark.

Combining with Other Characters
You can combine the special characters above with other characters as we have seen. The following expression will produce a match:

            if ("one two three four five six".search(/five\b six/) != -1)

This is similar to the last example we saw. You have the word, “five” followed by \b, a space and then “six” in the regex.

Well, let us take a break at this point. We continue in the next part of the series.

Chrys

Related Links

Major in Website Design
Web Development Course
HTML Course
CSS Course
ECMAScript Course
NEXT

Comments

Become the Writer's Fan
Send the Writer a Message