Broad Network


Grouping in ECMAScript String Regular Expressions

ECMAScript String Regular Expressions – Part 4

Forward: You can actually group bits of a regular expression and do something with it. That is what this part of the series is about.

By: Chrysanthus Date Published: 26 Jul 2012

Introduction

This is the fourth part of my series ECMAScript String Regular Expressions. You can actually group bits of a regular expression and do something with it. That is what this part of the series is about.

Groupings
We can use parenthesis to group characters in a pattern. Consider the following pattern:

          /The (guitarist)/

“guitarist” is in parenthesis. The parentheses form a group, which has the text: “guitarist”. Consider the following:

         /The (guitarist is good)/

“guitarist is good” is in parenthesis. The parentheses form a group, which has the text: “guitarist is good”.

ECMAScript treats a group as an entity. A group on its own is not serving any purpose. It becomes important when used in conjunction with other pattern techniques. There is another use, which we shall see below.

The String match() Method
The string object has a method called, match(). Ordinarily, this method returns a one element array whose element is the sub string found in the subject; however, if no sub string is found, the return value is null (not an array). Let us demonstrate this before we continue.

<html>
    <head>
    </head>
    <body>
        <script type="text/ECMAScript">
            var arr = "Hello World!".match(/World/)
            alert(arr);
        </script>
     </body>
</html>

The above script uses the match method. The subject string is "Hello World!". The regex is /World/. The return value is an array, which is, arr. Matching occurs and “World” from the subject is sent to the array as the only element. The second statement of the script should display all the elements of the array; well there is only one element in the array, so it displays just the one element, which is “World”. Everything being equal, the matching method returns a one-element array.

Let us now look at the case where matching does not occur. Remember that matching is case sensitive. So if we use “world” in the pattern, where “W” is in lower case, matching world not occur. Let us try it.

<html>
    <head>
    </head>
    <body>
        <script type="text/ECMAScript">
            var arr = "Hello World!".match(/world/)
            alert(arr);
        </script>
     </body>
</html>

Here, matching does not occur, there is no array and the return value of the match method is null. The second statement (alert) displays, null. You should try the above programs.

Note: either the subject or the regex, for the match method can be a variable.  

Sub Strings with common Parts
Imagine that you have a bookshop where there is a bookkeeper and a bookshelf. Here, bookkeeper is the person who looks after the books. Also imagine that you have any of the following subject strings:

          var subject = "There is a bookshelf in my shop.";

          var subject = "I have a bookkeeper.";

          var subject = "The bookkeeper takes care of the bookshelf.";

In your code, you might not know which subject string is present (the string might have been taken from somewhere and assigned to a variable); however, let us say your interest is to know whether there is a bookshelf or bookkeeper in the subject string. The regex for this can be:

          /bookshelf|bookkeeper/

Note that in the above regex, we have to type the word, “book” twice. We can avoid this double typing by using the following regex:

          /book(shelf|keeper)/

This second regex is more efficient, because we do not have to type the word, “book” twice. In the second case we have the group, (shelf|keeper). ECMAScript treats a group as a single unit. Also, in this group, ECMAScript has to choose “shelf” or “keeper”. In this way, ECMAScript will have to look for bookshelf or bookkeeper in the pattern, resulting from more efficient typing. The following code produces a match (returns an array).

           var subject = "There is a bookshelf in my shop.";
           var arr = subject.match(/book(shelf|keeper)/);
           alert(arr);

Here, the subject variable can be any of the above strings.

Here, the word “book” is common to both words and is the first part of both words. You can have a sub string that is the second common part of two phrases; the same technique applies, but this time the group is in front; like this:

           /(non-common|non-common)common/

The pattern can actually become complex and you would have the same group as above in different places of a bigger pattern.

Sub Strings with Alternation at Beginning of Available String in Group
Let us look at the case of sub strings with common part where matching has to occur at the beginning of the available string.

Consider the following pattern:

              /(^x|y)z/

The anchor metacharacter ^ is used to match the regex at the beginning of the subject string. The above pattern matches 'xz' at start of the subject string or 'yz' anywhere in the subject string. The following expressions match:

         "xz 5678".match(/^(x|y)z/)

         "34 yz 56 G".match(/(^x|y)z/)

More on Sub Strings with Common Parts
Imagine that you want to match “book” or “bookkeeper” or “bookkeepers”.  The sub string “book” occurs in the three phrases (sub strings). The sub string “bookkeeper” occurs in two of the phrases; and the sub string bookkeepers occurs in only one of the phrases. Our aim in this section is to develop an efficient pattern to match sub strings such as these three.

You can do this:

        /book|bookkeeper|bookkeepers/

The problem here (inefficiency) is that you have to type “book” three times and you have to type “keeper” two times.

The following pattern is efficient:

         /book(keeper(s|)|)/

First of all, note here that you have a nested group; groups can be nested. There are also two alternation metacharacters, |; one inside an inner nested group; the other inside an outer group.

In the inner nested group, ECMAScript has to chose between “s” or nothing. In the outer group, ECMAScript still has to choose between “keeper(s|)” or nothing.  “book” will always be chosen. In this way, ECMAScript will match “book”, “bookkeeper” or “bookkeepers”. The situation here is similar to the first situation above, but is more complex here.

Capturing
The grouping metacharacters, (), that is, parentheses also serve another completely different purpose: they allow the capture of sub strings in the subject that matched. Well, pattern (regex) is not usually an exact word or an exact phrase. After the matching has occurred with the subject, can you know the exact word or phrase that was matched? Yes, you can know this, and it is thanks to grouping (parentheses).

Consider the following subject string:

"This is one and that is two."

The pattern, /(one).*(two)/) matches the sub string “one and that is two” in the subject. Now, the whole sub string matched, is captured in an array, when you use the string object match() method. The portions of the subject string matched by a group can also be captured in the array. In this case, “one” and “two” that are in the subject string matched can be captured in an array.

The following code illustrates this:

<html>
    <head>
    </head>
    <body>
        <script type="text/ECMAScript">
            var arr = "This is one and that is two.".match(/(one).*(two)/);
            alert(arr[0]);
            alert(arr[1]);
            alert(arr[2]);
        </script>
     </body>
</html>

For the output, the first alert box displays:

         one and that is two

The second alert box displays:

         one

The third alert box displays:

         two

Let us look at the code first before we look at its output. The first statement is:

            var arr = "This is one and that is two.".match(/(one).*(two)/);

First of all note that we have used the match() method. There are two groups in the regex. Now, after execution of the matching process, the sub string in the subject that matches the whole group goes into arr[0]. The sub string in the subject that matches the first group goes into arr[1]. The sub string in the subject that matches the second group goes into arr[2]. Note: The array will acquire these sub strings only if there is matching. If there is no matching, the array will not acquire any sub strings from the subject; it will have only one element with the value, null.

After the first statement in the script, the other three alert statements display the three array values, accordingly. From the explanation given above, there can only be three elements in the array. This explains what you have as the output.

So, to capture the whole regex including any of its groups, you need the string object match() method and an array to hold returned sub strings matched. The whole regex gives rise to the value of the first element of the array. The first group in the regex (from left to right) gives rise to the value of the second element of the array. The second group gives rise to the value of the third element in the array. The rest of the groups, if available, fill the array in that order. The length of the array is the number of sub strings captured. This consists of the sub string in the subject that corresponds to the whole regex and the other sub strings corresponding to the groups in the regex. Remember, if there is no matching, there is no array and the return value of the match() method is, null.

What about Nested Groups
Consider the following code:

<html>
    <head>
    </head>
    <body>
        <script type="text/ECMAScript">
            var arr = "I like school boys.".match(/school (boy(s|)|)/);
            alert(arr[0]);
            alert(arr[1]);
            alert(arr[2]);
        </script>
    </body>
</html>

This is the output of the above code:

school boys
boys
s

The whole pattern would match “school boys”, “school boy” or “school”. We have two groups; one inside the other.

The outer group in the regex is boy(s|) and the inner group is (s|). The outer group corresponds to “boys” in “school boys”. The inner group corresponds to “s” at the end of “school boys”.

Let me do some more explanation on these capturing. “(boy(s|)|)” means “boy(s|)” or nothing, and then “boy(s|)” means “boys” or “boy”; so “boys” next to “school ” is captured. “(s|)” is a group and any group can be captured; it means “s” or nothing. Note that it is not necessarily the group that is matched; it is the pattern that is matched. The match-able sub string that has our “s” is “school boys”. As “school boys” is matched, our “s” is captured.

Capturing and matching are not the same things. After matching occurs, if there is any group in the matched sub string in the subject string, it is captured.

Now, you may ask, “with nesting groups, how do we know which group is considered as the first group, which group is considered as the second group, which is considered as the third and so on.” This is the answer: The left most opened bracket “(” is for the first group; the second opened bracket in the regex from the left, is for the second group; the bracket after this from the left, is for the third group, and so on.

Time to take a break. We continue in the next part of the series.

Chrys

Related Links

Major in Website Design
Web Development Course
HTML Course
CSS Course
ECMAScript Course
NEXT

Comments

Become the Writer's Fan
Send the Writer a Message