Broad Network


Building ECMAScript String Regular Expressions

ECMAScript String Regular Expressions – Part 6

Forward: In this section we look at two examples that are more demanding. Before we leave this part of the series, we shall talk about what is called Backtracking.

By: Chrysanthus Date Published: 26 Jul 2012

Introduction

This is part 6 of my series, ECMAScript String Regular Expressions. Many of the examples we have come across are simple examples. In this section we look at two examples that are more demanding. Before we leave this part of the series, we shall talk about what is called Backtracking.

Steps required to build a Regex
These are the steps required to build a regex:

- Specify the task in detail,

- Break down the problem into smaller parts,

- Translate the small parts into regexes,

- Combine the regexes,

- Optimize the final combined regexes.

Two Examples
Example 1
Hexadecimal Color Code Check
Specifying the Task in Detail
An example of a hexadecimal color code is #4C8. Another example is #44CC88.
- A hexadecimal code begins with a hash, followed by either 3 hexadecimal numbers or 6 hexadecimal numbers.
- Hexadecimal digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F.
- The hexadecimal letters can be in lower or upper case.

Breaking Down the Problem into Smaller Parts
- It begins with a # anchored
- It is followed by 3 hexadecimal numbers or
- 6 hexadecimal numbers
- There is no character after the 3 or 6 hexadecimal digits.

Translating into regexes
There are three small parts above. The first part gives the regex:

                  /^#/

The second part gives the regex:

                 /[0-9a-fA-F]{3}/

The third part gives the regex:

                 /[0-9a-fA-F]{6}/

The last part gives the regex:

              /$/

Combining the Regexes
This is the combined regex:

               /^#([0-9a-fA-F]{3}$)|([0-9a-fA-F]{6}$)/

Note the alternate metacharacter, | for the three or six hexadecimal digits. Also note the parentheses that separate the alternate groups.

Optimizing the Combined Regex
This means shortening the combined regex. Note that 0-9 is abbreviated to d. So in the combined regex, we change the two occurrences of 0-9 to d. The optimized regex is:

               /^#([da-fA-F]{3}$)|([da-fA-F]{6}$)/

This expression is shorter than the above by two characters.

The following code illustrates its use:

        <script type="text/ECMAScript">
         subject = "#44CC88";
        
         if (subject.search(/^#([da-fA-F]{3}$)|([da-fA-F]{6}$)/) != -1)
               alert('Matched');
         else
               alert('Not Matched');
        </script>

Example 2
User Name Check
Specifying the Task in Detail
Assume that we have a site where users have to login. We can tell the user that his login name should contain letters in lower or upper case and/or digits from zero to 9 and/or the underscore, _. We also insist that the name must not be less than 3 characters or greater that 18 characters. In this example we have imposed the specification details.

Breaking Down the Problem into Smaller Parts
From above, a login name is made up of
- letters of the alphabet in lower or upper case between 3 to 18 letters, inclusive, and/or
- digits from 0 to 9 between 3 to 18 digits, inclusive, and/or
- the underscore between 3 to 18 digits, inclusive. This means, you can have up to 18 underscores for a name. Let us allow that for simplicity.
- We must limit the subject string to 3 or 6 characters.

Translating into regexes
The regex for the first part is:

               /^[a-zA-Z]{3,18}$/

The regex for the second part is:

               /^[0-9]{3,18}$/

The regex for the third part is:

               /^[_]{3,18}$/

The fourth part is inherent in the above regexes.

Combining the Regexes
In the break down section, the above three parts are combined with the phrase, “and/or” There is no direct way of doing this, so we have to deduce it. This is the combined regex:

               /^[a-zA-Z0-9_]{3,18}$/

Optimizing the Combined Regex
This means shortening the combined regex. Note that the class [a-zA-Z0-9_] is abbreviated to w. The optimized regex is:

               /^[w]{3,18}$/

Backtracking
We have seen how to match alternatives using the alternation metacharacter, |. When matching alternatives, ECMAScript uses a process known as backtracking. I will illustrate this with an example. Consider the following expression:

            "12345".search("/(124|123)(46|4|45)/")

I will explain backtracking by explaining the operation of the above expression. The following steps explain how ECMAScript resolves this expression.

- It starts with the first number in the subject string '1'.

- It tries the first alternative in the first group '124'.

- It sees the matching of ‘1’ followed by ‘2’. That is all right.

- It notices that '4' in the regex doesn't match '3' in the subject string – that is a dead end. So it backtracks two characters in the subject string and picks the second alternative in the first group '123'.

- It matches '1' followed by '2' followed by '3'. The first group is satisfied.

- It moves on to the second group and picks the first alternative '46'.

- It matches the '4' in the group string.

- However, '6' in the regex doesn't match '5' in the group string, so that is a dead end. It backtracks one character in the group string and picks the second alternative in the second group '4'.

- '4' matches. The second group is satisfied.

- We are at the end of the regex; we are done! We have matched '1234' out of the subject string "12345".

There are two things to note about this process. First, the third alternative in the second group '45' also allows a match, but the process stopped before it got to the third alternative - at a given character position, leftmost conquers. Secondly, the process was able to get a match at the first character position of the subject string '1'. If there were no matches at the first position, ECMAScript would move to the second character position '2' and attempt the match all over again. ECMAScript gives up and declares "12345" =~ /(124|123)(46|4|45)/, to be “false”, only when all possible paths at all possible character positions have been exhausted.

Let us take a break here and continue in the next part of the series.

Chrys

Related Links

Major in Website Design
Web Development Course
HTML Course
CSS Course
ECMAScript Course
NEXT

Comments

Become the Writer's Fan
Send the Writer a Message