Metacharacters when Searching a Social Network Site with PHP and MySQL

Searching a Social Network Site with PHP and MySQL - Part 3

Large Website of HTML Files

Forward: In this part of the series, I talk about, Metacharacters when Searching a Social Network Site with PHP and MySQL.

By: Chrysanthus Date Published: 13 Feb 2013

Introduction

This is part 3 of my series, Searching a Social Network Site with PHP and MySQL. In this part of the series, I talk about, Metacharacters when Searching a Social Network Site with PHP and MySQL. You should have read the previous part of the series before reaching here, as this is a continuation.

In the previous part of the series, PHP regular expression and MySQL regular expression were used. Sometimes keywords have metacharacters. A good example of a word with metacharacters is the name of the computer language, C++. It has two plus’s that are metacharacters. A keyword is a normal word, which is important in a passage. So, in some articles, “C++” is a keyword. The question here is, how do you type a metacharacter in a regular expression (regex)? Also, how do you find and replace a metacharacter that you do not know, in a string?

In this tutorial I use the “C++” keyword for illustration. After that I talk about the problem of finding and replacing a metacharacter that you do not know, in a string.

Metacharacters
Metacharacters in PHP regex that I know are, + * ? [ ^ ] $ ( ) : { } = ! < > | . Metacharacters in MySQL regex that I know are, ^ $ . * + ? | { } [ ] : = > < . Do not confuse between white space characters and metacharacters. In PHP white space characters are: ‘\t’, ‘\r’, ‘\n’, and ‘\f’. In MySQL, white space (escape sequences) characters are: '\b', '\t', '\n', '\v', '\f', and '\r'.

How do you type a metacharacter in a regex? Answer: in PHP, you precede the metacharacter with a back slash, as in /C\+\+/. In MySQL you precede the metacharacter with double back slashes as in "C\\+\\+"

Metacharacters and the PHP Search Engine Script
I talk about the PHP file of the previous part of the series, here. The PHP file used regular expression technique in three places. In two of these places PHP regular expression functions are use. In the first place, a function is used to remove the non-keywords from the search phrase. In the second place, the keywords of the new search phrase are placed in an array by a function. In the third place, PHP “wraps” a MySQL regular expression technique; this happened at the formation of the MySQL Select Query.

In the first place, the PHP function is:

    $newSearchStr = preg_replace("/\b$nonKeywords[$i]\b/", "", $searchStr);

In the second place, the PHP function is:

    preg_match_all("/\b\w+\b/i", $newSearchStr, &$searchStrArr);

In the third place, you have the code segment:

            //form the WHERE clause of SQL select statement
            $numberWords = count($searchStrArr[0]); //no. of keywords
            $firstKeyword = $searchStrArr[0][0];
            $whereStr = " WHERE (series.keywords rLike \"$firstKeyword\")";
            if ($numberWords > 1)
                {
                    for($j=1; $j<$numberWords; ++$j)
                        {
                            $temp = $searchStrArr[0][$j];
                            $whereStr .= " AND (series.keywords rLike \"$temp\")";
                        }
                }

Here, “rLike” is MySQL regular expression operator and the MySQL regex is in the PHP variable, $temp.

Search Phrase Having C++
If the search phrase has “C++” for the above code as it is, the “C++” word will not be selected into the array, $searchStrArr. This is because + is a metacharacter. To solve the problem, the second place of regular expression above has to be rewritten as:

    preg_match_all("/c\+\+|\b\w+\b/i", $newSearchStr, &$searchStrArr);

The regex is now /c\+\+|\b\w+\b/ instead of just, /\b\w+\b/. Note that the + signs have been escaped. So the function now searches for an ordinary word (\b\w+\b) or C++ (c\+\+) and place in the array. The search is case insensitive (before and now), so c or C mean the same thing. That is not all; as part of the solution, the third regular expression place above, has to be rewritten completely as follows:

            //form the WHERE clause of SQL select statement
            $numberWords = count($searchStrArr[0]); //no. of keywords
            $tempkey;
            for ($j=0; $j<$numberWords; ++$j)
                {
                    $tempkey = $searchStrArr[0][$j];
                    $newtempkey = preg_replace("/\+/", "\\\\\\\\+", $tempkey);
                    $searchStrArr[0][$j] = $newtempkey;
                }
            $firstKeyword = $searchStrArr[0][0];
            $whereStr = " WHERE (series.keywords rLike \"$firstKeyword\")";
            if ($numberWords > 1)
                {
                    for($k=1; $k<$numberWords; ++$k)
                        {
                            $temp = $searchStrArr[0][$k];
                            $whereStr .= " AND (series.keywords rLike \"$temp\")";
                        }
                }

The PHP preg_replace() function has been used again, but in the new code. It has been used in a for-loop that loops through the $searchStrArr array (the row of interest). It looks for any + in a keyword in the array and replaces it with, "\\\\\\\\+" effectively, "\\+". (In PHP, if you want \ at a destination, you type, \\; if you want \\, you type \\\\\\\\). You do this in order to achieve “C\\+\\+” for the MySQL regular expression action at the destination database. MySQL requires a regex metacharacter to be escaped by double back slashes.

The rest of the new code should be self-explanatory, if you have read the previous parts of the series.

Unknown Metacharacter in a Keyword
Above, we know that the peculiar word is “C++”, and the metacharacter is, +. In theory you can have situations when you are not sure of the peculiar word and you are not sure of the metacharacter. In this case, the PHP file has to be modified again.

One way to do this is to know the different possible types of peculiar words, and then adjust the file accordingly. Assuming that there are three possible peculiar words, which are, “C++”, “C**”, and “W^^H”, the PHP function in the second place for regular expression above would be:

    preg_match_all("/c\+\+|c\*\*|w\^\^h|\b\w+\b/i", $newSearchStr, &$searchStrArr);

For the third place, you will have to write three for-loops: one for “c++”, one for “c**” and one for “w^^h”.

The above approach is one way. The problem with that approach is that as the number of peculiar words increase, the number of corresponding for-loops increase; that is cumbersome. With many peculiar words, in the second place, the preg_match_all() function stays the same but with more peculiar words in the regex; however you would have to use the following PHP regular expression function in place of the for-loops:

    $newSearchStrArr = preg_replace($patternArr, $replacementArr, $searchStrArr[0]);

where $patternArr is

    $patternArr = array("/c\+\+/i", "/c\*\*/i", "/w\^\^h/i", . . .);

and $replacementArr is:

    $replacementArr = array("C\\\\\\\\+\\\\\\\\+", "C\\\\\\\\*\\\\\\\\*", "W\\\\\\\\^\\\\\\\\^h", . . . );

and $searchStrArr[0] copied from the previous code is an array with keywords in their natural forms: keywords such as “C++”, “man”, “C**”, “woman”, “W^^H” , “boy”, etc. The order in which the keywords including the peculiar words are placed in this array, does not matter.

Under this condition, any “C++” found in $searchStrArr[0] is replaced by “C\\+\\+”; any “C**” found in $searchStrArr[0] is replaced by “C\\*\\*”; any “W^^H” found in $searchStrArr[0] is replaced by “W\\^\\^H”. Remember, we are putting peculiar keywords in the form suitable for MySQL regular expression action.

The return array of the preg_replace() function, is not $searchStrArr[0]; it is a new array of new variable, say, $newSearchStrArr. While $searchStrArr in this series is a two dimensional array, $newSearchStrArr is a one-dimensional array. $newSearchStrArr is the array with the changed values. $searchStrArr remains unchanged with the old values.

The values in $patternArr and $replacementArr are corresponding: for example, if "/c\+\+/i" is in the first position of $patternArr, then "C\\\\\\\\+\\\\\\\\+" would be in the first position of $replacementArr; if "/c\*\*/i" is in the second position of $patternArr then "C\\\\\\\\*\\\\\\\\*" would be in the second position of $replacementArr; and so on.

The … in the above code means more cryptic words; for you to extend the arrays to have more than three peculiar words.

Do not forget, the knowledge gained in this series, can be used for a large site of HTML files.

Wow, I find the production of this series exciting; I hope you find it interesting. We have come to the end of this part of the series. We take a break here and continue in the next part.

Chrys

Broad Network

Related Articles

Metacharacters when Searching a Social Network Site with PHP and MySQL

Searching a Social Network Site with PHP and MySQL - Part 3

Large Website of HTML Files

Introduction

Related Links

Comments