Martynas Sateika

Simple multiple keyword search in PHP

Published 2014-11-29

Initial solution

I have developed a website for a local clinic a couple of years ago, and recently rebuilt it using FuelCMS. Google Analytics data showed that the vast majority of people enter the site to check out the employee timetable, and then exit. I thought my search form worked well, until I actually reviewed the pages that were tracked in GA yesterday. Quite a few people entered terms that did not work with my very simple database query method.
To be precise, the form on the site has a select field and a text input field. The select field allows to search for employees by their category (surgeon, gynaecologist and so on), while the text field queries the database by an employee's full name.
The dropdown works well and did not cause any concern here, so let's focus on the text field. The algorithm used until recently did nothing more than take the user's input via POST, and see if it matched any of the concatenated first_name, last_name fields in the databae.

Problems with the initial solution

The method seems like it would work but there's two major downsides:

The Lithuanian alphabet contains characters with accents, and users need to actually use them to find the employee. Obviously this is rarely the case, as people are used to writing "z" instead of "ž" all the time on the web.
Interestingly, MySQL does not seem to care whether you write "ė" or "e", and still finds the employees if your string only contains this character with an accent. It does not work like that with the other ones.

Moreover, users often type the last name first, followed by the first name. Or type the first letter of the first name, followed by the last name. Both cases break the algorithm, too.

With a database consisting of less than 50 employees, it is not hard to find them simply by looking at the different categories. People also usually try out different terms to find the employee so most likely they tried just the name or surname as well. However, I wanted to update the algorithm to make sure it worked in all the cases mentioned above. Here's a simple way to do it in PHP with the FuelCMS framework.

Solving issue 1: Symbols with accents

Enabling users to search for employees without forcing them to use special symbols of the Lithuanian alphabet was pretty straight-forward. I created an additional column in the database - 'slug' - and altered the on_before_clean method in FuelCMS to create the slug after the user modifies an employee's data in the CMS.
These were the changes:

Simply put, it takes a string, converts it to lowercase, swaps the symbols that have accents with ones that do not, and then if needed, uses another helper function from FuelCMS that slug'ifies the string further by adding dashes instead of spaces, and more. Note how I had to use mb_strtolower instead of the regular strtolower.

Solving issue 2: Keywords entered out of order

To solve this, I updated the aforementioned get_by_full_name method in my employees model. Simply put, rather than searching for the whole input in the employees table, we split the input up into separate keywords and see which rows match all of them.
Of course it does not have to be an exact match, hence the $this->db->like instead of $this->db->where. This makes sure that the case where a user types the first symbol of the first name and then the full last name works as expected.
The code below should be pretty self explanatory.

The new method does not care about people typing words out of order. If someone types the last name and then the first name, both keywords are still in the 'slug' column.

Going further: typos in words

Google Analytics data showed a couple of cases where people entered names with typos as well. With a small database, we could offer them search suggestions by checking which names have the smallest Levenshtein distance to their keyword (more information on this function can be found on PHP.net.
With a larger database, you could make use of several phonetic algorithms available out-of-the-box in PHP, i.e. soundex. The advantage of it over Levenshtein's distance, for example, is that you can precompute an encoded string and store that in the database. The algorithm encodes any string into a 4-character string, and so to see if two words are similar, all you need to do is compute the Soundexes of the words the user types in and query the database with these.
It is not fool-proof, however, as it encodes only the consonants (vowels are taken into account only if they're the first character of a word):

In some cases, using the Soundex might just be enough (in fact, I am currently using it in the clinic's website). If you're after the more expensive and robust algorithms, though, soundex might provide a good starting point when going through a big database of items: while comparing your input's Soundex with that of millions of rows in a database might not be reliable, you could look at the most similar Soundexes, and compute the Levenshtein distance only of those.