Handling accents and special characters in search

Users of countless software products use search often to locate what they need. Search, in it’s most simplest form, is to take a search keyword from the user, run down a list and find results which partially or fully (it’s up to you) match the keyword. In this blog, we’ll talk about two features that we feel should be provided with any user-friendly search implementation. These are related to handling search with data having accented and special characters. We’ll see how to implement these using MS SQL Server.

Take a table in the database which has a column word that contains an item “João”, basic search should fetch this item with keyword “Joao”. See the below screenshot from WordPress.org that uses Google Search to provide this functionality,

In a database table which has column Word that contains an item “e-commerce”, basic search should fetch this item with keywords like “ecommerce” or “e commerce”. See the below screenshot from WordPress.org that uses Google Search to provide this functionality,

We’re going to focus on implementing these features using SQL Server.

Handling accented characters

The reason why we’re not able to search for “João” using the keyword “Joao” is because of the collation used in the database.

When a new database is created in SQL Server, the collation of the database is set to
SQL_Latin1_General_CP1_CI_AS by default.

While every word in the name of the collation has a meaning, we’ll focus only on CI and AS.

CI (Determines case sensitivity) means that the data that will be stored in the database will be case insensitive. AS means that the data that will be stored in the database will be accent sensitive.

AS (Determines accent sensitivity) is the reason why we were not able to search for “João” using the keyword “Joao”.

For our intended functionality, it’s obvious that the collation needs to be changed to
SQL_Latin1_General_CP1_CI_AI. This collation will not differentiate between ‘ã’ and ‘a’.

Collations can be set on an entire server, to a database in a server or to specific columns in a table of a database. For our requirement we will only change the collation of the column that we wish to make case and accent insensitive. Do note that, there is a performance impact when modifying the collation. To find out more, check here.

For the list of collations in SQL Server, run this query,

MySQL

1

SELECT*FROMfn_helpcollations();

We have a database
search_table which looks like below.

The easiest way to find the collation of the table’s column is to check the properties of the column in SQL Server Management Studio’s Object Explorer.

You can alter the collation of the Name column using the following query.

MySQL

1

2

ALTERTABLEdbo.search_tableALTERCOLUMNNamevarchar(50)

COLLATESQL_Latin1_General_CP1_CI_AINOT NULL;

Now, on searching the table for the word “Joao”, the following will be the result,

Neat!

Handling hyphenated characters

Solving this problem is also fairly simple. We do it by following these steps on the same table
search_table.

First, we create an SQL function in the database. The function will perform the following steps sequentially,

Take a string as an input.

Remove all the special characters from the string and lower the case of the string.

Return the result string.

This function will look like below,

MySQL

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

CREATE FUNCTION[dbo].[RemoveSpecialChars](

@OriginalNameVARCHAR(256))

RETURNSVARCHAR(256)

WITHSCHEMABINDING

BEGIN

IF@OriginalNameisnull

returnnull

DECLARE@ResultTextvarchar(256)=''

,@Lengthint=len(@OriginalName)

,@Indexint=1;

WHILE@Index<=@Length

BEGIN

DECLARE@charINT

SET@char=ASCII(SUBSTRING(@OriginalName,@Index,1))

IF(@charBETWEEN48AND57OR

@charBETWEEN65AND90OR

@charBETWEEN97AND122OR

@charBETWEEN192AND255)

begin

SET@ResultText=@ResultText+CHAR(@char)

end

SET@Index=@Index+1

END

IFLEN(@ResultText)=0

RETURNnull

RETURN@ResultText

END

Next, we create a computed column named
NormalizedName in
search_table table whose value in the column will be derived from the
Name column. We will use the function created above and store result string in the computed column. Make sure the collation of this computed column is set to
SQL_Latin1_General_CP1_CI_AI.

MySQL

1

2

3

ALTERTABLEsearch_tableADD

[NormalizedName]AS(dbo.RemoveSpecialChars(Name))

COLLATESQL_Latin1_General_CP1_CI_AIPERSISTED

We then remove special characters including spaces from the search keyword before using it in the query. You can use the same function that was created in the first step. The searching will be done against
NormalizedName column. The query to do this looks like this,

MySQL

1

2

3

4

5

DECLARE@searchStrnvarchar(max)='E-Commerce';

SELECT

Name

FROMsearch_tableWHERENormalizedName

LIKE'%'+dbo.RemoveSpecialChars(@searchStr)+'%';

Following is the result,

The result “E-Commerce” will be fetched and returned for search keywords like “ECommerce”, “E commerce” and “e comm”.

Code implementation

If an API or external C# code is used to interact with the database, then C# itself can be used to remove special character in the search keyword before passing it to the query.

In the above code example,
RemoveSpecialChars method is used to remove special characters using regular expressions. We are preserving accented characters. We can pass accented characters since the collation of the column in the database table has been changed to
SQL_Latin1_General_CP1_CI_AI.