Introduction To Linguistic Indexes – Part I January 3, 2008

Characters are sorted by default based on numeric values defined by the default character encoding scheme (known as Binary Sorting). For us Australians, this is fine as we (generally) speak English and the English alphabet is nicely sorted in ascending order by ASCII and EBCDIC standards. However, many other languages are not so fortunate as the binary sort does not sort the data in many language’s alphabetic sort order. Oracle has many Globalization Support features to help users in other languages get over these issues (all very interesting and topics for many a Blog entry in the future).

However, even us Australians have issues when it comes to “case-insensitive” searches, where data may be stored in many different cases (eg. Ziggy, ZIGGY, ZiGgY, etc.) and we want to return all data that matches a character value, regardless of its case.

The issue of course is that by default, all text searches are case-sensitive. For example a search WHERE name=’ZIGGY’ will only return ‘ZIGGY’ but not ‘Ziggy’ or ‘ZiGgY’ etc.

The standard fix is for the application to convert the data to a consistent case when performing the search. For example a search WHERE UPPER(Name) = ‘ZIGGY’ will return all values of “ZIGGY” regardless of their case but this will negate the use of any standard index on the Name column.

Therefore, a Function-Based index is required, say based on UPPER(Name), to ensure an efficient index access is possible for case insensitive searches.

However, this often requires an additional index to be created and for the application to be explicitly written to make use of the function-based index defined function.

Now the best cure for this problem is simply to ensure all data is stored in a consistent case (ZIGGY, ZIGGY, ZIGGY) but this may not always be practical or even desirable in some cases.

Another possible solution is the use of a Linguistic Index. This is an index that is created based on a specific case insensitive linguistic language or multilingual option that ensures the index entries are sorted in the linguistic language order, not on the default binary order of the database encoding scheme.

However, before you rush out and start using Linguistic Indexes to possibly simplify the use of case insensitive searches, note there are various disadvantages to Linguistic Indexes, which can somewhat dampen their appeal. These will be covered in Part II of this series.

Case 1, used because it was the only one and Oracle found it cheaper than performing a full table scan. But it can’t be used for case-insensitive searches unless you actually list all possible permutations in an IN list or some such.

Case 3, INDEX 2 was skipped also because it’s a binary index and binary indexes are ignored if the NLS_SORT value doesn’t match, as with INDEX 1.

[…] at Nominet. We have been following the writings of Richard Foote and in particular an article on Linguistic Indexes. I thought the article interesting though somewhat obscure and filed it at the back of […]

Thanks a lot Richard for your rapid response! (When you are in trouble that is well worth).

Sorry cause some of my comments were covered in part II, and didn’t notice.

But there is a thing still a bit uncovered, that is: As you say from 11g on, LIKE works fine and linguistic indexes are considered by CBO. The question is that, this is true only when you have set NLS variables and you use LIKE directly on fields-params, but it doesn’t work when you apply NLSSORT function in the WHERE clause.

These are the cases (afer having a BINARY_AI index created on “name” column, in 11.1.0.7g and NLS_SORT=BINARY_AI and NLS_COM=LINGUISTIC in all cases):

1. WHERE name LIKE ‘ZiGGy%’;

Works fine (returns expected) and index is used as you stated.

2. Using ‘=’ with NLSSORT in WHERE:
WHERE NLSSORT(name, ‘NLS_SORT=BINARY_AI’) = NLSSORT(‘ZiGGy’, ‘NLS_SORT=BINARY_AI’);

Works fine (returns expected as there is a record with name=’Ziggy’) and INDEX IS USED!!!

Please note I’m now on my third glass of champagne in Qantas club so I might not be mentally on top form 😉

However I believe, 3.b doesn’t work because it’s looking for the CI string ‘ZiGG%’ which doesn’t exist because it’s treating the % as a literal, not as a wildcard. To perform a LIKE type search, you need to specify the LIKE condition, else Oracle can’t know what you trying to do.

Qantas, Champagne and browsing the web at the same time… charming!!! (far from here I suppose).

Sorry again, cause 3. was wrong (cut & paste for this post), and this is what should have been (LIKE):

3.a: Without wildcards:
WHERE NLSSORT(name, ‘NLS_SORT=BINARY_AI’) LIKE NLSSORT(‘ZiGGy’, ‘NLS_SORT=BINARY_AI’);

3.b: With ‘%’:
WHERE NLSSORT(name, ‘NLS_SORT=BINARY_AI’) LIKE NLSSORT(‘ZiGG%’, ‘NLS_SORT=BINARY_AI’);

So, about your question: In Oracle’s doc. I have read that instead of setting NLS_SORT to an AI collation and NLS_COMP to linguistic, what you can do is (always with an underlying linguistic index defined) use the NLSSORT() in the WHERE clause (as shown in 2. and 3. examples).

That sounded very interesting as I can maintain all the rest of char based indexes as no linguistic (think also of indexes defined by constraints that otherwise we should have to duplicate), and focus on the queries that need a linguistic search (using NLSSORT()).

In the 2. example (with ‘=’ instead of LIKE), CBO detects a linguistic predicate and searchs for a compatible linguistic index, that is found and used.

But when we issue the 3.a example, what I am not planning to use and just tried it to see how CBO was performing, we can see that the search works (without ‘%’) but the index is NOT being used.

3.b is the one I really would like to use as it eliminates the explained collateral problems on traditional indexes. In this case what is happening is that:
1. While with ‘=’ the linguistic index is being considered, with LIKE it is NOT (the same to 3.a). Why? Bug or feature?

2. The wildcard issue. I share your explanation. But it sounds to me a bug, cause if you set NLS parameters and issue a LIKE with wildcards CBO has no problems to separate de wildcard for the rest of the text (that CBO must convert to its linguistic form).

So the two big problems are:
-LIKE + NLSSORT() -> Index is not used (yes when ‘=’+NLSSORT())
-NLSSORT(‘%’): % is not working as wildcard so no matching records are get.

The short answer is I’ll need to check when I get back home. As I’ve done once before, I just discovered I didn’t bring my 3 pin power adaptor for my laptop here in the US so I’ve got about 3 hours of battery life life 😦