For german readers, this text is, strictly speaking, posted in the wrong blog, since it is
about Oracle Text. And for this topic, a colleague and me are maintaining an own blog (in german
language only): oracle-text-de.blogspot.com.
But the similarity search, which I will talk about, is interesting for virtually everyone developing
applications on top of Oracle and
not only for the "classic" Oracle TEXT users. So I will show, how one can implement search functionality
on a normal, relational table with features like searching over multiple columns, similarity search and
a special "name matching". All these features are not possible using "plain" SQL. And all this
is part of Oracle TEXT, which is available even in standard edition.

For those, who are new to Oracle TEXT, this posting will introduce a lot of new terms like
preference objects, section groups, datastore and others, without explaining
them in the very detail. The reason is that Oracle Text is "a universe of its own"; I'd like to
introduce it here and to show what is possible. For further reading, I'd recommend the
documentation or our (german language)
Oracle TEXT blog

When querying a table with a "plain" SQL SELECT, this is always an exact search; even if the
LIKE keyword is being used, it searches exactly the specified string or substring. But reality
often poses higher requirements - a callcenter agent, for instance, often does not know how
to spell the name of the customer "at the other" end. So they are typing a guess - and it would be
nice, when the system tries also to find entries similar to it. Oracle TEXT
provides special functions for similarity search like FUZZY or (beginning with 11.2.0.2)
the new Name Searching feature.

But how to use these functions? We need an Oracle TEXT Index created beforehand - but Oracle TEXT Indexes
are being created on "documents" in CLOB, BLOB or VARCHAR2 columns, whereas "our" data resides in
plain relational columns. But, as we'll see, there is of course a way to get an Oracle TEXT index
on that kind of data. The example, I'll use in this blog posting, is based on the table
CUSTOMERS in the SH sample schema. If you have an own table with names and addresses
in it, use your table, since the data in SH.CUSTOMERS was generated from
a very small set of distinct rows.

As said: We want to search in that data and we want to have similarity search as well. Oracle TEXT
provides that feature, so we need an Oracle TEXT index on that data. And here is the first of
many "special" characteristics of Oracle TEXT: Unlike the "relational" world, where we would have
one index on each table column, we create only one Oracle TEXT index on one
table column indexing all the data. This is a very important thing to know about Oracle TEXT
with respect to query performance: Only one Text Index is the optimum - multiple TEXT indexes will degrade
query performance significantly.

One Index on multiple columns will be achieved with a special Oracle TEXT feature:
User Datastore. Oracle Text allows us to put a PL/SQL procedure "between" the table and the
index. This PL/SQL procedure is being called by Oracle TEXT during the indexing process, it receives
a ROWID as input and it returns VARCHAR2, CLOB or BLOB data as output. So this procedure allows
us to prepare the data to be indexed ourselves: In the procedure code we can query any data we
want and pass it to the Oracle TEXT index. So this functionality is key to create one Oracle TEXT index
on a table containing any data we want. More details on User Datastores feature can be found in the
Oracle documentation.

So we'll start creating our own User Datastore function. We'll prepare the data on our table
KUNDEN_TAB in "XML" format. Note that we generate the customer name twice: We will have a
XML tag for the first name, one for the last name and one for the complete name. We'll need
this for the different search variants we want to provide.

XML experts might argue that this is "not really" XML - since there is no root tag. But Oracle
TEXT does not require well-formed XML, just "tagged" text is sufficient for our needs. And since
this procedure will be called for each row in the table to be indexed, we'll just generate what is
absolutely necessary. The next step is to "register" this procedure
in the Oracle TEXT dictionary. Note that you need execute provileges on CTX_DDL or
(better) the CTXAPP role in order to execute the following commands.

Next, we'll need to tell Oracle TEXT how to handle the data inside the different "XML" tags.
The VORNAME needs to be handled differently from the KREDITLIMIT. While we
need similarity search on the former one, we need between searches on the latter one. So we'll
now create a section group with sections as follows ...

In the next step we'll adjust a few indexing details. Note that there are several
paramaters to adjust each individual detail of the indexing process. For this example
we'll just reduce diacritic characters to their normal form, which is essential for
german (ä -> a, ü -> u), french (é -> e, ê -> e) or other languages. Beyond that we make sure
that the index is case-insensitive.

Now all preparations have been made - we can create the index. We have to choose one of the
table columns to create the index on - but note that the data will be providey by our
PL/SQL procedure. As an alternative we also could add an additional "search" column
on the table.

Now, since the Oracle TEXT index is present, we could start executing queries. But, before that, another little
thing: We now have created the Index on the table column
NACHNAME. What happens, when a SQL UPDATE statement just updates the VORNAME column ...?
Nothing. The index would not take any notice - since that column is not indexed. But it should
take notice, since the PL/SQL procedure grabs data also from VORNAME. So we need to
create a trigger on the table making sure that any change on the table also "touches" the
NACHNAME column.

But that's not the only special characteristic of our Oracle TEXT index with regard to
DML on the table: Oracle TEXT indexes are asynchronous - after changes on the table data,
they need to be synchronized. But more on this later on - now we want to see what
we get for all our efforts. Here are some queries using Oracle TEXT and its features.

The complete query is part of the Oracle TEXT contains function. Even the "relational" filter
on KREDITLIMIT has been added to CONTAINS and is not part of the "plain" SQL query. This is an
example for the most efficient usage of Oracle TEXT. All search logic is part of CONTAINS; all
relevant data is part of the Oracle TEXT index. Queries will always perform very well in such
a scenario: Mixed Queries (combining CONTAINS and "relational" filters) or queries
with multiple CONTAINS clauses will always perform worse.

CONTAINS allows to use boolean operators like AND, OR or NOT, we can use
brackets and make the query as complex as we want. Wildcards (%) are
allowed similar to SQL LIKE. And Oracle TEXT provides special functions like
FUZZY for similarity, NDATA for name matching or SDATA for numeric or DATE filters.
As the examples show, the FUZZY function is very useful, since it returns much more
matches than the exact search term. The keyword WITHIN specifies the section
to be search in - and here a section matches to an "XML" tag in our PL/SQL procedure and this
matches to a table column.

So, we now can query multiple columns with
just using CONTAINS in the WHERE clause - it's like using a search engine.
But
in reality ... we can imagine, that we not always need all this power. Some queries would
return enough matches even with the simple, exact search - other queries will need all
similarity and name matching power we have. So ...

first, we want to see all exact matches

then we'd like to see similar matches

finally we'll start the special name matching

And all this until we got (say: 20) returned rows. So if we get 20 rows with exact matches,
the subsequent queries should not be executed any more.
For that purpose, Oracle TEXT provides the query relaxation
feature. Using the technique, we'll give all queries in sequential order and we provide
the number of rows we'd like to have returned. Oracle TEXT will then execute all queries,
until the amount of rows has been reached. The "Score" which Oracle TEXT calculates for each
returned row, gives in indication about its quality.

This query will first search for an exact match of "Anne Baer", after that it uses
the FUZZY operator (generic similarity), and then it employs the "name matching" facility.
The latter one, for instance, also matches rows, in which the first and last name are
inverted. And this is being executed until either 15 rows are returned or all the
queries have been executed. In our example, we only get one row, which is due to
the data in KUNDEN_TAB - as stated in the beginning - this data has been generated, based
on about only 30 to 40 distinct names.

This shows, how powerful the Oracle database can be, when it is about finding data. And
keep in mind, that all this is part of the standard edition - no separate licenses
are needed.

I've already mentioned: There is one very important additional characteristic about our
Oracle TEXT index: It is asynchonous - so when the table data changes, this is not immediately
reflected by the Oracle TEXT index. Though Oracle TEXT keeps track about the changed rows,
we need to synchronize it after changes and, to maintain
query performance over time, we also need to optimize the index in regular intervals. But this
is far too much for just one blog posting - for more information about this I'd like to
reference the documentation chapters
about synchronizing and optimizing
the index.

I'd encourage everyone to play a bit with those features; you will see that Oracle TEXT is
"a universe of its own" with a lot of things one needs to know. But looking at the
query power, Oracle TEXT provides, it's abolutely worth discovering it.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle Corporation or Oracle Germany. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.