Searching documents with Oracle Text, part 1

As I have mentioned in passing, it is my firm belief that an essential part of GDPR compliance is documentation. If your documentation is sketchy, out-of-date, or vague (consisting, say, of emails and slideshows), how will you show that you have privacy by design, and that you enforce and verify this requirement at every stage in the development process?

Once you’ve created all these documents, how will you find what you’re looking for? Have you ever wasted time trying to find information among hundreds of files in a shared folder or Sharepoint? If so, you already know that these tools have limited ability to search inside of documents.

What I plan to do in this series of posts is to introduce Oracle Text (OT), an astonishingly powerful tool for searching not only text files, but also common business file formats such as MS Office and PDFs. Not only can it search files in your local filesystems, but also web pages and files at the end of URLs.

This search capability offers a variety of sophisticated abilities. In addition to more-common search modes (wildcards, logical connectives), OT offers capabilities that I have previously seen only on dedicated document-management systems. Some of the capabilities, which I illustrate in today’s post, include finding:

2 (or more) words occurring near each other

stemmed searching (stemming in linguistics refers to recognizing that a single word may take different grammatical forms, such as plurals of nouns, conjugation of verbs, etc.)

fuzzy searching, in which the search terms need not match the target’s terms precisely

the number of times the search term was matched within the target

plus combinations of all the above

This list is not complete, as I will be learning to use this tool as I go along, and will offer further capabilities as soon as I can put together small demonstrations to illustrate them.

An important thing to understand is that, although these examples use text that is stored in a database table, OT normally will index and search files in their filesystem or web location (the file pointer or web URL is one of the columns in the OT table). OT only accesses the files when it is creating or synchronizing its indexes..

What is searched in OT queries is not the documents themselves, but the indexes. This means that performance can be very good, even if the files being searched are very numerous or large (though initial index creation may take some time).

What follows is a minimal demo of some of the simpler capabilities of OT. Statements are separated by the string ‘========================================'

for easier reading. For the SQL queries, each query is followed by its output. Each query is preceded by a brief comment (prefixed by ‘rem’, for remark) to indicate what feature is demonstrated in that query. I have also bolded a few terms in the query and result to emphasize the key point of the query.

These statements were run on Oracle 11g Express, which is available for free download. If you decide to install Oracle 11g, a good IDE is SQL Developer, also free, from Oracle. (If you wish to install Oracle and try this demonstration, there are one or two configuration steps you’ll need to take to configure your user. These are well covered on the web.)

Now, you may be thinking something like “this is way to complicated for me to use”. and it’s true that the maximum power of OT can be gained only with ad-hoc queries by someone adept at both SQL and text-searching techniques. What these examples are intended to do is to show you what power is there; whether you choose to make some functionality available with a GUI, or to use an expert when you have an exceptional need to search documents, it’s good to know what is possible.

I’ll close this post by mentioning that this is far from everything that OT has to offer. I plan at least one more post on this topic, as soon as I can get a demo working.

Post navigation

Step 9 of the Belgian Privacy Commission’s guide to getting started with GDPR compliance concerns detecting, analyzing, and dealing with the fall-out of a data breach. The recent recall and re-issue of Estonia’s smartcard IDs brought home to me that public relations (PR) planning is an essential part of breach preparation, not to protect the […]