Optimizing document search using Machine Learning and Text Analytics

A few weeks ago, Eugene showed how the Azure Search Blob Indexer can enable full text search over files such as Office, PDF and HTML. During this time, we have learned that it is not uncommon for many of you to have a lot of files that result in large search indexes. From a cost perspective, it is important to optimize this content as much as possible. The goal of this post is to show you how Azure ML Text Analytics can both optimize the content in Azure Search and improve the search experience by not only reducing the index size, but also to limit the indexed content to only those terms that are most important to users.

Key phrase extraction

The Azure Machine Learning Text Analytics API can perform tasks such as sentiment analysis, key phrase extraction, language and topic detection. We will focus on key phrase extraction which returns a list of strings denoting the key talking points of the provided text. Here is an example of some text and the associated key phrases:

Cool right? We just reduced the size of our content to only the key phrases. These phrases are also what our users are most likely going to be using as their search terms. Most importantly, this helps reduce the size of the content in your Azure Search index to only the most critical phrases.

Now how do we bring this all together? Since this is not yet integrated into the Azure Search Blob Indexer, we will use the Azure Search Push API to do this. I have extended the OCR sample used in the blog post leveraging OCR to index content from image files for this purpose to:

Pass the extracted text to Azure Machine Learning

Retrieve the key phrases from the OCR text and sends these key phrases to Azure Search

Other uses for text analytics and search

Text Analytics with Azure Search also lets your users search and filter results based on the phrases returned from the analysis phase. For example, let’s say a travel company ran all of their user comments through text analysis and the resulting phrases were then stored it in a faceted Azure Search collection field. Using this field, the travel site could then search and filter hotels results based on phrases of interest to the user, such as “family friendly” or “helpful staff.” Furthermore, by including the sentiment of the comments returned from the text analysis in a field within the search index, magnitude scoring profiles can be leveraged to boost items higher in the search results if they had a positive sentiment.

In addition, text analytics could be used to display the most important key phrases in a word cloud or similar widget, using faceting.

If you want to see this as part of the Azure Search Indexer, please vote for it and let us know what you think in the comments below.