Use Azure Text Analytics Service to Automatically Tag SharePoint Documents

Automatic content classification or metadata tagging has been one of the “wishes” for knowledge workers across organizations. SharePoint has provided a solution for Managed Metadata and manual tagging for long and some third party solutions exist which claim to provide automatic keywords extraction from the content of the uploaded documents.

In this article, we will explore how can we use Azure Text Analytics Service to Automatically Tag Documents stored in SharePoint from the keywords extracted from the content of those documents.

So, let’s’ get started.

Prerequisites

Before we can jump in to extract keywords from SharePoint documents, we need to setup a few things.

Term Store in SharePoint

It’s a good practice and widely used already to have an organization wide taxonomy in SharePoint that can be used to tag documents. This helps is ensuring that documents are tagged with keywords already identified by the business. This is NOT about the values of specific metadata fields which needs to be filled by end users (like business unit or location) but about the keywords associated with the document which identifies it’s “content” better.

So, for this article, we’ll create a term store named “Content Areas” and use that as the main source to find and filter the extracted keywords from the document and then associate those keywords with it.

Document Library

We obviously need a document library which contains the documents which we are going to read, extract keywords and update the “Content Areas” column with those.

For this article, I created a document library named “Auto Tagged Docs” and added a managed metadata type column “Content Areas” to it. This column has multiple values allowed and is mapped to the “Content Areas” term set created earlier.

For this article, I have uploaded two documents for about 2-3 MBs each. As you can see the Content Areas field is empty to start with.

Azure Text Analytics Service

Now that we are all set from SharePoint source perspective, lets create the Azure Text Analytics Service API. We need an Azure subscription but we can use the “F0 Pricing Tier” which provides 5000 free transactions of Azure Text Analytics API per month.

Fill in the form, note down the location being selected and click Create. The API URL to call is currently dependent on the location, so will need to use this later.

Once the Service gets created, browse to the resource and grab the keys. We’ll need these keys to call the API.

So, all set now.

Limitation

Before we jump into the solution, let’s take a quick look at a limitation of Azure Text Analytics Service.

Azure Text Analytics Service is meant for analyzing the text inputs like comments from users on articles/company’s products etc. and extract keywords, Language and sentiments (negative or positive). Currently one single document cannot have have more than 5000 characters. However, one single call to the API can have 1000 such sub-documents, so we need to split our documents in multiple subdocuments of less than 5000 characters each and later find out unique keywords after combining them.

Solution

For this article, I am going to explain the following which take care of generating the keywords from the uploaded documents and associating those keywords with corresponding documents as well. I will keep the scope to a single document library, but you can easily put a loop on top of it to update documents in all other document libraries as well.

So, what the solution looks like in a few points –

Get a list of all Terms from the Term Set which we will use as master list of keywords

Get list of all Files from the selected Document Library

Loop through each document and

Read the content

Prepare sub document of less than 5000 characters each

Extract Keywords using Azure Text Analytics Service

Find Unique Keywords after combining the Subdocuments

Compare these Unique keywords with Terms extracted from Identified Term Set (“Content Areas” in this case) and find common keywords

Update the “Content Areas” column of the document with the identified common Keyword

Let’s take a look at each of these steps in detail.

Get a list of all Terms from the Term Set

All we need is a list of all Terms from the identified Term Set from SharePoint. This will be used to finally compare with the extracted keywords from documents and common terms will be added to the documents as metadata.

The above function accepts the ClientContext object and GUID of the TermSet and returns TermCollection containing all the Terms from that TermSet.

Get list of all Files from the selected Document Library

Let’s now just connect to the identified Document Library and get the list of all Files. I have not considered files inside Folders in this case, but you could easily include that too. It’s just SharePoint afterall 🙂

Now that we have the list of all documents inside the document library, we can loop through it and take following actions for each document.

Read File from SharePoint

This one is going to be some work. There are various ways to read the documents but the content could be any type of office document (word, excel, ppt) and even with different extensions like doc, docx, ppt, pptx. So, you would need a function to read the content of all those types of documents to a string variable.

For this article, I will just show how that can be done for a word document. You can use other similar functions to read other document types.

This is actually a set of two functions – One to read the content from SharePoint file to memory stream and another to get the content in textual format from that memory stream using “WordProcessingDocument” object.

GetTextFromWordprocessingDocument function returns a string which contains all the content from the file in a string variable.

Prepare Sub Document

Now that we have all the content from the file available in a string variable, we need to find a workaround for the current limitation of 5000 characters which Azure Text Analytics Service imposes.

Azure Text Analytics API call is charged per call and we can have upto 1000 documents (of less than 5000 characters each) per call. Also, there is a rate limit of 100 calls per minute. So, it is efficient and cost effective to combine multiple documents into one request.

To keep things simple, in this example, we’ll make one API call for each SharePoint document. What we’ll do in this case is to have a function which checks the content and puts them into a string array with each index containing max upto 4999 characters. Now, there is a possibility that a valid word might get truncated with this logic, so you can implement any other logic so that it doesn’t split any word. But for this article, I will just split when the string contains 4999 or more characters.

Extract Keywords using Azure Text Analytics Service

Now we need to pass the string array containing all the content of the read file. Each index of the array contains max 4999 characters. We’ll make just one Azure Text Analytics Service API call to extract keywords from texts at all indexes combined.

Take a note at the AzureRegion specified above. It must be the same location in which the Azure Text Analytics Service was created. Also, provide the subscription key recorded earlier from the Azure Text Analytics Service under SubscriptionKey.

In above function, we are batching the API call by sending a combined input. This means, we’ll create a list of MultiLanguageInput object, each of which contains the text from one index of the file content.

In this example, I am putting the content language as English and document ID as an incremental number. The function ExtractKeywordsUsingAzureTextAnalytics() reads all the content from each index and returns keywords for text at each index. So essentially, if the read file had 22000 characters, it would have been splitted into an array with 5 index. Index 0 to 3 containing 4999 characters each and index 4 containing remaining 2004 characters. The Azure Text Analytics API considers all these a different documents and returns an object of type KeyPhraseBatchResult which contains keywords for each such document (or content at an index in our case) separately.

Extract Unique Keywords

The above function gives a complete list of extracted keywords for content at each index assuming each index contains a different document. But since we have splitted a single document content, it may content duplicate keywords across content at different index. i.e. fileSection[0] and fileSection[2] may have some of the same keyword extracted.

But since we need a set of unique keywords for the entire document as a whole, all we’ll do is to remove duplicate keywords from the list of keywords returned by Azure Text Analytics API.

This function returns all the Unique Keywords found from the SharePoint Document read.

Compare with Term Store and Find Common Terms

Now that we have the list of keywords extracted from a whole document, one approach is that we can just associated all these keywords to the document as metadata. But I have seen that for a large document, the number of keywords could be in hundreds which will not be very useful.

A better approach might be to find out if any of those extracted keywords exist in the Organizational Taxonomy (Managed Metadata) store and associate only those keywords which match.

For this article, I would go for this approach.

At the start of this article, we had extracted the list of all Terms from the given termset. Now we also have the list of all extracted keywords from the document. So, let’s find if there are any common terms.

Some More Thoughts

Now that we put put together a nice piece of working example, some valid questions could be like how to practically use this or when to trigger this update.

Well, I will just add some thoughts and leave that up to your personal choice and SharePoint environment to decide.

For existing documents, this can be run as a batch which iterates through all files and updates their metadata field with identified keywords one-by-one. This can be run either from windows scheduler (for on-premise environments) or as a web job (for SharePoint Online).

For new files, you could use an Event Handler or a Web Hook attached to item added/updated events.

You could also look into Office 365 Management APIs and read SharePoint related events and take action if you find any documents uploaded.