LUIS: Notes from the Field of Natural Language Processing

I'm Anna Thomas, an Applied Data Scientist within Microsoft’s AI Engineering team. My focus the past two years has been in the “Applied AI” realm, which basically means integrating pre-built AI services into applications to make them smarter and more effective.

The LearnAI team teaches at many events, and, because of that, I get exposed to many of the internal and external projects related to applying AI to existing and new applications. Over the past year, the team has had two popular courses in this area: Building Intelligent Apps and Agents and Designing and Architecting Intelligent Agents. As such, when I’m teaching these courses, or meeting with our Preferred Training Partners, I am often a sounding board for brainstorming ways to be successful and overcome challenges in current or new engagements with the suite of Applied AI products.

I digress. What I’ve learned, is this: developing simple LUIS models is easy, intuitive, and they work well. However, when the requirements for LUIS increase or become more complex, things are a little trickier. So, over the past year, I’ve been collecting input from members of the field as well as the Product Group, and I’d like to share to further democratize creating effective language understanding models.

So, enjoy, learn, and reach out if you want to share your own notes - and stay tuned!

LUIS: Notes from the Field

Originally, when I set out to write this blog, there were three points I wanted to make. But as I reviewed all of my notes, I realized I had more than three things to share. So, I’ve organized my thoughts into three mega-notes, and I’ll provide references throughout so this can keep the category of “blog” (instead of “book”).

Note 1: Generalize as much as possible

In the field, we notice that LUIS works well when you have about four intents, two entities, and 15 utterances for each intent (including the None intent!). But as you start adding more entities (and different types of entities), more intents, more utterances, etc., somewhere down the line you start to see a drop in performance.

Alexei Robsky, a Senior Data Scientist at Microsoft, once told me, “It’s all about finding a balance between the number of intents, and the number of options or actions within an intent.” To try to explain, let’s take an example where a user wants to ask questions about a company and questions about an aggregation. Initially, the questions are simple like “How many employees in [Company]” or “How many buildings does [Company] have?” or “Where is [Company] headquarters?”. But then you start expanding to be able to allow more options – “How many cafeterias are in [Company] in [Location]” or “Which gym for [Company] is open at [Time] in [Location]” or “Which building does [Company] sit in at [Location] and where is the cafeteria and how many people work in that building?” – the point is, the requests and the options will inevitably get more complicated. Maintaining all of these different options/actions within one intent becomes cumbersome and difficult, and performance during training and testing dips. One mitigation I saw several teams do is break up the intents that have many actions into smaller intents or sub-intents.

On the flip side, having too many intents can be just as complicated as having one intent with many options/actions. Let’s take a very simplified example, maybe I’ve created two sub-intents “FindCafeteria” and “FindVendingMachine”. They both have optional entities of “Location” and “Hours”, and we could see how some of the utterances might overlap: “I’m hungry for a snack”, “Where can I find cheap food”, “I want cookies”, etc. So finding a balance between number of intents, actions and entities is important and may require some tweaking. That’s where the versioning and collaboration tools can come in handy (which I’ll touch on in Note #3).

To expand on generalizing your models, I wanted to bring up the use of Patterns. Patterns were introduced into LUIS at Build (May 2018), and, at least in my experience, are underrated. Patterns are tools you can use to generalize common utterances, wording, or ordering that signals a particular intent. If you’re familiar with Regex, patterns are like Regex for intent identification, but smarter. LUIS first recognizes entities, and then can use matching to identifying patterns within the rest of the utterance. Because of this, patterns can help you increase the accuracy for an intent without providing a bunch of extra utterances. Once you get your LUIS model published and the app deployed, see what the endpoint utterances are. If it turns out that many people are saying “Where is [Company] headquarters?” in that word-order, then you can create a pattern that ignores punctuation (all patterns ignore case, so no extra work for you there!). LUIS will then recognize the [Company] entity, and that the rest of the words match the pattern, and it should increase the resulting confidence significantly.

Note 2: Choose your entity types wisely

The LUIS team has developed several different entity types for us to take advantage of. But there are a lot of them, and I often find people (myself included) getting confused regarding which one should be used for what and why. So here’s a condensed version of what the different entities are and how they work:

Match

Exact text match

Machine-Learned

Use utterances provided to create a model

Mix

Use a combination of entity detection methods

Regular expression entity

Simple entity

Pattern.any

List entity

Composite entity

Prebuilt entity

Pattern1

Hierarchical entity

Role1

Phrase list1

1: Not entities but useful to note

I refer to this chart (and there’s a more detailed chart here) when designing LUIS models. Through my experiences working with various teams to improve model performance, I’ve included a few other notes that are often useful when making decisions regarding entities:

The simple entity is best when the data to be extracted is a single concept, is not well-formatted (unlike a regular expression), and doesn’t match exactly to a list of words.

List entities work well for a clear list of values. For example, company names can be better as lists, unless the variety is too unmanageable (side note: I have seen this happen before, in which case the team actually did some post-processing with a simple hash look-up to confirm the correct entity name). Additionally, if there are conflicts or overlap between list entities, list entities may not be the best route, as this usually leads to unexpected results due to the exact matching nature of list entities.

Hierarchical entities provide the same contextual information as roles but only to utterances in intents. Similarly, roles provide the same contextual information as hierarchical entities but only in patterns. So which one to use will depend on if you are using patterns.

A hidden gem I found recently was the “KeyPhrase” prebuilt entity, which could potentially be used as a fall back in a bot if no other entities are used. It could be used to construct a follow up question or even provide a contextual menu.

The pattern.any entity allows you to find free-form data where the wording of the entity makes it difficult to determine when the entity begins and/or ends. For example, “What is {CompanyName} address?” These can also be optional or have optional parameters (using square brackets in combination with curly brackets) which is useful, especially keeping Note #1 (generalize as much as possible) in mind.

Phrase lists are used to improve entity detection, and ultimately improve pattern detection.

Remember, entities are found first, then the pattern is matched.

Note 3: This is still data science

I think it’s easy for those of us that work with LUIS, who are more Developer/Architect than Data Scientist, to forget that creating LUIS models is still data science. As such, we should follow the Team Data Science Process (TDSP) and other data science protocols. I want to focus in on the data science principle that you’ll typically have three separate datasets: one for training the model, one for choosing the best model (the validation set), and one for testing the performance of selected model.

When you start developing your model, you should have base training/validation/testing datasets (sample utterances) identified. Over time, you’ll add utterances to your training data by reviewing endpoint utterances, and it’s important to update your validation and testing sets to be representative of the changing utterances as well. This is where the batch testing and versioning features can come in handy.

Let’s walk through a hypothetical example. Say we’ve created and deployed a baseline model, with all simple entities and only a few intents (no patterns). After a few weeks of being deployed, we want to see how it’s doing and make improvements. So we can use our validation dataset to run a batch test which provides us with confusion matrices for each of the intents. This will give us some insights to what our model does well, and what we might want to improve. So we clone the model and use active learning to add utterances to the training set. At this point, we should also note what utterances are being added, and make sure we add some that are similar in form (but not exact!) to our validation and test datasets. We also will probably notice that there are a few common phrases that users submit, and so we create a few patterns. Next, we retrain our model, and run batch testing again (with the validation dataset), to see if we’ve addressed the errors we were seeing in the previous version. We might even try different combinations of updates (e.g. perhaps testing out hierarchical entities or exploring pattern.any and roles) by creating additional clones of the previous version, before running batch tests to pick the best version. Once we’ve picked the best model (using confusion matrices), we might publish it to the Staging slot and perform a final test using the test dataset to confirm the model performs well on an unseen dataset before publishing to the Production slot. Remembering that creating models is an iterative process, sometime later we’ll want to walk through these steps again, improving and updating our model as needed. As your model or application becomes more complicated or widespread, you may have multiple LUIS models working together, in which case you can leverage the Dispatch tool (in addition to batch testing) to determine model overlap, performance, and routing.

Acknowledgements

I want to thank you for reading this blog in its entirety. I also want to thank Conrad Wong, Alexei Robsky, Dina Berry, Nayer Wanas, the LearnAI team, and the field.