Building a Bot to Answer FAQs: Predicting Text Similarity

February 15, 2017 / Business, Developers, Tutorials

In our previous tutorial on customer support bots, we trained a bot using the Custom Collection API to direct customers to the team member who is best suited to assist them with their problem or query. The bot improved our team’s response times as we no longer had to rely on a human facilitator (who also plays many other roles in our company #startuplife) to do the job. However, we’re generally only able to respond during our office hours of 11am-7pm EST, so there’s still lag for inquiries outside of that period. How can we improve this? Build a bot to answer frequently asked questions, reducing lag time for more customers and ensuring our engineers don’t need to spend more time than necessary away from the products we’re building for you :).

The Task

We’ll conduct a nearest neighbour search in Python, comparing a user input question to a list of FAQs. To do this, we’ll use indico’s Text Features API to find all the feature vectors for the text data, and calculate the distance between these vectors to those of the user’s input question in 300-dimensional space. Then we’ll return the appropriate answer based to the FAQ that the user’s question is most similar to (if it meets a certain confidence threshold).

Getting Started

First, get the skeleton code from our SuperCell GitHub repo.
You’ll need to install all necessary packages if you don’t have them — texttable and, of course, indicoio.
If you haven’t already set up your indico account, follow our Quickstart Guide. It will walk you through the process of getting your API key and installing the indicoio Python library. If you run into any problems, check the Installation section of the docs. If all else fails, you can also reach out to us through that little chat bubble. Assuming your account is all set up and you’ve installed everything, let’s get started!
Go to the top of your file and import indicoio. Don’t forget to set your API key. There are a number of ways you can do it; I like to put mine in a configuration file.

import indicoio
indicoio.config.api_key = 'YOUR_API_KEY'

Using indico’s Text Features API

You’ll need to store your FAQs and their respective answers in a dictionary. For simplicity’s sake, I’ve created a dictionary, faqs, of five questions and answers in the script itself. This will be our starting dataset. We only need to find the text features for the questions and not the answers, so we extract faqs.keys() and then feed that data into our make_feats() function.

def make_feats(data):
"""
Send our text data through the indico API and return each text example's text vector representation
"""
chunks = [data[x:x+100] for x in xrange(0, len(data), 100)]
feats = []
# just a progress bar to show us how much we have left
for chunk in tqdm(chunks):
feats.extend(indicoio.text_features(chunk))
return feats

Next, let’s update the run() function. Save out feats to a Pickle file so you don’t have to keep re-running the Text Features API on the static list of FAQs every time you want to compare a user’s question to it.

Comparing FAQs to User Input

Now that we’ve got the feature representations for the FAQ text data, let’s move on to the next phase: collecting and comparing user questions to our FAQs. So that everyone can run this script locally, no matter what customer support chat service you plan to hook this up to, we’ll just use raw_input(). You’ll need to set up your own webhook according to your messaging app’s docs.
First, let’s get an input, add it to the list of FAQs, as well as find the text features for the input and them to the main feats list. This will simplify things later when we need to calculate the distances for all the feature representations. Update the input_question() function:

def input_question(data, feats):
# input a question
question = raw_input("What is your question? ")
# add the user question and its vector representations to the corresponding lists, `data` and `feats`
# insert them at index 0 so you know exactly where they are for later distance calculations
if question is not None:
data.insert(0, question)
new_feats = indicoio.text_features(question)
feats.insert(0, new_feats)
return data, feats

Time to update the run() function again. This time, you can just load the Pickle file with the FAQ features you found earlier.

So now we’ve got a list of feature vectors for all the FAQs and the user’s question! How will this help us figure out which FAQ the input is most similar to? Similarity between pieces of text is measured by similarity between their corresponding feature vectors. We predict their similarity in the calculate_distances function, which calculates the distance between these vectors in cosine space. Cosine is generally the comparison metric of choice when you’re dealing with points in high dimensional space. calculate_distances produces an m by n matrix that stores the distance between document m and document n at distance_matrix[m][n].
Update run() once again:

Finally, let’s see how well our nearest neighbours search performs! The similarity_text() function will sort through the distance_matrix and order each piece of text according to level of similarity, and then print it out in a table. We don’t want our bot to give an answer if it’s not very confident that it’s found an FAQ match though, so we need to set a confidence threshold. Add the following code to similarity_text(), just below print t.draw():

If the bot’s confidence level meets the threshold, it should return the appropriate FAQ answer. Otherwise, it should notify your customer support manager (you’ll have to hook that up based on your messaging app’s docs):

# print the appropriate answer to the FAQ, or bring in a human to respond
if faq_match is not None:
print "A: %r" % faqs[faq_match]
else:
print sorry

How well did it perform? Here’s an example:
Woo! It performed quite well, even though the input question’s word choice differed from and was more concise than the FAQ match, our rich text features were still able to capture its meaning. So, what’s happening here — why does this work?
indico’s Text Features API creates of hundreds of thousands of rich feature vector representations for a given text input, learned using deep learning techniques. These feature vectors — numerical representations in multi-dimensional space — are a computer’s way of assigning meaning to a word.
When we fed the list of FAQs into the Text Features algorithm, it essentially identified certain words in each question as “significant”, and determined where they could be found in multi-dimensional space. Text Features then combined these word representations to produce a representation of the entire “document” (in this case, an FAQ — you could also pass in an entire article to find its feature representation). It did the same thing when we showed it our new input question. We then used these document representations to calculate the distance between them to determine how conceptually similar the two “documents” are. The closer the features in multi-dimensional space, the closer they are in meaning.

Next Steps

At the heart of it, this was an exercise in predicting text similarity. What other applications can you imagine for this task? If you create something cool with our APIs, definitely let us know at contact@indico.io!