Links

A blog about Chat bots (mainly Watson), and life in general.

Python

So this really only helps if you are doing a large number of intents, and you have not used entities as your primary method of determining intent.

First lets talk about perceived accuracy, and what this is trying to solve. Perceived accuracy is where someone will type in a few questions they know the answer to. Then depending on their manual test they perceive the system to be working or failing.

It puts the person training the system into a false sense of how it is performing.

K-fold cross validation : You split your training set into random segments (K). Use one set to test and the rest to train. You then work your way through all of them. This method will test everything, but will be extremely time-consuming. Also you need to pick a good size for K so that you can test correctly.

K-Fold works well by itself if you have a large training set that has come from a real world representative users. You will find this rarely happens. So you should use in conjunction with a blind.

Previously I didn’t cover how you actually do the test. So with that, here is the notebook giving a demonstration:

While the complexity of building Conversation has reduced for non-developers, one of the areas that people can sometimes struggle is training intents.

When trying to determine how the system performs, it is important to use tried and true methods of validation. If you go by someone just interacting with the system you can end up with what is called “perceived accuracy”. A person may ask three to four questions, and three may fail. Their perception becomes that the system is broken.

Using cross validation allows you to give a better feel of how the system is working. As will a blind / test set. But knowing how the system performs and trying to trying to interpret the results is where it takes practise.

Take this example test report. Each line is where a question is asked, and it determines what the answer should be, versus what answer came back. The X denotes where an answer failed.

Unless you are used to doing this analysis full time, it is very hard to see the bigger picture. For example, is DOG_HEALTH the issue, or is it LITTER_HEALTH + BREEDER_GUARANTEE?

You have to manually analyse each of these clusters and determine what changes are required to be made.

Thankfully Sci-Kit makes your life easier with being able to create a confusion matrix. With this you can see how each intent performs against each other. So you end up with something like this:

So now you can quickly see what Intents are getting confused with others. So you focus on those to improve your accuracy better. In the example above, DOG_HEALTH and BREEDER_INFORMATION offer the best areas to investigate.

I’ve created a Sample Notebook which demonstrates the above, so you can modify to test your own conversations training.

Logging!

You can now download your logs from your conversation workspace into a JSON format. So I thought I’d take this moment to introduce Pandas. Some people love the “Improve” UI, but personally I like being able to easily mold the data to what I need.

First, if you are new to Python, I strongly recommend getting a Python Notebook like Jupyter set up or use IBM Data Science Experience. It makes learning so much easier, and you build your applications like actual documentation.

So we have the whole log now sitting in j but we want to make a dataframe. Before we do that however, let’s talk about log analysis and the fields you need. There are three areas we want to analyse in logs.

Quantitive – These are fixed metrics, like number of users, response times, common intents, etc.

Qualitative – This is analysing how the end user is speaking, and how the system interpreted and responded. Some examples would be where the answer returned may give the wrong impression to the end user, or users ask things out of expected areas.

Debugging – This is really looking for coding issues with your conversation tree.

So on to the fields that cover these areas. These are all contained in j['response'].

Field

Usage

Description

input.text

Qualitative

This is what the user or the application typed in.

intents[]

Qualitative

This tells you the primary intent for the users question. You should capture the intent and confidence into columns. If the value is [] then means it was irrelevant.

entities[]

Quantitive

The entities found in relation to the call. With this and intents though, it’s important to understand that the application can override these values.

output.text[]

Qualitative

This is the response shown to the user (or application).

output.log_messages

Debugging

Capturing this field is handy to look for coding issues within your conversation tree. SPEL errors show up here if they happen.

output.nodes_visited

Debugging
Qualitive

This can be used to see how a progression through a tree happens

context.conversation_id

All

Use this to group users conversation together. In some solutions however, one pass calls are sometimes done mid conversation. So if you do this, you need to factor that in.

context.system.branch_exited

Debugging

This tells you if your conversation left a branch and returned to root.

context.system.branch_exited_reason

Debugging

If branch.exited is true then this will tell the why. completed means that the branch found a matching node, and finished. fallback means that it could not find a matching node, so it jumps back to root to find the match.

context.???

All

You may have context variables you want to capture. You can either do these individually, or code to remove conversation objects and grab what remains

request_timestamp

Quantitive
Qualitative

When conversation received the users response.

response_timestamp

Quantitive
Qualitative

When conversation responded to the user. You can do a delta to see if there are conversation performance issues, but generally keep one of the timestamp fields for analysis.

So we create a row array, and fill it with dict objects of the columns we want to capture. For clarity of the blog post, the sample code below

One problem that is tricky to solve is if a user has asked two questions. Previously some solutions were to look for conjunctions (“and”) or question marks. Then try to guess if it is a question.

But you could end up with a question like “Has my dog been around other dogs and other people?”. This is clearly one question.

With the new conversation feature of “Absolute Confidences”, it is now possible to detect this. Earlier versions of conversation would have all intents would add up to 1.0.

Now each confidence has it’s own value. Taking the earlier example, if we map the confidences to a chart, we get:

Visually we can see that the first and second intent are not related. The next sentence “Has my dog been around other dogs and is it certified?” is two questions. When we chart this we see:

Very easy to see that there are two questions. So how to do it in your code?

You can use a clustering technique called K-means. This will cluster your data into sets of ‘K’. In this case we have “important intents” and “unimportant intents”. Two groups, means K = 2.

For this demonstration I am going to use Python, but K-means exists in a number of languages. I have a sample of the full code, and example conversation workspace. So for this I will only show code snippets.

Walkthrough

Conversation request needs to set alternate_intents to true. So that you can get access to the top 10 intents.

Once you get your response back, convert your confidence list into an array.

The first three lines will take the array of confidences and generate two centroids. A centroid is the mean of each cluster found. It will then group each of the confidences into one of the two centroids.

The first value is the first intent. So if the first value is 0 we invert the array and then add up all the values:

[ 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ] => 2

If we get a value of 2, then the first two intents are related to the question that was entered. Any other value, then we only have one question, or potentially more than two important intents.

Example output from the code:

Has my dog been around other dogs and other people?
> Single intent: DOG_SOCIALISATION (0.9876400232315063)
Has my dog been around others dogs and is it certified?
> This might be a compound question. Intent 1: DOG_SOCIALISATION (0.7363447546958923). Intent 2: DOG_CERTIFICATION (0.6973928809165955).
Has my dog been around other dogs? Has it been around other people?
> Single intent: DOG_SOCIALISATION (0.992318868637085)
Do I need to get shots for the puppy and deworm it?
> This might be a compound question. Intent 1: DOG_VACCINATIONS (0.832768440246582). Intent 2: DOG_DEWORMING (0.49955931305885315).

Of course you still need to write code to take action on both intents, but this might make it a bit easier to handle compound questions.

Apologies in my long time updating, life has been a bit crazy busy at the moment. I have a few entries cached to go, but couldn’t get around to finishing. As this year is nearly at an end for me, I should have some spare time to catch up.

So this is a brief entry to talk about IBM Data Science experience. This is a new service which hooks into Bluemix. Using Spark it allows you build python/R/Scala notebooks.

For those not familiar with notebooks, they are a really cool way to create prototyping code as documentation. It also has a whole host of extras that you can hook into to visualise and manipulate your data. As well as loads of datasets to play with.

So let’s talk about intents. The documentation is not bad in explaining what an intent is, but doesn’t really go into its strengths, or the best means to collect them.

First the important thing to understand with intents. How Watson perceives the world is defined by their intents. If you ask Watson a question, it can only understand it in relation to the intents. It cannot answer a question where it has not been trained on the context.

So for example if I say “I want to get a fishing license” may work for what you trained, but “I want to get driving license” may give you the same response, simply because it closely matches and falls outside of what your application is intended for.

So it is just as important to understand what is out of scope, but you may need to give an answer to.

Getting your questions for training.

The strength of intents is the ability to map your customers language to your domain language. I can’t stress this enough. While Watson can be quite intelligent in understanding terms with its training, it is making those connections of language which does not directly related to your domain is important.

This is where you can get the best results. So it is important to collect questions in the voice of your end-user.

The “voice” can also mean where and how the question was asked. How someone asks the question on the phone can be different to instant messaging. Depending on how you plan to create your application, depends on how you should capture those questions.

When collecting, make sure you do not accidentally bias the results. For example, if you have a subject matter expert collecting, you will find they will unconsciously change the question when writing it. Likewise if you question collect from surveys, try to avoid asking questions which will bias the results. Take these two examples.

“Ask questions relating to school timetables”

“You just arrived on campus, and you don’t know where or what to do next.”

The first one will generate a very narrow scope of test questions related to your application, and not what a person ask when in a situation. The second question is broader, but you may still find that people will say things like “campus”, “where”, “what”.

Which comes first? Questions or Intents?

If you have defined the intents first, you need to get the questions for them. However there is a danger that you are creating more work for yourself than needed.

If you do straight question collection, when you start to cluster into intents you will start to see something like this:

Everything right of the orange line (long tail) does not have enough to train Conversation. Now you could go out and try and find questions for the long tail, but that is the wrong way to approach this.

Focus on the left side (fat head), this is the most common stuff people will ask. It will also allow you to work on a very well polished user experience which most users will hit.

The long tail still needs to be addressed, and if you have a full flat line then you need to look at a different solution. For example Retrieve & Rank. There is an example that uses both.

Manufacturing Intent

Now creating manufactured questions is always a bad thing. There may be instances where you need to do this. But it has to be done carefully. Watson is pretty intelligent when it comes to understanding the cluster of questions. But the user who creates those questions may not speak in the way of the customer (even if they believe they do).

Take these examples:

What is the status of my PMR?

Can you give me an update on my PMR?

What is happening with my PMR?

What is the latest update of my PMR?

I want to know the status of my PMR.

Straight away you can see “PMR” which is a common term for an SME, but may not be for the end-user. No where does it mention what a PMR is. You can also see “update” and “status” repeated, which is unlikely to be an issue for Watson but doesn’t really create much variance.

Test, Test, Test!

Just like a human that you teach, you need to test to make sure they understood the material.

Get real world data!

After you have clustered all your questions, take out a random 10%-20% (depending on how many you have). You set these aside and don’t look at the contents. This is normally called a “Blind Test”.

Run it against what you have trained on and get the results. These should give you an indicator of how it reacts in the real world*. Even if the results are bad, do not look as to why.

Instead you can create one or more of the following tests to see where things are going weird.

Test Set : Similar to the blind test, you remove 10%-20% and use that to test (don’t add back until you get more questions). You should get pretty close results to your blind test. You can examine the results to see why it’s not performing. The problem with the test set is that you are reducing the size of training set, so if you a low number of questions to begin with, then next two tests help.

K-fold cross validation : You split your training set into random segments (K). Use one set to test and the rest to train. You then work your way through all of them. This method will test everything, but will be extremely time-consuming. Also you need to pick a good size for K so that you can test correctly.

Monte Carlo cross validation : In this instance you take out a random 10%-20% (depending on train set size) and test against it. Normally run this test at least 3 times and take the average. Quicker to test. I have a sample python script which can help you here.

* If your questions were manufactured, then you are going to have a problem testing how well the system is going to perform in real life!

I got the results. Now what?

First check your results of your blind test vs whatever test you did above. They should fall within 5% of each other. If not then your system is not correctly trained.

If this is the case, you need to look at the wrong questions cluster, and also the clusters that got the wrong answer. You need to factor in the confidence of the system as well. You should look for patterns that explain why it picked the wrong answer.

I come from a Java development background, but since joining Watson I’ve started using Python and love it. 🙂 It’s like it was made for Conversation.

The Conversation test sidebar is handy, but sometimes you need to see the raw data, or certain parts that don’t show up in the side bar.

Creating a Bluemix application can be heavy if you just want to do some testing of your conversation. Python allows you to test with very little code. Here is some easy steps to get you started. I am making an assumption you have

1: If you are using a MAC you have python already installed. Otherwise you need to download from python.org.