Links

A blog about Chat bots (mainly Watson), and life in general.

Month: November 2018

“Tools are the subtlest of traps”

Most developers learn the dangers of evil wizards, for everyone else it might not be so obvious. In the Watson tooling the purpose is democratize AI, that is to abstract the AI layer from your knowledge worker.

This allows you to utilize the power of AI, without having to search for the mythical person who understands your business and NLP, AI, etc.

While this can remove a lot of the complexity, it can also lead people into a false sense of security that their hand will be held through the whole process.

So I am taking a time out to talk about Watson Knowledge Studio. For those that don’t know what it is. The tool allows you to annotate your domain documents, and surface structured insights from unstructured data. It is extremely powerful and easy to use versus other solutions out there.

The downside is that it is extremely easy to use. So I have a number of different people/companies rush in and create a model that disappoints, or in some cases infuriates. It’s not that this is unique to WKS, only that you can overlook important steps in your workflow.

Now IBM does do 3-4 days training, and there are a number of videos (slightly out of date) that cover some of this. But to help some people starting off, I am going to list the main pitfalls you need to watch out for when doing your first WKS project.

Understand what you need to surface from your data!

This happens so often with technical people. They look at the tooling, see how easy it is and run off and annotate the world in their documents.

What normally happens in this regard is a very poor model which surfaces information correctly, but that data is meaningless to the business.

Get your business analysts/SMEs from the start.

You need someone to objectively understand what is the business problem you are trying to solve. You need to look at your data sources and determine if you can even surface that information (ie. enough samples to train).

Limit your Types and Relationships on your first pass.

After you have looked at what you want to surface, you need to focus on a small few number of types and relationships. Your BA/SME might have picked 50-100, but generally you should pick in around 20-40. The are reasons for this.

Each type/relationship adds more work for your human annotator.

Models can build faster.

As you work through documents you will find that your needs for types/relationships may change.

Don’t reinvent the wheel.

If you have existing annotators that will work as-is, don’t try and integrate them to your model. They may be part of your business requirement, but all you are doing is adding complexity to your model. You can run a second pass on your finished data to get that information.

Understand when to use rule based versus model based.

The purpose of the AI model is to have it train and understand content it has never seen before. To do this requires a lot of up front work on annotation and training.

Compare this to the rule based model. If you know new terms/phrases may not come up, but the nature of how they may be written changes, then rule based may solve your issue.

Personally the AI model is the better choice if you plan to go with WKS. There is easier tooling for rule based. For example Watson Explorer Studio.

Inter-Annotator Agreement is King.

Two things to realize before you start annotating.

Just because you are an expert in the content, doesn’t mean you are an expert at annotating.

The more subject matter experts you have, the less agreement on topics will happen in the real world.

To that end, you need to clearly define your inter-annotator agreement (IAA) so there is no ambiguity or disagreement. Have examples, and also have a single SME as the deciding factor where further disagreements occur.

Not creating a proper IAA can lead to more work to your main SME, and damage your model to the extent of hours/days of wasted work.

Data Wrangling is required.

Most of the work in formatting your data is to keep your human annotator sane. Annotating a document is a mentally exhausting process that normally follows these steps.

Read and understand the paragraph.

Annotate the paragraph with types and relationships.

Read and annotate the co-references.

Fix mistakes as you go.

You want to reduce the amount of time to do this for each document, and working set. So if there is information that isn’t required to annotate, remove it. If your document is very small, then join together (with some clear marker of a new document).

You want your document to be to be annotated in 30 minutes or so, and your document set in a day. This will allow you to progress at a reasonable speed, and build frequent models.

On top of this, you should also look at sourcing any dictionaries/terms that can be used to kickstart the annotation. (which most people do)

Lastly, check to see how your documents are ingested into WKS. For example I’ve seen instances of “word.word”. WKS sees this as a single term, and fixing that annotation can be annoying. It may be you need to do some formatting, or limit these mistakes.

Build your model soon and often.

You can’t really see what you are doing wrong until 2-3 models in. So it is important to build these models as soon as you can.

To that end I would recommend building a model as soon as you have a working set completed. Try to have sets be annotated within 1-2 days max, at least at the start of the project.

You can quickly see where the IAA is lacking, and if you need to change types/relationships or even data. Doing this sooner than later prevents technical debt of fixing the model.

Let the model work for you.

Once you have gotten 3-4 models created, and you are comfortable with some of the scoring, have it pre-annotate future working sets. It will reduce the mental requirements for the human annotators.

However! If you still have 1-2 of the three areas performing badly, recommend to your human annotator to just delete the poor performing part and redo. For example if Co-Reference doesn’t work well, just delete all co-references and redo. This is considerably faster than trying to manually fix every annotation error.

…

So I hope this helps those in their first journey into using WKS. Be aware that this is by no means a full tutorial.