A blog about one man's journey through code… and some pictures of the Peak District

Monthly Archives: December 2016

Recently at DDD North I saw a talk on MS cognitive services. This came back and sparked interest in me while I was looking at some TFS APIs (see later posts for why). However, in this post, I’m basically exploring what can be done with these services.

The Hype

Language: can detect the language that you pass

Topics: can determine the topic being discussed

Key Phrases: key points (which I believe may equate to nouns)

Sentiment: whether or not what you are saying is good or bad (I must admit, I don’t really understand that – but we can try some phrases to see what it comes up with)

For some reason that I can’t really understand, topics requires over 100 documents, and so I won’t be getting that to work, as I don’t have a text sample big enough. The examples that they give in marketing seem to relate to people booking and reviewing holidays; and it feels a lot like these services are overly skewed toward that particular purpose.

The subscription key is given when you register (in the screen under “Set-up”). Keep an eye on the requests, too: 5000 seems like a lot, but when you’re testing, you might find you get through them faster than you expect.

Here’s the output:

Evaluation

So, the 5 phrases that I used were:

The quick brown fox jumped over the hedge

This is a basic sentence indicating an action.

The KeyPhrases API decided that the key points here were “hedge” and “quick brown fox”. It didn’t think that “jumped” was key to this sentence.

The Language API successfully worked out that it’s written in English.

The Sentiment API thought that this was a slightly negative statement.

March is a green month

This was a nonsense statement, but in a valid sentence structure.

The KeyPhrases API identified “green month” as being important, but not March.

The Language API successfully worked out that it’s written in English.

The Sentiment API thought this was a very positive statement.

When I press enter the program crashes

Again, a completely valid sentence, and with a view to my idea ultimate idea for this API.

The KeyPhrases API spotted “program crashes”, but not why. I found this interesting because it seems to conflict with the other phrases, which seemed to identify nouns only.

Again, the Language API knew this was English.

The sentiment API identified that this was a negative statement… which I think I agree with.

Pressing return – the program crashes

The idea here was, it’s basically the same sentence as above, but phrased differently.

The KeyPhrases API wasn’t fooled, and returned the same key phrase – this is good.

Still English, according to the Language API.

This is identified as a negative statement again, but oddly, not as negative as the previous one.

Los siento, no hablo Enspanol

I threw in a Spanish phrase because I felt the Language API hadn’t had much of a run.

The KeyPhrase API pulled out “hablo Espanol”, which based on my very rudimentary Spanish, means the opposite of that was said.

It was correctly identified as Spanish by the Language API.

The Sentiment API identified it as the most negative statement. Perhaps because it has the word “sorry” and “no” in it?

If you jump straight to the references, you will find a very similar set of information, and I strongly encourage people to do so. Additionally, this is probably not the most efficient way to achieve this.

Right, on with the show

Here’s the string that I’ll be parsing, and a little code stolen directly from the link at the bottom to show what it looks like:

It is messy, and it is error prone, and it would be better done by creating classes and serialising it; however, I’d never attempted to do this manually before, and it’s generally nice to do things the hard way, that way, you can appreciate what you get from these tools.

Due to a series of blog posts that I’m writing on TFS and MS Cognitive Services, I came across a requirement to identify duplicate values in a dictionary. For example, imagine you had an actual physical dictionary, and you wanted to find all the words that meant the exact same thing. Here’s the set-up for the test:

Let’s assume that the built-in TFS standard templates are not sufficient for you. You’re in luck: TFS allows you to create your own custom work item. Let’s imagine that, for some reason, you want a work item type called “Defect”, rather than “Bug”. Here’s the process, based on the “Bug” work item type.

First thing is to open the command prompt in administrator mode, and navigate to a work directory; for example, the “Documents” folder
.
Then export the template work item type, like so (you can export the entire definition of all work items, but it becomes unmanageable):

The field name attribute (“New Field” in this case) tells TFS how to refer to this field to the user. This is important, because if you forget that you’ve called it “New Field”, you might assume this hasn’t worked and start googling to find out why.

Now you’ve told TFS that you have a field to store; the next step is to add that field to the layout:

Having looked into this for some time; I came up with the following method of extracting team project tags. I’m not for a minute suggesting this is the best way of doing this – but it does work. My guess is that it’s not a very scalable solution, as it’s doing a LOT of work.

As it was, I couldn’t find a way to directly query the tags, so instead, I’m going through all the work items, and picking the tags. I couldn’t even find a way to filter the work items that actually have tags; so here’s the query that I ended up with:

A couple of points on tags: firstly, tags seem to exist in a kind of transient state; that is, while something is tagged, the tag exists, but once you remove all instances of a tag (for example, if I removed “Tagtest1” from all work items in my team project, TFS would eventually (I believe after a couple of days) just delete the tag for me. Obviously, in my example, as soon as I did this, I would no longer find it. This might leave you thinking that there is a more efficient way of removing tags (that is, you should be able to access the transient store in some way).

The existence of this Visual Studio plug-in lends support to that idea. It allows you to maintain the tags within your team project. If you’re using tags in any kind of serious way then I’d strongly recommend that you try it.

Performance

This is doing a lot of (IMO) unnecessary work, so I tried a little performance test; using this post as a template, I created a lot of bugs:

As you can see, I created a random set of tags. One other point that I’m going to put here is that a TFS database with ~30K work items and no code whatsoever increases the size of the default collection DB to around 2GB:

Now I ran the GetAllTags with some timing on:

19 seconds, which seems like quite a reasonable speed to me for 13.5k tags.