I have mixed feeling about cliffhangers and I hope you do not hate me for leaving you with one in the last blog. In this episode, we will look at the last piece of the puzzle for predictions using Cloud Machine Learning Engine REST APIs.

Foreword

If you think machine learning is a panacea for every business challenges and sell it as such, you’re doing it wrong.
The best way to jeopardize your business is to go all in with machine learning by following the 5 tips below.

When dealing with sequences, Viterbi algorithm
and Viterbi decoding pops up regularly. This algorithm is usually described in the
context of Hidden Markov Models.
However, the application of this algorithm is not limited to HMMs. Besides, HMMs lately fell out
of fashion as better Machine Learning techniques have been developed.

Sequence labeling is one of the classic ML tasks, that include well-studied problems of Part-of-Speech (POS) tagging,
Named Entity Recognition (NER), Address parsing, and more. Here I want to discuss two related topics: tokenization, and satisfying
constrains imposed by the structure of input document.

In the first of these posts, we covered the (now) conventional wisdom that having a bigger dataset is better for training machine learning algorithms. The second of the series detailed a few rules of thumb for creating quality datasets. This time around, we’ll look at how to start building datasets.

In the first of these posts, we covered the now conventional wisdom that having a bigger dataset is better for training machine learning algorithms. But size is not the only metric for success, quality is also critical.

There was a time when working with big data was not technically possible because our compute resources couldn’t handle the amount of information involved. Beyond that, it took a while for the use case to develop around massive computing resources, so it wasn’t even considered a worthy pursuit. 15 years ago, I remember creating machine-learning algorithms using only a handful data points and then tweaking features representation for weeks. Back then, it was quite challenging to process the 20 newsgroup dataset and its 19 thousand news items.

Even as recently as five years ago, the situation hadn’t improved much. At that time, I worked on putting a learning system with a continuous retroaction loop into production. To fit the budget, we could only train the Random Forest with 5,000 examples – only a few days of data. Using such a small data set alone would not have produced the desired results, so we had to implement many tricks to keep ‘some’ past data alongside the continuous feed of new data to keep everything running smoothly.