Programming, Machine Learning and Some Thoughts

Last year I did an upgrade for a legacy Rails website, from version 3.2 to 4.2. One daunting task was to upgrade the ActiveRecord RubyGem since the API changed drastically in the new release. Luckily there was a RubyGem built specifically for preserving the functionality using the deprecated API, at least for now. To prepare for the inevitable, I want to write down some of the changes made recently for future reference so I don’t have to search through the documentation and Stackoverflow again.

What’s the problem?

We have a method that returns a report that aggregates on some data. 7 years ago when this application was built, there wasn’t much data so this was fine but now the report won’t load for an obvious reason: the performance of the query to calculate col2 was so bad given the current data volume. Therefore, we have to optimize it.

select
colum1 as col1,
GROUP_CONCAT(DISTINCT(column2) ORDER BY column2 SEPARATOR ',') as col2,
sum(column3) as col3
from table1
where lastUpdatedDate between '20XX-01-XX' and '20XX-02-XX'
AND (column3 <> 0)
GROUP BY col1;

Rails 3 query in SQL

What’s the solution?

We decided to use a nested query to group by both the col1 and column2 first and select from that result:

select
col1,
GROUP_CONCAT(DISTINCT(column2) ORDER BY column2 SEPARATOR ',') as col2,
sum(col3)
from (
select
colum1 as col1,
column2,
sum(column3) as col3
from table1
where lastUpdatedDate between '20XX-01-XX' and '20XX-02-XX'
AND (column3 <> 0)
GROUP BY col1, column2;
) a
GROUP BY col1;

Optimized query

Works like a charm except that I’m having problem translate this back to Rails code since I haven’t worked with these Rails code for almost a year now. I did some research but couldn’t find any straight answers to my question so I had to switch to trial and error by composing different API combinations and comparing the logged query with the desired one.

Deep learning hungers for a lot of training data. It is tempting to put whatever data you can find into the training set. There are some best practices for dealing with situations when the training and dev/test distribution differ from each other. This post is also published on Steemit.

Training and testing on different distributions

Imagine that you have 10, 000 mobile uploads which you care about. It’s too few for your algorithm to learn. You can find 200,000 high resolution images from the web for training but they are different from production environment.

Carrying out error analysis

Look at dev examples to evaluate ideas

Using the cat classifier as an example, say you have 90% accuracy and 10% error. Your team looked at some errors and found misclassified examples on dogs into cats. Should you spend time making the classifier do better on dogs? It depends. You need to figure out whether it’s worthwhile to do so.

If we check 100 mislabeled dev set examples and find only 5% of them are dogs, then the ceiling of your performance improvement with working on dog picture is 5%. In this case, it may not worth your time. However, if it’s 50%, then you find something that could potentially reduce half the error rate.

Observations

We have an ElasticSearch cluster that is running fine most of time. However, occasionally our users can submit heavy queries that add a lot of load to the cluster by accident. They typically include aggregations (on scripted fields and/or analyzed fields) over a large period of time without any filters to narrow down the search scope. During the execution of these heavy queries, the cluster usually has high CPU utilization. For some other users, they may see the cluster not responding to their queries. Or they may see error messages popping up on Kibana about shard failures. This post is also published on Steemit.

This week’s course Machine Learning Strategy mainly focuses on how to work on a machine learning project and accelerate the project iteration. Since this course covers quite a lot small topics, I’ll break down my notes to several shorter posts. Topics covered in this post are marked in bold in the Learning Objectives. This post is also published on Steemit.

Fact

Applied ML is a highly iterative process of Ideas -> Code -> Experiment and repeat.Accept and deal with it but we certainly should make this process easier. I guess that’s what the machine learning platforms out there are built for.

Before tuning the parameters

One question we have to answer is: what should we tune the parameters on? If you come from a machine learning background, you probably already know that the data should be split into 3 parts.

I recently signed up for the Deep Learning Specialization on Cousera and have just completed the first course Neural Networks and Deep Learning. Although it is recommended for 4 weeks of study, with some backgrounds in Machine Learning and the help of 1.5x play speed, finishing it in 1 week is also achievable. In this post I just want to summarize some of the take-aways for myself and hope it also helps whoever’s reading it. If you are familiar with the implementation of neural network from scratch, you can just skip to the last section for Tips and Best practices mentioned in the course. Note that this post is also posted on Medium and Steemit under the username @steelwings.

Scale of the problem

On small training set, neural network may not have a big advantage. If you can come up with good features, you can still achieve better results using other traditional machine learning algorithms than neural network.

However, as the amount of data grows, the traditional learning algorithms can hit a plateau while the neural network’s performance keeps increasing. The larger the neural network, the larger the increase.

As the neural network grows, it takes more resource and times to compute. Many innovation on the neural network algorithms were initially driven by performance requirement.

I’ve recently finished the first pass of CS231N Convolutional Neural Networks for Visual Recognition. Now it’s time to try out a library to get hands dirty. Keras seems to be an easy-to-use high-level library, which wraps over 3 different backend engine: TensorFlow, CNTK and Theano. Just perfect for a beginner in Deep Learning.

The tutorial I picked is the one on the MNIST dataset. I’m adding some notes along the way to refresh my memory on what I have learned as well as some links so that I can find the references in CS231N quickly in the future.