If you’re looking for an executive summary of some machine learning strategies, then this post won’t punch your ticket.

I thought I’d pay back those who’ve been generous with information and scripts by outlining what I’m currently trying. It won’t probably be useful to you, but it may get some ideas going! I’m using c# and all my stuff is scratch-built and not open source. Sorry.

The goal of this competition is to predict which hotel category (out of 100) a user booked given historical data and information about the search that the user did. The current go-to solution for this competition seems to be a counting solution. You basically count the users that booked during a similar search (say destination country and same user location city). To suggest the best hotel category, you pick the most common hotel category for a given combination of features.

So: you combine a number of features (out of more than 22) and pick whichever hotel category is most common for that combination of features.

But this doesn’t give you all the hotel clusters you need. One that does really well may only give you a small percentage of the votes you need. So add the hotel clusters from another combination of features until you’ve a hotel cluster for each user.

Time is precious and it’s slipping away

There are several difficulties with this,

Loading the data to iterate takes a long time

It takes quite some time to evaluate a single given combination of features

Combining several combinations of features takes even longer, and ut uses massive amounts of RAM. Some of my combinations combined 20+ combinations of several features.

Figuring out which features to combine and keeping track of how well they did is really painful.

Loading the data

I don’t know about you, but the first thing I do in a Kaggle is to try to make iterations quicker. And for this Kaggle, I converted the csv files to binary files. This made a world of difference. I have one file with my training data, one with my validation data and one with my testing data. Loading these files takes less than one minute, which cuts down on iteration time immensely.

Computing the hotel clusters of one feature combination

I code in c# and the Dictionary implementation is plenty fast. There isn’t much I can do to improve that. And the RAM requirements while building the dictionary; there isn’t much I can do about that. But there’s a trade off between storage and CPU, and that’s pre-computing! So what I do is I pre-compute the hotel-clusters of any given combination of features and simply store it to disk. If I ever change which hotel-clusters a given combination would generate, then I have to delete my files, but that’s rare. Pre-computing about 300 combinations of features take a couple of hours with my C# code. But once they’re there, they’re quite fast to load from disk!

Here’s a picture of a files of hotel clusters;

As you can see, they’re quite compact, the largest ones (9M) are *tiny* compared to the CSV submissions you send to Kaggle (50M).

And here’s a snippet from the program that generates the pre-computed files;

Now, if it seems like there would be a huge number of feature combinations, you’re right. It’s astronomical. So you have to limit the maximum number of features you’re going to combine (in my case I’m working with 4 at the moment) and which features to include (I’m working with 10 at the moment). This leaves me with 385 unique combinations. That’s doable. Add one more feature? Well, that number grows larger, we’re talking about factorials here.

Combining several combinations of features

To combine several features you traditionally would keep the dictionaries/counts in RAM, which quickly becomes prohibitive. But there’s no reason to do that, once you’ve determined which hotel clusters a particular combination of features would suggest given it’s counts – the counts are no longer useful and can be discarded. So to combine two feature combinations (say [hotel_country] and [srch_destination_id, user_id, user_location_city]) you simply concatenate add the hotel clusters together until you have one per user. This becomes very fast. Which is required for the next step…

Evolving a list of feature combinations

How do we figure out which feature combinations to use? And in which order? And each feature combination can return a total of 5 hotel cluster suggestions, should we use all?

Did it seem like there would be a lot of feature combinations to manage (turned out to be 385 when limits were added)? Well, that holds doubly^10 true here. If you allow 15 different feature combinations, picked from 385 different variants, that number comes out to gazillions. Puts the fac in factorial. Exchaustive searching goes right out the window. Anyone has the right to their own opinion, but no their own fact(orials)s.

Clever clever

A clever few of you are thinking “yeah, but once you’ve used a combination, you wouldn’t use it again”. Well wrong, soo wrong. In an effort to give me even more options to work with, I’m only adding the *first* hotel cluster from any given feature combination. Using the same hotel combination once more will add the second and so on. Simply put; I add the first suggested hotel cluster that doesn’t already exist for that user.

He used to do surgery for girls in the eighties, but gravity always wins out

So how to figure which frigging combinations to use? It’s hardly be possible to even enumerate the possible ones! Well, evolution to the rescue! What I did was create a list evolver that uses trivial Genetic Algorithm techniques. It literally took me less than two hours to implement. You’ll find the actual code below.

It first creates 200 randomly generated cluster combination “lists”. The best ones are propagated into the next generation, where they get a chance to create some mutated versions of themselves. This goes one for quite a few generations – hopefully while better and better solutions come up. My best solution, at the moment, is at 0.50561, and that was created using the techniques described.

Combining techniques

My list evolver from the previous paragraph uses whatever solution gives the best results for the validation period. This means that other solutions can be worked into the process, even if they were generated by some other means than by the counting technique. Just as long as they generate appropriate results for the validation period. So in the end I need results for the validation period and the testing period. But that’s easy enough to generate. At that point, I can add XGBoost results (terrible!) with Dictionary results (pretty good!) and clustering techniques (remains to be seen!).

I don’t get it, how does evolution chose?

I’ve been using evolution (specifically Genetic Algorithms (GA), Genetic Programming (GP) and evolved neural networks) since about 1996 for different projects. GA isn’t an optimal solution to most (if any) problems. One problem is that it isn’t greedy which means you can’t pick a nice trajectory and follow it. It’s inherently stochastic, which makes it slower than a greedy or numerical solutions would be. But you can trivially use it anywhere even though you don’t know of any greedy or numerical solution.

If you can measure which solution is better out of two different solutions (let them duel it out!), or how good a solution is in relation to all possible solutions (a score), you can make GA work. You just make more mutated copies of the currently best solution and keeping the best ones as you go along. In thgis specific case, measure how well the solution would do in the competition. Keep the best ones, discard the worse ones.

A trivial example

Here’s a trivial example demonstrating how my list evolver work. A list of no more than 6 item must be filled by the number -1 to 10. Below you see the lists it comes up with during a couple of generations.

The solution is so obvious given the premises and it takes milliseconds to find. But finding a good combination of feature combinations to use in this kaggle competition? Takes about a day. And there’s no guarantee that it’s an optimal solution.

The goal of AXA Driver Telematics Analysis was to “Use telematics data to identify a driver signature”. In short, we were giving a huge amount of “drives” (described by GPS coordinates) that had been performed by a large number of drivers. Entries were scored on how well they were able to identify who was driving the car.

There were 1,528 participants in the competition and I was able, after a lot of work, to place at #17. Out of the 16 entries that scored higher, 7 were teams and the rest were by individuals. To my mind, being beaten by a team hurts less than being beaten by an individual. And scoring well as an individual is cooler than scoring well as team. But maybe that’s just me.

I chose to develop all my code by myself, not using any Machine Learning packages in the process. If that seems impractical, then rest assured that it really really is. But that’s how I learn. Find a cool algorithm and implement it from some kind of paper or Wiki description. It’s often tricky, but I always learn stuff! For this competition, I was trying out some deep learning stuff before it all ended up in tears (see below). But I had to develop my own deep learning (Neural Network stuff) for that to work. Before I threw it all out, that is.

Anonymizing the data

The goal for AXA, who put up $30,000 in winnings (plus whatever money they paid Kaggle for administrating the competition) was to try to identify when the “wrong” driver was driving a vehicle. For instance when the vehicle had been stolen or lent out.

Kaggle/AXA had mixed the drives from a given driver with random drives from other drivers (incorrect drives). The goal was to identify drives that didn’t belong to the driver. The incorrect drives, having been driven by someone random, were most likely performed in another city or part of the country from the actual drives. So to a lot of drives that were virtually guaranteed to be from the correct driver, you could find parts of the drives that looked exactly the same. When the driver goes to or from work or the local store. Those parts of routes would turn up over and over again, making it easy to determine that they were “positives” (actual drives).

Kaggle/AXA predicted this and masked all the GPS coordinates as relative coordinates measured from the start of the drive. They also removed a short snippet from the start and the end of the drives.Then they mirrored 50% of the drives and rotated them a random angle between 0-360 degrees. All this to make it harder to a) mask identical parts of trips and b) prevent Kagglers from identifying the actual address from any of the drivers. B was done to preserve the privacy of their customers.

Two identical drives (given that you could drive exactly the same twice, including timing) would look radically different from each other. The image above shows all t he drives from a specific driver. Note how they go off in different angles and never overlap. That’s of course not how it would look in reality, most drives originate from home and go to a specific spot (work) or the reverse (work->home). Without mirroring and rotation, several routes would overlap.

A model that would be useful to AXA would base it’s predictions on other metrics, like the speed, how speed tends to change (acceleration and retardation), how the driver handles curves and so on. For that reason, AXA/Kaggle tried to force Kagglers to use that kind of data.

It all ended in tears

It all went terribly wrong, though. No models could place well using the metrics useful to AXA. It turned out that the only way to compete was to try to reverse the masking/anonymizing performed by AXA/Kaggle to figure out which drives were performed in areas typical to the main driver and which weren’t (called Trip Matching). If no one had used Trip Matching then the best strategies would possibly have been useful to AXA. But once one Kaggler used it, there was no going back. To compete, we all had to do it. And it turned out that Trip Matching was a fun problem to solve, just not very useful.

The picture below shows trips that have been matched up (by another Kaggler) and how they obviously were performed in the same area. Thus, they belonged to the true driver and could be given a high score.

My solution was very similar to what they describe; instead of comparing coordinates, the route was converted to turns (left and right) and these turns were aligned to other drives trying to find matches. The Kaggler with the best trip matching algorithm would win! You could try to mix this result with some actual Machine Learning, but that typically wasn’t what decided the result.

The reason this isn’t helpful to AXA is that they already knew which drives matched each other in world space, figuring that out didn’t help them. The winning solution was whoever was better at reversing the AXE/Kaggle obfuscation. Decidedly not helpful, but that what the competition became. I’m absolutely not saying this was wrong of the Kagglers – or that Kaggle or AXA did anything wrong. It’s just that we found a weakness and exploited it. This happens quite frequently and is something Kaggle/the competition owners try to prevent by masking the data. It’s not against the rules to exploit weaknesses though.

Processor power

This competition required huge amounts of processing power. The data ran into gigabytes and every hyperthread was sorely needed. Each new iteration of my algorithm took about 24 hours to run. So I would change one thing and run my tests. 24 hourse later I would evaluate the result of that change and either revert it and make another change, or move forward with the new change tweaking something else. Imagine playing tennis with one day between the hits – it’s hard to get a good creative flow.

Iterating quickly is actually extremely important, because it allows you to test many more variants – but in this case (at least for me) that was not to be. Since this competition I’ve bought a faster computer (6 cores, 12 hyperthreads and 32GB or RAM), but already my new computer has too little RAM for the competitions I’m running.

At the end of this competition, I was renting three large cloud servers (Microsoft Azure) to be able to evaluate several solutions in parallel.

Why I placed where I did

There are three main ingredients to my recipe

I was lucky when picking my method. There were thousands of routes I could have gone down, many of which would have been worse. Unfortunately, there’s rarely a clear sign of which way to go forward – as there often is in software development. In these ML competitions, I find I prototype solutions and continue with whichever route seems most fruitful. That very likely leaves you following a sub-optimal route and forcing you to backtrack. In this case, I remember very vividly where I went wrong and how I would have placed higher had I gone a different route at that point. Oh, well.

I’ve been programming for 30 years and no algorithm is too difficult to implement. When I have an idea, I can implement and evaluate it. And I can do this very quickly allowing me to rapidly iterate.

I worked very hard at this problem. I probably would have made more money working as a consultant for the hours I spend than I had if I won the competition… But the amount I learned was well worth the investment!

Conclusion

Find a method that allows you to iterate quickly. Automate as much as possible. Precompute as much as possible. Find a method that allows you to accurately validate your performance offline, so you don’t have to waste submissions to figure out if you’re doing well.

Right now, I’m competing in a Kaggle related to the travel industry – where I’m trying to evolve a solution. I could use 100* the CPU/Disk power I currently have and still not have enough. But I’m learning every step of the way, and I’m enjoying it immensely! And currently I’m falling quickly in the ratings, but I’m hoping on this last run! Or the one after that…

Machinist-Fabrique is a game where you learn to code by solving puzzles. The tools at your disposal are different code concepts. It’s fun and easy and suited for kids 10+ (and adults). We believe that learning to code is and will be an essential skill for our children and for ourselves.

Whenever I work with Visual Studio doing MVC development, I find that starting the web-site from a recompile takes far too long. So I asked my pal Tobias why it’s so fast when he’s doing it. The answer

I don’t do a full rebuild and I don’t attach the debugger

Press Ctrl-F5 to start the website without the debugger

Use the browser

If you make changes,

hit Shift-F6 to compile and update only the project of the file you’re currently in.

if the solution has focus, the entire solution is rebuilt

hit F5 to update the browser with the changes.

If you need to debug attach the debugger by pressing Ctrl+Alt+P

Thanks, Tobias. I’m writing this blogpost so I won’t have to ask you again!