5 Ways to generate better data mining models

This implies that you can measure the quality of your models. You know, only one quality measure really matters (whatever lift-adepts and AUC-adepts may tell you: what’s in it for your business ($$$$$) ?

Disclaimer : this post is NOT a list of things you should do in order to avoid all the known data mining mistakes. On the contrary, I suppose you know what you are doing as a data miner. Only there are some possibilities you might have overlooked.

Push your data mining tool to the limits. The more data you use, the better your model.

– As you know the best models are “ensembles” of weak learners, like bagging. In stead of feeding one data file to the algorithm and let it do the sampling, learning, averaging, I prefer to make the samples myself and feed one at the time to the algorithm. That way it is possible to use a lot more data before the tool crashes. For each individual model I push it to its limits. The averaging can be done afterward.

– A second advantage of making the samples yourself is that you can chose to generate non-overlapping samples as much as possible. That way the total number of different observations used in model building reach much higher levels than by feeding only one file to the modeling tool.

2. There is no data like more data (II) : variables

– Calculate additional (derived) fields. This is fairly easy. You can multiply, subtract, divide, add, numbers. OK it has to have some business meaning, otherwise how will you explain it afterwards?

– Find additional information, inside or outside your company.

3. Find the best algorithm (the very best actually is a combination of all: ensemble)

– It it tempting to state that probably for each problem there is one best algorithm. So all you have to do is try a handful of really different algorithms to find out which one is the best for the problem-data-data miner combination at hand. Surprised that the data miner plays a role in this ? Different data miners will use the same algorithm differently, according to their taste, experience, mood 🙂 …

So find out which algorithm works best for you and your problem.

4. Zoom in on your targets

– When you want to use a data mining model to select your customers who are most likely to buy your outstanding product XYZ, it is reasonable to use your past buyers of XYZ as your positive targets in your model. You get a model with an excellent lift and use it for a mailing. Afterwards you proudly report to your executives that your model (you !) increased the mailing return by a factor 3. Great. The logical thing is to move on to the next problem, product MNO …

Wait ! Zoom in on your targets ! When your mailing campaign is over, you now have all the data you need to create a new, better, model for product XYZ. Your targets : your past buyers of XYZ in response to your mailing. With this new model, you will not only take their “natural” propensity to buy into account, but also their willingness to respond to your mailing !

– If your databases contain far more observations than your data mining tool likes, the only thing you can do is use samples. No problem. Calculate your model, and you can use it. But you can push it a bit further. Zoom in ! Use your model to score the entire customer base. And now zoom in on the customers with the best scores. Let’s say the top-10%. Use them to calculate a new, second model which will use the far more tiny differences in customer information to find the really promising ones.

5. Make it simple

I confess : the four previous point all went in the direction of making things more complicated. But nevertheless, you have to keep your data mining work as simple as possible, because the guy who pays your bills wants you to deliver good models, on time for his campaigns.

So try to :

– automate as much as possible

– not to try out every possible algorithm in each data mining project. If problem A was best solved with algorithm X, than probably problem B, which is very similar to A, should equally be tackled with algorithm X. No need to wast time checking out other algorithms.

– not to make a model when you know that for the next campaign for product XYZ your marketeer will mail each and every customer. Models are made for campaign selection. When they don’t select, the do not need a model