Information Retrieval and other interesting topics

In one of his talks at QCon, John Allspaw mentioned using Holt-Winter exponential smoothing on various monitoring instances. Wikipedia has a good entry on the subject, of course, but the basic idea is to take a noisy/spikey time series and smooth it out, so that unexpected changes will stand out even more. That's often initially done by taking a moving average, so say averaging the last 7 days of data and using that as the current day's value. More complicated schemes weight that average, so that the older data contributes less.

At the recent PHP UK Conference 2012 I had the opportunity to chat about machine learning and IR with a bunch of very smart people. One of the conversations included the always enlightening Rowan Merewood, and was around ranking Twitter friends. It's reasonably well known that Google used to use a variant of PageRank based on who-follows-who to rank it's Twitter search results (back when it had them). The question is, could the same kind of thing work over a much smaller set - say using it to rank the influence users I follow, in order, perhaps, to prioritise tweets?

I had a great time at the recent PHP Benelux Conference in Belgium. There was a real mix of very interesting people to talk to, and I came away from it buzzing with new ideas (and a ridiculously long todo list). Some of the conversations I had during the weekend were around technical presenting at conferences and usergroups, so I thought I'd collect a handful of the tips that were discussed into a post, and use a few of my favourite speakers at the event to illustrate them.

A lot of interesting techniques involve taking statistical samples, and using those to predict what we'll see in the future. Usually this works pretty well, but when we're dealing with a lot of options or if we have some options that are very rare that approach can go pretty wrong. If we go down the street and note down how many men and women we see, we'll probably be able to use that to predict the chance of the next person we see being male or female pretty well. However, if we were counting all the species of animals we encounter, and trying to use that to predict what we'll see in the future, we'd likely run in to a couple of problems.

In the last post we had a simple stepping algorithm, and a gradient descent implementation, for fitting a line to a set of points with one variable and one 'outcome'. As I mentioned though, it's fairly straightforward to extend that to multiple variables, and even to curves, rather than just straight lines.