Month: September 2015

There are many dimensions on which we might compare a machine learning or data mining algorithm. A few of the first that come to mind are:

1) Sample complexity, convergence

How much predictive power is the algorithm able to extract from a given number of examples?

All else being equal, if algorithm A with N examples behaves the same as algorithm B with 2N examples, we would prefer algorithm A. This can vary in importance depending on how scarce or expensive data is.

2) Speed

How quickly does the algorithm run?

Obviously, a tool is more useful if it is faster. However, if that faster algorithm comes at the expense of sample complexity, one would need to measure the expense of running longer against the expense of gathering more data.

3) Guarantees

Does the algorithm just have good performance (along whatever dimension) in practice, or is it proven? Is it always good, or only in certain situations? How predictable is the performance?

When SVMs showed up, it looked like good practical algorithms came from good theory. These days, it seems clear that powerful intuition is also an excellent source of practical algorithms (deep learning).

Perhaps theory will some day catch up, and the algorithm with the best bounds will coincide with the best practical performance. However, this is not always the case today, where we often face a trade-offs between algorithms with good theoretical guarantees, and good empirical performance. Some examples are 1) Upper-Confidence Bounds versus Thomson Sampling with bandit algorithms [Update June 2018: Thompson sampling has largely caught up!] 2) Running a convex optimization algorithm with a theoretically derived Lipschitz constant versus a smaller one that still seems to work and 3) Doing model selection via VC-dimension generalization bounds versus using K-fold cross-validation.

4) Memory usage

How much space does the algorithm need?

I work with a lot of compute-heavy applications where this is almost a wall: we don’t care about memory usage until we run out of it, after which we care a great deal. With simpler algorithms and larger datasets, this is often more nuanced, with concerns about different cache sizes.

5) Handholding

Can it be used out of the box, or does it require an expert to “tune” it for best performance?

A classic example of this is stochastic gradient descent. In principle, for a wide range of inputs, convergence is guaranteed by iteratively setting where is a noisy unbiased estimate of the gradient at and is some sequence of step-sizes that obeys the Robbins-Monro conditions [1]. However, in practice, the difference between, say, , and can be enormous, and finding these constants is a bit of a dark art.

[1] ( and .)

6) Implementability

How simple is the algorithm? How easy is it to implement?

This is quite complex and situational. These days, I’d consider an algorithm that consists of a few moderate-dimensional matrix multiplications or singular value decompositions “simple”. However, that’s due to the huge effort that’s been devoted to designing reliable matrix algorithms, and the ubiquity of easy to use libraries.

7) Amenability to parallelization

Does the algorithm lend itself to parallelization? (And what type of parallelization?)

And if it does, under what model? Map-reduce, GPU, MPI, and openMP all have different properties.

8) “Anytime-ness”

Can the algorithm be implemented in any anytime manner?

That is, does the algorithm continuously return a current “best-guess” of the answer that is refined over time in a sensible manner? This can help diagnose problems before running the full algorithm, enormously useful in debugging large systems.

Note this is distinct from being an online algorithm, which I’m not mentioning here, since it’s a mix of speed and memory properties.

9) Transparency, interpretability

Can the final result be understood? Does it give insight into how predictions are being made?

Galit Shmueli argues that “explanatory power” and “predictive power” are different dimensions. However, there are several ways in which interpretability is important even when prediction is the final goal. Firstly, the insight might convince a decision maker that the machine learning system is reliable. Second, this can be vital in practice for generalization. The world is rarely independent and identically distributed. If a domain expert can understand the predictive mechanism, they may be able to assess if this will still hold in the future, or captures something true only in the training period. Third, understanding what the predictor is doing also often yields ways to improve it. For example, the adjacent visualization of the outputs of a decision tree in two-dimensions suggests the need for non axis-aligned splits.

10) Generality

What class of problems can the algorithm address?

All else being equal, we prefer an algorithm that can, say, optimize all convex losses over one that can only fit the logistic loss. However, this often comes at a cost— a more general-purpose algorithm cannot exploit the extra structure present in a specialized problem (or, at least, has more difficulty doing so). It’s instructive how many different methods are used by LIBLINEAR depending on the particular loss and regularization constant.

11) Extendability

Does the algorithm have lots of generalizations and variants that are likely to be interesting?

More of an issue when reviewing a paper than deciding what is the final best algorithm to use once all the dust has settled.

12) Insight

Does the algorithm itself (as opposed to its results) convey insight into the problem?

Gradient descent for maximum likelihood learning of an exponential family is a good example, as it reveals the moment-matching conditions. Insight of this type, however, doesn’t suggest that one should actually run the algorithm.

13) Model robustness

How does the performance of the algorithm hold up to violations of its modeling assumptions?

Suppose we are fitting a simple line to some 2-D data. Obviously, all else being equal, we would prefer an algorithm that still does something reasonable when the actual dependence when the expected value of is not linear in . The example I always harp on here is the pseudolikelihood. The original paper pointed out that this will have somewhat worse sample complexity that the full likelihood, and (much!) better time complexity. Many papers seem to attribute the bad performance of the pseudolikelihood in practice to this sample complexity, when the true cause is that the likelihood does something reasonable (minimizes KL divergence) when there is model mis-specification, but the pseudolikelihood does not.

—

I often ponder is how much improvement in one dimension is enough to “matter”. Personally, I would generally consider even a small constant factor (say 5-10%) improvement in sample complexity quite important. Meanwhile, it would be rare to get exited about even, say, a factor of two improvement in running time.

What does this reflect? Firstly, generalization is the fundamental goal of data analysis, and we are likely to be willing to compromise most things if we can really predict better. Second, we instinctively distrust running times. Theory offers few tools for understanding constant factors, as these are highly architecture and implementation dependent. In principle, if one could be completely convincing that a factor of two improvement was truly there, I think this probably would be significant. (This might be true, say, if all algorithms are bottlenecked by a few matrix multiplications, and a new algorithm provably reduces the number needed.) However, this is rare.

In some places, I think constant factor skepticism can lead us astray. In reality, a factor of 30 improvement in speed is probably better than changing a complexity to . (Calculate such that .) This is particularly true when the lower-complexity algorithm has higher constant factors. As an example, I’ve always found the algorithm for projection onto the ball to be faster than the algorithm in practice.

—

Given that there are so many dimensions, can a new algorithm really improve on all simultaneously? It seems rare to even improve on a single dimension without a corresponding cost somewhere else. There will often be a range of techniques that are Pareto optimal, depending on one’s priorities. Understanding this range is what makes having an expert around so important.

Ideally, as a community, we would be able to provide a “consumer” of machine learning an easy way to find the algorithm for them. Or, at least, we might be able to point them in the direction of possible algorithms. One admirable attempt along this line is the following table from The Elements of Statistical Learning:

(Incidentally, notice how many of the desiderata do not overlap with mine.)

Some other situations show a useful contrast. For example, take this decision tree for choosing an algorithm for unconstrained optimization, due to Dianne O’Leary:

Essentially, this amounts to the principle that one should use the least general algorithm available, so that it can exploit as much structure of the problem as possible. Though one can quibble with the details (pity the subgradient methods) it at least comes close to giving the “right” algorithm in each situation.

This doesn’t seem possible with machine learning, since there doesn’t exist a single hierarchy Rather, ML problems are a tangle of model specification, computational and architecture requirements, implementation constraints, user risk-tolerances and so on. It won’t be easy to automate away the experts. (Even ignoring the possible misalignment of incentives in that the field has the domain of automating itself.)