Last month, at our Alchemist Accelerator demo day, we showed off some of the great tech we’ve been building. We’ve gotten a ton of awesome press and recognition about our Machine Intelligence products, but also a fair amount of skepticism about whether our WiseRF benchmarks are as good as advertised.

This got me thinking about ML benchmarking in general and the skepticism that I typically have about other people’s benchmarks, particularly when their own algorithm is involved. How can we, as machine learning practitioners, guarantee objectivity in our benchmarks? How do we change the general perception that accuracy benchmarks simply show how much time you spent to tune/tweak each implementation? Can we devise a set of guidelines to ensure accurate and fair representation of each algorithm?

It’s useful to take a step back and examine the purpose of benchmarks in machine learning. The primary goal of benchmarking is to inform practitioners of the best choice of algorithm for their task at hand. That is, effective benchmarking simplifies the life of the data scientist faced with an unwieldy number of options. Benchmarks also exist to establish the relative strengths and weaknesses of different machine learning implementations when applied to distinct types of data which can assist in identifying shortcomings and bugs in existing codebases and ultimately help the community improve the available tools and increase their generally applicability.

Others have written about the role of benchmarking in machine learning. One post that strikes me as particularly insightful and relevant is Does off-the-shelf machine learning need a benchmark? by Jay Kreps. Kreps argues that “The end goal should be to automate the statistician as completely as possible, not just to add more tricks to her bag” and insofar as the objective is to automate this process, the “best approach [to benchmarking] is something that just throws many, many data sets at the algorithm, and breaks down the results along various dimension[s].”

Kreps’ viewpoint dovetails with wise.io’s vision of democratized machine learning. However, instead of automating the statistician, our aim is to lower the barrier to entry for machine learning (effectively turning more people into statisticians) by arming experts and non-experts alike with easy-to-use tools that work well across a variety of industries and with different types of data. Benchmarking is of utmost importance for wise.io, as it allows us to gauge the effectiveness of our technology under an assortment of different situations, to diagnose any shortcomings in our tools, and to further develop our ML products to ensure that they work well, out of the box, on a large class of problems.

A set of guiding principles to benchmarking

Below is a set of (necessary, but probably not sufficient) guidelines for the objective comparison of ML implementations on a variety of metrics (e.g., accuracy, speed, memory usage). These guidelines have been put to use by our Principal Data Scientist, Erin, in her benchmarks of WiseRF (Part 1):

Make it Reproducible

Any ML practitioner should be able to duplicate your result.

For instance, make a publicly available Amazon Machine Image (AMI) with all data, code, and executable files that run the benchmarks.

If you are benchmarking your own implementation or algorithm, go out of your way to ensure the unbiasedness of the comparison.

Don’t run any applications while benchmarking, and don’t do it on your laptop!

Use the most recent version of each implementation that you are comparing, and run them all on the same machine. Or, ask the software maintainer which version they prefer you to use in the benchmark.

Try to use the same number of cores for each implementation. In many cases, to maximize CPU utilization, more threads compared to cores. However, some approaches make large copies on a per-thread basis, and thus they may look less favorable in terms of memory usage in order to attain the highest computational performance. This should be documented.

The nature of software development is sometimes bugs arise in the newest version. Contact the software maintainer if there are any surprises.

Make it Transparent

Completely describe all modeling decisions that you make.

Be clear about why you selected the particular data sets you chose to analyze. Data sets should be representative of the kinds of problems that the algorithm is designed to solve.

Attempt to explain why different algorithms or implementations of the same algorithm do better or worse under each of the different scenarios.

Detail how you tuned each algorithm/implementation, and to the extent possible make the amount of tuning equitable between algorithms.

To the last point, hyperparameter tuning can be one of the most difficult aspects of ML benchmarking since methods have different numbers of free parameters and it is not always clear which parameters are the most important in driving performance. The amount of tuning should correspond to the use-case and target audience for the algorithm. For example, a few options are:

Off-the-shelf – How does the software perform right out of the box, with the default parameter settings? Like it or not, many users of your software are inexperienced at the art of machine learning, and will simply use the default settings. Good default settings are important, and testing the behavior of out-of-the-box algorithms is a critical and under-valued practice.

Moderate tuning – The typical data scientist will optimize the expected performance over a few parameters, often using a coarse grid search.

Extensive tuning – A few practitioners need the absolute optimal model, and will employ fine grid searches over all parameters.

Lastly, in addition to accuracy, benchmarks should always be performed on speed (of learning AND prediction) and memory usage. Each practitioner has a different set of requirements / constraints under which she is operating, so making all of this information available can clarify which algorithm she should choose. For instance, for some practitioners a 1% boost in accuracy will be worth the 10 times longer that the algorithm takes to learn, whereas for others the opposite will be true. Memory usage is particularly important since scalability a pressing challenge in analyzing modern data sets. In a subsequent post, we will give a step-by-step guide to benchmarking memory usage in Linux (spoiler: it is hard to get an accurate memory benchmark, and in cases where instrumentation or a CPU-intensive memory profiling is needed, you should benchmark speed separately).

Applying these principles to WiseRF benchmarks

Over the next few weeks, we will be benchmarking WiseRF against a multitude of other ML algorithms on a variety of prediction problems. Posts about those benchmarks will be published on our blog and linked here. If you have a particular data set or ML code that you would like us to test against, please contact us.

MLcomp: a step in the right direction

I’ll conclude by saying that there are a number of people out there that are taking ML benchmarking seriously. In particular, I’m impressed by the efforts of MLcomp, which hosts a ML benchmarking service to ensure the reproducibility, transparency and accuracy of their benchmarks. Their stated mission is to help the practitioner make sense of a “dazzling array of methods” and to make it easy for him/her to choose the best method for the task at hand. They do this by performing benchmarks on accuracy, speed, and memory usage across a broad range of problems with user-contributed code and data sets. They allow anybody to download and run their benchmarks.

MLcomp is certainly a big step in the right direction, but it lacks the critical mass of popularity in the community (and appears to no longer be maintained?). For example, there is only one Random Forest implementation (Weka) in their system. My general sense is that many of the data sets and methods are too obscure and lack sufficient metadata, making it hard to make sense of their dazzling array of benchmarks (300+ ML methods have been tested on some 600+ data sets). It would be great to see more moderation of the info on the site as well as some higher-level analysis on the benchmarks (e.g., which types of algorithm are most accurate on imbalanced, binary classification problems in terms of AUC with more than 50 dense features?). Such a system would be a tremendous resource for data scientists.

Learn more about machine learning in action in our ThredUP Case Study.