How to evaluate machine learning? U of T research supports latest benchmark initiative

Assistant Professor Gennady Pekhimenko’s research in systems, computer architecture and machine learning is contributing to the next performance benchmark for ML (photo by Nina Haikara)

Machine learning, a popular subfield of artificial intelligence that is revolutionizing everything from legal research to medical diagnostics, depends on three major parts: a model, a dataset, and the hardware that it's backed by.

So how do researchers, startups and companies evaluate its overall effectiveness?

Options were limited until the recent formation of MLPerf, a consortium of industry and academic partners including Google, Intel, Baidu, Harvard University, Stanford University and the University of Toronto, who are working together to offer a new benchmark suite to evaluate machine learning (ML) performance, from speed to system cost and power efficiency.

“Current benchmark suites give some basic numbers to say how well these benchmarks perform on certain hardware, but do not provide any insight into why these applications perform one way or another,” says Gennady Pekhimenko, an assistant professor in the department of computer and mathematical sciences at U of T Scarborough and the tri-campus graduate department of computer science.

“To know which design decision is bad or not for ML applications, you want to have some representative reference model,” he says. “Observations from a limited set of benchmarks can be misleading. We might get a good result for one specific workload but lower performance for everything else. MLPerf aims to avoid this issue by providing a small yet representative set of ML models that cover a very wide spectrum of applications.”

Pekhimenko, a systems and computer architecture researcher, was invited by Google and Stanford to take part in an initiative that later led to MLPerf. He says there are two benchmarking areas that are being explored by MLPerf: an “open” category where any model can be applied to a fixed dataset, such as the ImageNet dataset for image classification. There is also a “closed” category, where both the model and datasets are fixed, making it very useful for hardware and software stack developers to evaluate execution time, power and cost of their designs.

MLPerf takes its inspiration from benchmarks for general-purpose computing and database systems that helped propel both fields since the 1980s and have resulted in multiple breakthroughs.

U of T computer science’s Alex Krizhevsky, Ilya Sutskever and University Professor Emeritus Geoffrey Hinton won the ImageNet competition in 2012, which advanced ML’s capability at image recognition tasks. Their deep convolutional neural network, called AlexNet, outperformed all previous results. Their then-best 25 per cent margin of error has since been reduced in future ImageNet contests to less than five per cent.

“The ImageNet competition provided us with one important benchmark – image classification,” says Pekhimenko, who credits Krizhevsky’s use of GPUs, graphic processing units, in making AlexNet training practical.

"But there are many other important applications that use very different models to solve problems. For example, machine translation, such as Google Translate, uses a very different model based on recurrent neural networks. Another example, reinforcement learning, used in the AlphaGo system that was able to beat the human Go world champion [Lee Sedol], uses very different networks as well,” he says, noting hardware, systems, and algorithmic developments have made these models practical and efficient.

“We would like ML researchers to contribute to our work and help us choose the best models for different important ML tasks. Systems and architecture communities will then make the best software and hardware designs for these models.”

For his part, Pekhimenko says their lab created an open source benchmark suite called TBD, a training benchmark for deep neural networks. Released in March, it has made a significant impact on the benchmark choices and methodology for the MLPerf initiative.

“It’s called TBD – To Be Determined – because ML is changing so fast.”

“With TBD, we’re trying to go beyond pure benchmarking. We're interested in understanding how well available hardware and software perform, but we also look at both hardware and software efficiency. We then provide hints to the ML developers, so they can make their networks more efficient, and hence develop new algorithms and insights faster,” he says. MLPerf will continue to develop its own benchmark suite.

The new faculty member, who has worked at both IBM and Microsoft Research, is happy to be back at U of T, where he completed his master’s thesis prior to his doctoral studies at Carnegie Mellon University.

“I had many options to stay in the U.S.," he says. "But for many reasons, I'm really happy to be here. Great schools are not always in great cities. In Toronto, you have a great school and an amazing city to live in. So it was an easy choice for me.”