Hey, Criterion.rs maintainer here. I think it’s great that you’re using Criterion.rs for this. I’m taking a bit of a break from active development for the moment but I’d be happy to assist with this work from the Criterion.rs side. Let me know if there’s anything I can help with.

integrate perf record into criterion runners (this may take a bit of effort) so that perf.rlo can display more granular metrics than just runtime

This seems like it would be useful for others as well. I don’t know much about perf specifically; can that be done in-process, or would it require running the benchmark in a sub-process?

sort out a stable JSON format (I don’t know if the criterion authors expect to keep the current files they write as a stable format)

The JSON formats are deliberately unspecified and open to change, unfortunately. They’ve already changed at least once since 0.1.0. It might be possible to support this use-case more cleanly by allowing the user to provide a custom report that would receive the data directly. There’s already something like that internally, but the API isn’t really ready for public use.

make criterion able to write data files and reports to a configurable directory

This one should be pretty easy; I think all of the necessary code is there for it, I just didn’t want to stabilize an API for it until I needed to.

ping - @dikaiosune, I’m quite lost as to the status. Should we make a kind of tracking issue here? We’re hitting “Yet Another” case where it’d be really nice to have some insight into the effects of various changes on runtime performance (in this case, it is @michaelwoerister’s clever PR to reuse generics from dependencies, which offers a tidy win on compilation time, but comes at the cost of being able to do less inlining – for now, we are limiting to debug builds, but …).

There’s a bit more work to do before I think it would be a good idea to start collecting metrics on perf.rlo (and I still don’t have a clear idea of how to make the data more navigable). In the meantime, it would be pretty straightforward to clone the repository onto a benchmark machine (bare metal preferred) and run it. I would recommend the following:

set a rustup override for the benchmark directory for the “base” toolchain (before the changes)

run cargo bench, save the results

set a rustup override for the benchmark directory with a custom toolchain from the PR

run cargo bench again, and criterion should tell you on the command line if there’s a measurable difference compared to the base run

I haven’t run the entire suite yet myself so I don’t know how long it’ll take, but I’d estimate a couple of hours. Happy to help here or on IRC if someone wants to help set this up.

I don’t think we can afford a couple hours of runtime on the current perf collector, so we’d need to consider finding another dedicated server or narrowing down the quantity of benchmarks we run before we take that step.

I have a physical box we can use for lolbench for the time being. I’ve done a bunch of configuration to try to make microbenchmark performance more predictable.

I have a fork (that I need to rebase after some changes from @bheisler) of criterion.rs that records hardware PMUs from linux for each microbenchmark, so we should be able to describe benchmark performance much more granularly than just ns/iter.

I have a small patch in my criterion fork that writes results to CARGO_TARGET_DIR which was my main blocker to automating collection – definitely want to preserve the benchmark results alongside the built artifacts.

There are a few tasks left:

set up a task to run this on every nightly and push the JSON files somewhere

figure out how/whether to reformat the JSON files that come directly from criterion

maybe add more benchmarks

collect enough data that I can start figuring out how to display outliers on the various metrics

That sounds great. We should definitely talk about bringing some of those patches back into criterion.rs. I’m particularly curious about your changes to track thr PMU’s. Do you have a link to where I can take a look at your fork?

For non-lolbench further work, I need to also make criterion display PMU data when comparisons suggest that they regressed/improved. Right now there is no inter-benchmark-run comparison for them as there is for benchmark times.

2018-02-02 is the first nightly date since 2018-01-01 where all of the currently assembled benchmarks compile successfully, and I’m currently running a script to backfill data from there for the last couple of months using a spare machine I have. Once more of these have run, I’m planning to start exploring a few strategies for how to present the data. At a minimum I think we need to identify some statistics which can allow us to sort graphs for the benchmarks by some sort of “interestingness” metric, surfacing the most interesting graphs to look at. Right now I am assuming we want to know about a) runtime performance regressions and b) runtime performance improvements, and to focus on recent (6 weeks old? 4?) changes like those.

I don’t have much of experience dealing with this kind of data, so I’ve begun a bit of research on what kinds of analysis might be appropriate for finding interesting benchmarks, keeping a few notes at https://github.com/anp/lolbench/issues/7. So far, I’m pretty sure that:

we want to be very confident that something is a regression/improvement, not just noise

we don’t need to take any automated action based on the metric, other than making it easy for humans to know which benchmarks to look at

we don’t want to have to define lots of parameters up front for different benchmarks (there are too many individual benchmarks)

a solution should be as simple as possible so it doesn’t become a weird black box

I’ve collected a few ideas on that issue, would be great to hear from anyone with more experience with statistics.

Another, slightly more exciting update: I have run enough of the benchmarks on some of my hardware to have a little bit of data from nightlies cut during February and March.

I haven’t done more than a cursory scan of a few of the benchmarks’ results, but I already found one fun performance improvement:

Talked to @eddyb briefly on IRC and it seems somewhat likely that something in 3bcda48…45fba43b caused the improvement here. That commit range includes the LLVM 6 upgrade, so that’s a pretty likely candidate.

Hopefully we can have more automated detection for these changes in the near future!

I finally have the automated collection working reliably and running on a couple of cheap dedicated servers. The data is currently summarized at https://blog.anp.lol/lolbench-data/ if anyone wants to check it out!

EDIT: I am in the process of writing a blog post with more detail, should post that in a day or two.