The extra abstraction layer posed by the virtual machine, the JIT compilation cycles and the asynchronous garbage collection are the main reasons that make the benchmarking of Java code a delicate task. The primary weapon in battling these is replication: "billions and billions of runs", is phrase sometimes used by practitioners. This paper describes a case study, which consumed hundreds of hours of CPU time, and tries to characterize the inconsistencies in the results we encountered.

The Canon

Let's build an "Experimental Evaluation in Software and Systems Canon", a list of readings on experimental evaluation and "good science" that have influenced us and that have the potential to influence the researchers coming after us.