Each program is run and measured at the smallest input value, program output redirected to a file and compared to expected output. As long as the output matches expected output, the program is then run and measured at the next larger input value until measurements have been made at every input value.

If the program gives the expected output within an arbitrary cutoff time (120 seconds) the program is measured again (5 more times) with output redirected to /dev/null.

If the program doesn't give the expected output within an arbitrary timeout (usually one hour) the program is forced to quit. If measurements at a smaller input value have been successful within an arbitrary cutoff time (120 seconds), the program is measured again (5 more times) at that smaller input value, with output redirected to /dev/null.

The measurements shown on the website are either

within the arbitrary cutoff - the lowest time and highest memory use from 6 measurements

outside the arbitrary cutoff - the sole time and memory use measurement

By sampling GLIBTOP_PROC_MEM_RESIDENT for the program and it's child processes every 0.2 seconds. Obviously those measurements are unlikely to be reliable for programs that run for less than 0.2 seconds.

We start with the source-code markup you can see, remove comments, remove duplicate whitespace characters, and then apply minimum GZip compression. The Code-used measurement is the size in bytes of that GZip compressed source-code file.

Thanks to Brian Hurt for the idea of using size of compressed source code instead of lines of code.

The GTop cpu idle and GTop cpu total are taken before forking the child-process and after the child-process exits, The percentages represent the proportion of cpu not-idle to cpu total for each core.

On win32 - GetSystemTimes UserTime and IdleTime are taken before forking the child-process and after the child-process exits. The percentage represents the proportion of TotalUserTime to UserTime+IdleTime (because that's like the percentage you'll see in Task Manager).

Because I know it will take more time than I choose to donate. Been there; done that.

4 or 5 years ago, someone complained that publishing their own measurements wouldn't show their favorite language implementation on a website highly ranked by search engines. By now - if they had actually made measurements, and published and promoted them - their website would be highly ranked. But they did nothing.

afaict we all feel the same way about this, we all feel that we should sit on our hands and wait for someone else to do the chores we don't wish to do.

Measurements of proggit popular language implementations like Nim and Julia will attract attention and be the basis of yet another successful website (unlike more Fortran or Ada or Pascal or Lisp). So make those measurements, and publish them and promote them. If you're interested in something not shown on the benchmarks game website then please take the program source code and the measurement scripts and publish your own measurements.

The Python script "bencher does repeated measurements of program cpu time, elapsed time, resident memory usage, cpu load while a program is running, and summarizes those measurements" - download bencher and unzip into your ~ directory, check the requirements and recommendations, and read the license before use.

We are trying to show the performance of various programming language implementations - so we ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.

We do show one contest where you can use different algorithms - meteor-contest.

In the second case (Warmed), we started the program once and repeated measurements again and again and again 66 times without restarting the JVM; and then discarded the first measurement leaving 65 data points.

N means the value passed to the program on the command-line (or the value used to create the data file passed to the program on stdin). Larger N causes the program to do more work - mostly measurements are shown for the largest N, the largest workload.

When the program was being measured: the first core was not-idle about 27% of the time, the second core was not-idle about 34% of the time, the third core was not-idle about 28% of the time, the fourth core was not-idle about 67% of the time.

When all the programs show ≈ CPU Load like this '0% 0% 0% 100%' you are probably looking at measurements of programs forced to use just one core - the fourth core (rather than being allowed to use any or all of the CPU cores).

Do design-iteration on your own computer, or in a language newsgroup. Only contribute programs which give correct results - diff the program output with the provided output file before you contribute the program.

Prefer plain vanilla programs - after all we're trying to compare language implementations not programmer effort and skill. We'd like your programs to be easily viewable - so please format your code to fit in less than 80 columns (we don't measure lines-of-code!).

We are trying to show the performance of various programming language implementations - so we ask that contributed programs not only give the
correct result, but also use the same algorithm to calculate that result.

We do show one contest where you can use different algorithms - meteor-contest.