Sandbox Evaluation of Entries in
PhysioNet/Computing in Cardiology Challenges

This page describes how entries in PhysioNet/CinC Challenges are evaluated
and scored automatically. The method described below was developed to
support the 2014 Challenge. Entries for future challenges, and unofficial
late entries for previous challenges, will also be evaluated and scored using
this method.

This page also includes instructions for setting up a replica of the
Challenge test environment, which may be useful for debugging
Challenge entries.

Evaluation details

All of the processing needed to check, evaluate, and score challenge entries
is performed on dedicated 64-bit Linux servers, under control of the supervisor
script (evaluate). Each server runs several
virtual machines (VMs) using qemu and kvm hardware virtualization.

Newly uploaded entries are initially placed into a queue. The oldest entry in
the queue is loaded into an idle VM as soon as one is available, and stage 1
processing begins.

Stage 1 (prep): The entry is checked to
be certain that it contains all of the components required by the
rules of the challenge; if so, its setup.sh script is run.
The evaluation ends if stage 1 does not succeed for any of these
reasons:

the entry is unreadable or incomplete

setup.sh does not exit within five minutes

setup.sh fails (exits with non-zero status)

If stage 1 ends early, the diagnostic output of setup.sh (its standard
output and standard error output) is reported in the last case, or an
appropriate error message is reported otherwise. If setup.sh exits
successfully (with zero status), the entry is queued for stage 2 processing.

Stage 2 (quiz): The training data set
is copied into the VM. The entry's next.sh script is run for
a randomly selected subset of the training records. The evaluation
ends if stage 2 does not succeed for any of these reasons:

next.sh fails on any record

next.sh does not exit within the time limit
(1011 CPU instructions for the 2015 Challenge)

next.sh's results do not match the expected results
(answers.txt) submitted with the entry

If stage 2 ends early, the diagnostic output of next.sh (its standard
output and standard error output) is reported in the first case, or an
appropriate error message is reported otherwise. If all of the training set
records are processed successfully, and all results match the expected results,
and the entry does not include a DRYRUN file (which forces a premature
exit after completion of stage 2), the entry is queued for stage 3 processing.

In order to be fair to all competitors, the limits on entry running
time are measured in CPU instructions rather than in seconds (since
the exact running time will depend on many factors that are
impractical to control, such as hard disk access speeds.) On a
GNU/Linux system, you can measure the number of instructions used by
your program by running the command perf stat -e instructions:u
./next.sh a103l.

Stage 3 (exam): The test data set is copied into the
VM. The entry's next.sh script is run once with each test record as
input. Unlike stage 2, however, errors do not cause premature termination of
stage 3, and next.sh's diagnostic output is not reported (to prevent
leakage of information about the test data). The numbers of failures and
timeouts, if any, are reported in lieu of detailed diagnostics. The results
are collected and transmitted from the VM to the dedicated host for stage 4
processing.

Stage 4 (score): The collected results
are compared with the Challenge's reference results to determine
performance statistics and scores, which are reported to the user.

Replicating the Challenge test environment

In order to test your entry, it is not necessary to replicate the
Challenge test hardware, but it may be helpful to compare it with your
hardware to estimate your entry's run time. The dedicated Challenge
servers have two quad-core 2.6 GHz AMD Opteron 6212 CPUs and 32 GB of
RAM. The VMs are configured with a single-core amd64 CPU, 2 GB of
RAM, a 20 GB read-only root partition, a 2 GB read-write /home
partition, and a 500 MB read-write /tmp partition. A virtual CD-ROM
drive and serial port are used for transferring data to and from the
guest system. A virtual Ethernet interface is provided only when
running MATLAB entries, and only allows connections to the designated
MATLAB license server.

You may install and use the test software environment on a spare computer if
you wish, or in a VM using whatever VM technology you prefer on your favorite
host OS. On the Challenge servers, we use qemu-kvm, hosted on Debian
GNU/Linux.