A new paper on timing errors in experimental software

A new paper just out in PLoS One (thanks to Neuroskeptic for pointing it out on Twitter) shows the results of some tests conducted on three common pieces of software used in psychology and neuroscience research: DMDX, E-Prime, and PsychoPy. The paper is open-access, and you can read it here. The aim was to test the timing accuracy of the software when presenting simple visual stimuli (alternating black and white screens). As I’ve written about before, accurate timing in experiments can often be of great importance and is by no means guaranteed, so it’s good to see some objective work on this, conducted using modern hardware and software.

The authors followed a pretty standard procedure for this kind of thing, which is to use a piece of external recording equipment (in this case the Black Box Tool Kit) connected to a photo-diode. The photodiode is placed on the screen of the computer being tested and detects every black-to-white or white-to-black transition, and the data is logged by the BBTK device. This provides objective information about when exactly the visual stimulus was displayed on the screen (as opposed to when the software running the display thinks it was displayed on the screen, which is not necessarily the same thing). In this paper the authors tested this flickering black/white stimulus at various speeds from 1000ms, down to 16.6667ms (one screen refresh or ‘tick’ on a standard 60Hz monitor).

It’s a nice little paper and I would urge anyone interested to read it in full; the introduction has a really nice review of some of the issues involved, particularly about different display hardware. At first glance the results are a little disappointing though, particularly if (like me) you’re a fan of PsychoPy. All three bits of software were highly accurate at the slower black/white switching times, but as the stimulus got faster (and was therefore more demanding on the hardware) errors began to creep in. At times less than 100ms, errors (in the form of dropped, or added frames) begin to creep in to E-Prime and PsychoPy; DMDX on the other hand just keeps on truckin’, and is highly accurate even at the fastest-switch conditions. PsychoPy is particularly poor under the most demanding conditions, with only about a third of ‘trials’ being presented ‘correctly’ under the fastest condition.

Why does this happen? The authors suggest that DMDX is so accurate because it uses Microsoft’s DirectX graphics libraries, which are highly optimised for accurate performance on Windows. Likewise, E-Prime uses other features of Windows to optimise its timing. PsychoPy on the other hand is platform-agnostic (it will run natively on Windows, Mac OS X, and various flavours of Linux) and therefore uses a fairly high-level language (Python). In simple terms, PsychoPy can’t get quite as close to the hardware as the others, because it’s designed to work on any operating system; there are more layers of software abstraction between a PsychoPy script and the hardware.

Is this a problem? Yes, and no. Because of the way that the PsychoPy ‘coder’ interface is designed, advanced users who require highly accurate timing have the opportunity to optimise their code, based on the hardware that they happen to be using. There’s no reason why a Python script couldn’t take advantage of the timing features in Windows that make E-Prime accurate too – they’re just not included in PsychoPy by default, because it’s designed to work on unix-based systems too. For most applications, dropping/adding a couple of frames in a 100ms stimulus presentation is nothing in particular to worry about anyway; certainly not for the applications I mostly use it for (e.g. fMRI experiments where, of course, timing is important in many ways, but the variability in the haemodynamic response function tends to render a lot of experimental precision somewhat moot). The authors of this paper agree, and conclude that all three systems are suitable for the majority of experimental paradigms used in cognitive research. For me, the benefits of PsychoPy (cross-platform, free licensing, user-friendly interface) far outweigh the (potential) compromise in accuracy. I haven’t noticed any timing issues with PsychoPy under general usage, but I’ve never had a need to push it as hard as these authors did for their testing purposes. It’s worth noting that all the testing in this paper was done using a single hardware platform; possibly other hardware would give very different results.

Those who are into doing very accurate experiments with very short display times (e.g. research on sub-conscious priming, or visual psychophysics) tend to use pretty specialised and highly-optimised hardware and software, anyway. If I ever had a need for such accuracy, I’d definitely undertake some extensive testing of the kind that these authors performed, no matter what hardware/software I ended up using. As always, the really important thing is to be aware of the potential issues with your experimental set-up, and do the required testing before collecting your data. Never take anything for granted; careful testing, piloting, examination of log files etc. is a potential life-saver.

Actually, I find the results doubtful. I don’t see anything like this on my machines in testing. It isn’t clear whether this is caused by incorrect creation of the test scripts (they haven’t yet been shared) or by the fact that the version they were using is several years out of date.

But, suffice to say, I’m pretty certain that the timing of frame presentations is a *lot* better than they claim in this study.

Thanks for the extra info Jon – I was pretty surprised by these results too, as it doesn’t really jive with the (admittedly, informal) testing I’ve done with PsychoPy. Hopefully if the authors make their scripts available we can get to the bottom of why they saw such poor performance.