Our paper on defect density analysis of climate models is now out for review at the journal Geoscientific Model Development (GMD). GMD is an open review / open access journal, which means the review process is publicly available (anyone can see the submitted paper, the reviews it receives during the process, and the authors’ response). If the paper is eventually accepted, the final version will also be freely available.

The way this works at GMD is that the paper is first published to Geoscientific Model Development Discussions (GMDD) as an un-reviewed manuscript. The interactive discussion is then open for a fixed period (in this case, 2 months). At that point the editors will make a final accept/reject decision, and, if accepted, the paper is then published to GMD itself. During the interactive discussion period, anyone can post comments on the paper, although in practice, discussion papers often only get comments from the expert reviewers commissioned by the editors.

One of the things I enjoy about the peer-review process is that a good, careful review can help improve the final paper immensely. As I’ve never submitted before to a journal that uses an open review process, I’m curious to see how the open reviewing will help – I suspect (and hope!) it will tend to make reviewers more constructive.

Anyway, here’s the paper. As it’s open review, anyone can read it and make comments (click the title to get to the review site):

J. Pipitone and S. Easterbrook
Department of Computer Science, University of Toronto, Canada

Abstract. A climate model is an executable theory of the climate; the model encapsulates climatological theories in software so that they can be simulated and their implications investigated. Thus, in order to trust a climate model one must trust that the software it is built from is built correctly. Our study explores the nature of software quality in the context of climate modelling. We performed an analysis of defect reports and defect fixes in several versions of leading global climate models by collecting defect data from bug tracking systems and version control repository comments. We found that the climate models all have very low defect densities compared to well-known, similarly sized open-source projects. We discuss the implications of our findings for the assessment of climate model software trustworthiness.

Share this:

2 Comments

Not really surprised that the defect rates are so different in climate models than in the other projects. I’d be more interested to see a comparison with other sorts of large computational science software, particularly those which compute simulations of complex real phenomena (e.g. molecular dynamics, astrophysical “hydro-codes”, etc).

I think there are three distinct factors at work here which distinguish climate models from the other three systems in your study, and which all contribute to a lower “defect rate” for the climate models. I know you are aware of all these; I wonder what ideas you have for studies which can deconfound or tease out the different effects.

The first factor is that the context in which the software is used is quite different: the users are different, the use cases are different, and the work practices are different. For all these reasons, the criteria for a defect might be different. A few examples should clarify this. First, what were the internationalization requirements on the climate models you studied, and would a failure for some component to be readily localisable be considered a defect? Secondly, did the climate models include components to simplify installation or upgrading to a later release, or for testing compatibility with other systems – such as different compilers or third-party libraries – and would an installation, upgrade, or compatibility failure be considered a defect?

The second factor is that the computational tasks the software must perform are quite different, which may cause quite different classes of defect to arise at different rates. For instance, in a UI-intensive body of code such as Eclipse, many defects depend on the timing of particular interactions (essentially due to poor invariant maintenance, which is a bane of modern software development). Perhaps clicking some button on a toolbar when the class browser is active means that the keyboard focus is lost. Clearly a large computational code will not suffer from so many defects of this kind (and defects of this kind are likely to be rapidly fatal, and therefore soon fixed). A more general example of this factor might be that the climate models may be divided into fewer comparatively larger units (methods/functions/etc), for either domain or process reasons, and therefore be less prone to unit interface mismatch defects. Another example might concern typical object sizes, lifetimes, and scopes.

The third factor is that validation of the software is far harder for climate models, and that therefore validation defects are much harder to detect, or rather are harder to triage. Simulations always have these defects (“all models are wrong…”), and in some sense they are easy to detect (e.g. it is well known that some models have fewer extreme precipitation events than the real world), but for this exact reason many of them may not be considered to be ‘defects’ in the sense in which you use the word, because they are not thought to impair the accuracy of the model in other respects (“….but some are useful”). This, of course, is the angle of attack of anti-science critics: the meat of the line “it’s just a model”. We do know quite a lot about validation here, but I suspect that many validation defects simply won’t be counted by your analysis.

For the other systems you study, validation defects are much easier to triage, and many of them will certainly enter the bug databases and version control systems which you analyse.

Choice of language etc may form a fourth factor, as it certainly has some effect on the “lines of code” metrics which form the denominators of defect density sums.

I guess that either a follow-up study using either code inspection or defect categorization could shine some more light on these questions.

@Nick Barnes : Thanks – all excellent thoughts. WRT your third point, these kinds of errors show up as “model improvements” eventually, but would not get documented as defects, as they’re typically the things the modellers themselves understand as “acceptable imperfections”. In that sense we could treat *all* code changes as defects, on the basis pretty much every change is either a bug fix or an improvement to the fidelity of the model. But then we’d have to argue that every user enhancement in commercial software is also a defect – it’s a place where the developers failed to understand or anticipate some user need. Actually, what I think we need to do is end the fiction that there’s a meaningful distinction between “defects” and “enhancements” for any type of software. Hmmm. Time for some re-conceptualization of what it means to build software for a given purpose….

Anyway, I’ve been thinking for a while that a defect categorization might be a useful next step. We’d need this done before we could do any useful kinds of code inspection.