IUCr activities

Reasons for raw data archiving and reuse in chemical crystallography

Simon ColesAmy Sarjeant, with a Foreword by John R. Helliwell

Foreword

Crystallographers have a long-standing tradition of linking the underpinning data to their publications. Chemical crystallography has led the way, harnessing new technologies for data storage as well as carefully defining its metadata via the world-renowned crystallographic information file (CIF) and developing an extensive checkCIF procedure for vetting these data. Acta Cryst. C has also led the way with its exemplary article submission procedure comprising the authors’ narrative, their underpinning coordinates and structure factors along with the checkCIF report. Thus an editor and referees can undertake direct calculations and scrutinise the outcomes (alerts) highlighted in the standardised checkCIF process. Readers thereby enjoy articles with data carefully vetted at the highest degree possible and the databases likewise can then harvest these perfect fruit. In recent years there has been a major increase in digital storage capability along with an expansion of generic data archives such as those provided by individual universities or administered centrally such as the EU’s Zenodo. Naturally the question has now arisen as to the need for the preservation of raw i.e. experimental diffraction data sets, and across all of crystallography. That it is feasible and that it is recommended was the top recommendation of the IUCr’s Diffraction Data Deposition Working Group (DDDWG) in their final report presented to the IUCr in 2017 in Hyderabad. This would also continue to keep the IUCr in line with the general exhortation for scientific data to be FAIR, i.e. Findable, Accessible, Interoperable and Reusable. The individual IUCr Commissions are now digesting the IUCr DDDWG Final Report. The DDDWG has been incorporated into the IUCr’s new Committee on Data (CommDat). Two members of CommDat, Amy Sarjeant and Simon Coles, communicate with the structural chemistry community now via a questionnaire.

I encourage you to take part in providing your answers to the questionnaire.

Thank you.

John R. Helliwell
Chair of IUCr CommDat and Chair of the IUCr DDDWG (2011-2017).

It is now common to deposit structure factors when publishing journal articles and this now caters very well for routine small-molecule structures. However, this is only the case if everything in a raw image is fully and/or properly accounted for and the model is correct or appropriate. In some cases raw data may no longer be required, but in others it may be necessary to validate or 'do better' in the future. Now is the time to explore raw data archival practice and gather opinions as to if/how raw data could/should be used if it were to be made more widely available.

'Data' can generally be considered to be raw data, processed data and derived data - in the crystallographic context these are namely diffraction images, structure factors and crystal structures, respectively. Recently some progress has been made in that software will include derived data (structure factors) in the CIF result and that validation processes and the Cambridge Structural Database (CSD) will make use of and curate this data. But we can go further still - not only could raw data improve validation processes and provide valuable training sets for software developers to improve algorithms, etc., but there is a more interesting issue. A diffraction experiment records the average signal from the whole sample, which includes defects, impurities, etc., yet often only the data to get a perfect result is extracted. For materials engineering it can be crucial to be able to understand these additional effects, yet it is never made public that they have been observed!

Raw data availability therefore can be very important; however, there are often further counter arguments around the overhead to doing this. The diffraction experiment is relatively quick and cheap, so why not just do it again? The real cost of doing a structure again was assessed by the UK National Crystallography Service as part of the ‘Keeping Research Data Safe Project’. There are many nuances to such a costing, but if one has to factor in that the research expertise/group/laboratory that originally generated the material may not exist any more or may not still be specifically set up (people, apparatus, etc.) to make such materials, then the cost rapidly escalates. The replacement cost of the CSD is therefore almost unmeasurable!

Barriers to raw data archival include file size, file format interoperability, and a lack of perceived need. The macromolecular community has recognized the need for raw data archival and various workflows and deposition standards have arisen to meet this need. However, the small-molecule community lags behind.

Data transfer and storage problems are now being overcome and for around 15 years there has been an ‘extension’ to CIF (imgCIF) that can cater for raw data, yet its uptake by the small-molecule community has been very slow indeed. So why aren’t we amassing more of our valuable raw data for the community to widely exploit? For the last five years a group, known as the Diffraction Data Deposition Working Group, has been looking into the issues surrounding this topic. The outcome from the activity of this group is that the IUCr has recently convened a Committee on Data, ‘CommDat’ as an advisory committee to the Executive.

It is generally assumed that small-molecule crystallographers do archive raw data, but that it is not archived in the ‘best’ way, i.e. easily searchable, in a ‘structured’ environment and that this community does not think about making raw data visible outside of their own use. We are therefore looking to find out the following:

The extent to which small-molecule crystallographers archive raw data (in a sharable way?) and what stands in their way.

How much 'educating' of crystallographers is required to illustrate the benefits of archiving (both for oneself and for others).

How could raw data archives be used in validation, e.g. would it be more justifiable to publish a ‘poorer’ result if raw data were made available?

What are the driver(s) for the community in terms of using the contents of a raw data archive.

We have created a survey to canvass the small-molecule community to determine the answers to these questions. Whether you currently actively archive your raw data or not, we encourage you to take part in order to help to better define the problems and barriers to this important endeavour. The survey explores the following two themes:

2. Different people in different roles might have different drivers/reasons for archiving and revisiting raw data. We suggest the following as some food for thought and want to canvas views around these and any other related matters:

Validation: a result provides a contribution to chemical knowledge, but is poor quality

Validation: to support a 'grand' claim

To back up modelling of disorder, twinning, incommensurate, modulated structures

To back up modelling of diffuse scattering

To make available e.g. disorder, twinning, incommensurate, modulated, diffuse scattering datasets so others can attempt to resolve them

International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.