Quality control

The Archive is at the forefront of developing international standards for data processing, for both quantitative and qualitative data.

We use different levels of quality control depending on how much ‘additional value’ is to be added to the data.

We assign one of four levels of data processing to each incoming study, dependent on anticipated future usage (A*, A, B or C). Processing activities are then carried out in accordance with each processing level, as described in the tables below.

Quantitative data

Level A*: the number of cases and variables are checked against the documentation

Level A: as for A*

Level B: as for A*

Metadata checks

Level A*: the dataset must be comprehensible in itself - i.e. all variables should have variable labels and all categorical variables should have value labels

Level A: the dataset must be comprehensible in association with the documentation given to users

Level B: visual checks on quality are undertaken; action is taken for systematic problems

Data validity checks

Level A*: all categorical variables checked for out-of-range values/wild codes; where possible, interval variables checked for improbable or impossible values; variable and value labels need not be present in the data file as long as they can be found in the documentation

Level A: as for A*

Level B: visual checks on quality are undertaken; action is taken for systematic problems

Confidentiality checks

Level A*, A and B: always undertaken

Metadata enhancements

Level A*: for online browsing, the following may be added: literal question text, routing information and interviewers' instructions, frequencies and summary statistics, variable groups; bookmarked PDF user guides are produced; additional related resources may be provided on a dedicated web page; additional notes to users are given in the 'Read file'

Level A: bookmarked PDF user guides are produced; additional related resources may be provided on a dedicated web page; additional notes to users are given in the 'Read file'

Level B: bookmarked PDF user guides are produced; additional notes to users are given in the 'Read file'

ReShare

Level B: sample of 30 + 10 per cent of the remaining categorical variables must be checked for out-of-range values/wild codes; sample of 30 + 10 per cent of the remaining suitable interval variables must be checked for improbable or impossible values

Qualitative data

In addition to the levels above, most qualitative data collections are processed to A standard, with a select few being nominated for enhancement to A*. B and C are seldom used, but apply when handling older paper-based studies.

Level A*

data are fully digitised and anonymised

metadata and documentation are fully digitised and anonymised

for online browsing, data are marked up in XML

enhanced user guide is prepared for QualiBank

Level A

the dataset must be comprehensible in association with the documentation given to users

data are fully digitised and anonymised

metadata and documentation are fully digitised and anonymised

Level B

data are digitised at least to the level of scanned images and anonymised

metadata and documentation are digitised at least to the level of scanned images and anonymised

only major problems with data are resolved

Level C

no checks are made

data remain in the format in which they were received

non-digital collections are not anonymised or digitised and are transferred to another repository

only a basic catalogue record is created

For level C studies, a minimum of dataset dimension checks and confidentiality checks is carried out, with metadata enhancements as for B studies.

Format translation checks

Check are carried out when converting from:

the ingested format to our preservation format (tagged or delimited text of defined character set)

the preservation format to the dissemination formats (Stata and tab delimited text; or MS Excel, MS Access, SIR and SAS)

We use in-house programmes to automate most data format conversions for all levels of processing. These make sure no data or 'internal metadata' (variable and value labels, missing value definitions, variable format information, etc.) are lost beyond any that would occur because of differential data handling limits in specific software formats.

The following checks below are currently performed manually, but will be replaced by automated checking using the QAMyData tool.

Numbers of rows and cases the same

Level A* and A: Relevant checks made, problems corrected

Level B: Relevant checks made, problems corrected

Level C: Format conversion is not usually undertaken for C standard datasets. C standard is rare, but one of the reasons for it is that the data file cannot be converted from its original format, so normal processing cannot be undertaken. Relevant checks must be made, problems corrected

Data download validation

For data available via the UK Data Service download system, the names of the zip files include an MD5 checksum. This 32-character string can be used to verify that the file we make available is identical to the one the user downloads.