Designing Databases for Historical Research

F. Problems facing the historian

F3. Standardisation, classification and coding

The principal way forward for accommodating data containing these
kinds of problems is to apply (often quite liberally) a
standardisation layer into the design of the database (see
Section C4) through the use of standardisation,
classification and coding. These three activities are a step
removed from simply gathering and entering information derived
from the sources: this is where we add (or possibly
change) information in order to make retrieving information and
performing analysis easier. We use these techniques to overcome
the problem of information that means the same thing appearing
differently in the database, which prevents the database from
connecting like with like (the fundamental pre-requisite for
analysing data). For historians this is a more important step
than for other kinds of database users, because the variety of
forms and ambiguity of meaning of our sources does not sit well
with the exactitude required by the database (as with the example
of trying to find all of our records about John Smith,
Section F1), so that more of a standardisation layer needs to
be implemented.

Standardisation, classification and coding are three distinct
techniques which overlap, and most databases will use a
combination of the three when adding a standardisation layer into
the design:

Standardisation

This is the process of deciding upon one way of representing a
piece of information that appears in the source in a number of
different ways (e.g. one way of spelling place/personal names;
one way of recording dates and so on) and then entering that
standardised version into the table. Consider using
standardisation when dealing with values that appear slightly
different, but mean the same thing - ‘Ag Lab’ and ‘Agricultural
Labour’ as values would be treated very differently by the
database, so if you wanted them to be considered as the same
thing, you would signal this to the database by giving each
record with a variant of this occupation the same standardised
value.

Classification

This is the process of grouping together information (‘strings’)
according to some theoretical, empirical or entirely arbitrary
scheme, often using a hierarchical system in order to improve
analytical potential. Classification is about allocating groups,
and then placing your data in those groups. These groups can be
hierarchical, and the hierarchy will let you perform your
analysis at a variety of levels. Classification is less about
capturing the information in your sources and is much more about
serving your research needs.

When using a classification system it is very important to
remember two things: firstly, since it is an arbitrary component
of your database’s Standardisation layer designed to improve your
research analysis, the system does actually need to meet
your has to be able to meet the requirements you have for it.
Secondly, therefore, the system needs to have been devised before
data entry begins, it needs to intellectually convincing (at
least as far as your historical methodologies are concerned) and
it needs to be applied within your data consistently.

It is also worth being aware of how other historians have
classified their information. There have been many classification
systems created by the good and the great of the historical
profession,[1] many of which have been used
subsequently by others for two reasons: they allow comparability
between the findings of different projects; and because they
allow historians to turn different sources into continuous series
of information. That is, two projects investigating the same
thing at different periods may have to rely on different sources:
by classifying their (probably slightly different information)
into similar classification systems, a case can be made
(convincingly or otherwise) that the research is comparable. This
is not to say that you should necessarily try to adopt an
existing scheme rather than develop one that suits your research
better, but it is worth keeping in mind if you are interested in
comparing your analysis with that of another historian. In
addition, given that classification systems in practice really
only entail adding an extra field in a table into which the
classified value is added, there is nothing stopping you (other
than perhaps time) from employing more than one classification
system for the same information in the database.

A detailed example of a classification system can seen in an
ongoing project which is investigating the material aspects of
early modern households, and which uses a database to record
minutely detailed information about material objects. One of the
many ways it treats the information about objects is to classify
objects by type, in order to be able to compare like objects
despite the often substantial differences in the ways they are
referred to in the sources. This works by adding a field in the
table where item type data is recorded into which an ItemClass
code value can be added:

F3i – Data about material objects that have been classified and
coded

The ItemClass field here is populated with codes, and these codes
record precisely what type of item the record is about (you can
see what the source calls the item in the ItemDescr
field).[2] The fact that the code is a numeric
value, and the fact that the same numeric code is applied to the
same type of object regardless of how it is described in the
source, means that the ItemClass field acts as a standardised
value.

Additionally, however, the ItemClass field enables the use of a
hierarchical classification system (to examine a partial sample
of the classification system, download the Microsoft Excel file
Material
Object Type Classification.xls). The hierarchy
operates by describing objects at three increasingly detailed
levels:

To illustrate this we can take the example of how the database
classifies objects that are used for seating:

F3ii–Classification
system for objects in the category of ‘Seating’

You will notice from the
Microsoft Excel spreadsheet that each code level has a two
or three digit numeric code, so Code I: Seating has the numeric
code 05, that for Code II: Chair is 02, and that for Code
III: Wicker Chair is 006. These individual codes become elided
into a single numeric code (in the case of the wicker chair –
0502006) which is the value that gets entered into the relevant
single field (ItemClass) in the record for the wicker chair in
the database.

This may sound complicated and slow to implement, but the benefit
of doing so is considerable. Firstly, the database can be created
so that the codes can be automatically selected rather than
memorised by the database creator, so that they do not have to
stop to remember or look up what code needs to be entered for any
given object. Secondly, and here is the principal reason for
employing a hierarchical system, once the data have been coded,
they can be analysed at three different semantic levels. The
historian could, if they wished, analyse all instances of wicker
chairs in the database by running queries on all records which
had the ItemClass value “0502006”. Alternatively, if they were
interested in analysing the properties of all the chairs in the
database, they could do so by running queries on all records with
an ItemClass value that begins “0502***”. Lastly, if the point of
the research was to look at all objects used for seating, a query
could be designed to retrieve all records with an ItemClass value
that began “05*****”. This is an incredibly powerful analytical
tool, and one that would be impossible to achieve without the use
of a hierarchical classification system: to run a query to find
all objects used for seating without a classification system
would require looking for each qualifying object that the
historian can anticipate or remember, by name and taking into
account the variant spellings that might apply.[3]

Hierarchical classification systems are very flexible things as
well. They can include as many levels as you require to analyse
your data, and they do not need to employ numeric codes when
simple standardised text would be easier to implement.[4]

Coding

Coding is the process of substituting (not necessarily literally)
one value for another, for the purpose of recording a complex and
variable piece of information through a short and consistent
value. Coding is often closely associated with classification,
and in addition to saving time in data entry (it is much quicker
to type a short code word than it is to type five or six words)
codes additionally act as standardisation (that is, the same form
[code] is entered for the same information no matter how the
latter appears in the source).

These techniques are implemented to make the data more readily
useable by the database: the codes, classifications and
standardised forms which are used are simple and often easier to
incorporate in to a query design than the complicated and
incomplete original text strings that appear in the source; but
more importantly, they are consistent, making them much
easier to find. However there are a number of things to bear in
mind when using them, the most important of which is there are
two ways of applying these techniques:

By replacing original values in the table with
standardised/coded/classified forms

By adding standardised/coded/classified forms into the
table alongside the original values

Both of these approaches present a trade-off between maintaining
the integrity of the source and improving the efficiency of the
potential analysis, in much the same way as the choices offered
as part of the design process when selecting the Source- or
Method-oriented approach to the database (see
Section C3). The first approach to standardising, to replace
the original version of source information in any chosen field(s)
with standardised forms of data, enables the speeding up of data
entry at the expense of losing what the source says. It also
serves as a type of quality control, as entering standardised
data (especially if controlled with a ‘look-up list’) is less
prone to data entry errors than the original forms that appear in
the source.

The second approach, to enter standardised values in addition to
the original forms, allows for the best of both worlds: you
achieve the accuracy and efficiency benefits of standardisation
without losing the information as it is presented in the source.
Of course, this happens at the cost of extra data inputting time,
as you enter material twice.

When considering both approaches, bear in mind that you will only
need to standardise some of the fields in your tables, not every
field in every table. The candidates for standardising,
classifying and coding are those fields that are likely to be
heavily used in your record-linkage or querying, where being able
to identify like with like values is important. Creators of
databases built around the Source-oriented principle should
exercise particular caution when employing these techniques.

[1] See for example that developed for
household types in The population history of England,
1541-1871: a reconstruction (1981) by E.A. Wrigley and
R.S. Schofield; or the ongoing HISCO project to develop an
international classification system for occupations,
available at http://historyofwork.iisg.nl/
(accessed 23/03/2011).

[2] Note in passing that many of the other
fields in this example contain codes as well – this table
contributes substantially to the database’s Standardisation
layer.

[3] It would, for example, need to look for
all stools, buffet stools, wicker chairs, forms, settles,
benches etc., leading to extremely complicated queries with
possibly more criteria that the database can handle. For
criteria in queries please sign up to one of our
face-to-face Database courses.

[4] Indeed numeric codes are somewhat old
fashioned in modern database usage, although they are no less
efficient for being outmoded.