A. Introduction

The purpose of this Historical Research Handbook is to
provide an introduction to designing databases for use in
historical research. It will provide an overview of important
concepts – both historical in nature and in terms of databases –
that the historian will need to consider before embarking upon
designing a database, and it will provide a number of starting
points for overcoming certain design problems that specifically
affect historians when they come to wrestle their sources into a
database. This Handbook does not cover the actual
construction of databases for historical research in any
practical manner.

A1. Database design concepts

After working through this Handbook it is hoped that you
will have a good understanding of the complex relationship
between historical sources, information and data, and will be
aware of the translation processes that are required when moving
from one to the others. The informational contents of historical
sources need to be converted – often in multiple ways – before
they can be used as data, and a number of methodological
decisions will need to be taken as this is done. Unlike the more
mechanical aspects of using databases in historical research,
such as building tables, linking records or running aggregate
analyses, this translation process is not only difficult to learn
other than through experience, it is also likely to be a
substantially different process for every historian doing it.
Each historian has different materials, different projects and
different research aims, and so the databases they build will (or
should) address these in the way that best fits their specific
purpose. This ‘modelling’ of historical data is a difficult
process, but happily the difficulties that arise are those that
by and large are faced by historians in their everyday,
non-database, work, meaning that you will be well equipped to
deal with them. The modelling of data is also possibly the most
interesting and enjoyable aspect of using databases in historical
research, although perhaps this is only a relative response to
the long hours of data entry that follow the initial design of
the database!

This Handbook does not require the use of database
software, although it will show the occasional screenshot of a
database for the purposes of illustration. Instead it will spend
most of its sections on discussing sources and research
questions, and how these need to be recast when interacting with
a database with its strictures and rules. The exercises that are
offered will not have right or wrong ‘answers’, just as there is
no right or wrong way to design a database of historical sources.
Or rather, it may be more accurate to suggest that while there is
no right way to design a database, there are a number of (if not
wrong exactly) unhelpful ways to design a database, and this
Handbook will focus as much on the latter as on the
former.

B. Sources, information and data

B1. Introduction

In this section we are going to address some of the issues that
historians face when it comes to thinking about building and
using a database for their research. Quite what ‘using a database
for their research’ actually means is a subject that we will
return to in
Section C of this Handbook, as it is a subject that
encompasses a range of issues which are likely to impact upon the
design of a historical database. Essentially what this section
will focus on is the difference between ‘information’ and ‘data’
– the former being what sources provide, the latter being what
databases need – and it will begin the process of considering how
to move from one to the other.

Unfortunately, the historian is faced with particular kinds of
problem when it comes to converting sources into a useful
database resource, problems which are not shared by most other
database users. This (as we shall see) boils down to two separate
inescapable realities of historical research:

The historian often does not know precisely what kinds of
analyses they want to conduct when starting out on their research

The extent and scope of the information contained within the
historian’s particular sources cannot usually be anticipated
fully

In other words the sheer unpredictability of many historical
research projects, the various tangents and new lines of inquiry
that open up as soon as you get to grips with the sources, as
well as the constant promise of unearthing a type of
information that you were not expecting, make designing databases
a difficult proposition for historians. Indeed, in many ways,
these two factors provide conditions which are entirely contrary
to the environment required by the structures and functions of a
database. The difficulty for the historian is that what is
required is to take information that is informal and
unstructured, translate it, and make it fit into a rigidly formal
and structured medium. Reconciling the two – the milieu of the
historian and the rules of databases – is the principal aim of
this Handbook.

Much of what is discussed here is about good practice and
ensuring that the most common and critical mistakes are avoided
at the most important stage of database creation. Errors at this
juncture will have an effect on how useful the database will be:
they will make data entry more laborious and more difficult; and
more seriously, they will have a significant impact upon the
database’s ability to retrieve data for analysis. It is very
important therefore to design the database as ‘correctly’ as
possible, initially, to minimise the need for retrospective
restructuring further down the line (although some of this will
inevitably be necessary).

B2. Information and data

The tricky part of the process of using databases in historical
research lies in the ‘shape’ of the information that is found in
our sources. Databases have very strict rules about what type of
information goes where, how it is represented and what can be
done with it (see
Section C), and if the information from our sources can be
made to obey these rules then it has become data. Of course the
problem facing historians is that information can take many forms
in our sources, even if the research is only considering a single
type of source and that source is a relatively simple one.
Sources that are irregular in ‘shape’, such as textual sources
with long narrative accounts written in paragraphs and chapters
and so on, or databases of image/sound/video collections, are
particularly problematic when it comes to converting their
information into data; but the problem will also arise in the
more structured sources (such as census listings or taxation
assessments), which are never quite as simple as they might
appear.

The concept of ‘shape’ here is one that is fundamental to the
understanding of how databases work and the efforts needed to
enter our sources into them. One of the database rules alluded to
above is that all data in the database sit in tables, regardless
of what kind of data they are. This means that information taken
from our sources will need to fit into a tabular structure – that
is, arranged by rows and columns – by the time it has been
entered into the database. Often this is not the ‘shape’ the
information is in when we open the pages of our sources, and
usually we will have to mould it into a more compliant shape. As
we shall see, this will cause us to accept something of a
compromise between maintaining the full richness and integrity of
our sources’ information on the one hand, and maximising the
analytical potential of the data we create on the other.

Information from our sources is what we are interested in. It is
what we will use to perform our historical analyses, and it is
the raw material of our research. Away from the database, when
looking at our sources as a methodological necessity we extract
information from them and record that information as notes
(sometimes as transcripts) in a variety of forms. The
recording of information in this way allows us access to what we
need without having to consult the original source in the future,
but the form of our notes also allows us to accommodate the
vagaries in the types of information that we can obtain from the
source. In making notes we assimilate the variations in the type
and scope of the information being recorded without concern for
the shape of that information, something that is no longer
possible in a database environment.

For example, image B2i depicts an interesting historical source,
eminently useful for researching a variety of social, economic,
cultural or political subjects in the context of mid-nineteenth
century Chicago. The text of this pamphlet provides the historian
with the bulk of the source’s information - information about
places, dates, themes and events and so on – but from the point
of view of database design it is important to note that not all
of the information is contained in the source’s text. It is
important to identify these non-textual types of information
(such as page dimensions, layout, font types and sizes, language,
archival stamps, colours used etc.) because if they are important
to your research then they will need to be accommodated within
the database design, and in some cases this will involve extra
conversion processes. Descriptions of the source can be useful
information every bit as much as what the source actually says.

When considering this pamphlet as a candidate for inclusion into
a database, the most obvious aspect of this particular source is
that it does not look much like a table. It is not ‘rectangular’
in terms of its shape - the text is not organised into columns
and rows. This makes it difficult to ascertain the scope of the
information (what there is information about) without actually
reading the whole source, in the same way that you might be able
to with a source arranged by rows and columns in a database.

Immediately therefore it becomes apparent that if we wanted to
include this information in our database, we would need to think
carefully about how to enter the information we want into the
tabular structure required. How would it be possible to reorder
the information into columns and rows – what would our columns
be, how could the information be divided into instances of
something (rows)? Our sources, whilst they may be wonderfully
useful things, are not often actually suited for use in
databases.

On the other hand there are sources which are more promising at
first glance in terms of their suitability for inclusion in a
database. Take for example the returns of the census enumerators
(such as that for the 1850 US Census, image B2ii), a source which
is as ‘rectangular’ in shape as it is possible to be. Here the
information is conveniently arranged into columns and rows – each
columns pertains to one particular type of information (name,
age, occupation and so on), and each row corresponds to
information about a single individual. This is a source which
will ‘fit’ into the database structure without the need for too
much conversion, as its inherent shape approximates that required
by the database quite closely.

However it is worth noticing that even here the translation
process between source information and database data will not
necessarily be an entirely problem-free one. Whilst the bulk of
the information is contained within the tabular structure of the
source, not all of it is. The information at the top of the page
for example, vital information about the place and date of the
listing, as well as the identity of the enumerator, is not
contained within the table of the individual returns. In the
database of this page, this information would need to be
accommodated within a table somewhere, giving us some thinking to
do about how this should be managed. Similarly, there are a
number of pieces of information which might be useful to our
research which do not exist in the table of individual listings:
the arrow pointing to the Lincoln household, for example, or the
various ticks and crosses, emendations and marginalia, some of
which are not original to the source but which still constitute
information, might be desirable for inclusion in the database. As
we shall see in
Section C not all the information from a source need
necessarily be included in the database and significant decisions
about this will need to be made, but the information that is
required, no matter what its shape or where it is located in the
source, will need to be appropriately converted before it can be
used in the database.

The need to understand the differences between the shape that
information takes in our sources and the shape that data has to
adopt within databases, is something that this Handbook
will return to repeatedly, from a variety of angles. Squeezing
information into the right shape for use in a database is not the
only form of conversion that is required, however, as we shall
see in
Section F, but it is the most fundamental stage of the
process, and is the most important step in the design stages, as
we shall see in
Section E.

[1] Pamphlet calling for a strike at
the McCormick Reaper Works on the Haymarket Square in
Chicago, 1866. Available at
Wikimedia Commons (accessed 25/03/2011).

[2] The household of Abraham Lincoln as
described in the returns of the 1850 US Census.
Available at
Wikimedia Commons (accessed 25/03/2011).

C1. Introduction

This section of the Handbook provides an introduction to
the basic concepts that underpin the design and use of databases.
It will focus on aspects of the database which will need to be
considered at the beginning of the database design process,
alongside the identification of information in the sources (see
Sections
B and
E). The ideas and approaches discussed here are independent
of the sources being employed by the historian, and as such will
apply to every database used in a historical research project
(and indeed to most databases in general). This section will not
address technology or software, but rather it will examine
conceptual approaches to designing database in the abstract, and
specifically how these concepts will affect databases to be
created and used by historians.

C2. The Purpose of the database

As we shall see in
Section E, the very first step in the formal process for
designing a database is to decide what purpose(s) the database is
to serve. This is something that is perhaps not as obvious or as
straightforward as one might expect, given that databases in the
abstract can indeed serve one or more of a number of different
kinds of function. In essence, however, there are three types of
function that the historian is likely to be interested in:

Data management

Record linkage

Pattern elucidation/aggregate analysis

Each of these functions is a goal that can be achieved through
shaping of the database in the design process, and each will
require some elements of the database design to be conducted in
specific ways, although they are by no means mutually exclusive.
And this latter point is an important one, given that most
historians will want to have access to the full range of
functionality offered by the database, and will likely engage in
research that will require all three of the listed types of
activity. Or, to put it another way, many historians are unlikely
to know precisely what it is they want to do with their database
at the very beginning of the design process, which is when these
decisions should be taken. This is why, as we shall see later in
this section, many historians are inclined to design databases
which maximise flexibility in what they can use them for later on
in the project (a goal which will come at the price of design
simplicity).

The data management aspect of the database is in many cases
almost a by-product of how the database works, and yet it is also
one of its most powerful and useful functions. Simply being able
to hold vast quantities of information from different sources as
data all in one place, in a form that makes it possible to find
any given piece of information and see it in relation to other
pieces of information, is a very important tool for the
historian. Many historians use a database for bibliographical
organisation, allowing them to connect notes from secondary
reading to information taken from primary sources and being able
to trace either back to its source. The simpler tools of database
software can be used to find information quickly and easily,
making the database a robust mechanism for holding information
for retrieval.

Record-linkage is where the database, and particularly where the
relational database (see Sections
D and
E), comes into its own. Connecting people, places, dates,
events and themes across sources, periods and geographical or
administrative boundaries is clearly an incredibly useful task to
perform, and whilst the database can do this, the efficiency and
accuracy of the linkages will be dictated by both the design of
the database structure and the nature of the data model (see
Section E).

Finally once the information from your sources has been converted
into data, the database software can be employed to group
information together. Once records can be aggregated, then it
becomes possible to count them, meaning that statistical analyses
can be performed and structural patterns can be identified within
the information. Again, however, the efficiency and accuracy of
this kind of function will depend on the design of the database
and the manner in which the information has been converted. In
particular, this kind of functionality will depend a great deal
upon the latter, and if the historian aims to perform this kind
of analysis extensively, then there will need to be a
considerable effort put into applying a ‘standardisation layer’
to the data (see
Section C4).

C3. Conceptual models of database design

Whilst it is true that every database ever built has been
designed specifically for a particular conjunction of purpose and
data, and is therefore to a greater or lesser extent distinctive,
it is also true that there are two principal overarching
approaches to designing databases. The two conceptual models are
known as:

The Source-oriented
approach (sometimes called the Object-oriented
approach)

and

The Method-oriented approach
(also known as the Model-oriented approach)

These two models should be viewed as polar opposites at the ends
of a sliding scale, where the design of a database is based on an
approach somewhere between the two extremes. Every database
design will be something of a compromise, and no database will
ever constitute the ‘perfect source-oriented database’, nor will
there ever be the ‘perfect method-oriented database’.

C3i – The two conceptual approaches to database design

The Source-oriented model of database design dictates that
everything about the design of the historical database is geared
towards recording every last piece of information from the
sources, omitting nothing, and in effect becoming a digital
surrogate for the original. The information contained within the
sources, and the shape of that information, completely ordains
how the database must be built.

The lifecycle of an ideal source-oriented database can be
represented thus:

C3ii – Lifecycle of the Source-oriented database

This approach to database design is very attractive to the
historian as it places the sources at the centre of the database
project. Entering data into a database is a very time consuming
activity, however, and this becomes much more so if you are
taking pains to record all of the information that exists in your
sources. Ultimately you will need to make choices about which
information you will exclude from the database, contrary to the
principles of the Source-oriented model, which will undermine the
database’s role as a digital surrogate for your sources but which
will at least allow you to perform your research within a
reasonable period.

The Source-oriented approach, if rigidly applied, can lead to a
design that quickly becomes unwieldy as you try to accommodate
every last piece of information from your source, some of which
may only occur once. But, it does allow for wider analytical
approaches to be taken later, so that potential queries are not
reliant on the initial research agenda, meaning that the database
does not restrict the directions your research might take. It
also allows you the reassurance of not having to anticipate all
of your research questions in advance, which the Method-oriented
model does. The Source-oriented model transfers the source (with
all its peculiarities and irregularities) in a reasonably
reliable way into the database with little loss of information –
‘everything’ is recorded (or at least what is excluded is done so
by your conscious choice), and if later something becomes
interesting, you will not have to go back to the source to enter
information that you did not deem interesting enough to begin
with. The Source-oriented model also enables you to record
information from the source ‘as is’, and lets you take decisions
about meaning later – so ‘merc.’ can be recorded as
‘merc.’, and not expanded to ‘merchant’ or ‘mercer’ at the point
of entry into the database. [1]

At the other end of the scale, the lifecycle of the
Method-oriented model database could be represented in a
different way:

C3iii – Lifecycle of the Method-oriented database

This approach to database design is based on what the database is
intended to do, rather than the nature of the information it is
intended to contain. Consequently, if adopting this model for
designing your database, it is absolutely vital that you know
before you begin precisely what you will want to be able to do
with the database – including what queries you will want to run.
The level of precision needed here should not be underestimated
either, given that the database requires a high degree of
granularity to perform analysis –the database will not be able to
‘analyse the demographic characteristics of the population’, for
example, whereas it will be able to ‘aggregate, count and link
the variables of age, gender, marital status, occupation,
taxation assessment, place of residence’ and so on. When
designing any database it will be necessary to think at this
latter level of detail, but if you are designing a
Method-oriented database then it becomes much more important.

Method-oriented databases are quicker to design, build and enter
data into, but it is very hard to deviate away from the designed
function of the database, in order to (for example) pursue newly
discovered lines of enquiry.

Ultimately, historians will need to steer a middle course between
the two extreme models, perhaps with a slight tendency to lean
towards the Source-oriented approach. When making decisions about
what information you need from your sources to go into the
database, it is important to take into account that your needs
may change over the course of a project that might take a number
of years. If you want to be able to maintain the maximum
flexibility in your research agenda, then you will need to
accommodate more information in the database design than if you
are very clear on what it is you need to do (and what that is
will never change). If you do not know whether your research
needs will change, err on the side of accommodating more
information – do not exclude information about servants unless
you are absolutely sure that you will never want to treat
‘households with servants’ as a unit of analysis, because if you
have not entered that information, then it will not be there to
query later on.

However you should not dismiss the Method-oriented model out of
hand when considering the approach to your database design. If
you know your source(s) very well in advance, and you have
definite pre-determined research needs, and you know you will not
be attempting to recover all the information from the source, and
you know in advance exactly how you will treat your data and what
questions you will ask of it – if all this is true, you can use
the Method-oriented approach. Alternatively, if you are creating
a database which is not actually for historical research,
but is designed to be a resource with pre-defined functionality
and a limited set of tools that a user can use,[2] then a Method-oriented design is also appropriate.

[1] Leaving this kind of ‘normalisation’
until later in the project is beneficial as it allows you to
make decisions about the meaning of data until you have the
full body of data to act as context.

[2] Such as an online database with fixed
search and retrieval functionality, for example Old Bailey
Online (http://www.oldbaileyonline.org/,
accessed 23/30/2011).

C4. Database ‘layers’

Databases often involve several stages of work before they can be
fully utilised for analysis. This is because well designed
databases arrange data into several layers. The ‘Three Layer’
model of database design serves to illustrate how the
organisation of different types of data within a database can
dramatically improve the analytical potential of that database.
The Standardisation Layer in particular is one that historians
should invest time and effort into developing, and practical
methods of doing this will be addressed in
Section F.

C4i –The ‘Three Layer’ model of database design

The basic premise of the Three Layer model is to create different
kinds of data. The first kind is data that is derived from the
source and which is entered into the database in the form in
which it appears originally; the second kind is data that is
adapted to obey more closely the strictures imposed by the
database to enable the retrieval, querying and analysis processes
to be performed more efficiently and accurately. This second type
of data is standardised in one manner or another, where for
example variations of spelling of a particular entity are
codified into a single form to make it easier to find them in the
database.

Before we discuss the specific layers, it is important to make
the point that you will need to keep the different layers
separate: that is, you should always be able to tell whether a
piece of data is from the source or whether it has been
standardised in some way by you. In terms of database structure,
every field will always belong to one layer only, although tables
can contain fields that belong to more than one layer.

The Source layer

This layer comprises the tables and fields (see
Section C5) that contain information taken from the source
and the forms of the data which have been entered verbatim from
the original. No adaptation of the original information has taken
place in this layer. This allows us to retrieve information that
shows what the source actually said.

The Standardisation layer

This layer comprises the tables and fields which contain data
that you have adapted to make analysis easier, and will include
data where spelling has been standardised, where abbreviations
have been expanded, where a single dating system is used and so
on (see
Section F). There are two opportunities for creating this
layer, either at the database design stage, or after all the data
entry has been completed. If the former approach has been chosen,
then during data entry you must be rigorously consistent in the
way that you enter your standardised forms (e.g. always spelling
‘John’ in the same way), and you must document how you have
standardised. If done after data entry as post-processing, you
can create your standardised values globally across the whole
body of data, but this can be time consuming when dealing with
lots of information that needs to be standardised. If possible,
the former approach is almost always the better option to take.

The Interpretation (also known as Enrichment) layer

This layer is in many ways an optional one, whereas the other two
are not. This layer comprises data and materials which have
been drawn into the database from elsewhere other than your
sources, in order to make the database and the data it contains a
more useful and powerful resource. This can consist of
classification, interpolation, interpretation and derived
variables. It can also include making links between data: two
pieces of information when linked make a third new piece of
information. It can include reshaping the data in the database to
increase usability (for example drawing together dozens of
records about an individual in order to create a single
encyclopaedic record which contains everything there is to know
about that person), and it can include incorporating contextual
material drawn from entirely separate (perhaps secondary)
resources. Many databases used for personal research do not
include an Interpretation layer.

Describing these layers in this way might suggest that they each
exist in entirely separate ‘areas’ of the database as discrete
collections of data. In most cases this is not true, however. In
most cases data belonging to each field will co-exist within
tables, but within separate fields within the tables (see
Section C5): for example you might create two fields for
‘occupation’ in the same table that records information about
people, in one field (belonging to the Source layer) you can
record how the occupation is presented in the source, in the
second field (belonging to the Standardisation layer) you can
record a standardised version of the occupation. The standardised
version will be used for querying and analysis, because it will
be easier to find (by virtue of the fact it is always represented
in the same way).

When moving between layers it is important you are always aware
of what belongs to the Source layer and what belongs to the other
layers. This will usually be obvious (at least to you, if not to
others), but there should be explicit rules defined (and
recorded) explaining the demarcation between the layers. In other
words the layers should be managed so that it is always possible
to backtrack or undo a piece of interpretation that appears in
the database in the event of you changing your mind about
something. Similarly you should always be able to get from the
result of a query which draws upon standardised data back to the
original source data. Much of this layer management will occur
through the structure of the database and will not require active
intervention on your part, but it is worth remembering at all
times that any given piece of data belongs to only one or other
of your layers.

These are the principles of database layers, but as with
everything else about database design, there is a degree of
flexibility about how steadfastly you hold to them. For example,
employing a source layer and a standardisation layer can mean
entering lots of data twice in two different forms – the original
version of an occupation, and the standardised version, for
example – which will clearly slow down the data entry process.
There is a judgement to be made by every historian creating a
database as to how far these different layers should be used, and
specifically, whether or not some pieces of information need
both a source layer and a standardised layer. If you are
unlikely to use a piece of information for analysis, then perhaps
it does not need to have a standardised version; if you do not
need to have a piece of information in its original form, then
perhaps its source layer version is not required. These decisions
need to be made carefully, as they impact directly upon how long
your data entry will take and how easily you will be able to
perform your analysis.

This section of the Handbook will introduce a few
definitions and database terms which will act as a shorthand for
various concepts and processes that will be discussed in other
sections. Again, the issues discussed here will apply
generally to all databases designed for historical research, as
well as to most databases in general.

Harvey and Press provide a definition of a database:

“A database is a collection of
inter-related data organised in a pre-determined manner according
to a set of logical rules, and is structured to reflect the
natural relationships of the data and the uses to which they will
be put, rather than reflecting the demands of the hardware and
software.”

which is a useful if perhaps formal way of describing both the
content and environment of a database. Within the database
itself, however, are a number of different ‘things’, called
Objects,[1] which serve a variety of functions
(these include tables where the actual data is stored, queries,
user interfaces and so on). For the moment we will concentrate on
only the first of these objects, the tables, and we will look at
some of the terms connected with them.

There are four main elements to any table in a database, and each
of these have (somewhat inevitably) a number of names:

In each database, data are stored in tables, and most databases
will have more than one table. These tables will have
relationships between them which connect the information they
contain and will thus be ‘relational’ (see
Section D). Tables are made up of fields (columns) and
records (rows). Each field contains one type of information, and
for each record the information in that field will be of the same
kind.

Database Rules

Returning to the ‘logical rules’ referred to by Harvey and Press,
these can be summarised as follows:

The ‘information rule’: all information in a relational
database is represented in one way – as a value in a field in a
table

Each table in a relational database will have a name which
uniquely identifies it and which will distinguish it from the
other tables; the table should contain information about only one
entity. Further Information

After designing a table, it is
worth asking the question “What is this a table ‘of’?” If the
answer to the question involves using the word ‘and’ then it
is likely to be case that your table does not represent a
single discrete entity.

Note that this means that it is
possible for two fields each in a separate table to have the
same name, but for a variety of reasons this is not a good
idea. Database software will prevent you from giving the same
field name to two fields within the same table.

The values in each field must have the same logical
properties, that is, must be of the same datatype: numerical or
text (we will be looking at datatypes shortly)

Records will contain closely related information about a
single instance of the entity (that is, the subject of the table
– for example, the forename and surname of a single individual in
a table about individuals)

So, for example, you cannot
refer to other records in your data using terms such as ‘see
above’, as the order of the records in the table will
unquestionably be altered during use of the database.

The order of fields in a table must not be significant

Each complete record must be unique

Each field should contain ‘atomic values’: this is,
indivisible data (e.g. first and last names of an individual
should always be held in separate fields)

In many cases the database software will prevent you from
breaking these rules, but in some cases it is possible to
circumvent them, whether intentionally or otherwise. With those
rules that can be broken, it is almost always unwise to do so, as
it will generally lead to confusion (for the database if not its
user) when it comes to performing analysis.

Datatypes

We have seen that databases are made up of tables, and that
tables follow certain rules. One of the rules to be considered is
that each field within a database must be of a certain datatype.
There are a number of different datatypes, and although the names
by which these datatypes are known will vary slightly from
database software to database software, what they do and how they
operate will be essentially the same.

In essence what datatypes do is to control what type of data is
entered into a field. Each field in each table is assigned a
datatype, usually ‘text’ or ‘numeric’, and this in turn dictates
what kind of data can be entered into that field. The purpose of
datatypes is partly to act as a kind of quality control
mechanism, to help prevent data being entered incorrectly; and
partly they are used to help the database understand the meaning
of what is typed in. For example, the piece of data ‘23/03/2011’
will not be recognised as a date by the database if it is entered
into a field that has the datatype ‘text’, whereas it will
understand the meaning of this value if it is entered in a field
with the ‘date/time’ datatype.

A brief description of the principal datatypes follows.

The basic datatypes: text

This is the default datatype for your fields which will be
assigned by the database whenever you add a new field to a table

This datatype will allow the entry of textual and numerical
values, although it will treat that latter differently from
numerical values entered into a ‘number’ datatype field

Fields with this datatype will generally allow a maximum of
255 characters to be entered Further Information

Remember that spaces and
punctuation count as characters.

The basic datatypes: memo

Fields with this datatype are used for lengthy texts and
combinations of text and numbers

Up to 65,000 characters can be entered (the precise number
may change depending on the database software being used)

Data in these types of field cannot be sorted

Data in these types of field are difficult, although not
impossible, to query

The basic datatypes: number

This datatype allows the entry of numerical data that can be
used in arithmetical calculations

There are a number of variations of this datatype, which
control aspects of the numbers that can be entered, such as the
size of the numbers, the number of decimal places and so on:
Further Information

These properties of the number
datatype may differ in name or capacity slightly depending on
the software being used.

Single: stores very large positive and negative numbers
to 7 decimal places

Double: stores very large positive and negative numbers
to 15 decimal places

The basic datatypes: date/time

This datatype enables the entry of dates covering the period
100 through to 9999

This datatype can be customised in order to control the
format of dates that are entered

Warning: in Microsoft Access, the Office autocorrect feature
may well change some of your dates if it is active (e.g.
“02/12/04” will autocorrect to “02/12/2004” unless you enter the
year in full)

This datatype will allow the sorting of records
chronologically, which the same values entered into a text
datatype field would not (the latter would sort the records
alphabetically – alphabetically “01/01/2010” would come before
“31/12/1245”)

The basic datatypes: currency

This datatype allows the entry of numerical values data used
in mathematical calculations involving data with one to four
decimal places, with the inclusion of currency symbols

The basic datatypes: autonumber

This datatype automatically generates a unique sequential or
random value whenever a new record is added to a table.
Further Information

This is a very important type
of field, and one we will be returning to in
Section Dwhen discussing
Primary Keys.

AutoNumber fields cannot be updated, that is, you cannot
enter data into them manually

The basic datatypes: yes/no

A field with this type of datatype will only contain one of
two values (Yes/No, True/False, or On/Off)

Quite often database software will represent this type of
field as a checkbox in the table

The underlying text value of a yes/no field will be -1 (for
yes) or 0 (for no)

The basic datatypes: OLE

A field with this datatype is one in which another file is
embedded, as a Microsoft Excel spreadsheet, a Microsoft Word
document, an image, a sound or video file, an html link, or
indeed any other type of file

Every field in every table will need to have one of these
datatypes assigned, and the decision as to which type is chosen
should be factored into the database design process (see
Section E). For most fields the datatype chosen will be
either ‘text’ or ‘number’. Keep in mind how these two datatypes
treat numerical data differently, particularly in terms of how
they cause data to be sorted:

1,10,11,2,3,4,5,6,7,8,9 is how data will be sorted if the
datatype is ‘text’ (that is, alphabetically)

1,2,3,4,5,6,7,8,9,10,11 is how data will be sorted if the
datatype is ‘number’ (that is, numerically)

[2] Note in this table the Forename,
Surname, Occupation, Residence and Comments fields
contain values taken directly from the source, and are part
of this database’s Source layer. The Stated_Sex field
contains a standardised value (in fact a code), and is part
of the Standardisation layer.

C6. Conclusion

All of the foregoing concepts and aspects of database
organisation need to be taken into consideration when designing
the database that will be used in your historical research, as
each of them will either have an implication for how some of your
historical information will need to be converted into suitably
shaped data (see
Section E), or else they will have an impact upon the types
of linking, counting and analysing it will be possible to perform
on your data. Designing the database will not only involve
modelling the information from your sources, but it will also
involve making decisions about the conceptual approach of your
database, the functions you want it to perform, and how you want
the tables and fields of the database to operate at the most
minute of levels.

D. Relationships

D1. Introduction

As mentioned in
Section C, most databases consist of data held in more than
one table, and this is especially true for databases where the
data is derived from historical sources. Relationships are
created between the tables to connect the data in one to the data
in the other: more precisely, relationships are used to connect
specific records in one table to specific records in another. In
many ways relationships, and the whole relational data model,
comprise the most difficult aspect of designing a database, and
not necessarily because they are difficult to actually create.
What is difficult about relationships is why we need them – the
reasons for using related data can seem obscure and unnecessary
at the start of a database project, especially if you have
limited experience of using databases. They are, however,
extremely important. In essence what relationships allow us to do
is two-fold: firstly they allow us to simplify very significantly
the process of data entry (and incidentally at the same time
enable us to protect the quality of the data we enter by limiting
data entry errors); and secondly they serve to ensure that the
results of our queries are accurate by making it clear precisely
what it is that is being queried.

D2. Functions of relationships

These dual functions of relationships are best illustrated with
an example.

Imagine a database which contained data about people and the
hats that they owned, comprising personal information about
name, gender, age and so on, as well as information about hat
type and the materials used in their manufacture. There are two
ways that this database could be designed:

A single (flat file) table data model, where all of the
information about people and hats was entered into the same
table

A relational data model where two tables are created, one
to contain information about people, and one to contain
information about hats

The two scenarios are both feasible and will allow you to
conduct a detailed analysis of people and their hats, but each
brings with them some very significant consequences if chosen.

D3. Types of relationship

Once the need for related tables arises, it is important to
understand that there are different kinds of relationship
that can exist between two tables. These differences are not
technical, in the sense of being dictated by or a product of the
mechanisms of the database; they are instead a function of the
logical, semantic connection between the information between the
two tables.

D3i – The three types of relationships

There are three types of relationship that can exist between two
tables in a database, not all of which are useful or desirable.

One-to-one relationships:

This relationship exists where a record in Table A can only
have one related record in Table B, and a record in Table B can
only have a single matching record in Table A

For example, an MP can have only one constituency, and a
constituency can have only one MP

This type of relationship is unusual in a database, as in many
cases where a one-to-one relationship exists the information in
the two tables could exist in a single table. This is not
to say that if, as a part of your Entity Relationship Modelling
process (see
Section E), you discover a one-to-one relationship between
two of your designed tables you should redesign the tables
into a single one, only that you can if you want to. For example,
if you wanted to, you could create a single table to enter
information about Members of Parliament and about their
constituency, as when it came to entering this data you would not
encounter the duplication of information that was problematic in
the example of people and hats in
Section D2. On the other hand, tables are supposed to be
discrete entities, and so logically speaking you might prefer to
conceive of ‘MPs’ and ‘Constituencies’ as two different entities,
and thus two different tables. The important thing to remember
with one-to-one relationships is that the database software that
you use to build your database will allow you to create this kind
of relationship, and that it will not create any problems when it
comes to running queries.

One-to-many relationships:

This relationship exists where a record in Table A can have
no, one or more matching record in Table B, but a record in Table
B can only have one matching records in Table A

For example, a mother can have more than one child, but a
child can have only one biological mother

This is the most common type of relationship in found in
databases, and is usually the type that you want to build into
your designs. As illustrated in the people and hats scenario
(Section
D2) this type of relationship is used to overcome the kinds
of problems that arise within the database when the information
drawn from the sources would require the duplication of data if
entered into a single table.

Many-to-many relationships:

This relationship exists where a record in Table A can have
no, one or many matching records in Table B, and a record in
Table B can have no, one or more than one matching record in
Table A

For example, an author can write more than one book and a
book can be written by more than one author

If you discover this kind of relationship operating within your
database design at the end of the Entity Relationship Modelling
process, then you have a problem which will need to be addressed
before you can proceed to actually building the database.
Many-to-many relationships will not work in databases, as they
will essentially break any query you try to run on the tables
related, throwing the query into a ‘loop’ which will generate
gibberish as results.

Given how problematic this type of relationship is, it is
somewhat disheartening to see how frequently they crop up when
modelling historical information! The way of dealing with a
many-to-many relationship requires something of a conceptual
leap, as it requires the creation of a table, sometimes called a
Junction Table, to sit between the two related tables. This
Junction Table will act in an abstract fashion – the data it will
contain will not be information as such, but they will serve to
split the many-to-many relationship into two one-to-many
relationships.

D3ii – Many-to-many relationship between Author and Book tables

Take the database which contains a table about Authors and a
table about Books, which might be designed according to the
Entity Relationship Diagram depicted in image D3ii (for Entity
Relationship Diagrams see
Section E2). The arrowheads indicate the ‘many’ side of a
relationship, here indicating that both tables are on the ‘many’
side, clearly highlighting a problem. To overcome the
many-to-many relationship, we would insert a Junction Table to
spit the relationship into two one-to-many relationships, as
indicated in image D3iii.

D3iii – Many-to-many relationship between Author and Book tables
split with a Junction Table

Note that each record in the Junction Table contains three
fields: a unique ID for each record (Junction ID), and then a
field for each of the Author IDs and Book IDs. Each record
therefore becomes a unique combination of Author and Book IDs,
which indicates which books were written by which authors:

The Junction Table here is effectively circumventing the
many-to-many relationship between books and authors, and each
record it contains acts as a statement linking one or more author
with one or more books. The first two records in the Junction
Table, for example indicate that Author ID 1 was the writer of
Book IDs 1 and 2, whilst the last two records indicate that Book
ID 9 was co-authored by Author IDs 2 and 5. The relationship
between books and authors is managed by the Junction Table,
whilst the details about books and authors are kept in their
respective tables.

This arrangement, whilst somewhat convoluted, will enable the
database to run queries that draw on information in both the Book
and Author tables when it would otherwise not be able to due to
the many-to-many relationship. It is therefore a very valuable
technique to bear in mind when identifying relationships between
tables as part of the database design process.

D4. Primary and Foreign Keys

Primary and Foreign Keys are absolutely crucial to the running of
a relational database because they serve as the ‘anchors’ that
keep relationships connected to the relevant tables. The keys are
fields that serve a particular purpose within a table: they are
not used to capture information drawn for the sources, but
instead they are used to keep track of the information that is
necessary for the database to know which records in one table are
connected to records in the related table.

One of the ‘rules’ of database design (Section
C5) states that ‘each complete record must be unique’,
meaning that when taken as a whole, each record in a table must
be unique. With historical sources, this can sometimes be
problematic, as quite often the same information can crop up
repeatedly over a period of time – such as, for example, when an
individual is named as a witness to a number of wills across a
decade. To make sure that the records of a table obey this
particular rule of databases, it is necessary to guarantee that
each record will be unique, and if the nature of our historical
information prevents us from being able to guarantee this, we are
forced to cheat. In doing so, we actually achieve a number of
useful effects in the design of our tables.

The way that we guarantee that each record in a table will be
unique is to add a field into which we can enter (or have the
database enter automatically) a unique identifier, a value which
will be different for every record that is added to the table.
Normally this is a sequential number, such as the values in the
various ID fields in the People and Hats database (Section
D2). A sequential number value applied to each record that
will be different for each record will guarantee by itself that
every record as a whole will be unique – because the ID value
will never be duplicated from one record to the next.

However this field which provides us with a unique value for
every record also serves to act as the Primary Key field for a
table, this being the field that acts as the anchor for the ‘one’
side of a one-to-many relationship. The field that acts as an
anchor on the other side of the relationship that exists within
the table that is on the ‘many’ side of the relationship is known
as the Foreign Key. The Foreign Key field will not contain a
unique value for every record: because it is on the many side of
the relationships, the same ID value is likely to occur in more
than one record. Both the Primary and Foreign Key fields contain
the same information.

Consider the People and Hats database again:

Person table:

PersonID

Forename

Surname

Occupation

Residence

Age

Gender

HairColour

Income

1

Samuel

Spade

Builder

Peckham

23

M

Blonde

£25,000

2

Philip

Marlowe

Accountant

Dulwich

35

M

Dappled

£34,533

3

Hercule

Poirot

Plumber

Chelsea

26

M

Bald

£250,000

4

Miss

Marple

Doctor

Surbiton

58

F

Grey

£1.50

Hat table:

PersonID

HatID

HatType

HatColour

HatMaterial

1

1

Baseball cap

Black/red

Polyester

2

2

Bowler

Pink

Satin/fur

2

3

Crash helmet

Green

Kevlar/fur

3

4

Stetson

Brown

Leather

3

5

Captain’s hat

White/blue

Silk/braid

4

6

Nightcap

Grubby yellow

Rags

The relationship is one-to-many, a person can have many hats, and
the relationship is anchored on the PersonID field which is
present in both tables. The PersonID field contains the unique ID
number that is assigned to every person who is entered in the
Person table – because every person is only entered once in the
Person table, the PersonID field in the Person table is unique,
and it is the Primary Key. In the Hat table, because a person can
own more than one hat, the Person ID value will not be unique –
the same ID number will crop up every time a record contains a
hat owned by that person (as is the case for persons 2 and 3).
The PersonID field in the Hat table is therefore the Foreign Key.

When designing the database, there is a stage in the process
where you will need to think about what fields will appear in
your tables (see
Section E2), and it is very important when doing so that you
remember to identify the Primary and Foreign Keys for your
tables. Every table you design should have a Primary Key field –
a field with the datatype ‘autonumber’ which will generate a
unique value for every new record you add. Not every table will
have a Foreign Key field, only those that are on the ‘many’ side
of a one-to-many relationship. When you do add a Foreign Key
field to a table, remember that it will contain the same
information (that is, the ID numbers) drawn from the field that
is the Primary Key for that relationship. Without the Key fields,
the database software will not be able to correctly manage the
relationship, with the result that it will not be able to
identify which records from the one table are connected to the
which from the other, a situation that will make performing
analysis or even simple retrieval of information all but
impossible.

E. Entity relationship modelling

E1. Introduction

Throughout this Handbook so far reference has been made to
the translation and conversion processes involved in taking
information from sources and turning them into data within the
database. This section describes precisely the tasks involved in
performing these processes, which are collectively known as
Entity Relationship Modelling (ERM). The mechanics of ERM are in
fact a lot less intimidating than the name implies, but it is
nevertheless a complex activity, and one that is likely to prove
challenging at the first few attempts. Luckily, however, the
various stages of ERM draw very heavily upon the skills and
experience that the historian utilises as a matter of course
during their research anyway, which, unlike most aspects of
database use, places the historical researcher at something of an
advantage. The difficulty of the ERM process is directly
proportional to the complexity of the source(s) being used in the
research, with some types of sources being (relatively) simpler
to model than others. Highly structured sources like census
returns, lists of inhabitants, poll books and so on will be
easier to model than ‘semi-structured’ sources such as probate
inventories, which in turn will present fewer problems than
completely unstructured material such as narrative texts and
interviews, and so on. However all will have their own particular
features and problems to complicate the modelling.

The process of ERM serves a number of purposes. Firstly, it makes
the historian decide upon what it is the database is to achieve
in terms of its functions. Secondly, it identifies the types of
information that can be obtained from the sources, and in
conjunction with the database’s chosen aims, aids the historian
in deciding upon which information from the sources should be
entered into the database, and which can be can be excluded.
Thirdly, ERM makes the historian think in detail about the
components of the database, its tables, fields, relationships,
datatypes, and so on, decisions on all of which are crucial to a
successful database design. Finally, it encourages the
consideration of the layers of the database, what information
needs to be entered into both the Source layer and the
Standardisation layer, what can be entered only into the latter,
and how extensive the latter needs to be. Once these tasks have
been conducted, the historian is left with a very precise idea of
what the database will look like, and, on a more practical note,
will be left with the design of their database on paper (an
Entity Relationship Diagram [ERD]).

E2. Entity Relationship Modelling step-by-step

Entity Relationship Modelling (ERM) is where you model the
information from your sources to fit into a structure that will
form the basis of your database design. The product of ERM is an
Entity Relational Diagram (ERD), which will depict three
components: entities (tables), attributes (fields) and the
relationships between them.

Stage 1: Determine the purpose of the database

This stage is always the starting point of the ERM process, and
is especially important if you are adopting the Method-oriented
approach to your design, but it is a vital step in designing any
database. No database exists without a purpose: with a
Method-oriented database it is crucial that you know
exactly what you want to do with your data before you
design the database. At this stage you will need to make your
decisions about what information you want to keep and what you
want to discard, and you will need to be prepared to abide by the
consequences of this decision throughout the lifecycle of your
database project. This point deserves to be made emphatically:
although it is always possible to retrofit the design of your
database to include information that you had initially decided to
discard, it is not always a trivial matter to do so, particularly
if you have to enter another phase of data entry to collect the
new information.

Stage 2: List entities

Once you know what you want your database to do, you need to
divide your anticipated information into discrete subjects: each
subject, or entity, will evolve into a separate table.[1] We split the entities into separate tables for the
purposes of efficiency, for avoiding ambiguity, and because this
allows us the maximum flexibility for querying once the data are
in.

This stage of the ERM process sounds deceptively simple, but is
in fact probably the most difficult step of the whole process,
and it is certainly worth reading up on the subject (see the
Further
Reading section).

For example, consider a research project that was investigating
political elections in eighteenth-century Bristol, and the
sources consisted of a collection of poll books which recorded
the votes cast by the local electorates in the communities where
elections took place (image E2i):

With sources such as these we might pursue a research question
which was something like: ‘Analyse the geographic and economic
determinants of voting in early 18th century Bristol’.
With a question like this we would be interested in geography,
economic status and voting patterns in relation to a group of
individuals. In terms of entities, we might conclude that
there is only one: people, actually, more precisely, we would be
considering voters, which would lead us to the position of
deciding that we would need a table into which we would enter our
information about voters.

Alternatively if we were using probate materials for our
research, and wanted to create a database into which we could
enter information extracted from wills, we would need to consider
the entities from this source. We might conceive of our
entities – our discrete subjects – as breaking down into
‘document’, with each will being treated as a different document;
‘person’, containing information about different types of people
and their roles – testator, recipient, executor etc.; ‘bequest’
with a range of types of associated information; and ‘object’,
being the object of a bequest. If our research was interested in
the material culture of a period or place, this latter entity
would be particularly important, whereas if the project was
concerned only with networks of people and social interrelations,
the ‘object’ entity might not be necessary.

In both these examples the possibility of these entities
being identified has been stressed, because the definition of
entities is as much a product of the purpose of the database as
it is of the information contained in the sources. As mentioned
at the beginning of this Handbook, no two databases will
be designed in the same way, as no two databases will be built
for the same purposes – different designers may well identify
different entities based upon their unique appreciation of their
research and their sources. And this is why to a large extent
this step of the ERM process is the most difficult!

However it is perhaps worth considering the inclusion of three
commonly chosen entities in the design of your database:

People - with a related entity of
‘role’ (being the reason why they are present in the source)

Document – where archival and bibliographical material can be
entered (and thus enabling the tracking of every piece of data in
the database to its source)

Event – a slightly more abstract entity, one which describes
an instance of whatever it is your source records (a trial, a
taxation assessment, an election etc.) and where information
about dates can be recorded.

Exercise

Assume you have a research project that is examining the
demographic profile of families in a particular parish, and your
principal source will be ecclesiastical registers of baptisms,
marriages and burials. List the entities from this source that
would need to be included in your database design.[3]

Stage 3: Identify the relationships between the entities

Bearing in mind the nature of relationships between different
elements of information (see
Section D), this step of the ERM process requires you to
identify which of your entities are related, and what type of
relationship exists between them. This is an exercise in abstract
logic, and will take considerable practice: in addition, quite
often, this stage will require revisiting stage 2 and redefining
the entities you originally chose.

If we return to the database of wills mentioned in stage 2 with
the entities ‘document’, ‘person’, ‘bequest’ and ‘object’, we
would need to unpick the nature of the relationships between
these entities. We might logically decide that the relationships
would look something like this (the arrowheads depict the ‘many’
side of a one-to-many relationship (see
Section D):

E2ii –Example of relationships identified between entities
(wills)

A single document (will) can contain information about more than
one person, and also about more than one bequest, whilst a
bequest can include information about more than one object, so
all of these relationships are one-to-many.

Exercise

Identify the relationships between the entities chosen for the
parish register database in Stage 2.

Stage 4: Investigate and rectify problems

This stage is fairly self-explanatory. It is possible to spot
problems with the incipient design even at this relatively early
point in the ERM process, and if they exist it is better to do so
here than after investing work in the later stages.

Look out in particular for:

Relationships which do not appear to be one-to-many: remember
you cannot have entities related by a many-to-many relationship,
and whilst you can have them related through a one-to-one
relationship, it may be worth rethinking the two entities
involved (see
Section D)

Redundant relationships: if entities can be linked in more
than one way, you should work out which link should be kept and
which should be discarded – if Table A is related to Table B, and
Table B is related to Table C, then Tables A and C are already by
definition related, and do not need a ‘direct’ relationship to
exist between the two

Stage 5: List all attributes associated with each entity, and
identify keys

This stage involves listing the precise attributes of each entity
that has been identified in the previous stages of the ERM
process, by deciding on the fields that should occur in each
table. Each field is to contain one piece of information about
the subject of the entity, and that one piece will be a single
aspect of information that can be known about the entity (see
Section C5). Once you have listed the attributes for each
entity, you must then identify which fields will act as the
primary and foreign keys in each table, remembering that
relationships between tables do not exist between tables, but
between a specific field in one table and a specific field in the
related table (see
Section D).

E2iii –Example of attributes and keys identified within entities
(wills)[4]

You should always as a matter of course add a generic ‘Notes’
field into each table, with the memo datatype, as it is likely to
prove invaluable at moments of crisis when entering data (see
Section F)!

Stage 6: Construct the Entity Relationship Diagram (ERD)

Once you have completed stage 5, you are in a position to create
the ERD for your database design, which will resemble the heavily
simplified example above (image E2iii).

Exercise

Construct an ERD for the parish register database, ensuring that
you have identified your entities and their attributes, the
relationships between the entities and the fields which act as
keys.

[1] In fact they may evolve into separate
groups of related tables, but for the moment this does not
need to be considered.

[2] An Exact List of the Votes of the
Freeholders and Freemen of the City and County of Bristol
taken at the Election of Members of Parliament (Bristol,
1722) p.19.

[3] If you are not familiar with parish
registers as sources, it is possible to find examples of
transcripts (of varying quality) online. One such set of
transcripts, for the county of Hampshire, can be found at
http://www.knightroots.co.uk/parishes.htm
(accessed 23/03/2011).

[4] Terms in black are entity (table
names), those in red are primary keys, those in green are
foreign keys and those in grey are the remaining
attributes/fields, some of which pertain to the database’s
Source layer, and some to its Standardisation layer. It is
very important to note that this diagram is entirely
illustrative, and is not intended to be prescriptive or
definitive in any way!

E3. Conclusion

The process of Entity Relationship Modelling (ERM) is
difficult, and rapidly becomes more difficult if you are
blessed with a number of different kinds of sources, each of
which contains rich information about a variety of subjects. If
you are using multiple sources, it is a good idea to avoid
creating entities that are source specific: for example, if you
are using census returns and taxation lists, both of which
contain information about people, do not create two tables for
people (one containing the information from one source, the other
from the second). Stick to the abstract logic of the information
– what is important to your research is people, so
accommodate all of the information about people in the same
place. Not only does this make sense from the point of view of
logic, but it will also make it much easier to find data about
specific individuals later on (either manually or via queries):
looking for a person is easier to do if everyone is located in
one table rather than several.[1]

No Entity Relational Diagram (ERD) will ever be perfect, as with
so much else involved in database design it will be a matter of
compromise. The success of an ERD is something that can only be
determined in one way – by the database performing the tasks it
was intended and designed to do, and this is something that will
not become evident until after you have begun entering data and
using the database for analysis. This is why the creation of the
ERD is (or should be) swiftly followed by a period of intense
testing of the database ‘in action’, in order to quickly identify
where the design is impeding the database’s purpose (see
Section G).

[1] Ultimately of course this is a matter
of personal judgement: you may decide that your entity is not
‘people’, but is in fact two separate entities comprising
‘census return’ and ‘tax payer’, in which case you would be
able to argue for two separate tables. You would still face
the problem of having to look for individuals in more than
one table, however, should the need ever arise.

F. Problems facing the historian

F1. Introduction

This section of the Handbook is in some ways the most
crucial, as it addresses a range of issues that affect the
historian in particular during the course of designing their
database. Most of these issues arise through the variability and
ambiguity that inevitably accompanies the information present in
historical sources, and which is not something which
non-historical database creators encounter: usually these latter
will know precisely what kind of information they will have, and
can therefore comfortably predict how the database will need to
be structured to make best use of it. Historians will not be able
to predict the nature of their information so confidently, even
if they know their sources intimately, as there will almost
always be instances where the sources confound them – be it
through extra information cropping up in the material which
appears to be beyond the scope of the source, or through missing
or illegible text, or through marginalia, deletions and so on.

This variability or ambiguity in the type and shape of
information within the sources, (whilst one of the more exciting
aspects of historical research), is not the only occasion where
the materials and practices of the historian create difficulties
when a database is brought into the equation. Chronology,
topography, geography, orthography, and a range of other
historical contexts introduce an element of ‘fuzziness’ into the
data, and fuzziness is anathema to the strictures of the
relational database model. The irregularity of historical
information must be managed through the design of the database in
such a way as to maintain a balance between keeping the detail
and richness of the source to the extent required by the project,
whilst at the same time applying enough standardising and
reshaping of the information to allow the database tools to
operate with maximum efficiency and flexibility.

At the root of most of these ‘interfacing’ problems is the fact
that databases do not ascribe meaning to any piece of information
you enter into it. The database does not recognise words or
numbers or dates: when you enter data into a table, all the
database understands is that a field contains a value which
consists of a collection of characters organised in a particular
order (something that is known as a ‘string’). When two strings
are identical, such as might occur with the values appearing in
the Surname field of a ‘Person’ table across two records, then
the database understands that these two pieces of data are the
same and have the same meaning. If two strings are not identical,
then the database does not recognise them as meaning the same –
even if to the historian it is clear that ‘Jno’,
‘John’ and ‘Johan~’ are all ways of representing
the same piece of information.

For example (image F1i), when running a query in a database
containing the names of members of a London Livery Company from
the sixteenth to twentieth centuries, one would be able to search
for everyone named John Smith:[1]

F1i – Query results of a search for individuals called John Smith
(using standardised form of name)

As we see, there are 275 individuals called John Smith in the
database. We only know this, however, because we chose to run the
query using criteria in Sname field – the fields which contain
the surname of individuals – and the StdFname field – the field
which contains the standardised version of the forename.
If we ran the same query using the criteria “John” and “Smith”
using the non-standardised version of the forename (in the field
Fname), we would retrieve a smaller number of records:

F1ii – Query results of a search for individuals called John
Smith (using non-standardised form of name)

This time only 169 records are retrieved. The difference between
the two sets of results arises from the variant spelling of the
forename ‘John’ (which can be seen in the first image). When
searched for ‘John’ in the standardised field, everyone called
John had the same value; when we searched in the non-standardised
field where the original spelling of the name had been entered,
the record for anyone whose name was not spelt exactly as “JOHN”
was missed by the query. Whilst the historian knows that all the
variant spellings in effect mean the same thing, the database
cannot recognise this so long as the strings are not identical.
Thus, in order for our queries and analyses to be performed
accurately by the database, we have to make sure that data which
means the same thing can be identified by the database as meaning
the same thing.[2]

[2] The problem of non-identical strings
not being recognised as meaning the same thing acting as an
impediment to the running of queries can be mitigated to some
extent through the use of various querying tools, but even
then the problem can still lead to inaccurate query results.

F2. Problematic information

There are certain categories of historical information which are
habitually problematic, and unfortunately these tend to be those
subjects that often constitute analytical units, namely
geography, chronology and orthography.

Geographical information

The problem with geographical information as it occurs in
historical sources is that the boundaries of administrative units
overlap and change over time, so that the same physical location
can belong to different counties/parishes/wards/precincts and so
on depending upon the date of the source being consulted.
Obviously this means that if your sources cover a long period of
time, you will need to be aware of what implications of any
boundary changes in that period may have for your data. This is
especially true if you are recording data in a hierarchical
fashion: for example if you have a field in a table for ‘Parish’,
and another for ‘County’, and every record will be given a value
in each field. If the parish of St Harry Potter is
situated in the county of Hogwartshire at the beginning of
the 17th century, then records connected with this
parish would have these two values entered into the respective
fields in the table. If, however, administrative changes in the
18th century alter the county boundaries so that St
Harry Potter suddenly belongs to the county of
Elsewhereshire, then the records will have the values of
St Harry Potter in the parish field, and
Elsewhereshire in the county field. Whilst this is
accurate, it suddenly causes a problem for the database, in that
you will have a number of records with the same string in the
‘Parish’ field – and so will be recognised by the database as
meaning exactly the same thing - but which historically speaking
have different meanings at different points in time.

In this instance there are two ways of dealing with this issue.
Firstly, you simply stay aware of the problem, and when running
queries on parishes you take the ‘County’ field into account as
well as the ‘Parish’ field. This will enable you to specify which
version of the parish of St Harry Potter you are
analysing. Secondly, you could modify the Parish value to specify
which version it is, so instead of entering St Harry
Potter, you could enter St Harry Potter: Hogwartshire
or St Harry Potter: Elsewhereshire into the Parish field.
This would simplify the complication of running queries in this
situation, but it would technically break the database rule about
‘atomic values’ (see
Section C5, Rule no.9).

This particular problem is even more significant when it is not
just the geographical boundaries that change, but when the actual
entities themselves change. For example, 17th century
London had over 100 parishes in the early part of the century,
many of them absolutely tiny in terms of area and population.
After the Great Fire, the opportunity was taken to rationalise
the parishes, with the result that many were merged or united,
often with the newly created entity retaining the name of one of
the pre-Fire parishes, whilst each parish still maintained its
own existence for some administrative purposes (eg. St Martin
Ironmonger Lane and St Olave Jewry). Here the problem is not one
of changing hierarchy (which parish belongs to which county), but
one of meaning (what area/population is the source referring to
at this date when referring to ‘St Martin Ironmonger’?). Various
approaches to solving this are used, including that for the
preceding example, but what is most important is to be clear in
the data at all times precisely what is meant by the geographical
terms you enter into the database.

Chronological/dating information

All of the possible problems created by shifting geographical
terminology apply to the identification of dates in historical
data. This is clearly a more serious issue the further back in
history your sources were generated, when calendars and dating
systems were more varied and plentiful, and record-keepers had
more of a choice in what dating system they could choose. The
important thing to remember here, as with geography (and indeed
everything else entered into the database), is that the database
does not recognise meaning. The database will have no concept of
when the ‘Friday after the Feast of the Decollation of St
John the Baptist in the thirty-first year of Henry III’
was,[1] which means that this date, as a
value, cannot be treated chronologically by the database (that
is, sorted or queried by date). Regnal years, mayoral years,
feast days, the days of fairs and markets etc. when used to date
information in the sources will need to be converted into a value
that uses an actual modern date format. Alongside this there is
of course the issue of the shift from Julian to Gregorian
calendars, so that if your data spans 1752 you will need to
convert years into one of the Old or New Style systems.[2]

Do not forget the datatype of the field into which dating
information will be entered (see
Section C5), bearing in mind that ‘Text’ datatype fields will
sort dates alphabetically whereas ‘Date/Time’ datatype fields
will sort them chronologically.

Orthography/variant forms

This is the really big area in which historical sources provide
information that is problematic for the database: how do you deal
with information that appears with many different spellings or in
entirely different forms when in reality it means the same thing
(or at least you wish to treat it as the same thing)? How will
you deal with contractions and abbreviations, particularly when
they are not consistent in the source? How will you accommodate
information that is incomplete, or is difficult to read or
understand where you are uncertain about its meaning? All of
these issues are practically certain to crop up at some point in
the data entry process, and all of them will need to be addressed
to some extent to prevent problems and inaccuracies arising
during the analysis of your data (for the impact that these
issues have upon querying, for example, join one of our face
to face Database courses.

[1] The 30th of August 1247,
approximately.

[2] Note that this does not
necessarily literally mean ‘convert’: it would be
entirely reasonable if your research required it to have two
fields to enter date information, one that contained the date
verbatim from the source, and the second into which
the modern rendering could be entered. Querying and sorting
could then take place using the latter field.

F3. Standardisation, classification and coding

The principal way forward for accommodating data containing these
kinds of problems is to apply (often quite liberally) a
standardisation layer into the design of the database (see
Section C4) through the use of standardisation,
classification and coding. These three activities are a step
removed from simply gathering and entering information derived
from the sources: this is where we add (or possibly
change) information in order to make retrieving information and
performing analysis easier. We use these techniques to overcome
the problem of information that means the same thing appearing
differently in the database, which prevents the database from
connecting like with like (the fundamental pre-requisite for
analysing data). For historians this is a more important step
than for other kinds of database users, because the variety of
forms and ambiguity of meaning of our sources does not sit well
with the exactitude required by the database (as with the example
of trying to find all of our records about John Smith,
Section F1), so that more of a standardisation layer needs to
be implemented.

Standardisation, classification and coding are three distinct
techniques which overlap, and most databases will use a
combination of the three when adding a standardisation layer into
the design:

Standardisation

This is the process of deciding upon one way of representing a
piece of information that appears in the source in a number of
different ways (e.g. one way of spelling place/personal names;
one way of recording dates and so on) and then entering that
standardised version into the table. Consider using
standardisation when dealing with values that appear slightly
different, but mean the same thing - ‘Ag Lab’ and ‘Agricultural
Labour’ as values would be treated very differently by the
database, so if you wanted them to be considered as the same
thing, you would signal this to the database by giving each
record with a variant of this occupation the same standardised
value.

Classification

This is the process of grouping together information (‘strings’)
according to some theoretical, empirical or entirely arbitrary
scheme, often using a hierarchical system in order to improve
analytical potential. Classification is about allocating groups,
and then placing your data in those groups. These groups can be
hierarchical, and the hierarchy will let you perform your
analysis at a variety of levels. Classification is less about
capturing the information in your sources and is much more about
serving your research needs.

When using a classification system it is very important to
remember two things: firstly, since it is an arbitrary component
of your database’s Standardisation layer designed to improve your
research analysis, the system does actually need to meet
your has to be able to meet the requirements you have for it.
Secondly, therefore, the system needs to have been devised before
data entry begins, it needs to intellectually convincing (at
least as far as your historical methodologies are concerned) and
it needs to be applied within your data consistently.

It is also worth being aware of how other historians have
classified their information. There have been many classification
systems created by the good and the great of the historical
profession,[1] many of which have been used
subsequently by others for two reasons: they allow comparability
between the findings of different projects; and because they
allow historians to turn different sources into continuous series
of information. That is, two projects investigating the same
thing at different periods may have to rely on different sources:
by classifying their (probably slightly different information)
into similar classification systems, a case can be made
(convincingly or otherwise) that the research is comparable. This
is not to say that you should necessarily try to adopt an
existing scheme rather than develop one that suits your research
better, but it is worth keeping in mind if you are interested in
comparing your analysis with that of another historian. In
addition, given that classification systems in practice really
only entail adding an extra field in a table into which the
classified value is added, there is nothing stopping you (other
than perhaps time) from employing more than one classification
system for the same information in the database.

A detailed example of a classification system can seen in an
ongoing project which is investigating the material aspects of
early modern households, and which uses a database to record
minutely detailed information about material objects. One of the
many ways it treats the information about objects is to classify
objects by type, in order to be able to compare like objects
despite the often substantial differences in the ways they are
referred to in the sources. This works by adding a field in the
table where item type data is recorded into which an ItemClass
code value can be added:

F3i – Data about material objects that have been classified and
coded

The ItemClass field here is populated with codes, and these codes
record precisely what type of item the record is about (you can
see what the source calls the item in the ItemDescr
field).[2] The fact that the code is a numeric
value, and the fact that the same numeric code is applied to the
same type of object regardless of how it is described in the
source, means that the ItemClass field acts as a standardised
value.

Additionally, however, the ItemClass field enables the use of a
hierarchical classification system (to examine a partial sample
of the classification system, download the Microsoft Excel file
Material
Object Type Classification.xls). The hierarchy
operates by describing objects at three increasingly detailed
levels:

To illustrate this we can take the example of how the database
classifies objects that are used for seating:

F3ii–Classification
system for objects in the category of ‘Seating’

You will notice from the
Microsoft Excel spreadsheet that each code level has a two
or three digit numeric code, so Code I: Seating has the numeric
code 05, that for Code II: Chair is 02, and that for Code
III: Wicker Chair is 006. These individual codes become elided
into a single numeric code (in the case of the wicker chair –
0502006) which is the value that gets entered into the relevant
single field (ItemClass) in the record for the wicker chair in
the database.

This may sound complicated and slow to implement, but the benefit
of doing so is considerable. Firstly, the database can be created
so that the codes can be automatically selected rather than
memorised by the database creator, so that they do not have to
stop to remember or look up what code needs to be entered for any
given object. Secondly, and here is the principal reason for
employing a hierarchical system, once the data have been coded,
they can be analysed at three different semantic levels. The
historian could, if they wished, analyse all instances of wicker
chairs in the database by running queries on all records which
had the ItemClass value “0502006”. Alternatively, if they were
interested in analysing the properties of all the chairs in the
database, they could do so by running queries on all records with
an ItemClass value that begins “0502***”. Lastly, if the point of
the research was to look at all objects used for seating, a query
could be designed to retrieve all records with an ItemClass value
that began “05*****”. This is an incredibly powerful analytical
tool, and one that would be impossible to achieve without the use
of a hierarchical classification system: to run a query to find
all objects used for seating without a classification system
would require looking for each qualifying object that the
historian can anticipate or remember, by name and taking into
account the variant spellings that might apply.[3]

Hierarchical classification systems are very flexible things as
well. They can include as many levels as you require to analyse
your data, and they do not need to employ numeric codes when
simple standardised text would be easier to implement.[4]

Coding

Coding is the process of substituting (not necessarily literally)
one value for another, for the purpose of recording a complex and
variable piece of information through a short and consistent
value. Coding is often closely associated with classification,
and in addition to saving time in data entry (it is much quicker
to type a short code word than it is to type five or six words)
codes additionally act as standardisation (that is, the same form
[code] is entered for the same information no matter how the
latter appears in the source).

These techniques are implemented to make the data more readily
useable by the database: the codes, classifications and
standardised forms which are used are simple and often easier to
incorporate in to a query design than the complicated and
incomplete original text strings that appear in the source; but
more importantly, they are consistent, making them much
easier to find. However there are a number of things to bear in
mind when using them, the most important of which is there are
two ways of applying these techniques:

By replacing original values in the table with
standardised/coded/classified forms

By adding standardised/coded/classified forms into the
table alongside the original values

Both of these approaches present a trade-off between maintaining
the integrity of the source and improving the efficiency of the
potential analysis, in much the same way as the choices offered
as part of the design process when selecting the Source- or
Method-oriented approach to the database (see
Section C3). The first approach to standardising, to replace
the original version of source information in any chosen field(s)
with standardised forms of data, enables the speeding up of data
entry at the expense of losing what the source says. It also
serves as a type of quality control, as entering standardised
data (especially if controlled with a ‘look-up list’) is less
prone to data entry errors than the original forms that appear in
the source.

The second approach, to enter standardised values in addition to
the original forms, allows for the best of both worlds: you
achieve the accuracy and efficiency benefits of standardisation
without losing the information as it is presented in the source.
Of course, this happens at the cost of extra data inputting time,
as you enter material twice.

When considering both approaches, bear in mind that you will only
need to standardise some of the fields in your tables, not every
field in every table. The candidates for standardising,
classifying and coding are those fields that are likely to be
heavily used in your record-linkage or querying, where being able
to identify like with like values is important. Creators of
databases built around the Source-oriented principle should
exercise particular caution when employing these techniques.

[1] See for example that developed for
household types in The population history of England,
1541-1871: a reconstruction (1981) by E.A. Wrigley and
R.S. Schofield; or the ongoing HISCO project to develop an
international classification system for occupations,
available at http://historyofwork.iisg.nl/
(accessed 23/03/2011).

[2] Note in passing that many of the other
fields in this example contain codes as well – this table
contributes substantially to the database’s Standardisation
layer.

[3] It would, for example, need to look for
all stools, buffet stools, wicker chairs, forms, settles,
benches etc., leading to extremely complicated queries with
possibly more criteria that the database can handle. For
criteria in queries please sign up to one of our
face-to-face Database courses.

[4] Indeed numeric codes are somewhat old
fashioned in modern database usage, although they are no less
efficient for being outmoded.

G. Conclusion

If you apply the principles and techniques discussed in the
design of your database, you may well find that you spend a
considerable amount of time in the process. Unfortunately there
is no getting around this: designing databases simply is a
time consuming business, especially if you have adopted a
Source-oriented approach and you are working with a range of
different, rich and complex sources. However, the time you spend
working on the design will be more than repaid when it comes to
the data entry and data analysis stages of the database project –
and this cannot be overstated. Historical sources will give rise
to all manner of complications and problems, intellectual and in
terms of the mechanics of databases, and the more you can
anticipate these and accommodate them in the design of the
database, the more efficient and less frustrating the subsequent
use of the database will be.

Before you begin the process of designing your database, and
producing your Entity Relationship Diagram, it is worth spending
a little time seeing how other historians have designed their
databases (see the resources listed in the Further
Reading section). You should also read through the other HRH
Handbooks on Databases by Mark Merry as these describe in detail
the processes of building databases and performing analysis
respectively; and it completely necessary to see what is required
in order for these processes to work smoothly, so that the design
can facilitate take these requirements into account from the very
beginning.

Finally, it is worth reiterating that designing databases is
difficult, and there is no substitute for practice. No database
is ever perfect, and the only indicator of quality, or success,
when it comes to database design is whether or not it serves the
various functions that you intended. If you can manage the
information from your sources in the way that you need, and if
you can perform the analysis that you require, and if you can be
as flexible as you need in both of these areas, then your design
is successful. But you do not have to wait until the latter
stages of your database use to find out how successful you have
been in the design – you can and should test the design of the
database very early on. After producing your Entity Relationship
Diagram, build a structural prototype of your database (that is,
with only the tables and relationships, without worrying too much
about the other tools that go into creating the database
application) and spend a week entering data. If you are using
multiple sources, enter material from each of the sources. As
soon as you start entering data you will very quickly begin to
see where any deficiencies in the design might be – look out for:

Information that you would like to analyse which appears
repeatedly, but you have nowhere specific to put it (i.e. for
which you will need to add new fields)

If you find yourself repeating information from record to
record, you will need to think about re-ordering your
relationships to prevent this (see
Section D)

Watch out for your datatypes, and change them where they are
unhelpful

Look for data that could be standardised or classified

Look out for information that you had not anticipated when
designing the database

It is likely that you will find examples of all of these in a
very short space of time. Once you have spent some time entering
data, design and run some queries to test whether or not the
research questions you know you will want answers to can actually
be answered by the current design. Running
queries is the ultimate test of whether the database design works
or not, and it is likely that you will find yourself rearranging
fields in the light of what you learn. The queries will also
highlight (often starkly) how much standardising of information
you will need to engage in.

Once you have finished this testing, and moved on to design and
rebuild ‘Version 2’ of the database, you will be well on the way
to creating one of the most powerful research tools available to
the historian. It will be a struggle to begin with, but it will
be worth it in the end!