Designing Databases for Historical Research

F. Problems facing the historian

F2. Problematic information

There are certain categories of historical information which are
habitually problematic, and unfortunately these tend to be those
subjects that often constitute analytical units, namely
geography, chronology and orthography.

Geographical information

The problem with geographical information as it occurs in
historical sources is that the boundaries of administrative units
overlap and change over time, so that the same physical location
can belong to different counties/parishes/wards/precincts and so
on depending upon the date of the source being consulted.
Obviously this means that if your sources cover a long period of
time, you will need to be aware of what implications of any
boundary changes in that period may have for your data. This is
especially true if you are recording data in a hierarchical
fashion: for example if you have a field in a table for ‘Parish’,
and another for ‘County’, and every record will be given a value
in each field. If the parish of St Harry Potter is
situated in the county of Hogwartshire at the beginning of
the 17th century, then records connected with this
parish would have these two values entered into the respective
fields in the table. If, however, administrative changes in the
18th century alter the county boundaries so that St
Harry Potter suddenly belongs to the county of
Elsewhereshire, then the records will have the values of
St Harry Potter in the parish field, and
Elsewhereshire in the county field. Whilst this is
accurate, it suddenly causes a problem for the database, in that
you will have a number of records with the same string in the
‘Parish’ field – and so will be recognised by the database as
meaning exactly the same thing - but which historically speaking
have different meanings at different points in time.

In this instance there are two ways of dealing with this issue.
Firstly, you simply stay aware of the problem, and when running
queries on parishes you take the ‘County’ field into account as
well as the ‘Parish’ field. This will enable you to specify which
version of the parish of St Harry Potter you are
analysing. Secondly, you could modify the Parish value to specify
which version it is, so instead of entering St Harry
Potter, you could enter St Harry Potter: Hogwartshire
or St Harry Potter: Elsewhereshire into the Parish field.
This would simplify the complication of running queries in this
situation, but it would technically break the database rule about
‘atomic values’ (see
Section C5, Rule no.9).

This particular problem is even more significant when it is not
just the geographical boundaries that change, but when the actual
entities themselves change. For example, 17th century
London had over 100 parishes in the early part of the century,
many of them absolutely tiny in terms of area and population.
After the Great Fire, the opportunity was taken to rationalise
the parishes, with the result that many were merged or united,
often with the newly created entity retaining the name of one of
the pre-Fire parishes, whilst each parish still maintained its
own existence for some administrative purposes (eg. St Martin
Ironmonger Lane and St Olave Jewry). Here the problem is not one
of changing hierarchy (which parish belongs to which county), but
one of meaning (what area/population is the source referring to
at this date when referring to ‘St Martin Ironmonger’?). Various
approaches to solving this are used, including that for the
preceding example, but what is most important is to be clear in
the data at all times precisely what is meant by the geographical
terms you enter into the database.

Chronological/dating information

All of the possible problems created by shifting geographical
terminology apply to the identification of dates in historical
data. This is clearly a more serious issue the further back in
history your sources were generated, when calendars and dating
systems were more varied and plentiful, and record-keepers had
more of a choice in what dating system they could choose. The
important thing to remember here, as with geography (and indeed
everything else entered into the database), is that the database
does not recognise meaning. The database will have no concept of
when the ‘Friday after the Feast of the Decollation of St
John the Baptist in the thirty-first year of Henry III’
was,[1] which means that this date, as a
value, cannot be treated chronologically by the database (that
is, sorted or queried by date). Regnal years, mayoral years,
feast days, the days of fairs and markets etc. when used to date
information in the sources will need to be converted into a value
that uses an actual modern date format. Alongside this there is
of course the issue of the shift from Julian to Gregorian
calendars, so that if your data spans 1752 you will need to
convert years into one of the Old or New Style systems.[2]

Do not forget the datatype of the field into which dating
information will be entered (see
Section C5), bearing in mind that ‘Text’ datatype fields will
sort dates alphabetically whereas ‘Date/Time’ datatype fields
will sort them chronologically.

Orthography/variant forms

This is the really big area in which historical sources provide
information that is problematic for the database: how do you deal
with information that appears with many different spellings or in
entirely different forms when in reality it means the same thing
(or at least you wish to treat it as the same thing)? How will
you deal with contractions and abbreviations, particularly when
they are not consistent in the source? How will you accommodate
information that is incomplete, or is difficult to read or
understand where you are uncertain about its meaning? All of
these issues are practically certain to crop up at some point in
the data entry process, and all of them will need to be addressed
to some extent to prevent problems and inaccuracies arising
during the analysis of your data (for the impact that these
issues have upon querying, for example, join one of our face
to face Database courses.

[1] The 30th of August 1247,
approximately.

[2] Note that this does not
necessarily literally mean ‘convert’: it would be
entirely reasonable if your research required it to have two
fields to enter date information, one that contained the date
verbatim from the source, and the second into which
the modern rendering could be entered. Querying and sorting
could then take place using the latter field.