Designing Databases for Historical Research

E. Entity relationship modelling

E2. Entity Relationship Modelling step-by-step

Entity Relationship Modelling (ERM) is where you model the
information from your sources to fit into a structure that will
form the basis of your database design. The product of ERM is an
Entity Relational Diagram (ERD), which will depict three
components: entities (tables), attributes (fields) and the
relationships between them.

Stage 1: Determine the purpose of the database

This stage is always the starting point of the ERM process, and
is especially important if you are adopting the Method-oriented
approach to your design, but it is a vital step in designing any
database. No database exists without a purpose: with a
Method-oriented database it is crucial that you know
exactly what you want to do with your data before you
design the database. At this stage you will need to make your
decisions about what information you want to keep and what you
want to discard, and you will need to be prepared to abide by the
consequences of this decision throughout the lifecycle of your
database project. This point deserves to be made emphatically:
although it is always possible to retrofit the design of your
database to include information that you had initially decided to
discard, it is not always a trivial matter to do so, particularly
if you have to enter another phase of data entry to collect the
new information.

Stage 2: List entities

Once you know what you want your database to do, you need to
divide your anticipated information into discrete subjects: each
subject, or entity, will evolve into a separate table.[1] We split the entities into separate tables for the
purposes of efficiency, for avoiding ambiguity, and because this
allows us the maximum flexibility for querying once the data are
in.

This stage of the ERM process sounds deceptively simple, but is
in fact probably the most difficult step of the whole process,
and it is certainly worth reading up on the subject (see the
Further
Reading section).

For example, consider a research project that was investigating
political elections in eighteenth-century Bristol, and the
sources consisted of a collection of poll books which recorded
the votes cast by the local electorates in the communities where
elections took place (image E2i):

With sources such as these we might pursue a research question
which was something like: ‘Analyse the geographic and economic
determinants of voting in early 18th century Bristol’.
With a question like this we would be interested in geography,
economic status and voting patterns in relation to a group of
individuals. In terms of entities, we might conclude that
there is only one: people, actually, more precisely, we would be
considering voters, which would lead us to the position of
deciding that we would need a table into which we would enter our
information about voters.

Alternatively if we were using probate materials for our
research, and wanted to create a database into which we could
enter information extracted from wills, we would need to consider
the entities from this source. We might conceive of our
entities – our discrete subjects – as breaking down into
‘document’, with each will being treated as a different document;
‘person’, containing information about different types of people
and their roles – testator, recipient, executor etc.; ‘bequest’
with a range of types of associated information; and ‘object’,
being the object of a bequest. If our research was interested in
the material culture of a period or place, this latter entity
would be particularly important, whereas if the project was
concerned only with networks of people and social interrelations,
the ‘object’ entity might not be necessary.

In both these examples the possibility of these entities
being identified has been stressed, because the definition of
entities is as much a product of the purpose of the database as
it is of the information contained in the sources. As mentioned
at the beginning of this Handbook, no two databases will
be designed in the same way, as no two databases will be built
for the same purposes – different designers may well identify
different entities based upon their unique appreciation of their
research and their sources. And this is why to a large extent
this step of the ERM process is the most difficult!

However it is perhaps worth considering the inclusion of three
commonly chosen entities in the design of your database:

People - with a related entity of
‘role’ (being the reason why they are present in the source)

Document – where archival and bibliographical material can be
entered (and thus enabling the tracking of every piece of data in
the database to its source)

Event – a slightly more abstract entity, one which describes
an instance of whatever it is your source records (a trial, a
taxation assessment, an election etc.) and where information
about dates can be recorded.

Exercise

Assume you have a research project that is examining the
demographic profile of families in a particular parish, and your
principal source will be ecclesiastical registers of baptisms,
marriages and burials. List the entities from this source that
would need to be included in your database design.[3]

Stage 3: Identify the relationships between the entities

Bearing in mind the nature of relationships between different
elements of information (see
Section D), this step of the ERM process requires you to
identify which of your entities are related, and what type of
relationship exists between them. This is an exercise in abstract
logic, and will take considerable practice: in addition, quite
often, this stage will require revisiting stage 2 and redefining
the entities you originally chose.

If we return to the database of wills mentioned in stage 2 with
the entities ‘document’, ‘person’, ‘bequest’ and ‘object’, we
would need to unpick the nature of the relationships between
these entities. We might logically decide that the relationships
would look something like this (the arrowheads depict the ‘many’
side of a one-to-many relationship (see
Section D):

E2ii –Example of relationships identified between entities
(wills)

A single document (will) can contain information about more than
one person, and also about more than one bequest, whilst a
bequest can include information about more than one object, so
all of these relationships are one-to-many.

Exercise

Identify the relationships between the entities chosen for the
parish register database in Stage 2.

Stage 4: Investigate and rectify problems

This stage is fairly self-explanatory. It is possible to spot
problems with the incipient design even at this relatively early
point in the ERM process, and if they exist it is better to do so
here than after investing work in the later stages.

Look out in particular for:

Relationships which do not appear to be one-to-many: remember
you cannot have entities related by a many-to-many relationship,
and whilst you can have them related through a one-to-one
relationship, it may be worth rethinking the two entities
involved (see
Section D)

Redundant relationships: if entities can be linked in more
than one way, you should work out which link should be kept and
which should be discarded – if Table A is related to Table B, and
Table B is related to Table C, then Tables A and C are already by
definition related, and do not need a ‘direct’ relationship to
exist between the two

Stage 5: List all attributes associated with each entity, and
identify keys

This stage involves listing the precise attributes of each entity
that has been identified in the previous stages of the ERM
process, by deciding on the fields that should occur in each
table. Each field is to contain one piece of information about
the subject of the entity, and that one piece will be a single
aspect of information that can be known about the entity (see
Section C5). Once you have listed the attributes for each
entity, you must then identify which fields will act as the
primary and foreign keys in each table, remembering that
relationships between tables do not exist between tables, but
between a specific field in one table and a specific field in the
related table (see
Section D).

E2iii –Example of attributes and keys identified within entities
(wills)[4]

You should always as a matter of course add a generic ‘Notes’
field into each table, with the memo datatype, as it is likely to
prove invaluable at moments of crisis when entering data (see
Section F)!

Stage 6: Construct the Entity Relationship Diagram (ERD)

Once you have completed stage 5, you are in a position to create
the ERD for your database design, which will resemble the heavily
simplified example above (image E2iii).

Exercise

Construct an ERD for the parish register database, ensuring that
you have identified your entities and their attributes, the
relationships between the entities and the fields which act as
keys.

[1] In fact they may evolve into separate
groups of related tables, but for the moment this does not
need to be considered.

[2] An Exact List of the Votes of the
Freeholders and Freemen of the City and County of Bristol
taken at the Election of Members of Parliament (Bristol,
1722) p.19.

[3] If you are not familiar with parish
registers as sources, it is possible to find examples of
transcripts (of varying quality) online. One such set of
transcripts, for the county of Hampshire, can be found at
http://www.knightroots.co.uk/parishes.htm
(accessed 23/03/2011).

[4] Terms in black are entity (table
names), those in red are primary keys, those in green are
foreign keys and those in grey are the remaining
attributes/fields, some of which pertain to the database’s
Source layer, and some to its Standardisation layer. It is
very important to note that this diagram is entirely
illustrative, and is not intended to be prescriptive or
definitive in any way!