Scenario A: all information in one table

In this scenario, you might design a table with the following
fields to capture all of the people- and hat-related information
available from your sources: Note that database design
principles say that you should not combine different entities
into a single table (in this case ‘people’ and ‘hats’ being the
entities), as this confuses the underlying ‘meaning’ of your
data. Good practice would say: hat owners are one entity, hats
are another, so create different tables for them. However, it is
possible to combine entities into single tables, and for our
purposes here it is useful to see the consequences of doing so.
For more on entities see
Section E2.

PersonID

Forename

Surname

Occupation

Residence

Age

Gender

HairColour

Income

HatID

HatType

HatColour

HatMaterial

Entering a number of records into this table would result in
data that resembled:

PersonID

Forename

Surname

Occupation

Residence

Age

Gender

HairColour

Income

HatID

HatType

HatColour

HatMaterial

1

Samuel

Spade

Builder

Peckham

23

M

Blonde

£25,000

1

Baseball cap

Black/red

Polyester

2

Philip

Marlowe

Accountant

Dulwich

35

M

Dappled

£34,533

2

Bowler

Pink

Satin/fur

3

Hercule

Poirot

Plumber

Chelsea

26

M

Bald

£250,000

3

Captain’s hat

White/blue

Silk/braid

4

Miss

Marple

Doctor

Surbiton

58

F

Grey

£1.50

4

Nightcap

Grubby yellow

Rags

With a table like this we would be able to perform some
sophisticated analysis on types of people, types of hat,
materials used, the correlation between income, age and hat
type, occupational and gender distribution of hat types and so
on, which would obviously be of enormous benefit to historians
interested in this kind of ressearch.

As we continue to examine our sources, we find that some people
own more than one hat. This is a one-to-many relationship: one
person may have more than one hat (for one-to-many and other
types of relationship see
Section D3). So, in this database design scenario we might
find records appearing in the table in the following manner:

PersonID

Forename

Surname

Occupation

Residence

Age

Gender

HairColour

Income

HatID

HatType

HatColour

HatMaterial

1

Samuel

Spade

Builder

Peckham

23

M

Blonde

£25,000

1

Baseball cap

Black/red

Polyester

2

Philip

Marlowe

Accountant

Dulwich

35

M

Dappled

£34,533

2

Bowler

Pink

Satin/fur

2

Philip

Marlowe

Accoumtant

Dulwich

35

M

Dappled

£34,533

3

Crash helmet

Green

Kevlar/fur

3

Hercule

Poirot

Plumber

Chelsea

26

M

Bald

£250,000

4

Stetson

Brown

Leather

3

Hercule

Poirot

Plumber

Chelsea

26

M

Bald

£250,000

5

Captain’s hat

White/blue

Silk/braid

4

Miss

Marple

Doctor

Surbiton

58

F

Grey

£1.50

6

Nightcap

Grubby yellow

Rags

As you can see the table now has six records rather than four,
and we have entered two of our people twice (Marlowe and Poirot).
The reason we have entered these two twice is because each of
them owns two hats, as you can see from the hat-related fields.
As a result of our lucky multiple-hat owning individuals, nine of
our fields have duplicated values in them, and this is both a
problem, and a clue to the fact that this table might be better
designed as part of a relational data model. Duplicating
information across records in this way should be avoided for a
number of reasons. Firstly, data entry is time consuming enough
without having to enter the same information on more than one
occasion. Secondly, the more times you enter the same piece of
information into the database the more scope there is for
entering something incorrectly, as we have done here with
Marlowe’s occupation value.

This particular error could have been avoided through the use
of a variety of tools within the database that are designed to
mitigate data entry errors.

This may see a trivial point, but given the way that databases
connect information, the different spelling of ‘accountant’ will
mean that the database will recognise the two records as
belonging to entirely different individuals.

But the most serious problem that this duplication of
information raises is a third problem – which is that this will
adversely affect some types of analysis by providing false
results to queries. If we wanted to we could design a query to
answer the question: ‘how many fur hats were owned by
accountants?’ The query would look at the data, identify which
records were about fur hats and which were about accountants, and
then count the number of records which were about both, and this
would be the answer to the question. In this case the query would
tell us that there were two fur hats owned by accountants, and
this would be correct. If the question we wanted an answer to
was: ‘how many accountants owned fur hats?’, then the query would
operate in exactly the same way – count how many records of fur
hats and accountants, and tell us how many records had both. Thus
the answer would again be ‘two’, but in this instance, however,
that would be inaccurate, as only one individual who was an
accountant owned fur hats: he appears twice in the records and is
therefore counted twice, erroneously, when he should only have
been counted once.