Chapter 3: Sources of Inaccurate Data

Chapter 3: Sources of Inaccurate Data

Before we can assess data
correctness we need to understand the various ways inaccurate values get into
databases. There are many sources of data inaccuracies, and each contributes
its own part to the total data quality problem. Understanding these sources
will demonstrate the need for a comprehensive program of assessment,
monitoring, and improvement. Having highly accurate data requires attention to
all sources of inaccuracies and appropriate responses and tools for
each.

Figure
3.1 shows the four general areas where inaccuracies occur. The first
three cause inaccuracies in data within the databases, whereas the fourth area
causes inaccuracies in the information products produced from the data. If you
roll up all potential sources of errors, the interesting conclusion is that the
most important use of the data (corporate decision making) is made on the
rendition of data that has the most inaccuracies.

Figure 3.1: Areas where inaccuracies
occur.

3.1 Initial Data Entry

Most people assume that data
inaccuracies are always the result of entering the wrong data at the beginning.
This is certainly a major source of data inaccuracies but not the only source.
Inaccurate data creation can be the result of mistakes, can result from flawed
data entry processes, can be deliberate, or can be the result of system errors.
By looking at our systems through these topics, you can gain insight into
whether systems are designed to invite inaccurate data or are designed to
promote accurate data.

Data Entry Mistakes

The most common source of a data inaccuracy is that the person
entering the data just plain makes a mistake. You intend to enter
blue but enter bleu instead; you hit
the wrong entry on a select list; you put a correct value in the wrong field.
Much of operational data originates from a person. People make mistakes; we
make them all the time. It is doubtful that anyone could fill out a
hundred-field form without making at least one mistake.

A real-world example involves an automobile damage claims database
in which the COLOR field was entered as text. Examination of the content of
this field yielded 13 different spellings for the word
beige. Some of these mistakes were the result of typos.
Others were just that the entry person did not know how to spell the word. In
some of the latter cases, they thought they knew how to spell the word, whereas
in others they were just not able or willing to look it up.

Flawed Data Entry Processes

A lot of data entry begins with a form. A person completes a form
either on a piece of paper or on a computer screen. Form design has a lot to do
with the amount of inaccurate data that ends up in the database. Form design
should begin with a basic understanding of quality issues in order to avoid
many of the mistakes commonly seen. For example, having someone select from a
list of valid values instead of typing in a value can eliminate the
misspellings previously cited.

Another common problem is having fields on the form that are
confusing to the user. This often leads them to enter wrong information. The
field itself may be confusing to the user. If it is a field that is not
commonly understood, or if the database definition is unconventional, the form
needs to provide assistance in guiding the user through entry of values into
the field. Sometimes the confusion is in the way the field is described in its
identifying text or in its positioning on
the form. Form design should always be subjected to rigorous quality testing to
find the fields a normal user would have difficulty in knowing what to
enter.

Data entry windows should have instructions available as HELP
functions and should be user friendly in handling errors. Frustration in using
a form can lead to deliberate mistakes that corrupt the database.

Forms are better completed by a trained entry person than by a
one-time user. This is because the entry person can be taught how things should
be entered, can become proficient in using the form mechanisms, and can be
given feedback to improve the efficiency and accuracy of the data. A one-time
user is always uncertain about what they are supposed to do on the form.
Unfortunately, our society is moving by way of the Internet toward eliminating
the middle person in the process and having end users complete forms directly.
This places a much higher demand on quality form design.

The data entry process includes more than the forms that are filled
out. It also includes the process that surrounds it. Forms are completed at a
specific point or points in a process. Sometimes we have forms that are
required to be completed when not all information is known or easily obtained
at that point in the process. This will inevitably lead to quality
problems.

An example of a data entry process I helped design a number of
years ago for military repair personnel is very instructive of the types of
problems that can occur in data collection. The U.S. Navy has a database that
collects detailed information on the repair and routine maintenance performed
on all aircraft and on all major components of every ship. This database is
intended to be used for a variety of reasons, from negotiating contracts with
suppliers, to validating warranties, to designing new aircraft and
ships.

When an aircraft carrier is in a combat situation, such as in
Kuwait and Afghanistan, repairs are being made frequently. The repair crews are
working around the clock and under a great deal of pressure to deal with a lot
of situations that come up unexpectedly. Completing forms is the least of their
concerns. They have a tendency to fix things and do the paperwork later. The
amount of undocumented work piles up during the day, to be completed when a
spare moment is available. By then the repair person has forgotten some of the
work done or the details of some of the work and certainly is in a hurry to get
it done and out of the way.

Another part of this problem comes in when the data is actually
entered from the forms. The forms are coming out of a hectic, very messy
environment. Some of the forms are torn; some have oil or other substances on
them. The writing is often difficult to decipher. The person who created it is
probably not available and probably would not remember much about it if
available.

A database built from this
system will have many inaccuracies in it. Many of the inaccuracies will be
missing information or valid but wrong information. An innovative solution that
involves wireless, handheld devices and employs voice recognition technology
would vastly improve the completeness and accuracy of this database. I hope the
U.S. Navy has made considerable improvements in the data collection processes
for this application since I left. I trust they have.

The Null Problem

A special problem occurs in data entry when the information called
for is not available. A data element has a value, an indicator that the value
is not known, or an indicator that no value exists (or is applicable) for this
element in this record. Have you ever seen an entry screen that had room for a
value and two indicator boxes you could use for the case where there is no
value? I haven't. Most form designs either mandate that a value be provided or
allow it to be left blank. If left blank, you do not know the difference
between value-not-known and no-value-applies.

When the form requires that an entry be available and the entry
person does not have the value, there is a strong tendency to "fake
it" by putting a wrong, but acceptable, value into the field. This is
even unintentionally encouraged for selection lists that have a default value
in the field to start with.

It would be better form design to introduce the notion of NOT KNOWN
or NOT APPLICABLE for data elements that are not crucial to the transaction
being processed. This would at least allow the entry people to enter accurately
what they know and the users of the data to understand what is going on in the
data.

It would make sense in some cases to allow the initial entry of
data to record NOT KNOWN values and have the system trigger subsequent
activities that would collect and update these fields after the fact. This is
far better than having people enter false information or leaving something
blank and not knowing if a value exists for the field or not.

An example of a data element that may be NOT KNOWN or NOT
APPLICABLE is a driver's license number. If the field is left blank, you cannot
tell if it was not known at the point of entry or whether the person it applies
to does not have a driver's license. Failure to handle the possibility of
information not being available at the time of entry and failure to allow for
options to express what you do know about a value leads to many inaccuracies in
data.

Deliberate Errors

Deliberate errors are those
that occur when the person enters a wrong value on purpose. There are three
reasons they do this:

They do not know the correct information.

They do not want you to know the correct information.

They get a benefit from entering the wrong information.

Do Not Know Correct
Information

Not knowing the correct information occurs when the form requires
a value for a field and the person wants or needs to complete the form but does
not know the value to use. The form will not be complete without a value. The
person does not believe the value is important to the transaction, at least not
relative to what they are trying to do. The result is that they make up a
value, enter the information, and go on.

Usually the information is not important to completing the
transaction but may be important to other database users later on. For example,
asking and requiring a value for the license plate number of your car when
registering for a hotel has no effect on getting registered. However, it may be
important when you leave your lights on and they need to find out whose car it
is.

Do Not Wish To Give The Correct
Information

The second source of deliberate errors is caused by the person
providing the data not wanting to give the correct information. This is
becoming a more and more common occurrence with data coming off the Internet
and the emergence of CRM applications. Every company wants a database on all of
their customers in order to tailor marketing programs. However, they end up
with a lot of incorrect data in their databases because the information they
ask people for is more than people are willing to provide or is perceived to be
an invasion of privacy.

Examples of fields that people will lie about are age, height,
weight, driver's license number, home phone number, marital status, annual
income, and education level. People even lie about their name if it can get the
result they want from the form without putting in their correct name. A common
name appearing in many marketing databases is Mickey Mouse.

The problem with collecting data that is not directly required to
complete the transaction is that the quality of these data elements tends to be
low but is not immediately detected. It is
only later, when you try to employ this data, that the inaccuracies show up and
create problems.

Falsifying To Obtain A
Benefit

The third case in which deliberate mistakes are made is where the
entry person obtains an advantage in entering wrong data. Some examples from
the real world illustrate this.

An automobile manufacturer receives claim forms for warranty
repairs performed by dealers. Claims for some procedures are paid immediately,
whereas claims for other procedures are paid in 60 days. The dealers figure out
this scheme and deliberately lie about the procedures performed in order to get
their money faster. The database incorrectly identifies the repairs made. Any
attempt to use this database to determine failure rates would be a total
failure. In fact, it was in attempts to use this data for this purpose that led
to the discovery of the practice. It had been going on for years.

A bank gives branch bank employees a bonus for all new corporate
accounts. A new division of a larger company opens an account with a local
branch. If the bank employee determines that this is a sub-account of a larger,
existing customer (the correct procedure), no bonus is paid upon opening the
account. If, however, the account is opened as a new corporate customer (the
wrong procedure), a bonus is paid.

An insurance company sells automobile insurance policies through
independent insurance writers. In a metropolitan area, the insurance rate is
determined by the Zip code of the applicant. The agents figure out that if they
falsify the ZIP CODE field on the initial application for high-cost Zip codes,
they can get the client on board at a lower rate. The transaction completes,
the agent gets his commission, and the customer corrects the error when the
renewal forms arrive a year later. The customer's rates subsequently go up as a
result.

Data entry people are rated based on the number of documents
entered per hour. They are not penalized for entering wrong information. This
leads to a practice of entering data too fast, not attempting to resolve issues
with input documents, and making up missing information. The operators who
enter the poorest-quality data get the highest performance ratings.

All of these examples demonstrate that company policy can
encourage people to deliberately falsify information in order to obtain a
personal benefit.

System Problems

Systems are too often blamed for mistakes when, after
investigation, the mistakes turn out to be the result of a human error. Our
computing systems have become enormously
reliable over the years. However, database errors do occur because of system
problems when the transaction systems are not properly designed.

Database systems have the notion of COMMIT. This means that changes
to a database system resulting from an external transaction either get
completely committed or completely rejected. Specific programming logic ensures
that a partial transaction never occurs. In application designs, the user is
generally made aware that a transaction has committed to the database.

In older systems, the transaction path from the person entering
data to the database was very short. It usually consisted of a terminal passing
information through a communications controller to a mainframe, where an
application program made the database calls, performed a COMMIT, and sent a
response back to the terminal. Terminals were either locally attached or
accessed through an internal network.

Today, the transaction path can be very long and very complex. It
is not unusual for an application to occur outside your corporation on a PC,
over the Internet. The transaction flows through ISPs to an application server
in your company. This server then passes messages to a database server, where
the database calls are made. It is not unusual for multiple application servers
to be in the path of the transaction. It is also not unusual for multiple
companies to house application servers in the path. For example, Amazon passes
transactions to other companies for "used book" orders.

The person entering the data is a nonprofessional, totally
unfamiliar with the system paths. The paths themselves involve many parts,
across many communication paths. If something goes wrong, such as a server
going down, the person entering the information may not have any idea of
whether the transaction occurred or not. If there is no procedure for them to
find out, they often reenter the transaction, thinking it is not there, when in
fact it is; or they do not reenter the transaction, thinking it happened, when
in fact it did not. In one case, you have duplicate data; in the other, you
have missing data.

More attention must be paid to transaction system design in this
new, complex world we have created. We came pretty close to cleaning up
transaction failures in older "short path" systems but are now
returning to this problem with the newer "long path"
systems.

In summary, there are plenty of ways data inaccuracies can occur
when data is initially created. Errors that occur innocently tend to be random
and are difficult to correct. Errors that are deliberate or are the result of
poorly constructed processes tend to leave clues around that can be detected by
analytical techniques.