Every
person more or less is familiar with data bases. Starting with heap of scratch
papers, notebooks (I mean the old ones on the paper :), and ending with
relational data base management systems storing zillions of records. Not so
many people think they are familiar with data wastes. But in reality most
probably they are! You ask – what is data waste? In short – this is data base
where there aren’t clearly defined algorithms how to get out rational data.

You
probably have (or had) an old notebook on paper with very important information
including phone numbers, addresses, peoples etc. randomly written in. The same
probably could be said about files and directories of your own work computer
(if you have a perfect system, then most probably you are an exception :).

So who
cares about your own notebook or computer? Nobody do, except you. But in case
of databases used by many people situation turns out far worse.

The first
big chapter “Characteristics
for data waste” describes how one can identify data waste. Second
chapter “Consequences of
data waste” outlines most important and worst consequences of
data waste.

This
chapter tries to formulate most common and important characteristics describing
a data waste. Most of the time none of them are true/false type features but
the more of them you have the more your data base is data waste with all
possible (negative) consequences.

What I
understand by documentation? Documentation in reality can be stored in many
ways. You can have it in Software design description according to IEEE standard,
separate document called something like Database design description or even in
comment format in database. The most important thing isn’t that you have
enumerated all the tables and columns, but that you have described what they
contains, what possible values they could have, what these values mean, what
relationships they have etc. One can easily generate table, column, column data
type, primary key, foreign key and list of other database objects from data
dictionary. The main problem is to understand what these objects contain and
what the purpose they serveis.

In conjunction
with “lack of documentation” (1) and “all logic in application / db has only
tables” (5) you are in a big trouble if you need to change something or get
reasonable data out. The only way is to analyze the data model, analyze the
data and analyze the application code (if source is available). This takes
time. This takes lot of time. And there is no guarantee that you have made
changes in a right way. Probably after some time you’ll find that your changes won’t
work for a very special case and worst of all have silently damaged your other
data. So take care of your only well-informed person and force him to correct
“lack of documentation” or at least deliver his knowledge to others.

Initial
design is crucial for almost everything that is at least a bit complex. You
cannot build a big house without accepted project. Unfortunately many times
because of false expectations that it would be cheaper information systems are
built without any initial design. Actually it will be far more expensive at
least in long term. The earlier the mistake is allowed the harder is to correct
it. If your data model is incorrect in the very root it will affect all the
system and even best programmers in the world wouldn’t be able to make the
application performance acceptable.

Usually
that goes hand in hand with “3.
Lack of initial design” It will raise the total incomprehension level about
your data base because every object has to be analyzed separately. You cannot
create patterns from already known information and use analogies that greatly reduce
necessary overall work.

Do you have
all business logic outside your Oracle database? Yes? Then why on earth you
have spent so many $$ for Oracle licence? You could easily live with MySQL! And
MySQL would certainly be faster! Why? Because Oracle has so much possible
functionality compared to MySQL, that it certainly adds some overhead. This is
just like you have rented a big house with tens of rooms and tons of equipment
and advanced features and you are using only living-room. You are not using WC.
Why WC? You can go out in the garden and do necessary things there! You are not
using kitchen. Why kitchen? You can go out and do the cooking on fire! So would
you pay rent for the entire house with all its features and wouldn’t use them?

The same
situation is with databases, especially advanced databases with many features.
You have paid so much money and are just deliberately robbing yourself.

If you are
seeking for a miracle called database independence then you have achieved the
wonderful result when your application isn’t performing at possible best on every
one of them. All the built in features are as close as possible to the data and
that’s the primary reason they are doing things better than any feature outside
database.

This is
just like you are living in several separate houses each with different architecture and using only
living-rooms there. I’m sure you wouldn’t be satisfied with your quality of
life.

With “3. Lack of
initial design” and “4.
Lack of naming conventions” all people developing your data base are like ten people decorating
your ten room house without any guidance. If you are lucky achieved result will
be quite acceptable, if you aren’t result will strike terror into everyone. The
same situation is with databases. There are many conventions, standards and
best practises to follow. If more than one of them are used in your database the
ambiguity level is increased and it becomes apparent most brightly when you
have to do something with object A made after convention one and object B made
after convention two.

This
usually is the most radical form of “5.
All logic in application / db has only tables” or result of simple incompetence. If you haven’t
enforced referential integrity in database then this is only matter of time
when you’ll have orphan records, duplicate values and unwanted values. Sooner
or later you’ll access your data by passing your application, sooner or later
someone will use a bug in your application and your data will be damaged.

Trying to
enforce referential integrity and for example unique constraints in application
is far far harder than doing that in database, especially if you have multi-user
application (and who hasn’t?).

Tables with
names tab1, tab2 or AKP1, AKP2 and columns with names col1, col2 etc. can turn
everyone’s life into a hell. I prefer then table names in unknown language; at
least there is a reason to learn some words in it. In a new database initially
it is hard to remember logical names to say nothing of col1 and col2. Each and
every usage of them results in looking into some document and/or sticker.

7.3. Many columns like notes without any classifiers. Each user enters data in
his own style

A human can
probably get along without classifiers because he understands that USA, the United
States, the U.S.,
the U.S.A., the U.S. of A,
the States, and America
is one and the same country. But as soon as you need to write a report or exchange
data with some other system, you have to classify
all this mess anyway. And there is no guarantee that someone wouldn’t invent a
new value tomorrow because imagination of users in this sphere is endless. Other
bad thing is to unite separate data in one column for example name and surname.
Sooner or later you’ll definitely have Plivna Gints instead of Gints Plivna and
even Gints <Secondname> Plivna, and you wouldn’t know what the
<Secondname> is – name or surname?

This is
another one “feature” of database independence because each data base treats its
data types differently. So instead of using DATE and NUMBER you use VARCHAR (10)
and VARCHAR (XX). As a result you have wonderful potential for a big mess. Ordering
for characters works like ‘1’, ‘10’, ‘2’ and not like you’d like 1, 2, 10,
nothing to speak of ordering of varchar dates. Date and number mathematic
becomes Sisyphean task and
there is always a threat that someone will enter either date in incorrect
format or number with for example character “O”.

This is
more physical, than logical aspect of database. In case the database has more
or less only necessary indexes one can found common data access patterns that
can tell something about valuable data and not so valuable data. In case the
database has chaotic indexes one can found only bad example of indexing scheme.

List of
consequences described below of course isn’t complete and these are for the
worst case scenario when database complies with all criteria described above.
Therefore I hope there are only a few data bases in the world fulfilling all
characteristics mentioned above and I hope even more that you don’t have to
deal with one of them. Even if you have, maybe you are lucky enough to find a
person that has researched your problem source before and can give you some
knowledge about it, maybe you are lucky enough and the database is so small
that even with all characteristics above fulfilled you can understand it in a
reasonable time, maybe you are lucky enough and can agree with customer to
start a new life without old data or at least only with data that satisfies
some minimal criteria.

OK, you’ve
understood that you have data waste, let’s create a data base and start things
over! The bad news is – that’s quite a big work or almost impossible for a real
data waste.

With “1. Lack of documentation” and “2.
The only well-informed person has left a while ago” you don’t know WHAT to migrate. You don’t know which tables and columns take
seriously, which tables and columns are simple waste, you even don’t know
what all this mess means!

With “5. All logic in application / db has only
tables” and “7.1. No constraints and
referential integrity” you have records without necessary information
(because of lack of NOT NULL constraints), you have orphan records and
don’t know where to put them and you have duplicate records with different
values.

With “7.3.
Many columns like notes without any classifiers. Each user enters data
in his own style” and “7.4. Character data type
instead of date, number” you haven’t clear algorithms how to make data
transformation and how to map old a la classifiers to new classifiers.

With “7.2.
Cryptic and uninformative column/table names” your task of writing migration code becomes
nightmare and every evening you have headache J

As a result
with completely inadequate work you have made partial migration and all users
complain that old system was better because had more information and was less
restrictive.

Data waste
is probably one of the biggest causes of viewpoint that the biggest value is
application, not the data. I’ve personally seen only the contrary. What is data
conversion from text files, various ancient DBMSes, previous data model
versions? All these data were entered via some application. They are gone forever;
nobody cares about them, but the great requirement to convert old data remains.
In case of data waste the overall quality of data and inability to convert them
leads to this false conclusion.

Avoiding
mistakes described in “Characteristics
for data waste” doesn’t guarantee that your database will be good. But
allowing them does guarantee you problems that you won’t be able to correct (except
“7.5. Chaotic indexes”) via tuning, caching, writing better SQL, using bind
variables, buying bigger box or whatever. These are logic mistakes and the only
ultimate solution is to redesign data base and most probably rewrite your
application from scratch. If you really think that you need to allow at least
one of them then think at least twice and try to formulate both minuses and
pluses for yourself. This will at least allow you to look deeper in possible
problems and gather new experience for decision making skills J

Gints
Plivna gints.plivna@gmail.com is
system analyst in Rix Technologies Ltd. (www.rixtech.lv). He has experience in
working with Oracle since 1997 and his interests mostly have been connected
with analyzing system requirements, design, development and SQL tuning.

This work
is licensed under the Creative Commons Attribution-ShareAlike 2.5 License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/
or send a letter to CreativeCommons, 543
Howard Street, 5th Floor, San Francisco, California,
94105, USA.