Designing Databases

Structuring tables can improve database performance-- here's how to do it.

Indexing

In addition to the column definitions and the
UNIQUE qualifier, our definition for the
StationLines table included three INDEX
lines—one for each of the station_id,
line_id and north_to_south
columns.

While it often helps to think of a relational database table
as a glorified spreadsheet with rows and columns, there are some
important differences. One is that a database table does not store
its rows in any particular order. If we are interested in
retrieving rows from the table in a certain order, we must specify
it with the ORDER BY clause in our query.

Because rows are not ordered in any particular way, a
SELECT query can often take quite a while to
fulfill. For example, take the following query:

SELECT id
FROM RailStations
WHERE name = "Tel Aviv Central";

This might not seem like a time-consuming query, given that
it involves a single table and a simple WHERE
clause. But since the rows of RailStations are not stored in any
particular order, finding the rows where the name is “Tel Aviv
Central” can take quite a while. This might be a negligible amount
of time in the case of a 100-row table, but when a table contains
1,000 or 10,000 rows, the time can become noticeable. In this
particular example, the database server is probably smart enough to
realize that RailStations.name has been declared
UNIQUE, meaning our query will return one row,
if it returns anything. This means the server will, on average,
have to search through only half of the rows—but that can still
take quite a while.

An index changes this picture by adding a pointer to each
column value. If RailStations.name is indexed, the MySQL server can
almost immediately find those rows containing a particular value.
It can also determine whether a value exists at all.

If indexes can increase query speeds so dramatically, why are
rows unindexed by default? The main answer is that indexes are
written and updated each time an INSERT or
UPDATE operation is performed on a table. Since
the majority of database queries are SELECTs, in
which the index can substantially improve performance, this is
normally an acceptable trade-off. However, certain applications
must INSERT and UPDATE at
maximum speed, in which case creating an index can cause
problems.

Since indexes are used in locating columns of a certain
value, they are necessary for only those columns that will be named
in WHERE clauses. There is no need to index a
column that is displayed, but rarely used as a search
criterion.

In some cases, it is enough to index the first part of each
column rather than the entire column. For example, if we are
indexing a column of type VARCHAR(50), then we
might be able to index only 10 of those characters. This will
retain most of the advantages of a full index (since the first ten
characters are rarely identical in such a text field), while
reducing the amount of information the index must store.

Entering Train Information

Now that we have thoroughly examined the tables describing
the train system, it is time to put some trains on those tracks.
The question of how to model this data is a tough one, since there
are a number of ways in which to accomplish it. I decided to split
this information into two tables, Trains and DepartureTimes.

Each row of Trains describes a particular train, indicating
the line on which it runs, the ID numbers of its origin and
destination stations, and the time it departs from its
origin:

The first column is a primary key, allowing us to describe
each train with a single number. The combination of a rail line,
origin, destination and hour should be unique, so we ask the
database server to enforce this condition with the
UNIQUE keyword.

Finally, we define the DepartureTimes table, which stores
information on when a train will leave from a particular
station:

Now we know two trains will arrive in Tel Aviv early enough for us
to catch a morning meeting. But which trains are those? It would be
nice to get more information than that. One possibility is to print
the name of the origin station and the hour at which the train
leaves:

Notice how SQL allows us to use < and > when handling dates
and times, for columns declared as DATE,
TIME or DATETIME. Given the
contortions one must use in order to compare dates and times in
nearly any programming language, this built-in date comparison is
still one of my favorites.

Assuming we want to take the first train of the day
(ID 1), we can print the list of when it will
arrive at each station: