2.4. Flat-File Databases

The
simplest type of database that we can
create and manipulate is the old standby, the flat-file
database. This database is essentially a file, or group of
files, that contains data in a known and standard format that a
program scans for the requested information. Modifications to the
data are usually done by updating an in-memory copy of the data held
in the file, or files, then writing the entire set of data back out
to disk. Flat-file databases are typically ASCII text files
containing one record of information per line. The line termination
serves as the record delimiter.

In this section we'll be examining the
two
main
types of flat-file database: files that separate fields with a
delimiter character, and files that allocate a fixed length to each
field. We'll discuss the pros and cons of each type of data
file and give you some example code for manipulating them.

The most common format used for flat-file databases is probably the
delimited file in which each field is separated
by a delimiting character. And possibly the most common of these
delimited formats is the comma-separated
values (CSV) file, in which fields are
separated from one another by commas. This format is understood by
many common programs, such as Microsoft Access and spreadsheet
programs. As such, it is an excellent base-level and portable format
useful for sharing data between applications.[8]

[8]More
excitingly, a DBI driver called DBD::CSV exists that allows you to
write SQL code to manipulate a flat file containing CSV data.

Other popular delimiting characters are the colon (
: ), the tab, and the pipe symbol (
| ). The Unix /etc/passwd
file is a good example of a delimited file with each record being
separated by a colon. Figure 2-1 shows a single
record from an /etc/passwd file.

Figure 2-1. The /etc/passwd file record format

2.4.1. Querying Data

Since
delimited
files are a very low-level form of
storage manager, any manipulations that we wish to perform on the
data must be done using operating system functions and low-level
query logic, such as basic string comparisons. The following program
illustrates how we can open a data file containing colon-separated
records of megalith data, search for a given site, and return the
data if found:

indicating that our brute-force scan and test for the correct site
has worked. As you can clearly see from the example program, we have
used Perl's own native file I/O functions for reading in the
data file, and Perl's own string handling functions to break up
the delimited data and test it for the correct record.

The downside to delimited file formats is that if any piece of data
contains the delimiting character, you need to be especially careful
not to break up the records in the wrong place. Using the Perl
split()
function with a simple regular expression, as used above, does not
take this into account and could produce wrong results. For example,
a record containing the following information would cause the
split() to happen in the wrong place:

Stonehenge:Wiltshire:SU 123 400:Stone Circle and Henge:Stonehenge: The most famous stone circle

The easiest quick-fix technique is to translate any delimiter
characters in the string into some other character that you're
sure won't appear in your data. Don't forget to do the
reverse translation when you fetch the records back.

Another common
way of storing data within flat
files is to use fixed-length records in which to
store the data. That is, each piece of data fits into an exactly
sized space in the data file. In this form of database, no delimiting
character is needed between the fields. There's also no need to
delimit each record, but we'll continue to use ASCII line
termination as a record delimiter in our examples because Perl makes
it very easy to work with files line by line.

Using fixed-width fields is similar to the way in which data is
organized in more powerful database systems such as an RDBMS. The
pre-allocation of space for record data allows the storage manager to
make assumptions about the layout of the data on disk and to optimize
accordingly. For our megalithic data purposes, we could settle on the
data sizes of:[10]

[10]The fact that these data sizes are all
powers of two has no significance other than to indicate that the
authors are old enough to remember when powers of two were
significant and useful sometimes. They generally aren't
anymore.

Storing the data in this format requires slightly different storage
manager logic to be used, although the standard Perl file I/O
functions are still applicable. To test this data for the correct
record, we need to implement a different way of extracting the fields
from within each record. For a fixed-length data file, the Perl
function unpack() is perfect. The following code
shows how the
unpack()
function replaces the split() used above:

Although fixed-length fields are always the same length, the data
that is being put into a particular field may not be as long as the
field. In this case, the extra space will be filled with a character
not normally encountered in the data or one that can be ignored.
Usually, this is a space character (ASCII 32) or a
nul (ASCII 0).

In the code above, we know that the data is space-packed, and so we
remove any trailing space from the name record so as not to confuse
the search. This can be simply done by using the uppercase
A format with unpack().

If you need to choose between delimited
fields and fixed-length fields, here are a
few guidelines:

The main limitations

The main limitation with delimited fields is the need to add special
handling to ensure that neither the field delimiter or the record
delimiter characters get added into a field value.

The main limitation with fixed-length fields is simply the fixed
length. You need to check for field values being too long to fit (or
just let them be silently truncated). If you need to increase a field
width, then you'll have to write a special utility to rewrite
your file in the new format and remember to track down and update
every script that manipulates the file directly.

Space

A delimited-field file often uses less space than a fixed-length
record file to store the same data, sometimes very
much less space. It depends on the number and size of any
empty or partially filled fields. For example, some field values,
like web URLs, are potentially very long but typically very short.
Storing them in a long fixed-length field would waste a lot of space.

While delimited-field files often use less space, they do
"waste" space due to all the field delimiter characters.
If you're storing a large number of very small fields then that
might tip the balance in favor of fixed-length records.

Speed

These days, computing power is rising faster than hard disk data
transfer rates. In other words, it's often worth using more
space-efficient storage even if that means spending more processor
time to use it.

Generally, delimited-field files are better for sequential access
than fixed-length record files because the reduced size more than
makes up for the increase in processing to extract the fields and
handle any escaped or translated delimiter characters.

However, fixed-length record files do have a trick up their sleeve:
direct access. If you want to fetch record 42,927 of a
delimited-field file, you have to read the whole
file and count records until you get to the one you want. With a
fixed-length record file, you can just multiply 42,927 by the total
record width and jump directly to the record using
seek().

Furthermore, once it's located, the record can be updated
in-place by overwriting it with new data.
Because the new record is the same length as the old, there's
no danger of corrupting the following record.

2.4.2. Inserting Data

Inserting data
into a flat-file database is very
straightforward and usually amounts to simply tacking the new data
onto the end of the data file. For example, inserting a new megalith
record into a colon-delimited file can be expressed as simply as:

#!/usr/bin/perl -w
#
# ch02/insertmegadata/insertmegadata: Inserts a new record into the
# given megalith data file as
# colon-separated data
#
### Check the user has supplied an argument to scan for
### 1) The name of the file containing the data
### 2) The name of the site to insert the data for
### 3) The location of the site
### 4) The map reference of the site
### 5) The type of site
### 6) The description of the site
die "Usage: insertmegadata"
." <data file> <site name> <location> <map reference> <type> <description>\n"
unless @ARGV == 6;
my $megalithFile = $ARGV[0];
my $siteName = $ARGV[1];
my $siteLocation = $ARGV[2];
my $siteMapRef = $ARGV[3];
my $siteType = $ARGV[4];
my $siteDescription = $ARGV[5];
### Open the data file for concatenation, and die upon failure
open MEGADATA, ">>$megalithFile"
or die "Can't open $megalithFile for appending: $!\n";
### Create a new record
my $record = join( ":", $siteName, $siteLocation, $siteMapRef,
$siteType, $siteDescription );
### Insert the new record into the file
print MEGADATA "$record\n"
or die "Error writing to $megalithFile: $!\n";
### Close the megalith data file
close MEGADATA
or die "Error closing $megalithFile: $!";
print "Inserted record for $siteName\n";
exit;

This example simply opens the data file in append mode and writes the
new record to the open file. Simple as this process is, there is a
potential drawback. This flat-file database does not detect the
insertion of multiple items of data with the same search key. That
is, if we wanted to insert a new record about Stonehenge into our
megalith database, then the software would happily do so, even though
a record for Stonehenge already exists.

This may be a problem from a data integrity point of view. A more
sophisticated test prior to appending the data might be worth
implementing to ensure that duplicate records do not exist. Combining
the insert program with the query program above is a straightforward
approach.

Another potential (and more important) drawback is that this system
will not safely handle occasions in which more than one user attempts
to add new data into the database. Since this subject also affects
updating and deleting data from the database, we'll cover it
more thoroughly in a later section of this chapter.

Inserting new
records into a fixed-length data
file is also simple. Instead of printing each field to the Perl
filehandle separated by the delimiting character, we can use the
pack() function to create a fixed-length record
out of the data.

2.4.3. Updating Data

Updating data
within a flat-file database is where
things begin to get a little more tricky. When querying records from
the database, we simply scanned sequentially through the database
until we found the correct record. Similarly, when inserting data, we
simply attached the new data without really knowing what was already
stored within the database.

The main problem with updating data is that we need to be able to
read in data from the data file, temporarily mess about with it, and
write the database back out to the file without losing any records.

One approach is to slurp the entire database into memory, make any
updates to the in-memory copy, and dump it all back out again. A
second approach is to read the database in record by record, make any
alterations to each individual record, and write the record
immediately back out to a temporary file. Once all the records have
been processed, the temporary file can replace the original data
file. Both techniques are viable, but we prefer the latter for
performance reasons. Slurping entire large databases into memory can
be very resource-hungry.

The following short program implements the latter of these strategies
to update the map reference in the database of delimited records:

#!/usr/bin/perl -w
#
# ch02/updatemegadata/updatemegadata: Updates the given megalith data file
# for a given site. Uses colon-separated
# data and updates the map reference field.
#
### Check the user has supplied an argument to scan for
### 1) The name of the file containing the data
### 2) The name of the site to search for
### 3) The new map reference
die "Usage: updatemegadata <data file> <site name> <new map reference>\n"
unless @ARGV == 3;
my $megalithFile = $ARGV[0];
my $siteName = $ARGV[1];
my $siteMapRef = $ARGV[2];
my $tempFile = "tmp.$$";
### Open the data file for reading, and die upon failure
open MEGADATA, "<$megalithFile"
or die "Can't open $megalithFile: $!\n";
### Open the temporary megalith data file for writing
open TMPMEGADATA, ">$tempFile"
or die "Can't open temporary file $tempFile: $!\n";
### Scan through all the records looking for the desired site
while ( <MEGADATA> ) {
### Quick pre-check for maximum performance:
### Skip the record if the site name doesn't appear as a field
next unless m/^\Q$siteName:/;
### Break up the record data into separate fields
### (we let $description carry the newline for us)
my ( $name, $location, $mapref, $type, $description ) =
split( /:/, $_ );
### Skip the record if the site name doesn't match. (Redundant after the
### reliable pre-check above but kept for consistency with other examples.)
next unless $siteName eq $name;
### We've found the record to update, so update the map ref value
$mapref = $siteMapRef;
### Construct an updated record
$_ = join( ":", $name, $location, $mapref, $type, $description );
}
continue {
### Write the record out to the temporary file
print TMPMEGADATA $_
or die "Error writing $tempFile: $!\n";
}
### Close the megalith input data file
close MEGADATA;
### Close the temporary megalith output data file
close TMPMEGADATA
or die "Error closing $tempFile: $!\n";
### We now "commit" the changes by deleting the old file...
unlink $megalithFile
or die "Can't delete old $megalithFile: $!\n";
### and renaming the new file to replace the old one.
rename $tempFile, $megalithFile
or die "Can't rename '$tempFile' to '$megalithFile': $!\n";
exit 0;

You can see we've flexed our Perl muscles on this example,
using a while ... continue loop
to simplify the logic and adding a pretest for increased speed.

An equivalent
program that can be applied to a
fixed-length file is very similar, except that we use a faster
in-place update to change the contents of the field. This principle
is similar to the in-place query described previously: we don't
need to unpack and repack all the fields stored within each record,
but can simply update the appropriate chunk of each record. For
example:

### Scan through all the records looking for the desired site
while ( <MEGADATA> ) {
### Quick pre-check for maximum performance:
### Skip the record if the site name doesn't appear at the start
next unless m/^\Q$siteName/;
### Skip the record if the extracted site name field doesn't match
next unless unpack( "A64", $_ ) eq $siteName;
### Perform in-place substitution to upate map reference field
substr( $_, 64+64, 16) = pack( "A16", $siteMapRef ) );
}

This technique is faster than packing and unpacking each record
stored within the file, since it carries out the minimum amount of
work needed to change the appropriate field values.

You may notice that the pretest in this example isn't 100%
reliable, but it doesn't have to be. It just needs to catch
most of the cases that won't match in
order to pay its way by reducing the number of times the more
expensive unpack and field test gets executed. Okay, this might not
be a very convincing application of the idea, but we'll revisit
it more seriously later in this chapter.

2.4.4. Deleting Data

The final
form of data manipulation that you can
apply to flat-file databases is the removal, or deletion, of records
from the database. We shall process the file a record at a time by
passing the data through a temporary file, just as we did for
updating, rather than slurping all the data into memory and dumping
it at the end.

With this technique, the action of removing a record from the
database is more an act of omission than any actual deletion. Each
record is read in from the file, tested, and written out to the file.
When the record to be deleted is encountered, it is simply
not written to the temporary file. This
effectively removes all trace of it from the database, albeit in a
rather unsophisticated way.

The following program can be used to remove the relevant record from
the delimited megalithic database when given an argument of the name
of the site to delete:

#!/usr/bin/perl -w
#
# ch02/deletemegadata/deletemegadata: Deletes the record for the given
# megalithic site. Uses
# colon-separated data
#
### Check the user has supplied an argument to scan for
### 1) The name of the file containing the data
### 2) The name of the site to delete
die "Usage: deletemegadata <data file> <site name>\n"
unless @ARGV == 2;
my $megalithFile = $ARGV[0];
my $siteName = $ARGV[1];
my $tempFile = "tmp.$$";
### Open the data file for reading, and die upon failure
open MEGADATA, "<$megalithFile"
or die "Can't open $megalithFile: $!\n";
### Open the temporary megalith data file for writing
open TMPMEGADATA, ">$tempFile"
or die "Can't open temporary file $tempFile: $!\n";
### Scan through all the entries for the desired site
while ( <MEGADATA> ) {
### Extract the site name (the first field) from the record
my ( $name ) = split( /:/, $_ );
### Test the sitename against the record's name
if ( $siteName eq $name ) {
### We've found the record to delete, so skip it and move to next record
next;
}
### Write the original record out to the temporary file
print TMPMEGADATA $_
or die "Error writing $tempFile: $!\n";
}
### Close the megalith input data file
close MEGADATA;
### Close the temporary megalith output data file
close TMPMEGADATA
or die "Error closing $tempFile: $!\n";
### We now "commit" the changes by deleting the old file ...
unlink $megalithFile
or die "Can't delete old $megalithFile: $!\n";
### and renaming the new file to replace the old one.
rename $tempFile, $megalithFile
or die "Can't rename '$tempFile' to '$megalithFile': $!\n";
exit 0;

The code
to remove records from a
fixed-length data file is almost identical. The only change is in the
code to extract the field value, as you'd expect:

### Extract the site name (the first field) from the record
my ( $name ) = unpack( "A64", $_ );

Like updating, deleting data may cause problems if multiple users are
attempting to make simultaneous changes to the data. We'll look
at how to deal with this problem a little later in this chapter.