Next to every DOS programmer's terminal must be a fortune cookie with a fortune that reads "640K is the mother of invention." And if you've ever had to manage large files of records, you've probably cracked this cookie more than once.

The goal of this article is to show how you can retrieve any record in a multimegabyte file with one disk access, and any record in any size file with a maximum of two accesses. KRAM.PAS, Listing One, page 116, (written using Turbo Pascal 5.0) allows any record in a file of up to approximately 1.4 Mbytes of user data to be retrieved with one disk access. But before we look at the code, let's examine the basic algorithm.

Extensible hashing uses a record's key to compute a hash code. The first n bits of the hash code (n = 10 in KRAM.PAS) are used as an index into a table of block numbers. All records whose hash codes are the same in their first n bits are stored in the same block. For faster access, a possible optimization is to use the full hash code again to look up the record within the block. This is especially valuable when using large data blocks, as a significant amount of time can be consumed in the search for empty slots when adding records.

When a block is full, it must be split into two new blocks. The distribution of records between the two new blocks is based on one more bit of the full hash code than was used before the split. For example, if the hash code used to look up the record in the old block was 4 bits long, with the value 1011, then the first new block would contain records with hash codes of 10110, and the second new block would contain records with hash codes of 10111.

There are two ways of extending the capacity of a KRAM file. For maximum speed and simplicity of programming, you can keep the entire index table in memory, writing it out to the KRAM file only when it is modified as blocks are split, as is done in KRAM.PAS. This "index-in-memory" approach, however, limits the maximum size of the files that you can use: For example, if you are willing to set aside about 140K bytes for the index and the data buffers, you could retrieve any record from a file of about 128 Mbytes in one access.

The other possibility is to keep the index table mostly on disk in the KRAM file, and read portions of it into memory as needed. This requires two disk accesses, one for the index and one to retrieve the data, but only 16K bytes of memory would suffice for any size file. Of course, as a compromise, you could keep a few of the most recently used index records in memory. This would reduce the number of accesses to the index if your references tend to be repetitious.

While discussing the code, I will mention other possible extensions and optimizations that were not included in the code, primarily because the listings would have become too large.

First, let's examine the data structures used to keep track of the current status of a KRAM file. The sizes of these structures are parameterized so that you can adapt them to your particular application. This is important primarily when you use the "index-in-memory" approach, as the maximum file size is then limited by the number of index entries and the size of each data block. The maximum data storage available in this case can be approximated by the formula:

DATASIZE*INDEXCOUNT*.67;

DATASIZE refers to the number of bytes in a data block, and INDEXCOUNT refers to the number of entries in the index. (Note that INDEXCOUNT must be a power of two because it must have one entry for each possible hash code of the maximum length allowed.)

The constant .67 is an approximation of the packing factor or the proportion of the file that is occupied by data records assuming random keys. The exact value of the packing factor depends on the distribution of keys, but shouldn't vary much from this value.

For example, if you needed to store 100,000 100-byte records for a total data storage requirement of 10 Mbytes, you could set INDEXCOUNT to 8192 and DATASIZE to 2048, giving a maximum accessible file size of 16 Mbytes. This provides an approximate storage capacity of 11.2 Mbytes.

The memory requirement for the data and index buffers in the "index-in-memory" approach is:

(2*INDEXCOUNT) + (3*DATASIZE);

as three data buffers are required when splitting a block. The example of 100,000 100-byte records would require 2*8192 + 3*2048, or 22K bytes of memory for the index and data buffers.

KramInit is used to create a new KRAM file. Once created, KramInit initializes the first data block to all zeros, and sets all pointers in the index block to point to that first data block. KramInit also initializes the parameter block, which contains the data length, the key length, and the current high block number.

Once the file has been initialized, call KramOpen to open the file, read the parameter block and the index block into memory, and allocate space for the temporary data block. The declaration for the FileRec record type shows how the file information is associated with the file pointer.

Turbo Pascal 5.0 represents an untyped file as a record type, called "FileRec." The UserData field, however, is never accessed by Turbo Pascal, and is free for user-written routines to store data in. Therefore, I have redefined that field as:

UserData: array [1 . . 4] OF pointer;

Currently, only the first element is used to store the address of the KramParam block for the file. Therefore, any number of KRAM files may be open at once, subject to system limitations on file handles, with only the file handle itself being passed to the KRAM routines.

KramAdd is used to write records to the KRAM file. KramAdd returns TRUE if the record was added successfully (the key was not a duplicate of one in the file), and FALSE otherwise.

KramAdd first calls HashCode with the key value to calculate which index entry corresponds to the record to be stored. It then retrieves the block number pointed to by that index entry. If that block is not the one currently in memory, it must be retrieved from disk. A possible optimization here is to keep more than one data block in memory, discarding the least recently used block when its buffer is needed for another block.

The next task is to discover whether the key is a duplicate of one already in the block, and if not, whether there is an empty slot in the block to store the new record. If an empty slot is found (and there is no duplication), then the data is moved from the function arguments KeyValue and DataValue to that slot.

If no empty slot is available, and there is no duplication of keys, we must split this block to make room for the new record. The first subtask here is to scan the index table to determine which index entries are affected by the split (that is, which index entries point to this block). Note that if only one index entry points to this block, the block cannot be split because one entry cannot point to more than one block. Splitting a block under these conditions would cause one of the new blocks to be inaccessible.

To prevent this from occurring, page the index from disk, allowing it to be increased in size as needed. When using a paged index, you should also keep in each data block a record of the number of bits of the key that are used to access that block. This way, you won't have to page through a large part of the index table to find out which entries are affected by a split. For example, if a data block contains all records with a hash code starting with the 5 bits 11101, splitting it would affect all index entries that start with 11101. After determining that at least two entries point to this block, allocate another block to accommodate the overflow, and update the higher numbered half of the affected entries so they point to the new block. Next, allocate two temporary block buffers, LowDataPtr^ and HighDataPtr^ (so called because they will receive the records with lower- and higher-hash codes, respectively). Next, initialize them to zeros so they are ready to receive the records from the full block.

The distribution of records from the full block to the two temporary buffers depends on the updated entries in the index table. The key of each record is used to calculate the hash code for that record and the block number is looked up in the index table. If the block number in the table is the same as the block number of the full block, the record will end up in the old block when we are through because the old block number was kept for the first (lower) half of the updated index table entries.

For example, suppose that before block 12 became full, the relevant section of the index table looked like that shown in Figure 1.

All records with hash codes from 100 to 103 would remain in block 12, and records with hash codes from 104 to 107 would be moved to block 23.

After the records have been moved to their new places in LowDataPtr^ and HighDataPtr^, the file is updated. First, the old block (now approximately half-empty) is written back to its old position. Next, the new block is added to the end of the file. The parameter and index buffers are then written back to the file to keep the high block number and index tables up-to-date. Finally, we return to the top of the WHILE loop to handle the insertion of the record, now that there is room in the block it belongs in.

KramRead is basically the same as KramAdd (except that it doesn't have to deal with the splitting of blocks) and requires little explanation. KramClose does what its name implies; it closes the file, after making sure that anything that has been changed is written out to the file. Be sure to call this routine if you have added any records to the file. KramInfo is a procedure that you can call to find out the key and data lengths of a KRAM file.

HashCode takes the key of a record and returns a 32-bit hash code, from which the calling program extracts the number of bits it needs to access the index table. SeekBlock takes a data block number and positions the file to enable that block to be read or written.

The main program is a driver that illustrates how to call these routines. It will read an ASCII file of lines, each consisting of a string key and numeric data, optionally initializing and loading the KRAM file. Then it will retrieve all the records, checking that they all exist and have the same data in them as when they were created. The data I used for testing is the same as the record numbers of the keys; the test records were created by a small test data generating program, KRAMDATA.PAS (see Listing Two, page 121).

In many applications, it is necessary to be able to delete records from a KRAM file. The simplest way to do this is to zero out the key and data of the record you want to delete, so that KramAdd can reuse its slot. This would not work properly, however, if you decided to use hashing to speed up the lookup within a block. In that case, you could change the key to something that would not occur in your application, such as FFh. You would then change KramAdd to reuse a slot that had that value, and change the block-splitting function to ignore records that had that key. Thus, whenever a block was split, any deleted records that had not already been reused would disappear.

Lest you get the impression that KRAM files are the best possible organization for keyed access, I should mention two limitations of KRAM files, which make them unsuitable for certain applications. First, this is not an indexed sequential access method, but truly a keyed random access method. There is no convenient way of retrieving the records other than by supplying the exact key for the record you need.

The second limitation is that KRAM files are not very efficient in their use of disk storage. You can expect that a file containing 1 Mbyte of your data will occupy approximately 1.5 Mbytes on the disk. This is an unavoidable side effect of the dynamic storage allocation method that gives KRAM files their tremendous speed. If these limitations do not adversely affect your application, KRAM files are probably the best way of organizing data that requires random access by key.

By the way, a shareware version of KRAM is also available. For more information, contact Chrysalis Software Corporation, at the address at the beginning of this article. Registered users will receive an updated version of the program, incorporating all of the optimizations and extensions mentioned in this article.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!