Description:

I've seen it mentioned time and time again that flat text databases become very inefficient as the size of the file approachs 1Mb.

Does anybody know why that is and, more importantly, any way to lessen that effect?

It can't really all be down to the file system, as an RDBMS uses files on the same system in many cases. For instance, I have Perl and MySQL installed on the same drive.

Is it the indexing, or is it just down to the queries being controlled by a specialised compiled program instead of Perl? Is that really enough to allow MySQL to handle tables 5000 times the size available for a flat text version?

Has anyone experimented with indexing flat file systems in Perl and using seek to query smaller blocks at a time?

If the flat file uses variable length records, updates to existing records that require the record length to change require a re-write of all records below the record being updated.

Fields in a flat file database can be indexed. Either the index can be stored in the flat file database, or in an adjacent file in a known location. However, this is not usually done, as the effort of writing routines to index a flat file database can likely be better spent using an existing binary database format, or toolset.

An additional issue with the requirement that much of the database be re-written if updates are necessary, is the problem of contention. What if multiple processes need to update the database? If process A needs to update the database, it must lock the database, re-write a large section of the database, and unlock the database. Process B needs to lock the database, re-read the database to ensure that it incorporates Process A's change, then re-write a large section of the database including Process B's change, and so on.

In general, flat file databases should only be used if they are very small (I would say less than 100Kbyte, not less than 1Mbyte), and if they only need to be updated by one process, and read by others. The 100Kbyte number is a question of fitting the database content into cache memory, as even having to fallback to RAM can be quite expensive.

I have had a great deal of success with one process writing common data to a text file, and many (500+) clients reading from the text file. The trick here is to ensure that the operation is atomic from the perspective of the client. The client must always see a complete file at the configured location. Also, the client should not rely on the data being fully up-to-date. From a network perspective, for a small amount of data, it is far more efficient to use a network protocol such as NFS to serve a static 'flat file database' to clients, than to serve the data from a generalized database server such as MySQL.

Scalability Unless you've got a very tightly scoped project, the inherent creep in data model will invariably lead to your data set growing. If you want to follow any type of normalisation of your data it means having multiple text files. This type of design is not going to scale very well.

Maintaince Adding records, removing records and maintaining the integrity of your record structure can be difficult. You've got two clients requesting access to the same record, both writing, which should go 1st? we're talking about locking. How do you manage backups? restore a complete file? what if its been updated? we're talking transaction logs..

Integrity What character do you use as a delimiter? and what if a user wishes to use it for a legit purpose? what about a data model which says only values 'a, b and c' are valid, and a user enters d? This kind of things can be handled very easily using foreign key constraints in an RDBMS.

Any serious designer will tell you an RDBMS is the best way to go in terms of building data models (assuming you're data model is relational). You've got all the access control, the transactions, the ability to fundamentally normalise your data, its all there, and, its all available freely.

Now all the negative stuff is out of the way, there are instances where a FF may be handy. Read only as MarkM pointed out is probably the best use of flatfiles.

"Classic" databases do their own data managment within a reserved space (usually called a "tablespace") on disk. Internally, this might not differ very much from how a filesystem works, where the data and directory listings (e.g. table rows and indices) are organized in some sort of tree. When you insert, update or delete data, only a small part of it needs to be rewritten.

Classic databases are usually developed over a number of years, get tweaked, tuned, optimized, critical loops rewritten to low level high performance code and such.

Last, but not least, many databases are also optimized to cache the most critical and most used data and indices in RAM.

Unless there are very specific project requirements (compatibility with other programs, customer requirements), you should take a look into using a "real" database.

While there are many "big" databases out there, like PostgreSQL, MySQL, Oracle, Microsoft SQL Server, there are alternatives that you might find useful, too, depending on many factors.

For example, if you want a light, portable system where you can just copy the data file to another computer, DBD::SQLite may be an alternative.

If you are doing mostly key/value stores (where the "values" can be complex datasets as well), you could also take a look into NoSQL databases like CouchDB.