Storing Binary files in the Database

Purpose: to cover the pluses and minuses of storing files in a database versus storing the path to the binary file and other meta data.

Reason: this topic has come up several times in the past on the PostgreSQL mailing lists. With new user asking, is it OK store pictures, word, spreadsheet, html, pdf, or etc. in the database. Of course like everything thing there is 2 prevailing schools of thought each method having pluses and minus detailed below.

Caveat: many people come to this page looking for PostgreSQL's equivalent of the MySLQ "BLOB*" data types. The equivalent in Postgres is BYTEA. For details on PostgreSQL's "binary large object" (which are quite different from MySQL BLOB's and provide random seeking, etc), see below.

Methods:

Storing the large binary* file aka unstructured data streams in a database.

Need to escape/encode binary data before sending to DB then do the reverse after retrieving the data

Memory requirements on the server can be steep even on a small record set.

Pluses overall

Security and access control is simplified

version controlling is easier.

ACID

Backups are easier no need to track external files

Minus overall

Performance hit storing files in database.

Memory requirements higher, for Database

Backups can take significantly longer

Access to files to external applications is complicated. Normally a temporary file is copied to the client to access and modify the file then needs to be copied back.

Storing Meta data and symbolic link in the database to where a Binary file is located

Pluses

Performance accessing binary file skips DB access layer.

Number of files limited only by file system

Size limited by file system

Minuses

Need to develop an interface to keep track of externally attached files

external files and database can get out of sync

binary files missing entries in the database

database having entries to external files that's been deleted or moved by users or other events out of control of the database.

Security settings between file system and database are independent from each other. The application may need to may need multiple security logins to access all resources. This is dependent on security needs of the application, and methods used for access control. There are several methods to create a single log on infrastructure.

Multi points of failure depending on complexity of the network.

When should files be stored in the database?

A normal/ordinary record in a database would consist of a Inspector, time stamp of inspection, process used to inspect the part, was the part accepted or rejected, and the parts serial number. In this scenario photos are also taken showing the part inspected. The photos are a integral piece of the inspection record keeping process.

The photos being a critical piece making up the record/inspection report should not to be stored separately. Consider if the photos being stored on file sever are moved to new location as part of upgrade, the link to the record is bad if it was not updated. So now the inspection record by itself is worth less as it lost the piece it is describing and the photos are worthless without the record describing them. So the two pieces must stay together to have any value as a inspection report.

Storing the photos outside the database, there is no easy way to guarantee atomicity, consistency, isolation and durability of the photos. Storing the photos in the database is the only practical solution where the programmer can guarantee rules are followed in the future.

What is the best way to store the files in the Database?

There are two methods to choose from in PostgreSQL BLOB or bytea/text data types. The authors personal preference is using text data type to store files in the database as it simplifies getting to the file and eases encoding the file for storage.

The BLOB type requires using a different interface and adds some complexity in programming. Although it does support a far larger file. This added complexity the author tries to avoid.

The author dislikes the bytea data type as it can be very difficult to escape and then unescape depending on programming tools used. There is nothing wrong with the bytea data type but it can be more difficult to use. Yet this has been made easier in some programming languages such as PHP, which includes a escape and unescape bytea functions. Yet this is not a common tool available in other programming languages and as a general rule/suggestion the database should be designed to work with as many programming languages and or frameworks as makes sense.

Storing files in the text data type requires the files to be encoded in some character to binary representation. There are several to chose. This is not exhausted list just commonly used ones.

Base64 probably today is the most common encoding method for binary files thanks to email. Many programing languages, and frameworks support base64. So the author encodes the files in base64 when storing files in the database. Postgresql also has base64 encoding and decoding "encode() & decode()"

Biggest draw back using text data type and base64 encoding is it inflates the data by 33%. This adds a significant increase to storage requirements and overhead that bytea and BLOB do not suffer from. So keep in mind when designing a database what appears to be a simple decision now, can have big impacts down the road.

This 33% hit in overhead can have a significant impact. Consider the process time and IO hit on a 10 gig table with binary files. Postgresql is having to process an additional 3.3 gigs of more data because of the encoding method.

What Meta Data should be stored in a database?

here is a basic list what should be recorded in the database along side the binary file

original file name

data type

this is needed to tell us what kinda of binary file is stored in the database

the author suggest following the Internet Media Type it makes it easier to use the same information for a website or attaching to email.

long description

date added

When is it bad idea to store binary files in the database?

Very large files (100meg+), where performance is critical to the application. The database layer adds a lot of overhead and complexity that may not be required.