I am currently creating a web application that allows users to store and share files, 1 MB - 10 MB in size.

It seems to me that storing the files in a database will significantly slow down database access.

Is this a valid concern? Is it better to store the files in the file system and save the file name and path in the database? Are there any best practices related to storing files when working with a database?

I am working in PHP and MySQL for this project, but is the issue the same for most environments (Ruby on Rails, PHP, .NET) and databases (MySQL, PostgreSQL).

Questions on Programmers Stack Exchange are expected to relate to software development within the scope defined by the community. Consider editing the question or leaving comments for improvement if you believe the question can be reworded to fit within the scope. Read more about reopening questions here.
If this question can be reworded to fit the rules in the help center, please edit the question.

12 Answers
12

ACID consistency including a rollback of an update which is complicated when the files are stored outside the database. This isn't to be glossed over lightly. Having the files and database in sync and able to participate in transactions can be very useful.

Files go with the database and cannot be orphaned from it.

Backups automatically include the file binaries.

Reason against storing files in the database:

The size of a binary file differs amongst databases. On SQL Server, when not using the FILESTREAM object, for example, it is 2 GB. If users need to store files larger (like say a movie), you have to jump through hoops to make that magic happen.

Increases the size of the database. One general concept you should take to heart: The level of knowledge required to maintain a database goes up in proportion to the size of the database. I.e., large databases are more complicated to maintain than small databases. Storing the files in the database can make the database much larger. Even if say a daily full backup would have sufficed, with a larger database size, you may no longer be able to do that. You may have to consider putting the files on a different file group (if the database supports that), tweak the backups to separate the backup of the data from the backup of the files etc. None of these things are impossible to learn, but do add complexity to maintenance which means cost to the business. Larger databases also consume more memory as they try to stuff as much data into memory as possible.

Portability can be a concern if you use system specific features like SQL Server's FILESTREAM object and need to migrate to a different database system.

The code that writes the files to the database can be a problem. One company for whom I consulted not so many moons ago at some point connected a Microsoft Access frontend to their database server and used Access' ability to upload "anything" using its Ole Object control. Later they changed to use a different control which still relied on Ole. Much later someone changed the interface to store the raw binary. Extracting those Ole Object's was a new level of hell. When you store files on the file system, there isn't an additional layer involved to wrap/tweak/alter the source file.

It is more complicated to serve up the files to a website. In order to do it with binary columns, you have to write a handler to stream the file binary from the database. You can also do this even if you store file paths but you don't have to do this. Again, adding a handler is not impossible but adds complexity and is another point of failure.

You cannot take advantage of cloud storage. Suppose one day you want to store your files in an Amazon S3 bucket. If what you store in the database are file paths, you are afforded the ability to change those to paths at S3. As far as I'm aware, that's not possible in any scenario with any DBMS.

IMO, deeming the storage of files in the database or not as "bad" requires more information about the circumstances and requirements. Are the size and/or number of files always going to be small? Are there no plans to use cloud storage? Will the files be served up on a website or a binary executable like a Windows application?

In general, my experience has found that storing paths is less expensive to the business even accounting for the lack of ACID and the possibility of orphans. However, that does not mean that the internet is not legion with stories of lack of ACID control going wrong with file storage but it does mean that in general that solution is easier to build, understand and maintain.

Why can't you use CDNs? This is a supported scenario with pretty much every CDN I've ever heard of.
–
Billy ONealMay 30 '12 at 17:55

@BillyONeal - You can't use a CDN and store the file in the database. Unless you are OK with duplication, you can't have both.
–
ThomasMay 30 '12 at 20:09

Erm, the whole point of a CDN is duplication. CDNs merely cache the target of a web address -- the only requirement is that there's an HTTP host serving the content, and that the content changes rarely. (How on earth is the CDN supposed to tell where you pulled the image from anyway?)
–
Billy ONealMay 30 '12 at 20:22

1

@BillyONeal - However, I think this is bad choice of words on my part and I've adjusted my answer. Specifically, if you want to use cloud storage (and then perhaps use a CDN with your cloud storage), you can't do it natively with the database storage solution. You would have to write a synchronization routine to pull the files from the database and then send them to your cloud storage provider.
–
ThomasMay 30 '12 at 21:17

In many cases, this is a bad idea. It will bloat the database files and cause several performance issues. If you stick the blobs in a table with a large number of columns it's even worse.

However! Some databases, like SQL Server have a FILESTREAM column type. In this case, your data is actually stored in a separate file on the database server and only an ID to the file is saved in the table. In this case I don't see much of a reason not to keep the data in the SQL server. The files are automatically included as part of the server backup, and the database and the files are never out of sync. The problem with Tony's suggestion of storing file names, is that the database and the filesystem can get out of sync. The database will claim a file exists when it's been deleted on disk. If a process is modifying the database and then crashes, the files and the database will not match (i.e. no ACID with files outside of a database).

I disagree with the statement ` If a process is modifying the DB and then crashes, the files and the DB will not match.` If you wrap the entire process in a transaction (create file, validate file, update db) and throw error messages when something goes wrong it's quite easy to keep them in sync.
–
briddumsMay 29 '12 at 14:57

2

I'm with briddums on that: consider scenario: store file to filesystem (without deleting old one), update DB, on success delete old file, on rollback delete new file. Worst case scenario - if the process gets interrupted, you have orphan file. But you always have the files referenced by DB in correct version.
–
vartecMay 29 '12 at 15:01

2

Other potential problems with the File/DB method: 1) you have to do updates as copy-on-write. If your process crashes during an update, the DB status will be rolled back, the file will not. 2) Doing this then requires some sort of garbage collection of the old file. 3) Storing everything in the DB means that the versions of the DB and files are in sync after backups. Restore your DB to its state 2 weeks ago...now what where the contents of the files at that time?
–
Timothy BaldridgeMay 29 '12 at 16:54

3

@briddums - Nope, since SQL Server integrates directly into the file system and manages those files on behalf of the OS. I haven't used them myself, but the documentation makes it look like FILESTREAM and its descendant FileTables grant you the best of both worlds: Files are bound tightly to the database and relating data (allowing you to centrally manage your data) without bloating the database.
–
Nick ChammasMay 29 '12 at 23:29

1

I agree with Nick. We've replaced our Disk+DB system with FILESTREAM columns and never looked back. It's really nice to be able to have files tie out to other tables via FKs. So you can actually say "each person must have one or more HR docs associated with them", or something else like that.
–
Timothy BaldridgeMay 30 '12 at 3:19

if you do a SELECT with any BLOB column, you will always do a disk access, while without BLOBs you have a chance to get data straight from RAM (high throughput DB will be optimized to fit tables in RAM);

replication will be slow, replication delay high, as it will have to push BLOB to slaves. High replication delay will be causing all kinds of race conditions and other synchronization problems, unless you explicitly take that in account;

DB backups/restore will take lot longer;

Speed advantage — none! While some older filesystems would not handle well directories with millions of files, most modern have no problem at all and in fact use same kind of data structures as BDs (typically B-trees). For example ext4 (default Linux filesystem) uses Htree.

Conclusion: it will hinder your DB performance and will not improve file retrieval performance.

Also, since you're talking about web application — serving static files directly from filesystem using modern webserver, which can do sendfile() syscall is tremendous performance improvement. This is of course not possible if you're fetching files from DB. Consider for example this benchmark, showing Ngnix doing 25K req/s with 1000 concurrent connections on a low end laptop. That kind of load would fry any kind of DB.

@BillyONeal: why do you assume, that you have to have same server for static and dynamic content? As for synchronizing files across servers, there are tools specifically designed for that, much more efficient than databases. Using database as fileserver is like trying to hammer a nail with a screwdriver.
–
vartecMay 31 '12 at 8:59

When comparing the NTFS file system and SQL Server 2005, BLOBS smaller than 256KB are more efficiently handled by SQL Server, while NTFS is more efficient for BLOBS larger than 1MB.

I would recommend that you write some small tests for your particular use case. Bear in mind that you have to beware of caching effects. (I was amazed the first time I got save-to-disk speeds that seemed to have higher throughputs than was physically possible!)

If you serve a file from your filesystem, your Web server can use kernel code like sendfile() on BSD or Linux to copy the file directly to the socket. It's very fast and very efficient.

Serving files out of the database means you have to copy data from the database server's disk to database server memory, then from db server's memory to the db server's network port, then in from the network to your Web server process, then out again to the outgoing network connection.

Unless you have a really good reason not to, it's always better to serve static files from the file system.

This is true, but I fail to see where the user states in the question that he will be serving static files from the database. This very well could be dynamic files or user uploaded files which if stored on the filesystem seperate from the database now must be synced and have a seperate backup/restore process.
–
maple_shaft♦May 29 '12 at 19:13

My understanding is the question is about serving user-uploaded files. "I am currently creating a web application that allows users to store and share files [...] It seems to me that storing the files in a database [...]". I don't think it's really that convenient to do DB dumps with lots of multi-megabyte blobs in the database. Also: yes, it's hard to deal with files; synching, archiving, are all more difficult. However, it's not much more difficult, and sacrificing online performance to save a few lines in your nightly backup script is a big mistake.
–
Evan P.May 30 '12 at 1:23

It's usually best to store large BLOBs in a separate table and just keep a foreign key reference to the BLOB in your main table. That way, you can still retrieve the file from the database (so you don't need any special code) and you avoid the problems surrounding external DB dependencies (keeping the DB and filesystem in sync, etc), but you only incur that overhead if you explicitly join to that table (or make a separate call). 10MB isn't terribly large, most modern commercial databases won't have a problem. The only reason I'd store a file in the filesystem is to cut down on database bandwidth. If your database is going to be shuffling a lot of these files, then you may need to split the workload and only store a file descriptor of some sort. Then you can have a separate call to load the file from another server, so you aren't tying up your database connections (and network connections on your database server) with all those file transfers.

I would be pragmatic about it, and follow the "don't optimize yet" principle. Make the solution that makes sense at the moment, and one that you have the development resources to properly implement. There are plenty of potential problems. But those do not necessarily become real problems. E.g. It would probably not be a problem if you have 100 users. It might be a problem if you have 100,000 or 10,000,000 users. But in the latter case, there should be a basis for more development resources to deal with all the issues.

But storing the data in the database does relieve you from dealing with other problems, e.g. where should the files be stored, how should they be backed up, etc. Since you are writing a web application it would be a very good idea for security reasons to make sure that the process hosting the application does not have write access to the file system, so you need to configure the server so that process has read/write access to the folder where data is stored.

I would personally choose to store the data in the database, but make sure that the BLOBS are not read until they are really needed, i.e. no "SELECT * FROM ..." executed on those tables containing blogs. And I would make sure that the design makes it easy to move the data out of the database, into the filesystem, if you do get performance problems. For example store the file information in a separate File table, thus keeping the file information away from other business entities.

Assuming that you have a File class for representing a file read in the database, then the coding impact of later moving it out will be minimal.

Doing a SELECT * which involves the row with the large blob takes very long, even if you don't need the blob (Of course you should do a specific select, but sometimes applications are written like this)

Doing a backup can take much longer. Depending on your needs you may need to lock your tables for the time of the backup, so you may want to keep your backup time low

Restoring will also take much more time.

If you run out of space, you have to think of some way (maybe moving the whole database to a new server) to solve this problem. Storing the files on the file system you can always mount another hard drive and set soft links.

Simply looking into a file for debugging or other information is not as easy. This also includes scripts which might not have access to the database but need some information from various files.

Of course you also get some benefits:

Backing up data and file menas they are in sync

Removing the file without the database knowing is not possible

You don't have to read the file from disk but can do it in one sql statement

You can download the database, include the dump into your development environment and have all dependencies right there

Personally I don't do it as I find the cons much heavier than the pros. But as stated above it totally depends on your use case and such.

Famous Tom Kyte has written that they (the Oracle) are using the Oracle database as file server and it's working perfectly fine, even faster that normal filesystem, with full transactionality, no performance loss and with single backup.

Yes, but note, they are the producer of the Oracle DB, and for any other user there are cost issues. Using commercial DB such as Oracle for storage of files is simply cost ineffective.

However, with PostgreSQL for example, you can simply run another DB instance only for blob storage. You have then full transactional support. But transactionality costs DB space. There is the need for database to store multiple blob instances for multiple concurrent transactions. On PostgreSQL it is the most painful, since this database stores the duplicates of blobs made for transaction are stored even if they are not needed anymore, until VACUUM process is done.

With filesystem storage, on the other hand, you must be very carefull when someone modifies the file, because transaction can be rolled back and the copy of the file must be kept until the old version is no longer visible.

In the system where files are only added and deleted, and transactional access to files is not an issue, the filesystem storage will be IMHO the best choice.

All file contents are definitely synchronized with your table. As comments above said, backing up data is totally convenient as you don't need to keep data synchronized with the file system.

From coding, you can get file content directly from a SQL select.

From a query, you can even filter file content or its size explicitly from SQL statement.

Downsides:

Compared to a databased of which structure is semantically the same but does not store file content, you database tends to consume radically more memory when doing query.

Auto backup can cause performance problem but not much. Let's imagine your database server is backing up things every 6 hours and those databases you have are storing 10-MB file per record. That scenario is not what you want.

If you do a bit of research, you'll find out that SiteCore is one of the most popular enterprise content management systems. SiteCore supports large number of concurrent user, and scales pretty well, so yes, storing files inside a separate database is not a bad practice if you do it right.
–
šljakerJul 28 '13 at 20:52