I'm currently building an application that will generate a large number of images (a few tens of thousand of images, possibly more but not in the near future at least). And I want to be able to determine whether a file exists or not and also send it to clients over http (I'm using apache is my web server).

What is the best way to do this? I thought about splitting the images to a few folders and reduce the number of files in each directory. For example lets say that I decide that each file name will begin with a lower letter from the abc. Than I create 26 directories and when I want to look for a file I will add the name of the directory first. For example If I want a file called "funnyimage2.jpg" I will save it inside a directory called "f". I can add layers to that structure if that is required.

To be honest I'm not even sure if just saving all the files in one directory isn't just as good, so if you could add an explanation as to why your solution is better it would be very helpful.

p.s
My application is written in PHP and I intend to use file_exists to check if a file exists or not.

Just wanted to note up here... everyone seems to advocate this splitting up of files into multiple folders (I do it too...). There are advantages in not having to worry about name conflicts as much, but I have not seen any compelling information that indicates that this is required or that it lends any boost to direct by-file-name access to the files. That directory scanning or wildcard matching is faster seems apparent and logical, but for simple access to the file... no data, just opinion. I got the idea that it "should" be done from other people telling me so, not from any documentation
–
Chris BakerSep 20 '11 at 18:54

Chris, I have real life experience with this from a couple of years ago. These were decent machines at the time, but maybe the disks could have been faster. In a directory with ~25,000 subdirectories it would sometimes take seconds to scan and open a subdirectory. Today this effect will be lower, but it still exists.
–
EvertSep 20 '11 at 21:27

And with a couple of years, I'm talking around 2007/2008
–
EvertSep 20 '11 at 21:28

6 Answers
6

Do it with a hash, such as md5 or sha1 and then use 2 characters for each segment of the path. If you go 4 levels deep you'll always be good:

f4/a7/b4/66/funnyimage.jpg

Oh an the reason its slow to dump it all in 1 directory, is because most filesystems don't store filenames in a B-TREE or similar structure. It will have to scan the entire directory to find a file often times.

The reason a hash is great, is because it has really good distribution. 26 directories may not cut it, especially if lots of images have a filename like "image0001.jpg"

Since ext3 aims to be backwards compatible with the earlier ext2, many of the on-disk structures are similar to those of ext2. Consequently, ext3 lacks recent features, such as extents, dynamic allocation of inodes, and block suballocation.[15] A directory can have at most 31998 subdirectories, because an inode can have at most 32000 links.[16]

A directory on a unix file system is just a file that lists filenames and what inode contains the actual file data. As such, scanning a directory for a particular filename boils down to the equivalent operation of opening a text file and scanning for a line with a particular piece of text.

At some point, the overhead of opening that directory "file" and scanning for your filename will outweigh the overhead of using multiple sub-directories. Generally, this won't happen until there's many thousands of files. You should benchmark your system/server to find where the crossover point is.

After that, it's a simple matter of deciding how to split your filenames into subdirectories. If you're allowing only alpha-numeric characters, then maybe a split based on the first 2 characters (1,296 possible subdirs) might make more sense than a single dir with 10,000 files.

Of course, for every additional level of splitting you add, you're forcing the system to open yet another directory "file" and scan for your filename, so don't go too deep on the splits.

+1 - Col. Shrapnel, do you have any references on the benefit/utility in spreading the files to multiple folders? This issue has also come up in a project of mine, and I have not been able to find compelling evidence for or against storing a ton of files in a single folder. I have found some vague references to maximum file size in NTFS, and references to file-in-folder limitations for FAT32, but nothing concrete and nothing concerning ext3 etc.
–
Chris BakerSep 20 '11 at 18:40

that's just very common issue. you can test easily it in your very own environment if you're interesting in the certain numbers.
–
Your Common SenseSep 20 '11 at 18:47

I know I can test it, but what a clumsy and inaccurate way to determine what surely, if it exists, must have been put in place purposely and thus be written in a documentation somewhere. If you don't have the documentation to support your statement, you can just say so...
–
Chris BakerSep 20 '11 at 18:50

quite contrary, with your own test results you can be sure about certain circumstances of your current server - not some outdated info about server with completely different setup. What disks you're using - SAS or SATA?
–
Your Common SenseSep 20 '11 at 18:53

SATA. If there are limitations, they would be a product of the file system with some possible (though unlikely) variance added by disk size or type. For instance, FAT32 has very specific limitations on file and folder size - this is without regard to the hardware or any other circumstance. Likewise, if the case for splitting up files into multiple folders is as compelling as it seems to be (100% of the answers advocate doing so), then it must be because of reasons similarly as firm as the aforementioned FAT32 limitations. Or, we're all doing so out of inertia with no supportable reason.
–
Chris BakerSep 20 '11 at 18:59

I think linux has a limit to the amount of files a directory can contain; it might be best to split them up.

With your method, you can have the same exact image with many different file names. Also, you'll have more images that start with "t" than you would "q" so the directory would still get large. You might want to store them as MD5-HASH.jpg instead. This will eliminate duplicates and have a more even distribution over 36 directories.

Edit: Like Evert mentions, you can do a multi-level directory structure to keep the directory size even smaller.

Do you have any documentation of such limitations?
–
Chris BakerSep 20 '11 at 18:43

There's no practical limit to how many files you can have in a directory. You'd run out of inodes (max # of files on the filesystem in general) long before you hit any per-directory limit.
–
Marc BSep 20 '11 at 18:47

@Marc B - So, if you are not doing any directory scanning or lsing or what have you, but only referring directly to a precise file name... where are all these answers touting the performance of splitting up the files getting their information? It seems like a fallacy.
–
Chris BakerSep 20 '11 at 18:48

Even using just a filename requires the directory to be scanned to locate that filename's corresponding inode number. fopen() may accept a filename, but internally everything goes by inodes. libc will internally have to open the dir, scan for the filename, and get the inode number. Only after that can it say "ok, open file #3452342564 and get the first 1024 bytes". And THAT will require the root inode to be scanned to find where the inode data itself is stored, and from that retrieve the particular disk sectors where the file is actually located.
–
Marc BSep 20 '11 at 18:52