I need to serve around 70,000 static files (jpg) using nginx. Should I dump them all in a single directory, or is there a better (efficient) way ? Since the filenames are numeric, I considered having a directory structure like:

How large are the image files? If they're all (quite) small then a Squid cache or just the filesystem caching will make a huge difference, as most (or all) of them could be cached in memory.
–
David GardnerNov 19 '09 at 12:15

12 Answers
12

Benchmark, benchmark, benchmark! You'll probably find no significant difference between the two options, meaning that your time is better spent on other problems. If you do benchmark and find no real difference, go with whichever scheme is easier -- what's easy to code if only programs have to access the files, or what's easy for humans to work with if people need to frequently work with the files.

As to whichever one is faster, directory lookup time is, I believe, proportional to the logarithm of the number of files in the directory. So each of three lookups for the nested structure will be faster than one big lookup, but the total of all three will probably be larger.

But don't trust me, I don't have a clue what I'm doing! Measure performance when it matters!

You're absolutely correct about the need to measure, but you are incorrect on the lookup time. It's dependent on filesystem, and many filesystems start showing degraded performance at well below 70k files.
–
Christopher CashellJul 12 '09 at 6:37

Sorry if this is a silly question but ... how do I benchmark this ?
–
AhsanJul 12 '09 at 6:40

I'd imagine you just want to wrap a typical call to fopen() in a loop, then pound away opening (and quickly close()ing) a typical set of files by name. Make sure fopen() isn't lazy before you trust those results, though. @Christopher Cashell: Hence the big fat disclaimer :)
–
kquinnJul 12 '09 at 7:55

i agree with this. one app i'm working with stores all uploaded files in a single directory and navigating/deleting/copying any one of those files is a pain in the butt.
–
Spencer RuportJul 12 '09 at 8:44

3

Urg, that joke is getting realllly old now. Was funny the first 50 times I heard it, hitting the 300th its starting to drone.
–
Adam GibbinsJul 13 '09 at 17:49

Doing some basic directory hashing is generally a good idea. Even if your file system deals well with 70k files; having say millions of files in a directory would become unmanageable. Also - how does your backup software like many files in one directory, etc etc.

That being said: To get replication (redundancy) and easier scalability consider storing the files in MogileFS instead of just in the file system. If the files are small-ish and some files are much more popular than others, consider using Varnish (varnish-cache.org) to serve them Very Quickly.

Another idea: Use a CDN -- they are surprisingly cheap. We use one that costs basically the same as we pay for "regular bandwidth"; even at low usage (10-20Mbit/sec).

You could put a squid cache in front on your nginx server. Squid can either keep the popular images in memory, or use it's own file layout for fast look ups.

For Squid, the default is 16 level one directories and 256 level two. These are reasonable defaults for my file systems.

If you don't use a product like Squid, and create your own file structure, then you'll need to come up with a reasonable hashing algorithm for your files. If the file names are randomly generated this is easy, and you can use the file name itself to divide up into buckets. If all your files look like IMG_xxxx, then you'll either need to use the least significant digits, or hash the file name and divide up based on that hash number.

Anything that contains the words "in memory", though he didn't tell us the size of those files.
–
gbarryAug 4 '09 at 4:59

linux will deliver the popular files from memory anyway without touching the file system. the backend will probably need the hashing anyway as files will still need to be backed up, published to, administered etc
–
MattAug 23 '12 at 13:11

@mindthemonkey do you know where I could find more information on this? E.g. how to monitor what is in memory, how to adjust config etc.? Thanks
–
UpTheCreekOct 30 '12 at 8:31

@UpTheCreek here's a decent overview of the internals. Overall usage of the page cache can be seen with free -m or top or nmon (buffers/cached). Specific usage for files can be interrogated with fincoreftools. And you can poke about the cache with vmtouch
–
MattNov 5 '12 at 10:00

By all means benchmark and use that information to help you make a decision but if it was my system I would also be giving some consideration to long term maintenance. Depending on what you need to do it may be easier to manage things if there is a directory structure instead of everything in one directory.

Splitting them into directories sounds like a good idea. Basically (as you may know) the reason for this approach is that having too many files in one directory makes the directory index huge and causes the OS to take a long time to search through it; conversely, having too many levels of (in)direction (sorry, bad pun) means doing a lot of disk lookups for every file.

I would suggest splitting the files into one or two levels of directories - run some trials to see what works best. If there are several images among the 70,000 that are significantly more popular than the others, try putting all those into one directory so that the OS can use a cached directory index for them. Or in fact, you could even put the popular images into the root directory, like this:

...hopefully you see the pattern. On Linux, you could use hard links for the popular images (but not symlinks, that decreases efficiency AFAIK).

Also think about how people are going to be downloading the images. Is any individual client going to be requesting only a few images, or the whole set? Because in the latter case, it makes sense to create a TAR or ZIP archive file (or possibly several archive files) with the images in them, since transferring a few large files is more efficient than a lot of smaller ones.

P.S. I sort of got carried away in the theory but kquinn is right, you really do need to run some experiments to see what works best for you, and it's very possible that the difference will be insignficant.