Sunday, September 19, 2010

Optimizing your file dialog

Recently, someone asked me why is it that the custom file dialogs in some programs I wrote are able to load a file list so much faster than the so called "native file dialogs" on those operating systems. The answer is that I optimized many areas of loading a file list, and displaying it. Today we'll look at some optimizations that can be performed, instead of using the most obvious approaches to most of the process.

Know your system calls for reading directory entries

You'll want to create some sort of library around the native directory functions in order to use the most efficient path on each operating system. Operating systems vary widely in what they they provide.

Sometimes you have multiple APIs, or a single API with many options you can turn on and off. Unless the provided API is in itself doing everything as efficiently as possible (such as efficiently returning the file names in a sorted matter), you'll want to use the lowest level API, or turn off all options it provides.

If bidirectional or scanning options are provided, turn them off, as you'll want forward only. The overhead for some of these features can significantly slow down the process of obtaining a file list. Same goes for anything else that adds overhead.

Be aware of what information is provided other than file names. Some OSs will return a structure which may also contain file size or file permissions alongside each file. If you're filtering by any of these, use the information directly there, instead of using an additional call to stat() or similar.

In addition to the above, if you need to filter by file size or file permissions, and the API does not provide that information directly, avoid the need for concatenating strings to get your information. A call to stat() would require concatenating a file name onto the directory name if it's not in the current directory. If your OS provides for it, use fstatat() or similar instead (see "OpenGroup Extended API Set Part 2").

Process file names in an optimized manner

Once you start getting back file name(s), before anything, determine the length of a file name, as all other operations will need to know the length. Pretty much every operating system / file system limits the length of a file name to 255 bytes, the length of that can fit in a single byte. Compute the length (if need be) and keep reusing this cache.

If you want to do extension filtering, now is the time to do so. Do not use strrchr(), or anything of the sort to find where the extension begins (searching for a period). Functions like strrchr() will first travel to the end then look back, effectively being O(2N). Since you already know the length of the string, use an O(N) function to look back from the end. Perhaps even using one which can scan multiple bytes (whole CPU words) at a time.

For extension filtering, the most common method is generally one that breaks down to a series of compares. Put a little more thought into your algorithm usage! Either sort your list of extensions and binary search it for a match, or use perfect hashing. In either case, make sure the sorting or hashing phase is performed just once.

For copying the files names you want elsewhere, use memcpy() instead of strcpy() and similar. Don't forget to pass around the length for other uses too.

Use efficient data structures efficiently

The next question is how to keep track of your list of files. The most two common ideas are to use a vector and just keep adding to it, or use a linked list. Both those solutions are wrong.

A vector while great in theory, has some performance and memory issues. When a vector can no longer contain what is being added to it, it doubles in size. The doubling in size can cause up to 99% more allocation than is actually needed during the growing process, which also comes with a lot of copying around. This may suit many workloads, but won't always suit the file list workload well.

A linked list on the other hand is one of the worst possible data structures you can use. Every entry has 100-200% extra overhead for pointers, and often means many series of allocations. Many small allocations leads to performance issues, and a lot of memory fragmentation.

The solution I would recommend is either a tweaked vector, or a combination of a vector plus linked list. Tweaking a vector starts off with reserving a set amount of space up front. Have your vector by default start with room for say 512 entries, and only then grow as needed. When you see your vector is full, instead of doubling in size, reserve another 512 entries. Smaller requests are more likely to be fulfilled for a realloc() in the same location, than a doubling in size. Very large directories beyond this point also becomes increasingly rarer at each new milestone. On top of all this, have your program remember how many files were in each of the past 10 or so directories it looked in (with filtering options of course). Then when loading a directory you recently loaded in the past, you can allocate an amount which is usually just what you need, or only needs one more chunk allocated onto it. Keep in mind that most users will keep using the same few directories over and over again with each application.

The combination of the vector and linked list would instead build a structure out of the two ideas combined. Create your array of 512 entries, along with a fill pointer and and linked list pointers. When your array is full, allocate another block of this structure as the next chain in the linked list. This ensures no need to ever reallocate, or copy data around during the growing phase. Of course keep a global pointer to the current link being filled, instead of iterating to it each time you add a new entry.

Which of these two options you use depends on how you want to sort and display your files. Each data structure has pros and cons for different use cases. Before we get into that, know how to store your file names themselves.

The above description was for the pointers to the arrays holding the file names, the file names should NOT have a separate block of allocated memory for each. That's inefficient. Instead allocate large chunks of memory, say 16KB each. Store your file names in this memory pool. Forget the silly trailing null if you can, store the length before the name itself. The two data structures explained above for containing lists of files will point into these pools. If a pool can't contain another entry you're about to add, allocate another 16KB chunk, point to it from the previous one, and start allocating in this new one. If the need for a new one was for a rare overly large file, say 230 bytes, when 200 bytes were free, you can implement an algorithm to save some overhead by seeing if the next couple of files fit into the end of any previous pools. The pools in general will probably waste less than your OSs malloc() for small file names, and you'll certainly get better performance. Deallocation will also be a quick (backwards) traversal for large chunks instead of many small ones.

Sort efficiently

The first part of sorting is comparison. strcmp()/strncmp() is not the function to use. You want something based on memcmp() as that works faster thanks to having the length of the file name already cached. Of course that's only if you want standard sorting.

If you want case insensitive sorting, other methods will have to be used. If you're only dealing with ASCII file names, and you're not worried about file names containing control characters or some symbols end up being of equal value, you can compare 4 bytes at a time or'ing each with 0x20202020 (0x2020202020202020 for 8 bytes at a time). You're probably out of luck otherwise for case insensitive, unless someone knows of any good tricks to use.

If you need "natural sorting", where numbers are compared to each other, so file1 precedes file15, and file2 follows file1 not file15, there's a few implementations floating around online and most are just copies of one another. Sadly, the most popular one I see, which is present in two operating systems, and the native implementation in a particular programming language, happens to incorrectly handle numbers past a certain magnitude (did anyone even test the implementations or just accept them at face value?), and has a worse case of O(3N). So beware of what you use or base your code off of.

d2 = isdigit(*s2); //If we broke the above loop because a single string ran out of digitsif (d1 != d2) { return(d1-d2); } //Return that as the difference

for (constchar *p1 = s1, *p2 = s2; d1; ) //Otherwise, difference in the digits themselves, clarify magnitude { d1 = isdigit(*++p1); d2 = isdigit(*++p2); //First string to run out of digits firstif (d1 != d2) { return(d1-d2); } //Will lose right here } //If loop breaks, both strings are out of digits, the difference found above will be handled below }if (NOT_EQUAL(*s1, *s2)) { break; } //In all other cases, fall through with difference in current position - if any }return(*s1-*s2);}

Modify as necessary for unicode, and replace NOT_EQUAL() with your function (or macro) of choice for comparing two letters (accounting for case differences if you want).

As for the sort functions themselves, it will depend on which data structure you chose. You'll probably want Quicksort for the vector. Make sure your Quicksort sorts the pointers to the data and not the data itself! Same for any sorting algorithm sorting this. Swapping pointers is quick and easy. For the latter structure, you might want Merge sort, or perhaps Smoothsort. What you choose can depend on several factors.

You may or may not want to utilize threading. Depending on how you filter, the OS may be giving you mostly sorted file names which can also play a role in the algorithm you choose. In the case of the linked lists of arrays, you might want to sort each array with one algorithm, then sort the whole. If you're utilizing threading, you can begin sorting the array(s) as you're still obtaining files from the operating system, as disk reads are slower than memory reads (when the current directory contents is not (yet) cached by the OS).

Display directly

This point is rather simple. Have your file dialog display entries directly from whichever data structure you chose, without copying anything around. Know that most existing file dialog routines provided by various libraries will not be compatible with the data structures described above. So roll your own, or choose structures which can be passed directly to your library.

Another possibility is to copy entries as they fit in the dialog window. The dialog is unable to show an infinite amount of entries at once, so convert from your structure to its structure just for the current files being viewed. Update as scrolling is performed.

Search efficiently

Many file dialogs provide the option to type in some characters, and the file dialog jumps to the first file that begins with those characters. Since your file dialog is sorted, use binary search. Some people don't realize this, but binary search can find more than just an exact match, binary search can also be used to find the transition from entries "less than" to "first equal match" (in other words, binary search can also be used to find the transition between entries before the first match and the first match itself). Use this to find the first matching file in O(log(N)).

This basically wraps up this discussion on how to optimize file dialogs. I once was annoyed that loading up a directory with 20,000 files in it on a particular OS with a particular file browser took a good torturesome 20-30 seconds. Now using my own, I can load a directory with 100,000 files in 1-2 seconds on the same setup.

Thoughts, improvements, complaints, and every other kind of comment is welcome. Be sure to comment if you take of these ideas to optimize your library or application. I'd love to hear about it.

4 comments:

Another important optimization: avoid reading a file if it is archived on tertiary storage. If secondary storage is slow, imagine something with a seek time measurable in tens of seconds. Similarly, avoid network access.

Other performance bottlenecks include loading icons for each file and running the normal shell system that allows tons of plugins to add more features (including virtual folders like the zip access in windows).

And don't be caught of guard by the fact that some OSes allows longer paths than 255 characters. Windows for example can use UNC style paths longer than MAX_PATH.

This article nowhere indicates one should ever read a file for displaying a file dialog. Although some do do that for showing imagery.

>And don't be caught of guard by the fact that some OSes allows longer paths than 255 characters. Windows for example can use UNC style paths longer than MAX_PATH.

You shouldn't be storing a full path to display a dialog of file names. Windows to my knowledge even with UNC does NOT support file names larger than 255 characters. All the NTFS documentation I can find indicates the file name limit for NTFS is 255 characters. So unless you know of some other file system Windows can work with with a higher limit, the limit is 255 characters.

Generally good advice; however there are a couple of points that I would take issue with.

When using asymptotic `big-O' notation the leading constant is irrelevant and often misleading. For example, when comparing an algorithm with an O(n^2) runtime to one with an O(n) runtime I can be sure that as n tends to infinity the O(n) algorithm will result in a superior runtime. However, this is not the case with O(2n) and O(3n) — as each iteration may have a totally different run-time. Iterating through a list three times and doing nothing is almost certainly going to be faster than iterating through a list twice and doing an expensive comparison.

To make a meaningful statement about the constant you need to profile the algorithm and determine it empirically for a given system.

In addition, it is not worth worrying about the over-allocation associated with vectors. Thanks to virtual memory `realloc` almost never copies (even if it returns a different address) and with on-demand paging you only pay for what you use. Should your malloc implementation prove to be a bottleneck it is often worthwhile to switch implementations. (This is not uncommon: Safari uses TCMalloc and Firefox jemalloc.) A good malloc implementation can simplify your code and improve performance everywhere.

Memory pools are a good idea, however, for lots of small strings. However, 16 KiB is not a good size. Always try to avoid allocating a multiple of the system's page size (4 KiB on most x86 systems). This is because many malloc implementations store bookkeeping information at (allocaddr - sizeof(void *)*n). Knocking it down by 32 bytes or so should ensure you don't needlessly consume part of a page.

Looking at the NTFS documentation the file name limit is 255 UTF-16 characters — or 510 bytes. (As an aside if you're not using the Unicode API I highly recommend you switch to it immediately — it will save many needless memory allocations and character conversions. This length of 510 bytes should not be a problem. Just use a size_t to store the length. (If you don't the compiler is just going to pad it out to ensure alignment.) Or, better, still, store the number of characters and use the systems locale-aware comparison function.

>When using asymptotic `big-O' notation the leading constant is irrelevant and often misleading.

In theory yes. But this is practical advise here. Are you seriously arguing that immediately searching from the end of the string for the last dot is not faster than starting your search from the beginning (for either of the algorithms that can be used to do so)?

>To make a meaningful statement about the constant you need to profile the algorithm and determine it empirically for a given system.

The system in question makes no difference for such an algorithm. While one isn't orders of magnitudes slower, and they have the same basic "theoretical" big-O notation simplification, one will always be faster than the other.

>In addition, it is not worth worrying about the over-allocation associated with vectors. Thanks to virtual memory `realloc` almost never copies (even if it returns a different address) and with on-demand paging you only pay for what you use. Should your malloc implementation prove to be a bottleneck it is often worthwhile to switch implementations.

I offered two possible methods of containment. Which works better really depends on many different factors. Your mileage may vary for the system you're on. I only propose the slimmer vector for systems where this is an issue. Dumb embedded systems are increasingly common these days, and some older systems are still around which have flakey memory managers. Use whichever works better for you. The main point of advice is that files on the order of hundreds is common, and becomes increasingly rarer after that, so optimize your structures and algorithms accordingly.

Depends on the OS. While that is a popular technique for many UNIX systems, many other systems have the OS store a table somewhere, and use the pointer as an index into it. Do whatever works best for your system.

>Looking at the NTFS documentation the file name limit is 255 UTF-16 characters — or 510 bytes. (As an aside if you're not using the Unicode API I highly recommend you switch to it immediately — it will save many needless memory allocations and character conversions.

This goes under using whichever API is the most lightweight for your OS.

>This length of 510 bytes should not be a problem. Just use a size_t to store the length.

Uh no, NTFS limits it to 255 characters. You can't use the 510 bytes as you please, encoding with UTF-8 or something else (unless you hack up NTFS drivers and go with breaking Windows compatibility). Therefore, only a single byte is needed to track the amount of characters used in a particular filename.