I run a hosted continuous integration company, and we run our customers' code on Linux. Each time we run the code, we run it in a separate virtual machine. A frequent problem that arises is that a customer's tests will sometimes fail because of the directory ordering of their code checked out on the VM.

Let me go into more detail. On OSX, the HFS+ file system ensures that directories are always traversed in the same order. Programmers who use OSX assume that if it works on their machine, it must work everywhere. But it often doesn't work on Linux, because linux file systems do not offer ordering guarantees when traversing directories.

As an example, consider there are 2 files, a.rb, b.rb. a.rb defines MyObject, and b.rb uses MyObject. If a.rb is loaded first, everything will work. If b.rb is loaded first, it will try to access an undefined variable MyObject, and fail.

But worse than this, is that it doesn't always just fail. Because the file system ordering on Linux is not ordered, it will be a different order on different machines. This is worse because sometimes the tests pass, and sometimes they fail. This is the worst possible result.

So my question is, is there a way to make file system ordering repeatable. Some flag to ext4 perhaps, that says it will always traverse directories in some order? Or maybe a different file system that has this guarantee?

Besides the really true answers - what is the "correct" order? Just alphanumerically sorted? Or by CTIME? Arbitrarily magically? How do the customers ensure this order on deployment? How should this magical order information be transferred to you?
–
MichuelnikJul 10 '12 at 12:35

@Michuelnik There's no real correct order, but something repeatable would mean that we get the same result every time, which would be better than nothing. Ideally, we'd use the HFS+ ordering, which I think is alphabetical.
–
Paul BiggarJul 10 '12 at 14:38

@Michuelnik This problem affects tests much more than deployment Deployment mostly happens on Linux, but if something fails they'll fix it. Tests mostly run on OSX so if something fails it must be our fault.
–
Paul BiggarJul 10 '12 at 14:40

@PaulBiggar: I understand your problem and I can't offer a good solution (unless you can find a way to detect if the file order is the cause of the problem). But I don't agree that "repeatable success is better than inconsistent failur": If my development (and CI) environment have repeatable success but my deployment platform has the "unreliable failure" syndrom then I'm truely in a bad spot. I'd rather see the unreliable failure as soon as possible (ideally on my development system but at least on my CI system).
–
Joachim SauerJul 11 '12 at 13:28

3 Answers
3

I know it's not the answer you're looking for, but I believe the correct solution is to avoid depending on the ordering of files in a directory. Maybe it's always consistent across all HFS+ filesystems, and maybe you could find a way to make it consistent in ext4 or some other filesystem as well, but it will cost you more trouble in the long run than it will save. Someone else using your application will run into a nasty surprise when they don't realize that it's compatible only with some types of filesystems and not others. The order may change if a filesystem is restored from backup. You'll likely run into compatibility problems because the HFS+ consistent order and the ext4 consistent order might not be the same.

Just read all of the directory entries and sort the list lexicographically before using it. Just like ls does.

You mention files a.rb and b.rb, but if we're talking about programming language source files, shouldn't each file already be responsible for ensuring that it imports all its dependencies?

The problem is that we did not write the code we're running. We run customers code, and we have no control over how the code was written. So our problem is really that we're getting blamed for the problem, because it works on their machine but not ours. If we could force everyone to write correct code, we would, but that's not within our power :)
–
Paul BiggarJul 10 '12 at 2:01

7

@PaulBiggar: but isn't "it runs here but not in production" exactly the problem that CI is supposed to fix? In other words: "Why does my code break in your system?" should be answered with "Because we're doing exactly what you're asking us for!" ;-)
–
Joachim SauerJul 10 '12 at 5:29

3

I don't know about anyone else, but when code works on my machine and then fails on a CI or colleague's checkout I immediately assume there is something platform- or environment-dependent that I need to fix...
–
matt5784Jul 10 '12 at 5:30

Surely developing the application on a platform you wont use in production is a bad idea? Get them to develop on the same platform they are writing for.
–
Matthew IfeJul 10 '12 at 11:33

2

I disagree. I think it is a great a idea. It makes much more faults show up during the move from development to test servers. And thus the code is much more sturdy before it moves to the production servers. So in a correct or theoretical world it is much better. This is the same world where you can force everyone to write correct code, also known as dreamland.
–
HennesJul 10 '12 at 12:42

Educate your customer that there is an inherent order dependency that should be explicitly stated. Offer to help the customer express the dependency in such a way that a compile works on all systems and have the customer adopt the changed flow that captures the compilation order dependency.

If the customer wants to be able to compile on other machines then it would be churlish of them to think that it comes for free.

The POSIX call in Linux readdir() doesn't guarantee any consistent ordering. If you want ordered results, the application that is handling files is responsible for ordering how they are presented to calling functions.

Now, since you said this was your customer's code and you can't fix it, you could possibly alter the linked libraries that are used to provide a consistent readdir() call. That would take some work and be worth its own question. For a quick reference to that, see http://www.ibm.com/developerworks/linux/library/l-glibc/index.html.

Altering this could spawn some other entire series of issues that I may not be able to foresee. You are strongly cautioned, but it may be a solution if your customer cannot be properly educated.