Unfortunately, this technique is not guaranteed to work. Even if you write, flush then rename, the OS and disk may decide to apply these operations in a different order, breaking your guarantee.

A few years ago, I had to fix this behavior on Firefox, because it was actually causing data loss. The only techniques I found that seem to work are journaling and rolling backups + transparent recovery (which can still lose data, just one order of magnitude less often).

Ah, thankyou for fixing session restore. For years this was a really expensive (although useful) feature that caused huge disk write pressure. I used to have to move the Firefox directory to a ramdisk in order to have a usable system.

Other developers are currently working on fixing other aspects of Session Restore, including both performance issues and plugging other possible sources of data loss, but I'm not following this closely.

> Unfortunately, this technique is not guaranteed to work. Even if you write, flush then rename, the OS and disk may decide to apply these operations in a different order, breaking your guarantee.

I've heard about lying disks, but do you know of any that are still being sold today? My impression was this isn't really a thing anymore.

I don't think the OS lies if you're sufficiently careful: write the new file, fsync it, fsync the _directory_, then rename. [edit: on OS X, use F_FULLFSYNC. grr.] Now it's guaranteed that the filename has either the old or new contents. Another fsync on the directory to guarantee it's the new. I'd be interested in any evidence to the contrary.

I don't think the OS will apply the operations in a different order. I believe the fsync acts analogously to a memory barrier with regards to reordering. It may be the case that the disk will still apply the operations in a different order though.

Your technique potentially changes the owner, group, and protection bits of the file, which may come as an unwelcome surprise to anyone trying to use it as a drop-in replacement for directly over-writing the old file. (Worse, there seems to be no way to completely fix this in Unix-land as of the last time I looked back in the 90’s, when we had to pull a feature in FrameMaker over this).

Also, be aware that (many? some?) disk controllers routinely lie about whether their internal caches are actually flushed to physical media, and there’s no way for fsync to be 100% sure that the data has really made it all the way to the spinning platter of rust. Try testing your code by putting it in a loop and repeatedly pulling the power plug out of the wall, and see if you always end up with an uncorrupted file. (Yes, people do this sort of testing on highly fault-tolerant systems.)

> Worse, there seems to be no way to completely fix this in Unix-land as of the last time I looked back in the 90’s, when we had to pull a feature in FrameMaker over this.

I had no idea FrameMaker ran on Unix. Apparently it started life on a Sun-2 workstation, and at one point according to Wikipedia, “FrameMaker ran on more than thirteen UNIX platforms, including NeXT Computer's NeXTSTEP and IBM's AIX operating systems”.

> Try testing your code by putting it in a loop and repeatedly pulling the power plug out of the wall, and see if you always end up with an uncorrupted file. (Yes, people do this sort of testing on highly fault-tolerant systems

I've had "fun" with this on some Windows CE systems, building my own little log-structured database to address the problem. All writes are appends, one disk block at a time, flushing afterwards. Worked well when using TFAT ("transactional" FAT) as the underlying file system, but when tried on a normal PC we discovered a new exciting failure mode: the file ended up with a random chunk of recently-deleted data at the end of it.

The best you can do is assume that anyone who uses your application in production will have enterprise hardware, those (usually) don't lie about fsync (consumer hdd's are somewhat likely to do this)

I've also not yet run across a SSD that lies about fsync.

The disappointing thing is that most filesystems also blindly assume this. Ext4 is somewhat robust by design but I've heard of atleast one time where a (very) cheap consumer harddisk trashed any filesystem on crash, including ZFS to the point of becoming unmountable.

Your OS or filesystem may maintain a blacklist of lying devices (IIRC XFS does so), although such probably isn't exported to the user. Regardless, those blacklists are built via testing, and probably aren't comprehensive. If you absolutely positively must be assured your write is persisted, I recommend fsync, sync, waiting five minutes, fsync and sync again, then top it off with a prayer.

This is a shockingly hard problem, which the technique still doesn't get right even if the hardware cooperates.

1) If the filename provided is just a filename, then dirname (and parent) may not work as expected. You ought run it through realpath first.

2) What if the parent's parent hasn't been flushed? You need to recurse all the way up to "/".

3) Extending support for other POSIX systems will be tricky. The Austin group (keepers of POSIX development) had trouble with this. Some systems consider using open() on a directory a complete sin (you must use opendir()), some allow it read only but not with write, fsync() may require write permissions instead of just read, and opendir() may not produce an FD which you can ever convince fsync() to operate on (e.g. because no write permission). Indeed, the initial proposal from the Austin group was that all operations on directory entries should be atomic if the inode is also synced (which matches e.g. HPUX, one of the systems which strictly disallows any attempt at calling fsync() on a directory). Many other systems (Linux especially) obviously don't fit that model, it has some unpleasant performance implications, and overall no agreement was ever reached as to how to standardize it.

The ITS operating system (at the MIT AI lab) had a system call called .RENWO -- "rename while open". Basically you could open a file, write into it (or read/write) and then in an atomic operation give it a new name and close it. So it was actually "atomic close and supercede"

It's really unfortunate the posix link() system call takes a pair of paths. It should really take an inode. Then you could open a temp file and immediately unlink it, and keep writing to it until you are finished (well, you can do this today as well). Then when you're done you could give it a name. If you crashed, you'd leave no detrietus behind.

While I think the grandparent is right about POSIX not supporting this, I think this should be possible in Linux:

1. Open the file with open(2), passing the O_TMPFILE flag. This creates a temporary, unnamed file. (As if you had opened it, then immediately unlinked it, but atomically.)

2. Write to the file.

3. Link the file into the filesystem with linkat(2), passing the AT_EMPTY_PATH flag (which tells linkat(2) to target a file descriptor that we pass it, not a path.)

(I'm not sure if step 3 helps in the case that you want to atomically replace the contents of an existing named file; I suspect that linkat(2) will error out because the file exists. Perhaps someday linkat(2) will grow a "atomic replace" flag.)

This library does not come even close to doing what it claims to.
This is excusable because it is actually a very difficult problem with no simple solution. Depending on your hardware and software, it may have no guaranteed solution at all.

Let's start with the fact that modern disc controllers cache megabytes of data and there exists no way to force them to write it out. There exist a few ways to suggest this. Some will listen, some will lie. SSDs even worse in this respect, they cache hundreds of megabytes, and while they're doing garbage collection, even more data may be touched and modified. Depending on how good their algorithms are, you may lose the data that you are writing if they're powered off suddenly, or entirely unrelated other data, or perhaps a little of each. There is no standard way to tell an SSD to flush its data to actual Flash either, of course. In fact, depending the the logic in the FTL in use, this may not even be possible to do quickly (say, you discover a few bad blocks during garbage collection and need to relocate a lot of data suddenly). This changes even across firmware updates.

And then we arrive at modern file systems. The default mount options generally only sync metadata to disc no matter how many times you call the sync system call. All the dirty pages containing actual file data will be written whenever the system damn well feels like it. And of course, as mentioned above, they're only written to the volatile RAM inside the disk controller.

This is actually an insanely complicated problem. If you want to see how it is solved professionally, take a look at a databases. People who work on sqlite spent countless hundreds of hours making stuff like this work reliably, or break in ways that are predictable and recoverable. Generally this involves extreme amounts of journaling, sometimes multiple levels of journals. Oftentimes there is a whole lot of os-specific quirks handling to fix the differences between what the OSs promise functions do and the reality.

This is actually why, if you ever start needing to "write things reliably", or "store more than a few of something in a file, and update it", and you can spare the code size, just use sqlite for your storage. Because there's a whole lot of problems you're going to hit that they have already solved for you.

I usually don't defend Windows, but, rename not overwriting (without an argument) seems safer as a low level API, no? Given it takes a 2 line helper to achieve the same thing in C++ [1], doesn't seem that bad.

On Windows even if you use native API to open a file in a write-through mode with no caching (FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING), you will still end up with zero-filled files if the machine is power cycled or blue-screened in the right moment. Disk controllers cache aggressively and there's not much an OS can do about it.

A reminder that Reiser4 exists and is still getting updated. Fast, stable, and most of all: atomic (only one core though).

You'll need to patch your kernel[0] and install the mkfs.resier* util[1]. Take a look at this Gentoo-forums tweaks post[2] for troubleshooting and performance tweaks. If you find reiser is lagging, your filesystem was likely built incorrectly and you'll need to run a simple

fsck.reiser4 -y --fix --build-fs --build-sb

on your partition and it should be fixed. I only made the switch recently because Ext4 doesn't have intelligent inode allocation and drops the ball at 1mil files.

I’d be very interested in any way that would let me (quickly) write small changes to large files atomically, on Windows. The naive solution is to copy the original to a temp file, then proceed as in this library (write then atomic rename).

I’m thinking there may exist some lower level magic api with e.g CoW pages that lets the caller appear to copy the original without actually doing it? “Shadow copy” or similar?

That's a funny one. The kernel changes they needed to make in order to support transactional stuff were pervasive, and it's used extensively under the covers. The MSDN commentary makes it sound like they're going to completely pull the functionality one day, which is misleading.

That is probably a better approach for large files. My initial use case was an application that would write an encrypted file containing passwords to disk. The files are relatively small, so that method didn't seem worth the additional complexity of something like what you describe. I believe sqlite does something similar at least in some configurations.

The title is pretty misleading, as already clearly mentioned by the author, the library is all about atomically _creating_ a file with specified content on disk, it is not about how to atomically append/update/delete files on disk.