A memory mapped file is a feature for all modern operating system. It require coordination between the memory manager and the I/O subsystem. Basically, you can tell the OS that some file is the backing store for a certain portion of the process memory. In order to understand that, we have to understand virtual memory.

Here is your physical memory, using 32 bits for ease of use:

Now, you have two processes that are running. Each of them get their own 4 GB address space (actually, only 2 GB is available to the process in 32 bits, but that is good enough). What happen when both of those processes obtain a pointer to 0x0232194?

Well, what actually happens is that the pointer that looks like a pointer to physical memory is actually a virtual pointer, which the CPU and the OS work together to map to physical memory. It is obvious from the image that there is a problem here, what happens if two processes uses 4GB each? There isn’t enough physical memory to handle that. This is the point were the page file gets involved. So far, this is pretty basic OS 101 stuff.

The next stuff is where it gets interesting. So the OS already knows how to evict pages from memory and store them on the file system, because it needs to do that for the page file. The next step is to make use of that for more than just the page file. So you can map any file into your memory space. Once that is done, you can access the part of the memory you have mapped and the OS will load the relevant parts of the file to memory. Again, pretty basic stuff so far.

The reason you want to do this sort of thing is that it gives you the ability to work with files as if it was memory. And you don’t have to worry about all that pesky file I/O stuff. The OS will take care of that for you.

And now, to Tobi’s questions:

What are the most important advantages and disadvantages?

Probably the most important is that you don’t have to do manual file I/O. That can drastically reduce the amount of work you have to do, and it can also give you a pretty significant perf boost. Because the I/O is actually being managed by the operation system, you gain a lot of experience in optimizing things. And the OS will do things like give you a page buffer, caching, preloading, etc. It also make it drastically easier to do parallel I/O safely, since you can read/write from “memory” concurrently without having to deal with complex API.

The disadvantages you need to be aware that things like the base addresses would change whenever you re-open the file, and that data structures that are good in memory might not result in good performance if they are stored on disk.

Another problem relates to how the changes actually are saved to the file. It is hard to make sure that the writes you do are coherently stored in the file system. For example, let us say that you made changes to two different memory locations, which reside on two different pages. The OS can decide, at any time, to take one of those pages away from you because it needs that memory for other things. When that happens, it will write that page to disk for you. So when you ask for it the next time, it can load it up again and the application won’t notice.

However, what would happen during a crash? You might have partially written data in that scenario. Usually, when writing to files using file I/O routines, you have very predictable write pattern. When you use memory mapped files for writes, you don’t know for sure in what order that is going to happen. The OS is free to choose that. There are ways you can control that. For example, you might use FILE_MAP_COPY to avoid the OS writing stuff for you, but you would have to handle writes yourself now, and that is decidedly not trivial.

You can use something like FlushViewOfFilew() to make sure that a specific range is flushed, but that still doesn’t mean that they will be written in any order that you might expect. As a db writer, I care, quite deeply, about the exact way the data is written to file, because otherwise it is really hard to reason about how to read it back. Especially when you have to consider failures and corruptions.

Personally, I would be writing stuff using memory mapped file for data that I needed critical stuff for.

Why are using well known products like SQL Server custom caches instead of memory mappings?

SQL Server is actually its own operating system. It managed memory, locks, threads, I/O and a lot more. That comes from the fact that for a long time, it had to do those sort of things. But note that SQL Server probably use memory mapped file quite a lot. But they are also using their own custom caches because it make sense to them. For example, query plan cache. Even when you do have memory mapped files, you usually have multiple layers of caching.

In RavenDB, for example, we have the native page cache (managed by Esent), the documents cache (managed by RavenDB) and the client cache. The reason that we have multiple layers is that we cache different things. The client cache avoid having to call the server. The documents cache avoid having to parse documents and the native page cache avoid having to go to disk.

Why are other products like LevelDB using mmap instead of custom caches?

Because they are drastically simpler products. They really want to just give you the data as quickly as possible, and they don’t need to do additional processing of the data. They can lean on the OS page cache to a much larger extent. Note that when use in real products, there are often higher level caches that they will use anyway, used for storing processed / parsed information.

Are they well suited for managed code?

Memory Mapped Files are usually harder to use from managed code, because we don’t do our own memory management. It does meant that you lose the benefit of just treating this as part of your memory space, because there is a very clear line between managed memory and native memory, and memory mapped files are on the other side of the hatch. That said, you can easily use them via UnmanagedMemoryStream, or read from them directly via Memory Accessor. The managed implementation for leveldb make heavy use of memory mapped files.

How does performance compare with other techniques?

It is generally better, mostly because the OS is really good in managing paging, and that you rely directly on the low level I/O routines. Another thing that you have to remember that if you use file I/O, you need to create a buffer, copy to/from that buffer, etc. Using memory mapped files saves all of that.

Is it possible to use them with mutable data structures and retain crash recoverability?

Yes, but… is probably the best answer. Yes, you can use them for mutable data, but you really want to be careful about how you do it. In particular, you need to make sure that you write in such a way that you can survive partial writes. A good way of doing that is to make writes to specific pages, and you “commit” by recording that those pages are now available on a metadata page, or something like that. This require a lot of really careful work, to be honest. Probably more work than you would like it to be. LMDB does it this way, and even if the code wasn’t a eye tearing pain, what it is doing is quite hard.

Note that in order to actually be sure that you a flushing to disk, you have to call both FlushViewOfFile and FlushFileBuffers on Windows.

What guarantees does the OS make regarding ordering and durability?

Pretty much none regarding ordering, as noted. Windows will guarantee that if both FlushViewOfFile and FlushFileBuffers have been called and successfully completed, you are golden. But there aren’t any promises in the API about what will happen for partway failures, or in what order this happens.

To summarize:

Memory mapped files are a great tool, and for reads, they are excellent. For writes, they are awesome, but since there is no way to ensure in what order dirty pages gets written to disk, it make it hard to generate reliable system using them.

I rather use standard file I/O for writes, since that is far more predictable.

Another point to make about memory mapped files that as a developer you probably deal with a lot is exe and dll locking, along with questions about assembly loading and performance based on size.

Most OS's including Windows use memory mapped files to loaded executables and libraries(dll's). This allows only portions of the assembly actually used to even need to be loaded from disk as needed, this is also why they are locked once used and .Net will not unload an assembly until app domain unload because it remains memory mapped. So a large assembly where only small parts are referenced will have no performance or physical memory hit compared to a small assembly containing only the classes/methods used. Note the JIT stage in .Net is also on demand as methods are called, so if you never call a method, it's IL may never be loaded from disk and it will never be JITed.

A big annoyance on Windows related to this is the lack of an inode for files concept as on Linux/Unix. You can update a running exe or library on Linux even though it has been memory mapped and locked because only the inode is locked, the directory entry(file name) can be removed and readded independently, the old inode will stay around until the last process using it is gone. So much less hassle then Windows and no need for the crazy shadow copy system ASP.Net uses. This is why Linux requires less reboots than Window for updates, even the OS can be updated while running in Linux.

Also, normal I/O copies the buffer into kernel space before a write, and when you do a read, it reads from disk into kernel space and then copies to user space. These copies can have a substantial performance penalty.

Some OS's have an option to avoid the copies (on Linux, I think it's O_DIRECT) but direct-memory reads and writes don't allow sharing, so every process that accesses the data has to go to disk. When you use mmap, it avoids that copy and lets you share the memory region between processes.

Dude,
MemoryMappedFile give you mmap in .NET, sure, but it is quite a different experience than the one you would get int C, for example.
That is a major difference and something that you need to take into account. In C, you can just access the memory directly. In C#, you usually copy it to managed code, unless you want to use unsafe code.

OracleDB makes heavy use of O_DIRECT on POSIX systems. BerkeleyDB has an option to use it; I wrote the patch for BerkeleyDB to support it on Linux. (Previously it only worked on a select few OSs like SGI Irix, Solaris, and a few others.) Databases use it because they are DMAing into their own explicitly managed caches, so that the result can still be shared between processes. It's fine if you're willing to replicate all of the functionality of the kernel's virtual memory manager in your own application code. In general, that's a waste of programmer time and system resources, but some people like to reinvent wheels.

Howard,
Linus appears to be dead set against O_DIRECT: https://lkml.org/lkml/2007/1/10/233
I agree with both you & him at that regard, it is a lot of work for very little benefit.

Then again, a lot of the time you had to do that, because the OS wasn't good enough in managing the memory for you, and you could do better with your own knowledge. I can't talk about Oracle, but SQL Server did a lot of stuff like that because Windows' behavior wasn't good enough for it.

Yes. Even SQLite maintains its own VFS layer and page cache because it gets deployed into small embedded systems that have little more than a process spawner and nothing else in terms of operating system services. I guess in those scenarios you do what you have to do. But nobody is running OracleDB or BerkeleyDB in those conditions.

Windows... I keep hearing that its virtual memory manager is "way better" in "the latest release" but have yet to see any actual improvement. This is a distinct liability for LMDB; on the same hardware with all else being identical, the OpenLDAP server runs significantly slower on Windows than on Linux. Application-level caching could insulate us a little bit more from such OS-dependent performance differences, but the easier answer is just to run Linux.

Howard,
I am guessing that a lot of that has to do with the different ways both systems are working. At a guess, you are heavily optimized on *nix, but you haven't done nearly as much work to get that working on Windows.
There are different behaviors that you do on both. For example, I see madvise() calls that are only applicable for *nix, and not Windows, and that can be it.
Other stuff can just be access patterns.