Key Takeaways

Web servers often have far more memory than the .NET GC can efficiently handle under normal circumstances.

The performance benefits of a caching server are often lost due to increased network costs.

Memory Mapped Files are often the fastest way to populate a cache after a restart.

The goal of server-side tuning is to reach the point where your outbound network connection is saturated. This is obtained by minimizing CPU, disk, and internal network usage.

By keeping object graphs in memory, you can obtain the performance benefits of a graph database without the complexity.

In continuation of the Big Memory topic on the .NET platform (part1, part2), this article describes the benefits of utilization of large data sets in-process on the managed CLR server environments using Agincore’s Big Memory Pile.

Overview

RAM is very fast and affordable these days, yet is ephemeral. Every time the process restarts, memory is cleared out and everything has to be reloaded from scratch. To address this we have recently added Memory Mapped Files support to our solution - NFX Pile. With memory mapped files, the data can be quickly fetched from disk after a restart.

Overall, the Big Memory approach is beneficial for developers and businesses as it shifts the paradigm of high-performance computing on the .NET platform. Traditionally Big Memory systems were built in C/C++ style languages where you primarily dealt with strings and byte arrays. But it is hard to solve any real world business problems while focusing on low level data structures. So instead we are going to concentrate on CLR objects. Memory Pile allows developers to think in terms of object instances, and work with hundreds of millions of those instances that have properties, code, inheritance and other CLR-native functionality.

This is different from language-agnostic object models, as proposed by some vendors (i.e. ones that interoperate Java and .NET), which introduce extra transformations, and all of the out-of-process solutions that require extra traffic/context switching/serialization. Instead, we’re going to discuss in-process local heaps, or rather “Piles” of objects, which exist in managed code in large byte arrays. Individually, these objects are invisible to the GC.

Use Cases

Why would anyone use dozens or hundreds of gigabytes of RAM in a first place? Here are a few tested use-cases of the Big Memory Pile technology.

The first thing that comes to mind is cache. In an E-Commerce backend we store hundreds of thousands of products ready to be displayed as detailed catalog listings. Each may have dozens of variations. When you build a catalog view listing 30+ products on a single screen, you’d better get those objects pretty quickly even for a single user scrolling a page with progressive loading. Why not use Redis or Memcached? Because we do the same thing only in-process, saving on network traffic and serialization. Transforming data into network packets into objects can be a surprisingly expensive operation. Wouldn’t you use a Dictionary<id, Product> (or IMemoryCache) if it were possible to hold all several hundred thousand products and their variations? Caching data alone provided enough motivation for using RAM, but there is much more...

In another cache use-case - a REST API server we were able to pre-serialize around 50 million rarely changing JSON vectors as UTF8-encoded byte arrays. The byte[], which was around 1024 bytes, could then be served directly into Http stream, making the network the bottleneck at around 80,000 req/sec.

Working with complex object graphs is another perfect case for Pile. In a social app, we needed to traverse the conversation threads on Twitter. When tracing who said what and when on a social media site, the ability to hold hundreds of millions of small vectors in memory is invaluable. We might as well have used a graph DB, however in our case we are the graph DB, right in the same process (it is a component hosted by our web MVC app). We’re now handling 100K+ REST API calls/sec, which is the limit of our network connection, while keeping the CPU usage low.

In this, and other use cases, background workers asynchronously update the social graph as changes come in. In many cases, such as the product catalog we talked about earlier, this can be done preemptively. You couldn’t do that with a normal cache that only holds a subset of the interesting data.

How it Works

Big Memory Pile solves the GC problems by using the transparent serialization of CLR object graphs into large byte arrays, effectively “hiding” the objects from GC’s reach. Not all object types need to be fully serialized though, - string and byte[] objects are written into Pile as buffers bypassing all serialization mechanisms yielding over 6 M inserts/second for a 64 char string on a 6 core box.

The key benefit of this approach is its practicality. The real-life cases have shown the phenomenal overall performance while using the native CLR object model - this saves development time because you don't need to create special-purpose DTOs, and works faster, as there are no extra copies in-between that need to be made.

Overall, Pile has turned much of the I/Obound code into a CPU-bound code. What should have normally been a typical case for an async (with i/o bound) implementation, became 100% sync linear code, which is simpler and performs better as Tasks and other async/await goodies have a hidden cost (see here and here) when doing multi 100K ops/sec on a single server.

Big Memory Mapped Files

In-memory processing is fast and easy to implement, however when the process restarts you lose the dataset, which is large by definition (tens to hundreds of gigabytes). Pulling all of that data from its original source can be very time consuming, time that you can’t afford just after a restart.

Writing to memory via MemoryMappedViewAccessor (MMFMemory class) modifies virtual memory pages in the OS layer directly. The OS tries to fit those pages in physical RAM, if it can’t it swaps them out to disk. A nice feature of writing Pile into MMF is you don’t need to re-read everything from disk even after the process restarts soon after shutdown. The OS keeps the pages that have been mapped into process address space around even AFTER the process termination. Upon start, the MMFPile can access the pages already in RAM in a much quicker fashion than reading from disk anew.

Do note that MMFPile yields slower performance than DefaultPile (based on byte[]) due to the unmanaged code context switch done in the MMFMemory class.

As you can see, the MMF solution does have an extra cost; the throughput is lower due to unmanaged MMF transition, and once you mount the Pile back from disk, it takes time proportional to the amount of memory allocated to warm-up the RAM with data from disk. However you do not need to wait to load the whole working set back, as the MMFPile is instantly available for writes and reads after the Pile.Start(), the full load of all data is going to take time, in the example above the 8.5 GB dataset takes 48 sec to warm-up in RAM on a mid-grade SSD.

Other Improvements

Since our previous post on InfoQ we have made a number of improvements to the NFX.Pile:

Raw Allocator / Layered Design

The Pile implementation is now better layered, allowing us to treat string and byte[] as directly writeable/readable from the large contiguous blocks of RAM. The whole serialization mechanism is bypassed for byte[] completely, making it possible to use Pile as just a raw byte[] allocator.

Performance Boost

The segment allocation logic has been revised and yields 50%+ better performance during inserts from multiple threads due to introduction of sliding window optimization that tries to avoid multi-threading contention. Also, strings and byte[] are now bypassing the serializer completely yielding 5M+ inserts/sec for most cases (200%+ improvement)

Durable Cache

For performance reasons, the default mode for the cache is “Speculative”. In this mode hash code collisions may cause lower priority items to be ejected from the cache even when there is otherwise enough memory.

The cache server can now store data in a “Durable” mode, which works more like a normal dictionary. Because durable mode needs to do rehashing in the bucket, it is 5-10% slower than speculative mode. This is hardly noticeable for most applications, but you’ll need to test to see what is best for your particular situation.

In-Place Object Mutation and Pre-allocation

It is now possible to alter objects at the existing PilePointer address. The new API Put(PilePointer...) allows for placing a different payload at the existing location. If the new payload does not fit in the existing block, then Pile creates an internal link to the new location (a la file system link in *nix systems) effectively making the original pointer an alias to the new location. Deleting the original pointer deletes the link and what it points to. The aliases are completely transparent and yield the target payload on read.

You can also pre-allocate more RAM for the future payload by specifying the preallocateBlockSize in the Put() call.

no you do not need to wait for 500 gigabytes to load a whole.What happens is: you Start() the pile and it mounts segments from disk in < 1 sec,however it does not know the whole statistics as of yet, - you can instantly read pointers pointing to those segments, you can instantly delete those pointers but you can not write into those segments until the get crawled() - analyzed by the async thread. This thread may take minutes to load your data. Its ok.

Until it does - the new writes will go towards the end of the MMFPile.

To summarize: you may use MMFPile in 1 sec after start, IF you need the full statistic (which most likely you do not for operation) then you wait.Statistics = total object count, bytes used etc...

The answer is this: the pile is not empty - it gets "crawled()" async by a separate worker, the statistics (how many objects, bytes yadayada...) gets built as the thread reads the stuff to memory, BUT that does not mean that you can dereference stuff right away. The MMF files are handled by the OS, so if you try to do a scattered read it will work just fine right after the load. See the PileForm , run this guy here to see how it works graphically using WinfForms: github.com/aumcode/nfx/tree/master/Source/Testi...

With LocalCache it is tricky as it is purposed for in-memory use for fast indexing.On shutdown you will lose index, but can keep MMFPile intact.What you can do is reconstruct the cache by enuming through pile after load, which is going to cause some delay. We have yet to release into open source our full cache server that stores keys in the balanced index in MMFPile using version tolerant serializer - the code is used in proprietary system.

A nice feature of memory mapped files is that they are loaded on demand by the OS.

Unless you try to iterate through the entire collection, the OS is going to pick up your data from disk one page at a time until it runs out of RAM.

Now lets say your application crashes and restarts. Since the OS already has the file mapped into memory, there is no delay. You're not "copying" the file into your application's memory. Rather, your application is using the file/file cache as memory.

Memory mapped files are often used for cross-process communication. If two applications map the same file to memory, they can see each other's changes. Again, this works because the file is kept in memory at the OS level. (I wonder how two Pile-based applications would handle this.)

Regarding IPC using MMF as provided by Pile.short answer: In the open-source NFX code, the MMFPile, the MMF mounted into pile are for exclusive use per process - this is purposely designed this way for simplicity and speed.Besides, the IPC in NFX is done via Glue. There is no practical need to share the memory using Pile for IPC.

The long answer:Pile is a memory manager, which is a thread safe state machine. As such, it needs to synchronize the access to segment buffers and free slot pool which are not stored in the MMF. MMF only stores the actual data kept in Pile, but not the freelists and other metadata. This is done on purpose as syncing this stuff between processes would have been either prohibitive performance-wise or very complex to implement. Now, we are ONLY talking about the PARTICULAR implementation of the IPile interface as provided by NFX.

Internally we do have a distributed "huge pile" which spans multiple machines, but it is not open as of yet as it is a part of cluster Agni OS.

the slim serializer is using the NFX.Serialization.Slim.TypeRegistry class to find out the type for deserialization. To deserialize the data, the SlimeSerializer is trying to first read the type's id from the memory stream and then using this id to get the Type from the TypeRegistry. However there is no such Type at that time so the exception is thrown. To avoid it, I have to put the person object first to the Pile, so the TypeRegistry registers the Person Type (and if more types were stored previsouly, I need to do that in the exect order so the TypeRegistry is storing the type with the same id).

Is there a way to register the types before the TypeRegistry is used so I can be certain of the position?

Martin, you are not missing anything, and did a fantastic job! This is me missing an improper merge issue which I did not even realize was there.

The MMFPile writes its type registry to a file (near the data files). On start it reads it back. This code was absent on GitHub and Nuget (we use internal company repository and I incorrectly merged older code)

I have just synced the internal repo and GitHub and also released a new NuGet, so this problem is solved.

MMF is a great idea ! Good job. You have only to be careful to have an integrity mechanism by validate each change by only one byte change, and implement a recovery code that invalidated not tagged blocks. Few years ago i've written a storage insensitive to dirty stopping, based on the write of one byte on the hard drive, and it work perfectly (must take care of caching mechanisms - hard drive are not writing blocks in logical order, but taking care of cache and write head displacement, and electrical accu are here to finish cache flush). If you don't have a such thing, you can consider that a 4kb page should be written all or not at all. But without a last update validation of modifications with a final write, you cannot certify the file will be never corrupted. Further, the MMF block manager is not logically managing backup order. You have to do single write flush to be sure that the validation data are persisted after the content itself. You can read the source code of the extremely good LMDB library, based on MMF and incredibly robust against hard stopping.

Is your profile up-to-date? Please take a moment to review and update.

Email Address

Note: If updating/changing your email, a validation request will be sent

Company name:

Keep current company name

Update Company name to:

Company role:

Keep current company role

Update company role to:

Company size:

Keep current company Size

Update company size to:

Country/Zone:

Keep current country/zone

Update country/zone to:

State/Province/Region:

Keep current state/province/region

Update state/province/region to:

Subscribe to our newsletter?

Subscribe to our architect newsletter?

Subscribe to our industry email notices?

You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.

We notice you're using an ad blocker

We understand why you use ad blockers. However to keep InfoQ free we need your support. InfoQ will not provide your data to third parties without individual opt-in consent. We only work with advertisers relevant to our readers. Please consider whitelisting us.