64-bit vs 32-bit

As 64-bit machines become more common, the problems we need to solve also evolve. In this post I’d like to talk about what it means for the GC and the applications’ memory usage when we move from 32-bit to 64-bit.

One big limitation of 32-bit is the virtual memory address space – as a user mode process you get 2GB, and if you use large address aware you get 3GB. A few years these seemed like giant numbers but I’ve seen as more and more people start using .NET framework, the sizes of the managed heap go up at a quite high rate. I remember when I first started working on GC (which was late 2004 I think) we were talking about hundreds of MBs of heaps – 300MB seemed like a lot. Today I am seeing managed heaps easily of GBs in size – and yes, some of them (and more and more of them) are on 64-bit – 2 or 3GB is just not enough anymore.

And along with this, we are shifting to solving a different set of problems. In CLR 2.0 we concentrated heavily on using the VM space efficiently. We tried very hard to reduce the fragmentation on the managed heap so when you get a hold of a chunk of virtual memory you can make very efficient use of it. So people don’t see problems like they have N managed heap segments, are running out of VM, yet many of these segments are quite empty (meaning having a lot of free space on them).

Then you switch to 64-bit. Now suddenly you don’t need to worry about VM anymore – you get plenty there. Practically unlimited for many applications (of course it’s still limited – for example if you are running out physical memory to even allocate the datastructures for virtual pages then you still can’t reserve those pages). What kind of differences will you see in your managed memory usage?

First of all, your process consumes more memory – I am sure all of you are already aware of this – the pointer size is bigger – it’s doubled on 64-bit so if you don’t change anything at all, now your managed heap (which undoubtly contains references) is bigger. Of course being able to manipulate memory in QWORDs instead of DWORDs can also be beneficial –our measurements show that the raw allocation speed is slightly higher on 64-bit than on 32-bit that can be attributed to this.

There are other factors that could make your process consume more memory – for example the module size is bigger (mscorwks.dll is about 5MB on x86, 10MB on x64 and 20MB on ia64), instructions are bigger on 64-bit and what have you.

Another thing you may notice – if you have looked at the performance counters under .NET CLR Memory – is that you are now doing a lot fewer GCs on 64-bit than what you used to see on 32-bit.

The curious minds might have already noticed one thing – the managed heap segments are much bigger in size on 64-bit. If you do !SOS.eeheap -gc you will now see way bigger segments.

Why did we make the segment size so much bigger on 64-bit? Well, remember we talked about in Using GC Efficiently Part 2 how we have a budget for gen0 and when you’ve allocated more than this budget a GC will be triggered. When you have a bigger budget it means you’ll need to do fewer GCs which means your code will get more chance to run. From this perspective you should get a performance gain when you move to 64-bit – I want to emphasize the “this perspective” part because in general things tend to run slower on 64-bit. The perf benefit you get because of GC may very well be obscured by other perf degrades. In reality many people are not expecting perf gain when they move to 64-bit but rather they are happy with being able to use more memory to handle more work load.

Of course we also don’t want to wait for too long before we collect – we strive for the right balance between memory (how much memory your app consumes) and CPU (how often user threads run).

Is it really possible to use 2GB of RAM with .NET application on 32bit platform? From our experience with it we can’t really exceed total of 1.4GB of RAM on single managed .NET process even we have a plenty of free RAM memory (running on dual XEON HP with 4GB of RAM and we are processing price feed – a lot of small data with frequent change). It looks like to us once the process reaches around 1.4GB (~1GB is data only) then .NET gives up and process is essentially dead with no real response to outside services. Can we expect this will never happens again on 64bits?

MM: no, because there are other things that are slower/bigger on 64-bit. You don’t necessarily come out ahead as a net result.

Libor: did you enable large address aware? If not you are only limited to 2GB user mode space. With large address aware you get 3GB user mode space (so kernel mode is limited to 1GB). This doesn’t have anything to do with managed code – it’s an OS feature.

1) I am also curious about what other things are slower in 64 bits :-))

2) I understand that for 64 bits GC, you have to spend more time to compact the free memory because now the budget is higher and therefore more area to sweep. Is GC using any algorithm to strike the balance as when it needs to do compact ?

Bigger pointer size/bigger module size are the *main* factors that cause slow down on 64-bit (not the only factors). Of course they apply to both native and managed code.

nativecpp –

1) see above; 2) yes; 3) yes (and the size of the min large object is the same. Yes, I know you are going to say maybe we should make it bigger – maybe that’s the case – I haven’t gotten around to look at this – it hasn’t been a problem.).

So is 84000 bytes still the right size for a Large Object on a 64bit OS? In some cases what use to be a small object in x86 now become a large one on x64 assuming the object is not a buffer or some similiar data structure. In addition does this not also mean that since we have a 64bit pathway that we can move 84000 bytes much faster than with a 32bit pathway?

..and the size of the min large object is the same. Yes, I know you are going to say maybe we should make it bigger – maybe that’s the case – I haven’t gotten around to look at this – it hasn’t been a problem.

@roy: "one sad thing about programming in .net" ? Do you remember C++? Fragmented heaps at best? Exceptions due to dangling pointers at worst? Having to cope with 7 different smart pointer implementations using 9 heap managers since every library and developer had to get smart about that? And still playing tricks with MFC’s CString reference counting? And I did not even mention COM yet…

Garbage Collection is the single best thing in .NET. If Microsoft shreddered C# I would rather switch to VB.NET than back to C++, just because of GC 😉

I’m sure I speak for everyone when I say your posts are truly awesome. This is quite a bit off topic but I didn’t know where to post or how to contact you. And hopefully the answer will have some benefit for others.

I’m writing a custom session manager for a suite of high-performance web apps. There’s a fair amount of complexity with the tracking, expiring and synchronizing of cache objects and it struck me that in a way, what I was making was actually a sort of simplified but specialized garbage collector.

The applications are “high-cache” and are designed to utilize the full system resources – using a large amount of RAM. But we need to have the memory freed as soon as possible after a cache object expires to make room for more incoming session objects. As the session manager “knows” when it wants to free up memory and it’s quite possible that many of these sessions are quite long lived before they expire – would you think that it’s reasonable to assume that managing our own memory assignments, etc. is a good idea? Or is the GC going to do a much better job of this than we ever could? And if we do rely on the GC for everything, is there anything we can do to “help” it – seeing as though we’re looking to aggressively reclaim much needed memory? Everything I’ve ever read has pretty much said to leave it alone.

I haven’t really been able to find much info on this so any advice you could give would be greatly appreciated.

Thomas, caching is a big topic. As of now, I would suggest you to do most of the management of the cache by yourself since it sounds like you want a rather strict behavior with your cache. Since you know when a cached item expires, you can setup your policy to trigger a full GC when you know a SUBSTANTIAL amount of cached items expired (you don’t want to trigger unproductive full GCs).

Your blog posts are quite informative. Though my question is not related to 64-bit vs 32-bit, it is however about memory and garbage collector.

We are currently migrating from a machine which has a 2GB RAM to a machine which will have 4GB RAM. My question is what would be the maximum memory available for each windows process if we migrate to the new 4GB machine? Will it only be 3GB user mode space?

Will the GC not run that frequently when compared to our old machine’s GC reason being it has more memory?

Will my new machine’s increased memory translate to increased size of managed heaps or will the managed heap size remain the same for both my old machine and new machine?

What advantage can I get by migrating to new machine which has more memory? I know that this is a dumb question but some of your pointers will be appreciated.

How much memory you use depends on how much data you allocate. A memory manager will add some number to the absolute amount of memory you allocate for efficiency purpose. If you have more physical memory you could handle more requests concurrently ’cause you can use more memory.

Thanks for your response. I read your post on some fundamentals of how VM works. I have couple of follow up questions.

Firstly, on a 32-bit machine, user mode virtual space for a process can be as large as 3GB. What is the max size that a managed heap can be? i.e., after what point will we be getting OOM errors? somewhere close to 3 GB? I understand that in one of your posts you were not willing to reveal the size of the managed heap. You do not have to answer this question if you do not want to.

If I migrate from a 2GB RAM machine to 4GB RAM machine, can I assume that GC collections in a 4GB RAM machine will not be as frequent as 2GB RAM machine keeping in mind that we make same number of allocations?

Sorry about bugging you with questions, but I feel that it is important for me to understand what is going on behind the scenes.

In one of your earlier responses you said that "If you have more physical memory you could handle more requests concurrently ’cause you can use more memory."

Why can we handle more requests concurrently? Would not there be always a constant number of threads in the thread pool that can handle only certain number of requests even if we move from one machine to another machine with more RAM?

As a simplied example – if one request takes 10mb of memory and you have 500 mb memory available, assuming you limit it so it doesn’t page you get 50 requests concurrently; if you have 1gb memory available, you get 100 requests concurrently.

The thread pool you are talking about is just a library you use. If I use a different thread pool which allows max 2 threads per CPU I can handle more requests if other resources suffice. Now perhaps each requests will be processed slower (because they have to share the CPU) but I can process more requests concurrently.

The amount of resource you use limit how many requests you can process concurrently. Memory is one of them. You could be running out of memory, or handles, or threads, or something else. These are all factors you need to consider. So assuming you have no other limiting factors besides memory, adding memory could allow for more concurrent requests.

Try using SOS.dll on windbg. You can get that free with the Debugging Tools For Windows package. I don’t think VS is yet a 64-bit package.

@Na:

When your physical memory becomes full and you start paging out old data to disk (even though there is virtual space for it, it has to go to disk if there’s not enough physical memory), then you can’t keep concurrent actions alive, because the disk is a serial device (and it’s slow to boot). Basically, going to disk forces requests with paged-out data to wait in queue for the disk to re-read that data. Thus it is still a win for you to have more memory, but on a 32-bit OS, the practical limit for you to get any benefit from it is ~4GB. With a 64-bit version of Windows, you can have terabytes of memory and it will still improve your concurrency.

With our server app on x64 we are seeing increased managed heap fragmentation due to some evil pinning we are doing on the managed heaps. Running on 32-bit we still do our pinning but the resultant heap fragmentation is much less. Is this so much more significant on x64 because of increase allocation size? How does memory pressure on the system affect GC behavior? Any thought gone in to moving pinned objects to a separate heap in the future to help or something similar? In earlier blog entries you indicated that with the pinned objects staying in gen0 we can allocate within the fragments, are there any caveats to that?

Dave, if your fragmentation is gen0 then yes it will be used to satisfy allocation requests. No caveats there. And if it’s in gen0 it’s most likely because on 64-bit since the segment size is much bigger, we could allow much more allocation in gen0 before we trigger a GC.

It works similiarly on 64-bit and on 32-bit wrt memory load on the machine. When the physical memory load gets really high we will trigger a GC if we get the low memory notification from the OS; and we are more aggressive at trigger full GCs (when we see it as productive).

As far as the notification from OS about memory condition – how is does that work – is there a threshold value somewhere? Our app consists of native code hosting managed code, our native code also does significant processing and uses significant memory (commited) – maybe 300-400 mb per instance of the process. Should we be doing something with AddMemoryPressure() to help GC do a better job detecting when to do a collection? Other apps on these systems are competing for resources as well (sql server, other server apps)…

I think that there is a great misunderstanding of why customers use X64 in Microsoft – because I think many Microsoft folks think of workstations or SQLserver only. Our customer base sees X64 machines as terminal servers – and the increased GC heap kills them. They already run out of reasonable heap space at 50 .net users on a 32 bit terminal server (runnign several .net applications in each winstation for each user) – and the reason to go X64 is to use the memory to reduce the number of physical terminal servers by using native X64 programs and increasing the number of users per physical server.( in our tests X86 on x64 = 40% performance decrease over X86 on X86 so they want native X64 applications to regain X86 on X86 performance). The large GC heaps just force 10’s of gb of memory and page file to be used and contradicts the whole reason to use a native X64 terminal server in the first place. Why doesnt microsoft allow the GC to be configured to the same size as X86 for these terminal server cases – so memory footprint is reasonable ? these are just WPF winforms aplications that the additional GC hit is miniscule comapred to the GC memory load. Remember that 64mb of memory is still out of reach of many x64 servers and using 24gb of page file (for max potential virtual memory) is considered extreeme at say 400 X 64mg gc even on x86.

I’m a bit confused by the Large Address Aware support with regard to .NET. There’s way too much conflicting information flying around on the net about this stuff and I can’t find 1 all inclusive and trustworthy (Microsoft) answer about it.

I get that we need to add the /3GB parameter to the boot.ini but where the misinformation comes in is:

1. Do I have to do anything else in addition to the OS switch? Some sources I’ve read say the switch is all I need if I’m using .NET 1.1 or later (i.e., 2.0, 3.5, etc). Some sources say no, I have to modify my executable (e.g., editbin /LARGEADDRESSAWARE MyApp.exe) – if this is true – is this only way to add support to the executable (i.e., post compile?) Is there an attribute or something I can add to the project?

2. My understanding is, if I do have to use editbin, it only has to be done on the executable (i.e., MyApp.exe). What if I’m dealing with an ASP.NET application where there are only DLLs? We have a WinForm application and an ASP.NET version.

3. Finally, (I’m not entirely sure about this point so forgive my ignorance) Aren’t AppDomains limited to 2GB? If so would my application really benefit from adding Large Address Aware support?

The reason I’m concern is we have an application that deal with a large amount of (small) objects and consumes a tremendous amount of memory. We’re trying to figure out if we should pursue Large Address Aware support as a short term solution while we refactor our application.

How much memory your processes are using is mainly determined by you, not by GC. You should profile your application see what is using so much memory.

Buuba,

Things may run slighter slower on 64-bit (or not; it depends on your scenario) but you get a much bigger benefit which is you now have access to much more memory. To most people this is extremely attractive.

Jeronimo,

I am not sure if there is a way to enable large address aware in VS projects but if you are unsure you can always use dumpbin /HEADERS on the exe to check if it’s already marked as LLA enabled.

I am not sure what you meant by “Aren’t AppDomains limited to 2GB?”. If you meant the managed heap limit then appdomains don’t have such a limit – it’s limited per process and it’s by default 2GB. So using 3GB could give you some benefit – of course keep in mind that now kernel only gets 1GB which is not necessarily desirable.

1)For .NET 3.5 release, besides GCCollectionMode and GCSettings.LatencyMode, are there any new features ?

2)I am curious about DLR. Since it is dynamic, I would assume that GC would be doing a lot collections possibly in gen 0. I know that GC is very good in collecting, are there any issues where GC need to be adjusted for DLR ?

At my firm we are currently working on migrating from 32bit 1.1 framework, to 64bit 2.0 framework, exactly coz we need the enlarged heap space.

we have built the application and now in the stages of testing and benchmarking to see where we stand.

Of course we find it very hard to determine speeds and such, as our machines are completely different (the new 64 bit ones, are also multi-CPU(ed) and alot newer, so even if the application works slower, the machine itself is faster), but we did find some issues, where stuff works slower on 64bits.

Is there a place where I can find/read about such differences, to see if what I found is documented etc?

When we use remoting on DataTable objects it seems what once took 0.5 seconds now takes 1.5 seconds, now this maybe seems nothing, but if you remote a large number of small DataTables, this could mean 300% decrease in the performance.

Another question:

When we were still using 32bit/1.1 framework, we used GC.Collect manually from time to time, when the application would allocate large ammount of memory fast, it seemed the only way to prevent OOM exceptions. when migrating this to 64bit (which was already done), would you consider removing those?

do you think they might take a long time, and decrease performance once OOM exceptions are practically out of reach?

1)For .NET 3.5 release, besides GCCollectionMode and GCSettings.LatencyMode, are there any new features ?

No.. 3.5 is a small release. There was some performance tuning work included in 3.5 as well, along with a few bug fixes.

2)I am curious about DLR. Since it is dynamic, I would assume that GC would be doing a lot collections possibly in gen 0. I know that GC is very good in collecting, are there any issues where GC need to be adjusted for DLR ?

>>>Is there a place where I can find/read about such differences, to see if what I found is documented etc?

I would imagine if you search on msdn you should find some info.

>>>When we were still using 32bit/1.1 framework, we used GC.Collect manually from time to time, when the application would allocate large ammount of memory fast, it seemed the only way to prevent OOM exceptions. when migrating this to 64bit (which was already done), would you consider removing those?

1.1 had some premature OOM bugs that were fixed in 2.0. Are you using 2.0? If so you should try removing induced GCs and it’s very likely you will get better results.

I have an existing asp.net 2.0 application running on 32 bit machine, now we want to move it to a 64 bit machine and on a asp.net 3.0 envmt. What are the benefits that i would get by shifting to the new setup? would there be any performance increase or decrease. What kind of issues i can probably get in this migration?

How does the difference between 64bit and 32bit affect applications running under the WOW64 subsystem? Do such apps still get access to the increased VM space?

For example, We have a managed image processing app, however it uses a number of 3rd party native COM components that have not been modified for 64bit use, so we have to build our app as 32bit even when running under a 64bit OS.

We allocate a lot of objects on LOH, mainly byte arrays up to about 80mb in size, and have recently been running into a number of OOM exceptions. Up till now our memory allocation pattern has been rather naiive and I have been given the task of coming up with something more efficient ( such as a memory pool pattern ). I was just wondering if fragmentation of VM is still likely under this scenario.

As I said, in 64-bit your pointers are bigger, you do consume more memory for the same payload. However, in 64-bit you are basically not limited by the virtual address space (physical memory then can become a limiting factor but I presume you can just add more of that :-)). So let’s say you used to be only be able to handle 100 concurrent requests due to the virtual address limitation, now you might be able to handle a lot more.

As far as migrating from asp.net 2.0 to 3.0 they did make perf improvement in 3.0. As with any kind of significant migration, I would recommand you to try it out in a test environment first, of course.

dscaravaggi,

AWE is an OS feature so you can enable it for your process and use it in your application. But the GC heap does not allocate with AWE. Usually people move to 64-bit to take advantage of the much larger virtual address space. Keep in mind that if you have a much larger heap (just because now you can allocate a lot more, assuming you are not limited by the amount of physical memory, and you are handling a lot more stuff), that the latency from full GCs can also be larger (obviously, it takes a lot longer to collect a 12GB heap than a 800MB one).

planetmarshalluk,

If you enable large address support and are running on wow64 instead of 32-bit, you get 4GB user mode virtual address space instead of 3GB. So that’s an advantage if virtual address space is the limitation for you. Allocating large objects out of a pool is a good way to decrease fragmentation on LOH – obviously you want to verify that LOH fragmentation is the problem first.

We compiled the same .NET2 app with AnyCPU and x86 platform target in VS2008, and run them on 64 bit Win2008 and WOW64 respectively. The 64 bit version consumes 50% more memory and takes 60% longer than 32 bit version. As you explained early, bigger pointer size/bigger module size are the *main* factors that cause slow down on 64-bit. Are there major improvements in .NET4 on this 64-bit performance penalty?

Running on x64, it does not seem to be possible to allocate more data than can fit in physical memory. For example, if I have 16GB of RAM on an x64 server, I cannot allocate 64GB of data in a .NET program. If I try to allocate past the physical limit, the garbage collector consumes 100% of CPU, preventing me from allocating any more (virtual) memory. Why does .NET prevent usage of virtual memory?

You must be very busy indeed answering all of these questions. Can you explain why there is no LOH user compacting function? Why would it be a problem to block and wait while the LOH is compacted? I think you guys are doing a great job and I love .NET GC! But it would be nice to have just a little control back 😉 Anyway thanks for the great article these are certainly helpful and knowing what I am up against is always good!

Thanks for good article, my question is about behavior of GC on 64 bit machine. I am trying to understand that even though VM is unlimited, as GC collections are reduced, wont the Paging and thus number of disk-IO increase will this not be counter productive.

Does GC collection frequency take into consideration the physical RAM vs VM ratio.