Thursday, May 24, 2012

At scale: gigabyte page sizes? (technical)

The question that drives us is 'scalability'. It's a systemic anomaly inherent to the programming of the matrix.

I'm too lazy to spend more time Googling the answer to this question, so I thought I'd just ask it in a blogpost. I can allocate memory with a larger pagesize than 4096 bytes using VirtualAlloc()/mmmap(). The next step up is the 2-megabyte pagesize, but can I also allocate the next level pagesize of 1-gigabyte?

Explanation:

On 32-bit processors, there were two levels of virtual memory. A 32-bit address was divided into 10-bits for the first level lookup, with the next 10-bits being used for the next stage lookup, with the final 12-bits resulting in a 4096 pagesize (10 + 10 + 12 = 32-bits). But, you could skip the second level lookup, thereby having a 22-bit page (or 4-megabytes).

On 64-it processors, the 64-bit address is broken up into 9-bit chunks, with the last chunk remaining at 12-bits, meaning that the default pagesize is still 4096. In today's system, the upper bits in a virtual address aren't used, so addresses are only 48 bits, resulting in a four stage virtual-to-physical memory translation (9 + 9 + 9 + 9 + 12 == 48).

Going up one level means you can get a 21-bit pagesize (2-megabytes instead of 4096-bytes). Going up two levels means a 30-bit pagesize, or 1-gigabyte. The x86 architecture supports this. And I assume it also supports going up further levels (39-bit pagesize of 0.5-terabytes), but that would be crazy.

The question I have is that even though x86 supports this, does the operating system?

The reason they wouldn't want to is because most such allocations would fail. As the system runs, it's allocating/freeing 4096 chunks of memory. While this may seem contiguous in virtual memory, it is in fact spread throughout physical memory. You might be able to allocate a contiguous 1-gigabyte chunk of memory at system startup, but as time goes on, you have less and less a chance of that succeeding.

The reason they should support this is because it's something like 20 lines of code. It's extraordinarily easy to support. In addition, they could support it without telling you. To allocate memory for the existing 2-megabyte pagesize, you call VirtualAlloc() with the MEM_LARGE_PAGES flag. If you do this asking for 1-gigabyte, the operating system can first attempt a single 1-gigabyte page, and if that succeeds, return it to you, and if not, return you memory using the smaller pages.

Let's say that you are trying to create a system supporting 8-million concurrent TCP connections. At 512-bytes per TCP control block (TCB), that means you need 4-gigabytes of RAM just to store the TCP information. This further means that you'll need 8-megabytes of RAM just for the last-level of the pagetable (and 16-kilobytes for the level above that). This 8-megabytes is as large as the L3 cache on processors.

The upshot is that when a packet arrives, the TCP control block will cause a cache-miss and go to main memory, which will cost 100 cycles. But, resolving the pagetable will also hit main memory, causing another 100 clock cycles. And that's just the TCB. The socket structure, and then application-layer memory will all likewise have a double-hit to main memory. You are up to 600 clock cycles just to get the memory into the CPU for the packet and you haven't done anything yet. Not to mention that you've consumed an enormous of memory bandwidth.

But if you use large pagesizes, then you get rid of half the problem. This is especially true of the 1-gigabyte size. As applications grow, so does the cost of virtual-memory to physical-memory translation. On typical applications, that cost is already 10%, and at scale, it can reach 50%. That's why people see moving their code into the kernel as the only solution, because it works with physical memory and gets rid of this overhead. But, at gigabyte pagesizes, this cost completely disappears. The translation will be cached in the TLB (translation lookaside buffer), making virtual-memory accesses just as fast as physical memory.

There is a lot of good reasons to have small 4096 pages, such as supporting Apache-style programming of forking processes, where you want all the processors to "copy-on-write" page entries. It would be a horrible performance loss moving to 2-megabyte pages. But on the other hand, Apache is stupid, and smart software like Nginx would prefer to have a few larger pages. As networks continue to scale beyond where even Nginx is today, we might want even larger pages.