Posted
by
Soulskillon Tuesday December 24, 2013 @07:20PM
from the try-a-bigger-sail dept.

jones_supa writes "The x32 ABI for Linux allows the OS to take full advantage of an x86-64 CPU while using 32-bit pointers and thus avoiding the overhead of 64-bit pointers. Though the x32 ABI limits the program to a virtual address space of 4GB, it also decreases the memory footprint of the program and in some cases can allow it to run faster. The ABI has been talked about since 2011 and there's been mainline support since 2012. x32 support within other programs has also trickled in. Despite this, there still seems to be no widespread interest. x32 support landed in Ubuntu 13.04, but no software packages were released. In 2012 we also saw some x32 support out of Gentoo and some Debian x32 packages. Besides the kernel support, we also saw last year the support for the x32 Linux ABI land in Glibc 2.16 and GDB 7.5. The only Linux x32 ABI news Phoronix had to report on in 2013 was of Google wanting mainline LLVM x32 support and other LLVM project x32 patches. The GCC 4.8.0 release this year also improved the situation for x32. Some people don't see the ABI as being worthwhile when it still requires 64-bit processors and the performance benefits aren't very convincing for all workloads to make maintaining an extra ABI worthwhile. Would you find the x32 ABI useful?"

Well, I do find it extremely useful. Especially in Debian & Ubuntu, we have multi-arch support. For some specific workload using interpreted languages, it just reduces the memory footprint by a half. For example, PHP and Perl. If you once ran Amavis and spamassassin, you certainly know what I mean: it takes double the amount of RAM on 64 bits. Since most of our servers are running PHP, Amavis and Spamassassin, this would be a huge benefits (from 800 MB to 400 MB as the minimum server footprint), while still being able to run the rest of the workloads using 64 bits: for example, Apache itself and MySQL, which aren't taking much RAM anyway compared to these anti-spam dogs.

You need to run in 64bit mode if you want to take advantage of many cache eviction reducing IPC increasing instructions. If you want to gain this benefit while keeping your pointer size to a minimum, then you need the x32 mode. aka, 64bit mode with truncated pointers. You can probably gain 10%-15% performance with few changes over true 32bit mode. A lot of that is hidden when using 64bit pointers because of the reducing data density for some work loads.

Yes and no. The larger your cache, the higher its latency. Can't get around this. L1 caches tend to be small to keep the execution units fed with typically 1 or 2 cycle latencies. L2 caches tend to be about 16x larger, but have about 10x the latency.

L2 cache may have high latency, but it still has decent bandwidth. To help hide the latency, modern CPUs have automatic pre-fetching and also async low-priority pre-fetching instructions that allow programmer to tell the CPU to attempt to load data from memor

In answer to my question, no, it is not dirt cheap. For any size cache you will get fewer cache misses if your data structures are smaller than if they are larger. Until the cache is so big that everything fits in it, you always win if you can double what you can cram into it.

Until the cache is so big that everything fits in it, you always win if you can double what you can cram into it.

Which is all nice and good except this implies your data structure was mostly pointers to begin with, so if you want to increase cache efficiency forget about pointer size and redesign them for better locality.

I suspect this is the real reason why this ABI has not caught wind: anyone who cares has already taken steps that render it pointless.

For some workloads, it's ~40% faster vs amd64, and for some, even more than that vs i386. For a typical case, though, it's typical to see ~7% speed and ~35% memory boost over amd64.

As for memory being cheap, this might not matter on your home box where you use 2GB of 16GB you have installed, but vserver hosting tends to be memory-bound. And using bad old i386 means a severe speed loss due to ancient instructions and register shortage.

That seams reasonable advantage. If it could take me from 60K tps to 100K tps per blade its a no-brainer. I doubt its going to allow office/home application to run any noticeably quicker but with a blade centre of 16 blades, I'll want to get my monies worth before needing to expand.

~35% memory boost is quite nice if you're running memory-bound multithreading processes; each thread being relatively light on CPU% but uses lots of memory.I run a webserver where one of the batch jobs is exactly that. ~35% memory boost would be very close to ~35% increase in throughput.

It's not just about "having enough RAM". While that certainly is a factor, it's not the only one. As you suggest, pretty much everyone has enough RAM to run just about any normal application with 64-bit pointers.

But if you want speed, you also have to pay attention to things like cache lines. 64-bit pointers often means larger instructions are needed to be encoded to do the same work, larger instructions means more cache misses. This can be a large difference in performance.

He's right. If you mix x32 and amd64 binaries on the same system, then you need two copies of every shared library that they use to be mapped at the same time. And this means that every context switch between them is going to be pulling things into the i-cache that would already be present (assuming a physically-mapped cache, which is a pretty safe assumption these days) because the other process is using them.

This is why x32 doesn't make sense on a consumer platform like Ubuntu unless the entire system is compiled to use it, making the entire article a 'well, duh'. The real advantage of x32 is on custom deployments and embedded systems where you can build everything in x32 mode.

Oh, and on the subject of caches, x86 chips typically have 64 byte cache lines. If you make pointers 4 bytes instead of 8, then you can fit twice as many in a cache line, which is usually nice. It can be a problem for multithreaded applications though, because you may now end up with more contention in the cache coherency protocol.

ECC memory is artificially expensive. Were ECC standard as it ought to be, it would only cost about 12.5% more. (1 bit for every byte) That is a pittance when considering the cost of the machine and the value of one's data and time. It is disgusting that Intel uses this basic reliability feature to segment their products.

That's right. Unfortunately it's called the market. The same boneheads that says x32 isn't worth it, are the same boneheads which have no idea how ECC is important, how hard it is to properly code everything worrying about cache hits is. Probably people that never wrote a single line of C or assembly code.

But the Intel way of making the same physical hardware cost 50% more (with a simple on/off switch) will continue until ARM Cortex start giving intel some real competition (at least competing with the lates

You've not understood this correctly. x32 is an enhancement and optimization for executable files that do not require gigabytes of RAM, primarily regarding performance. It has nothing to do with the availability or lack of RAM in the system, or how much RAM costs to buy in the computer store.

With x32 you get:- You get 16 registers instead of 8. This allows much more efficient code to be generated because you don't have to dump/reload automatic variables to the stack because the register pressure is reduced.- You also get a crossover from the 64 bit ABI where the first 6 arguments are passed in registers instead of push/pop on the stack.- If you need a 64 bit arithmetic op (e.g. long long), compiler will gen a single 64 instruction (vs. using multiple 32 ops).- You also get the RIP relative addressing mode which works great when a lot of dynamic relocation of the program occurs (e.g..so files).

You get all these things [and more] if you port your program to 64 bit. But, porting to 64 bit requires that you go through the entire code base and find all the places where you said:
int x = ptr1 - ptr2;instead of:
long x = ptr1 - ptr2;Or, you put a long into a struct that gets sent across a socket. You'd need to convert those to int'sEtc...

Granted, these should be cleaned up with abstract typedef's, but porting a large legacy 32 bit codebase to 64 bit may not be worth it [at least in the short term]. A port to x32 is pretty much just a recompile. You get [most of] the performance improvement for little hassle.

It also solves the 2037 problem because time_t is now defined to be 64 bits, even in 32 bit mode. Likewise, in struct timeval, the tv_sec field is 64 bit

The C standard does not guarantee that sizeof(long) is as big as sizeof(void*). The type that you want is intptr_t (or ptrdiff_t for differences between pointers). If you've gone through replacing everything with long, then good luck getting your code to run on win64 (where long is 4 bytes).

Having smaller data structures is much better for the small 64-byte cache lines of modern CPUs.

If your data structure includes pointers that you actually use, then you are randomly accessing memory anyways. If you arent using those pointers, then I suggest 0-sized pointers which are compatible with x64.

Some people don't see the ABI as being worthwhile when it still requires 64-bit processors

There's your answer. If I'm writing a program that won't need over 2GB, the decision is obvious: target x86. How many developers even know about x32? Of those, how many need what it offers? That little fraction will be the number of users.

Some people don't see the ABI as being worthwhile when it still requires 64-bit processors

There's your answer. If I'm writing a program that won't need over 2GB, the decision is obvious: target x86. How many developers even know about x32? Of those, how many need what it offers? That little fraction will be the number of users.

Wait, what are you talking about? "target x86" Wat? Are you writing code in Assembly? How do you target C or higher level code code for x86 vs x86-64, or ARM for that matter?

Ooooh, wait, you're one of those proprietary Linux software developers? Protip: 1's and 0's are in infinite supply, so Economics 101 says they have zero price regardless of cost to create. What's scarce is your ability to create new configurations of bits -- new source code -- not the bits. Just like a mechanic, home builder, burg

True. But for the vast majority of applications, that greater number of registers only translates into a small performance increase. I can potentially see x32 being useful for a rather small amount of heavily hand-optimized code (e.g. a massively optimized math or physics library), but for the vast majority of applications this performance benefit will be tiny.

To me, the real problem for the adoption of x32 is that so few programs on PC's need to worry that much about optimization. When it does become wo

I do not see many cases where this would be useful. If we have a 64-bit processor and a 64-bit operating system then it seems the only benefit to running a 32-bit binary is it uses a slightly smaller amount of memory. Chances are that is a very small difference in memory used. Maybe the program loads a little faster, but is it a measurable, consistent amount? For most practical use case scenarios it does not look like this technology would be useful enough to justify compiling a new package. Now, if the process worked with 64-bit binaries and could automatically (and safely) decrease pointer size on 64-bit binaries then it might be worth while. But I'm not going to re-build an application just for smaller pointers.

You misunderstand the desired impact. "Loads a little faster" doesn't really enter into it. It's rather that system memory is _slow_, and you have to cram a lot of stuff into CPU cache for things to work quickly. That's were the smaller pointers help, with some workloads. Especially if you're doing a lot of pointery data structure heavy computing where you often compile your own stuff to run anyway.

Still not saying it's necessarily worth the maintenance hassle, but let's understand the issues first.

The main benefit is that it runs faster. 64-bit pointers take up twice the space in caches, and especially L1 cache is very space-limited. Loading and storing them also takes twice the bandwidth to main memory.

So for code with lots of complex data types (as opposed to big arrays of floating point data), that still has to run fast, it makes sense. I imagine the Linux kernel developers No1 benchmark of compiling the kernel would run noticably faster with gcc in x32.

So for code with lots of complex data types (as opposed to big arrays of floating point data), that still has to run fast, it makes sense.

Well, here's the problem. Code that is that performance-sensitive can often benefit a whole lot more from a better design that does not have so many pointers pointing to itty-bitty data bits. (For instance, instead of a binary tree, a B-tree with nodes that are at least a couple of cache lines, or maybe even a whole page, wide.) There are very very few problems that actually require that a significant portion of data memory be occupied by pointers. There are lots and lots of them where the most convenient

64-bit pointers take up twice the space in caches, and especially L1 cache is very space-limited.

L1 cache is typically 64KB, which is room for 8K 64-bit pointers or 16K 32-bit pointers. Now riddle me this.. if you are following thousands or more pointers, what are the chances that your access pattern is at all cache friendly?

The chance is virtually zero.

Of course, not all of the data is pointers, but that actually doesnt help the argument. The smaller the percentage of the cache that is pointers, the less important their size actually is, for after all when 0% are pointers then pointer size cannot

Simple.It is just as fast.Takes less drive space.Uses less memory.As to rebuilding apps it should be just a simple compile and yes while memory is cheap it is not always available even today. What about x86 tablets on Atom? I mean really does ls need to be 64bit what about more?

Any application that does heavy-numerical computation should not be affected by much by the ABI if at all. All function calls are inlined inside the critical loop.

The ABI here also defines the size of all pointers. All pointers are 32-bit here. Any purely compute intensive application will not be affected much, but something including some complexity in data structures, with pointers, could possibly benefit a lot. On the other hand, if all your code does is traversing trees, you should seriously consider allocating them in one bunch and using internal indices (of smaller integer type) rather than native pointers anyway.

Number crunching rarely involve any pointers in the critical parts, the only exception I can think of is sparse matrices, which is actually usually done with fixed-size indexes rather than pointers.Game engines however probably have a lot of trees of pointers for their scene graph, so they could be affected. But if they're well-optimized, they're designed to that each level fits exactly inside a cache line, and changing the size of the pointers will mess that up.

The maintainer(s) find it interesting, and they're developing it on their own dime... so I don't get the hate in some of these first few posts. No one's forcing you to use it, or even to think about it when you're coding something else.

The company I work for compiles almost all programms with 32 bits on x86-64 CPUs. It's not only cheap RAM usage, it's also expensive cache which is wasted with 64 pointer and 64 bit int. Since 3 GB is much more than our programms are using, x86-64 would be foolish. I'm eager waiting for a x32 SuSE version.

I don't get it. x86-64 doubles the general purpose and SSE registers over x86. This alone makes a (usually quite big) difference even for programs that don't use 64bit arithmetic. The point of the x32 ABI as I understand it is to keep that advantage without having 64bit pointers.But you just compile with 32bits losing all the advantages of x86-64?

The idea makes sense in theory. Build binaries that are going to be smaller (32-bit binaries have smaller pointers compared with 64-bit) and faster (because the code is smaller, in theory cache should be used more efficiently and accesses to external memory should be reduced).

But I suspect the problem is that the benefits simply outweigh the inconvenience of having to run with an entirely separate ABI. I doubt the average significant C program spends a lot of time doing direct addressing, and as such I suspect the size benefits of using 32-bit pointers is overstated.

But I suspect the problem is that the benefits simply outweigh the inconvenience of having to run with an entirely separate ABI.

Well; if the benefits outweigh the inconvenience --- then it seems x32 should be catching on more than it is.

Personally I think it is a bad idea because of the 4GB program virtual address space limit;
which applications will be frequently exceeding, especially the server applications that would otherwise
benefit the most from optimization.

Personally I think it is a bad idea because of the 4GB program virtual address space limit; which applications will be frequently exceeding, especially the server applications that would otherwise benefit the most from optimization.

You're making an assumption that the 4GB limit is prohibitive. For some applications, it could be - databases and scientific processing, and definitely games. But there are plenty of other applications that won't really benefit from the enlarged address space - would a word proc

and faster (because the code is smaller, in theory cache should be used more efficiently

Your skill is Not enough. when you blow registers onto the stack the code crawls. x86-64 has more registers. Code compiled for is far faster than x86 because of the extra registers. The L1 cache is how big on your CPU? Is your binary MEGABYTES in size? If your code is jumping all over the digital universe generating cache misses then you're purposefully doing something more idiotic than this universe should care about.

It depends on the delta. There are still many 32bit problems out there, and there are plenty of cases where having extra performance helps. If you have enough of the right size problems you could even reduce the number of systems that you would need.

It looks like it could allow packing a single system tighter with less wasted resources.

Reducing the footprint of individual programs could also have some benefits from system performance / management, especially in tight resource situations.

There's plenty of applications around still without a 64 bit binary. From what I understand this layer just allows 32 bit programs to utilize some performance enhancing features of 64 bit architecture. It seems a genuinely good idea.

There's plenty of applications around still without a 64 bit binary. From what I understand this layer just allows 32 bit programs to utilize some performance enhancing features of 64 bit architecture. It seems a genuinely good idea.

It allows 32-bit programs, which are *recompiled*, to benefit from those features. You still need the source and x32 builds of all dependencies. However, sometimes I guess there could be porting issues due to pointer size assumptions (but no other hard assumptions of x86 ABI behavior). Those codebases could not be recompiled for x64, but might port to x32 more easily.

x32 would have been nice as the first transition away from x86-32, but memory needs keep increasing, and we are far too used to full 64-bit spaces. In fact, it feels like we're finally over with the 32-64 bit transition, and people no longer worry about different kinds of x86 when buying new hardware. So introducing this alternative is a needless complication. As others have pointed out, it's too special a niche to warrant its own ABI.

It's not a complication, it's an enhancement. A majority of software does not need a 64-bit address space and can thus be streamlined while still getting the benefits of doing fast 64-bit integer math, among other things. Obviously you just select the target when compiling and that's that, it's like enabling an optimization, so what are you talking about?

The kernel needs to be an amd64 one for x32 to work, at least as things stand now. The most common situation would _probably_ be an amd64 system with some specialist x32 software doing performance intensive stuff. (Or possibly a hobbyist system running an all-x32 userspace for the hack value.)

Yeah, working with big data is unlikely to benefit, and data _is_ generally getting bigger.

Of course the OS is still 64-bit in that regard, it's just the address space of that particular application which is reduced to 32-bit to streamline it. The majority of all executable files do not require several gigabytes of RAM, hence it makes sense to streamline their address space.

That why I have mentioned the memory hungry algorithms. Many applications are doing it this days. Needless to mention that java this days is started almost exclusively with the "-d64".

The market for 4GB address space is really small. Because modern general programming practices generally disregard the resources in general, RAM in particular. (The (number of) CPUs being the most disrega

I do some alternative OS development. When I setup a program to run there are 3 different 64bit modes (programming models) for me to select to run the program under: ILP64, LLP64, and LP64. In ILP64 you get 64 bit ints, longs, long longs, and pointers. In LLP64 you get 32bit longs and ints, and 64bit long longs and pointers. In LP64 you get 32bit ints, 64 bit longs, long longs and pointers. Note: All these pointers are 64 bit (but the hardware may have less bits than this, the OS will query it, code mus

Funny thing I notice in articles of this sort. There are always comments saying it's dumb because there is no point in optimising software for performance because hardware is so cheap. And there are comments like yours, complaining that having to do a recompile to achieve it is too big a burden.

Do you see the tension between the thoughts? Because if hardware is so cheap that it is more reasonable to tell the user to upgrade his computer, rather than optimise your software, then does it not follow that same

Think Atom processors running Android, or High-performance computing applications. Neither of these require a huge external ecosystem, but if you get a 30-40% boost in some workload, they are worth it. It's my understanding that small-cache Atoms benefit from this more than huge Xeons.

This sure feels a lot like a throwback to the old 16-bit DOS days, where you had small/medium/large memory models depending on the size of your code and data address spaces. We've already got 32-bit mode for supporting pure 32-bit apps and 64-bit mode for pure 64-bit; supporting yet a third ABI is just going to result in more bloat as all the runtime libraries need to be duplicated for yet another combination of code/data pointer size.

I hate to say this since I'm sure a lot of smart people put significant effort into this, but it seems like a solution in search of a problem. RAM is cheap, and the performance advantage of using 32-bit pointers is typically small.

I understand it is the same beast as the COMPAT_NETBSD32 [netbsd.org] option that has been available in NetBSD for 15 years now. It works amazingly well: one can throw a 64 bit kernel on a 32 bit userland and it just works, except for a few binaries that rely on ioctl(2) on some special device to cooperate with the kernel.

NetBSD even had a COMPAT_LINUX32 [netbsd.org] option for 7 years, which enables running a 32 bit Linux binary on a 64 bit NetBSD kernel. Of course the Linux ABI is a fast moving target, and one often misses the lat

The idea is that you use the 32 bit pointer model, with 32 bit indirect instructions, but you're doing it all using the x86-64 instruction set. Ie, the task is in 64 bit mode. The 64 bit mode includes primarily more registers, so you can write / compile to tighter code.

The stuff you described is for running 32 bit binaries that use the i386/i485/i586 instruction set, complete with the limited set of temporary registers. x86-64 has many more registers to use.

While it's possible to have a system with 16GB that could use only x32 (the kernel is still x86_64 under x32, so the kernel can see the 16GB), for instance running thousands of tasks using up to 4GB each just fine, plus the page cache is a kernel thing, so the I/O cache can always use all memory.

On the other hand, there are workloads that run on a 4GB system but that need x86_64 (mmaping of huge files for instance), and so boneheaded tasks reserve tons of never used RAM, it could actually use 1GB of RAM but reserve 8GB, the issue there really should be putting the coder in jail, but I digress.

But the vast majority of linux workloads today that use even a 8GB system would run just fine under x32. Like 95-98%.And nobody is even suggesting a mainstream linux distro without x86_64 userland. I'm sugesting all standard tools using x32, but keeping the x86_64 shared libraries and compilers, so if you need you could use some apps with full 64bit capability. Just use x32 by default.

Plus it's a good way to remind lazy developers that no matter how cheap RAM is, you should be serious about being efficient (specially to the KDE developers) !KDE functionality is great, but they really have no clue about efficiency (RAM and CPU).

humm, I'm running firefox / chrome with 3GB total system RAM just finedozens of tabs, flash, java, you name itmany pages with hundreds of jpegs openthe maximum virtual memory space for those jobs don't even get to 1GBI'm a MySQL/pgsql/Progress DBA and the only case I've seen that would require x86_64 is a customer with 6 Progress databases, a single local client attaches to all 6 dbs, requiring over 4GB of address space, all other cases don't come even close, all jvm I've ever seen, maxed out at 1.3GBagain,

So for me the answer is no. The whole thing reminds me of doing ARM assembler with thumb code mixed in. If you have a very specific usage for it then yes, it would certianly be useful - but it's going to be up to the people who need it to actually use and improve it. Everyone else has no need to care and the average developer shouldn't *need* to care or even be aware of it.

I would not go that far since I'm sure a special case may exist, but that's exactly what it would be for. Hence the 'no massive wide scale adoption' or 'applications written for this' becomes an (what should be) obvious outcome.

If I'm custom Joe and see a workload that benefits from 32 vs. 64bit OS constraints I load a 32bit OS. The reason we went to larger memory however means those special cases are extremely rare today. They happen more because "we can't get new hardware" than by choice.

Not many NEGs are using 64 bit processors, and this ABI offers too little advantage to bother with. Most embedded systems run a single primary process. If that process fits in a 4GB address space (as is required to use this ABI), then the system would just use a native 32 bit ABI on a 32 bit CPU, not this 32 bit ABI on a more expensive 64 bit CPU.

I could get into specifics but I shan't, because what you're blathering about has zero relevance for x32. It's not a replacement-to-be for the usual amd64 ABI, nobody is going to break amd64 to make x32 run. It's mostly a specialist tool for specific workloads (aside from being a hacker's playground, as are many things). Whether thinking it's useful as such is misguided or not, you're more so.

I can recompile and run 20 year old SunOS apps no problem with OpenSolaris. Try that with Linux?

Depends on what it's looking for, but in theory should work. 20 years? CLI or GUI based? Probably wants TCL/TK and/or Motiff if it's GUI, make sure they're installed. I'm willing to try, if you have source code that old...

Hairyfeet mentioned he tried linux and people kept calling back angry that their printer stopped working after an Ubuntu update.

I did not even know it existed? I will keep Linux on a VM I suppose but only CentOS as Redhat likes to make somewhat ABIs that do not break after each freaking update!

If you need stability then you should go with a stable OS. Fedora, OpenSuSE, and Ubuntu change too fast for enterprise use - which is what makes RHEL great.

With that said, I don't seem to have issues running some older software I have laying around for Linux. Oracle Database 8 instal

We went to x86-64 for three reasons: 64-bit integer registers, more integer registers, and 64-bit pointers. Some applications need only the first two of these three, which is why x32 is supposed to exist.

Eventually, I assume that all binaries which don't need 64-bit addressing (which will probably always be more than 90% of them) will switch to this ABI since having access to the extended register set without the overhead of all the bus bandwidth and cache space lost to store lots of zeroes is a HUGE win with zero cost.

Uh, no.

Really, no.

It's just not going to happen.

90+% of applications are not CPU-intensive, so they don't give a crap. 90% of the other applications that are CPU-intensive would benefit far more from removing pointer accesses than from making the pointers half the size. Only the remaining 1% are going to go through the hassle of dicking around with a complete second set of libraries on their system just so they can halve the size of their pointers.