256 Bit Arithmetic

Large integer types are very useful. They are used in cryptography, where the larger the type, the more possibilities, and thus exponentially more security. Another use is in fixed point arithmetic. Floating point numbers have intrinsic inaccuracy. Large bit-depth fixed point arithmetic allows expressions to be evaluated without the risk of truncation or rounding errors. They also can be the primitive base used in arbitrary precision arithmetic where the larger the integer, the less overhead.

gcc has in-built 128 bit integer types on 64 bit machines. These can be accessed via __uint128_t for unsigned and __int128_t for signed 128 bit types. These are useful, because their use exposes the 64×64 bit ⇒ 128 bit multiplication assembly instruction, which would otherwise have to be described as a series of 32×32 bit ⇒ 64 bit multiplications.

Unfortunately, the gcc optimizer does not handle these types very efficiently. For example the two simple routines

with gcc version 4.3 This is absolutely horrible code. The problem here is that the rdx register is defined as an input register for part of the y variable by the ABI. It is also defined to be high part of the output. This confuses the gcc register allocator, and a large number of extra register-register copies are generated. In addition, this causes stack spills, which makes the code even worse.

This code is noticeably better. However, it still contains obvious extra instructions. To get an optimal implementation, we will need to move to assembly language. Normally, this would best be done via inline gcc asm. Unfortunately, in this case that doesn't work as well as we'd like. The problem is that gcc defines the "A" register description as rax and rdx. We would like to use this combination as an output. However, we need rdx as an input to match the ABI. The overlapping register description is an error. A possible work around is to use the "A" register description as in input/output register. However, this causes the compiler to want to add an extra initialization of rax. Another work around is to specify rax and rdx separately, and construct the return 128 bit type from the results. This doesn't work either - with the compiler again creating the extra register-register moves we are trying to avoid. Thus moving to assembly is the only option. The result is

as the data layout for our 256 bit unsigned integer. This matches the endianess of the other integers, and also allows easy access to the component 64 bit and 128 bit parts we will use in the following. Implementing the addition and multiplication operations is relatively easy to do by using the inbuilt 128 bit type to handle the carries.

Where we have used, as the comment describes a dumb manual n2 multiplication algorithm. Expanding the 128 bit multiplications, this contains a total of six double-precision, and four single precision multiplications. Can we do better?

There are several multiplication algorithms which are asymptotically better than brute-force n2 multiplication. The first of these is Karatsuba multiplication. This replaces four multiplications by three via the trick:(x1 + t x2) × (y1 + t y2) = A + t((x1 + x2)(y1 + y2) - A - B) + B t2,
where
A = x1y1,
and
B = x2y2,
where x and y are the two numbers to be multiplied, and expanded in terms of a base, t.
This can be done recursively, halving the base at each step, yielding a total of nine multiplications for our problem. This is one better than the total of ten given above. However, these multiplications need to all be double precision. This slows the algorithm down, and the one less multiplication doesn't help.

Another method of multiplying high precision numbers is called Toom-Cook multiplication. This works by noticing that the we are multiplying polynomials in t, the base. A polynomial can be described by its value at n+1 points, where n is the maximal order. i.e. if you know the value of a quadratic at three points, then you can derive its functional form. This means we can use point-wise multiplication to evaluate the polynomial multiplication. For a multiplication of n×n components (here we have 4 64 bit components to make up the 256 bit total) we require 2n-1 points, and thus 2n-1 multiplications. For our problem, this yields 7 multiplications. Unfortunately, this isn't the whole story. To go from the point values back to the polynomial expansion form, which will be our wanted result, we need to a few more mathematical operations. Unfortunately, for the case given here, we need to divide by 6 or 3. This is most efficiently done as yet another multiplication, giving a gross total of 8 double-precision multiplications. Again, this is slower than the dumb brute-force method.

Why are the advanced methods slower? The reason is that we are doing a 256 bit × 256bit ⇒ 256 bit multiplication. The full result is 512 bits in length. The more advanced methods calculate this full calculation. If the brute force method were used, it would require 16 double-precision multiplications. Only requiring the half-sized result saves us quite a bit. Can the advanced methods be modified to give the lower bits we require? Unfortunately, there is no simple modification of Toom-Cook to do what we want. In the best case, we might save a single multiplication, but this isn't enough to help.

The Karatsuba algorithm is another story. Here, it is indeed possible to modify it to lower the number of required multiplications significantly. Expanding the terms as
x = x1+tx2+t2x3+t3x4,
y = y1+ty2+t2y3+t3y4, and
z = x×y = z1+tz2+t2z3+t3z4
we can rewrite
the multiplication as

z1 =

x1 y1

z2 =

(x1 + x2)(y1 + y2) - x1 y1 - x2 y2

z3 =

(x1 + x3)(y1 + y3) - x1 y1 - x3 y3 + x2 y2

z4 =

(x2 + x3)(y2 + y3) - x2 y2 - x3 y3 +x1 y4+x4 y1

If the values calculated multiple times are reused we require a total of 8 multiplies, with five of them double precision, and three single precision. This is lower than the brute force method, so may just be faster. Unfortunately, we cannot use the algorithm shown above as is. The problem is overflow. The additions cause us to have to multiply 65 bit numbers together. This can be fixed by using subtractions instead. Since of a pair of numbers, one must be larger than or equal to another, we can order the subtraction so that the result will always fit in 64 bits. For one set of orderings we might have:

z1 =

x1 y1

z2 =

(x1 - x2)(y2 - y1) + x1 y1 + x2 y2

z3 =

(x1 - x3)(y3 - y1) + x1 y1 + x2 y2 + x3 y3

z4 =

(x2 - x3)(y3 - y2) + x2 y2 + x3 y3 +x1 y4+x4 y1

To do the 65 bit multiply, we need to check the orderings of the terms:

This uses a relation between signed and unsigned multiplication, together with the "sbb" trick to have a branchless function. This can be further simplified in the case where the carry (borrow) isn't required:

The normal method takes 10.9 seconds to evaluate 228 multiplications, and the pseudo-Karatsuba algorithm takes 8.8 seconds. (Using gcc 4.4, which generates the better code.) This is approximately the 8/10 speedup predicted. Now how much faster can we make them go if we use full assembly versions? The pseudo-Karatsuba algorithm is rather complex, so is rather long in machine code:

Running the multipliers again for 228 times gives a time of 5.5 seconds for the assembly optimized pseudo-Karatsuba algorithm. However, doing this for the brute force method yields a time of 4.0 seconds. Thus the dumb brute force method is faster overall. Why is this? The reason is that the more complex algorithm requires many more additions and subtractions. The time taken for these adds up, and overwhelms the small advantage in lower number of multiplies. The end result is code which is more than twice as fast than the original C code. The reason we can speed it up so much is because even in version 4.4, gcc does not generate very good code for 128 bit types. Hopefully version 4.5 will be better.

The horrible captcha has been updated. For prosperity, the evil old one used to look like:

The new one should be a little easier for humans to read, and a bit harder for machines.

Comments

Jeff said...

Wow, hardest captcha ever.

I would expect that branch mispredictions on the conditionals would obliterate any benefit of the pseudo-Karatsuba version, if you used unpredictable inputs.

When you did your timings, did you go through integers sequentially (or just use the same inputs over and over), or were they somewhat random?

captchabot said...

your captcha sucks. any bot can apply a few image manipulation function to get the letters into matchable shapes. making it hard to read for humans != making it hard to read for computers.

a good way to think about this is photoshop. if you can make the letters readable by throwing a few photoshop filters at it, then so can a bot.

use a mixture of fonts (big, small, serif, non-serif, outline, italics, etc) instead of those lame dots. combine that with a little bit of random warping, and your captcha will be 100x harder to break, while still being human readable.

Mike said...

Jeff, the good news is that if you can pass the captcha, you know you're not color-blind!

Jaime said...

I was planning on commenting on the topic of the article until I saw the captcha. Then, I forgot what I was going to write. Now I'm just left with this: "WOWZA! WTH where you people thinking!"

The captcha totally stole the spotlight from your article! :)

YATroll said...

Good Lord, that captcha sucks balls. Shit, dude, where did you get this piece of crap? I mean, seriously, wasn't reCAPTCHA free and stuff?

timmey said...

LOL, how everyone is talking about the captcha!

sfuerst said...

I've updated the captcha. It should be a little better now. The old one was something thrown together in a few hours. The idea was to have something unique... the more common a captcha is, the more effort spammers spend to try to crack it. Of course security through stupidity is never a good idea.

Jeff:
The psuedo-Karatsuba code benchmark results use the branchless asm version of the 65bit multiplier. The only "branch" is the cmove instruction.

The benchmark calculated x(n)=x(n-1)*x(n-2), starting with the initial x's being odd numbers. This tests a wide range of numbers, not just adjacent ones, and is faster than using a random number generator.

trayres@gmail.com said...

This article made me decide to go mess with assembly. Thanks for this - very very cool.

Anonymous said...

What about 256bit division?

said...

Really interesting article! How can you wrap the assembly code given in the examples, for example mul65d, such that it is callable from within the C code as shown in mul256c?

said...

Enter your comments here

Nor said...

I've been writing and rdeaing a fair amount of AMD64 assembly code over the past 3 years, and I must say this zero extension to 64 bits is quite useful. If I remember correctly AMD choosed to do it this way for performance and simplicity: compilers/developers don't have to explicitely reset the high 32 bits (because zeroing them out is what you want to do most of the time).

Tejas said...

Personally I think it should go into the trash. Because I am sick of this. Especially when<a href="http://ylpsvfym.com"> celibritees</a> and politicians go on about it. Good God, these are people who live in huge houses, has private jets and yachts, entourage of cars and they say the average citizen is to blame for their little global warming? It is all absurd especially if you truly look into it. Some of them know they lost and because of this now want to make it a crime to deny global warming. Look it up. It is all there. And it is all absurd. There is a reason why global warming enthusiasts never debate a real scientist but instead talk to people who either believe it or know nothing of Earth Science 101.

Jazlyn said...

It's great to find <a href="http://jjlgpg.com">sonomee</a> so on the ball

anon said...

Spam/junk comments above from "" (2nd), "Tejas", and "Jazlyn".

Maybe the captcha is too easy for bots? Ironic, since it's taken me 3 tries.