Together that would be 2232 operations, only 600 of which are additions.
But additions make the meat of cryptographic diffusion and require many
logic gates to implement.

On a CPU which requires one cycle per each operation it takes 2232*32 =
71424 cycles to compute 32 SHA-256 outputs.

Logic gate implementation doesn't need to implement rotations and shifts
as they just reorder inputs.
Full 32-bit adder requires 160 bit ops,
So what's left is 600 * 160 + 960 * 32 = 126720.
(It can be further reduced a bit through constant propagation of zeroes
from shifts.)

So we have 71424 against 126720. That's worse, but not five times worse,
not even twice worse.

Furthermore, some CPU architectures do not implement rightrotate
natively and have to implement it though three operations (two shifts
and OR). They would require 3384*32 = 108288 operations, which is pretty
close.
----

(This would be NVIDIA GPU. One with native ROTR is AMD GPU (which also
implements MUX instruction which can shave cycle count even further).

So number of operations is close, but logic circuit implementation
allows additional optimizations. Even 20% speedup would make it
competitive on NVIDIA GPU and ~50% speedup -- on AMD GPU.

What optimizations are possible?

If we're doing pre-image attack or bitcoin mining, we can fix values of
some variables and vary only few of them, say, 32 (size of nonce to pick
in bitcoin mining). Fixing 736 variables out of 768 enables one to do
'constant propagation' kind of optimization.

This applies both to logic circuit implementation and to a traditional
one, but logic circuit one makes more use of it, as it can optimize
parts of XOR or ADD with a constant, while traditional implementation
would just do full operation.

Then, it's possible to use some symmetries to reduce number of
computations required, say, via binary decision diagrams. But that works
only for the first few rounds, after that it's fully diffused and BDDs
will be basically equivalent to full truth tables. You can't compress a
random signal, you know. So savings aren't huge. (Although I don't yet
know the point at which symmetry is totally lost.)

I played with these and other ideas for a while, and it looks like
there's simply no way to compute SHA-256 significantly faster, i.e.
optimizations are minuscule.

There is however one thing which bugged me... Let's say we're looking
for input such that output is all ones (it is easy to convert any
pre-image attack to this form), so we need to verify that

O1 and O2 and O3 and .. and O32 = TRUE

Now imagine that outputs O1, O2 .. O32 are represented in form of BDD or
a disjunctive normal form (DNF). As SHA-256 is supposed to be
indistinguishable from random, DNF has about 2^31 member, and BDD is
huge. Or if you look at truth table, it is 50% ones and 50% zeros
randomly mixed.

Now if you consider a conjucnction (O1 and O2) it has only 25% of ones,
and thus smaller DNF (and BDD). Likewise, (O1 and O2 and O3) has even
less ones, and so on, until we are left just with handful of ones -- or
none at all -- which are actually a solution to our problem.

So there is a great simplification at the last step, can we drive it
down so whole computation gets simpler?

If we are going to compute it precisely -- I don't think so.

But we do not need to be exact here. Just like in number theory there
are probabilistic primality tests, we only need to find such inputs
where output is POTENTIALLY true with probability higher than 50%
(higher = better), and then do full check on them.

Alternatively, we can rule out inputs which are definitely false and do
a full check on the rest. If first part can be done cheaply, we'll have
a speedup compared to baseline.

One way to do this is to find some approximate truth table where zero is
actually zero, and one is "maybe one, maybe zero", or "unknown", in
other words.

When logic circuit is converted into a form where it has only AND and OR
gates it is monotonic, i.e. flipping one input to 1 might only flip
ouput zeros to ones, but never ones to zeros.

This means we can inject extra ones into truth function at any step
(say, to make it more symmetric) and that would only inject some fake
ones input output, but we can still trust zeros.

My first attempt was to inject extra one directly into input variable,
i.e. assume both X1 and NOT(X1) are true at once. But it doesn't work
very well -- output is all ones. This happens because this lossy
representation injects 1/8 more ones for each AND operation, and so it
quickly turns to all ones.

Next try was to replace one of logic gates with 1 without actually
computing it. Then output of a whole function is correct on input
combinations where that gate's output was indeed one, i.e. in 1/2 or 1/4
of all combinations. Otherwise it might be true or false depending on luck.

So that's already some speedup, as there is no need to compute output of
that gate (although a very small one), and if that fudge doesn't affect
output of all gates we still get lots of zeros.

When we look at it from truth-table point of view, essentially we
replace truth table of one gate with all ones leaving all other intact.
This is probably not the best way to do it. What if we inject few ones
at different nodes as needed, i.e. to make BDD more symmetric? I.e. if
we have a fork node in BDD

0101 0100

We can replace it with eq node 0101 = 0101 OR 0100. And potentially we
need to do less operations when we have more eq nodes.

But this shit is fairly complex and I doubt it will work, so I'm
considering other options, say, working with DNFs.

In DNF, conjunction with fewer variables produces more ones, so maybe it
makes sense to start with those to get at least some ones cheaply?