Date: Sun, 14 Oct 2012 06:04:00 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: ROM or memory port hardness
Colin, all -
Here's an idea that I am going to explore further and I'd appreciate
your input on.
First, let me define the problem:
The task is password hashing on servers dedicated to authentication (or
maybe even to password hashing alone), at a large enough org to go for
this approach anyway.
At low durations (e.g. the 1 ms to 100 ms range as needed for user
authentication at such orgs), we face two problems with scrypt:
1. Somewhat low memory usage: acceptable at 100 ms, way too low at 1 ms.
2. Poor scalability on multi-CPU (and multi-core) systems.
Moreover, there's a trade-off between these two: we can improve memory
usage slightly (up to ~2x) at the cost of even worse scalability, or we
can improve scalability (no hard limit) at the cost of even lower memory
usage. (We access this trade-off by hacking the Salsa20 rounds count in
scrypt, so we deviate from the official scrypt when we do this.)
Here's a possible solution (with its own drawbacks, though):
Pre-fill a large amount of RAM (would be tens of gigabytes currently)
with reproducible yet not too quick to recompute pseudo-random data.
This may take e.g. a minute upon system bootup. There should be no
trivial patterns like V[j] = f(V[j-1]), to defeat an efficient TMTO that
we'd otherwise have. As far as further processing is concerned, we have
a fairly large ROM filled with pseudo-random data.
At password hash computation, use a loop similar to scrypt smix's
second loop, up to a certain number of iterations (chosen for the
desired duration, which can then be arbitrary). This loop will access a
small percentage of the ROM array elements only, but as long as the set
of elements it will need to access for a given {password, salt} is not
predictable in advance (without going through the same sequential
computation), the entire ROM needs to be quickly accessible in order for
hash computation to complete quickly.
This allows for parallel computation of hashes for many {password, salt}
pairs with only one instance of the ROM. This is both good and bad.
It is good for authentication servers: we fully use our RAM size for
defense regardless of the current request rate (be it 1 or 1000 requests
in a given second). We have perfect scalability to multiple CPU cores:
we only need to increase the amount of processing between memory
accesses to be such that we stay just below saturating the memory
bandwidth when all CPU cores are in use. However, obviously an attacker
has the same advantages, and more: a custom or otherwise more suitable
hardware setup would have a larger number of ports to the same ROM size
(and more cores too). So this is not exactly a memory-hard algorithm;
it's more like a memory-port-hard one.
While asymptotically this is less secure than scrypt, I think that at
realistic parameter values for user authentication on current server
hardware and given likely attackers' hardware (when we're talking
password hashing and not KDF for data encryption), it is actually a lot
more secure.
scrypt at 100 ms is currently limited to using about 64 MB (either as
two instances using 32 MB each, or one hacked instance using 64 MB due
to lower Salsa20 rounds count). With the ROM approach above, we can
make it e.g. 64 GB. Yes, it allows for parallelism, including for
attackers, but if we consider ASICs, it'd take 1024 times more memory
ports to get at the same area*time/candidate as scrypt's, and that's not
counting area consumed by the memory ports themselves. OK, those don't
have to be ports to the same large ROM - instead, separate smaller ROM
banks may be used, assuming that bank conflicts will be fairly rare -
but nevertheless the area*time/candidate estimate vs. scrypt's holds.
An attack would be to use a smaller RAM holding only a portion of the
large ROM at a time (e.g., one half of it). When we have a large number
of candidate passwords to test, we may proceed with computation for each
as long as we have the needed data in RAM. When we don't, computation
for this candidate gets stalled (but others can proceed). Then we
switch to the next ROM window (fill our RAM with the next subset of
data, e.g. read from slower external storage) and proceed with
previously-stalled candidates. And so on. This is a trade-off: our
cost is in RAM needed to store the in-flight candidates (OK, they might
be quickly recomputable) and per-candidate state (not recomputable
without swapped-out portions of the ROM).
A similar attack may work with a botnet, if typical nodes don't possess
enough RAM to compute these hashes without trade-offs on their own. The
ROM table may be split across several nodes, but then the trade-off is
not only in extra RAM needed to store the per-candidate state, but also
in need for a lot of communication between the nodes.
Thus, I am suggesting to use this with ROM table sizes that are likely
to exceed a typical botnet node's RAM size for a few years to come -
e.g., currently this might be 48 GB to 256 GB RAM in a new server vs. up
to 16 GB RAM in a typical end-user's computer. (Supporting ROM table
sizes that are not powers of 2 might be desirable to get closer to the
full RAM size actually installed in a server, yet leave a little bit of
room for the OS/software/buffers/caches - e.g., use 46 out of 48 GB.)
I also have thoughts on making use of CPU caches for defense: we could
use L3 and L2 in scrypt-like manner, and L1 in bcrypt-like manner (take
advantage of being able to do narrow lookups efficiently), along with
the large ROM approach above (scrypt on L2 caches is too weak on its
own, as we know from Litecoin, yet it may be a reasonable addition if we
can have it for free). This will increase the cost for attackers who go
for parallelism levels beyond what we use for defense. Ideally, to
avoid excessive complexity, we'd come up with a way to use similar code
for all 4 levels - e.g., merely accessing them at different iterations
of the loop. This is tricky. If we fail to make this simple enough,
we should use only the ROM or maybe 2 layers (ROM and one cache level).
Any comments?
Alexander