Ideas for a New Elliptic Curve Cryptography Library

Why Create a New Elliptic Curve Cryptography Library?

Commonly used crypto libraries like NSS and OpenSSL implement ECDH and
ECDSA signing and verification for a large number of prime and non-prime
curves. In order to support a wide variety of curves, these libraries internally
have frameworks that abstract the general algorithms from the curve-specific
details. These frameworks make sense when a wide variety of curves and
algorithms need to be supported, but they add significant complexity that
would be unneeded for software that only needs to deal with a small number
of well-known curves, such as the NIST P-256 and P-384 curves and Curve25519.

OpenSSL exposes its internal framework as a public API
for applications to use, so changing the framework risks breaking
applications built on top of it. This makes it difficult to make
improvements to the framework. Consequently, the implementations of
curve-specific optimizations in OpenSSL generally avoid the lower-level parts
of that framework, causing those implementations to have very similar but
separate implementations of the generic logic. Thus, for example, one set of optimized
implementations of the P-224, P-256, and P-521 curves in OpenSSL has ~1,700,
~2,300, and ~2,100 lines, respectively, of quite similar code.

With today’s C compilers, some key components need to be implemented
in assembly language to give reasonable performance and to prevent side channel leaks
of secret information. Assemblers usually have
macros and other high-level features that make assembly-language programming
convenient. However, some assemblers do not
have such features, and the ones that do all have different syntaxes.
OpenSSL has its own assembly language macro framework called perlasm to deal with this.
Every OpenSSL assembly language source file is actually a Perl program
that generates the assembly language file. The result is several large files of
interleaved Perl and assembly language code. The fastest P-256 implementation
in OpenSSL, ecp_nistz256,
consists of ~6,000 lines of x64 perlasm + ~1,700 lines of x86 perlasm + ~1,500 lines
of ARMv8 perlasm + ~1,700 lines of ARMv4 perlasm + ~1,500 lines of C code plus a
~150KB table of data.

Imagine that we must find a high-performance implementation of
P-256 and P-384 (the two curves generally used in TLS) ECDSA signature
verification and verify its correctness. We might not find a fast
enough P-384 implementation, so we might be tempted to adapt an existing
fast P-256 implementation for P-384. How long would the audit and adaptation
process take? How confident would we be in the result? How hard would it
be for somebody else to meaningfully confirm our findings and review any
changes we make? It seems likely that we would take a lot of time, end up
with something we could not be very confident in, and find the result to be hard
to improve later.

The fundamental problem isn't the sheer amount of code in
the optimized implementations of ECC or that the techniques they use are bad. Rather,
after the simple ideas used in the implementations were
translated into low-level C and assembly language code, with multiple layers
of repetition, it is now difficult to map the code back to the
abstractions it was derived from. A small number of reasonable trade-offs of the ease
of understanding by people for the ease of processing by computers accumulated quickly
into a large and unreasonable such trade-off.

The specialized implementations of ECC in OpenSSL are optimized for
ECDH key agreement and ECDSA signing, so they use constant-time algorithms to avoid
timing attacks. Hoewver, ECDSA signature verification does not need to be constant
time, and the variable-time WNAF scalar elliptic curve point algorithm is
significantly more efficient than constant-time algorithms. The constant-time
ecp_nistz256 code in OpenSSL is much faster than OpenSSL’s variable-time
P-256 code, but WNAF-based variable-time point multiplication
implemented using ecp_nistz256’s implementation of P-256
field operations should be significantly faster (on average) yet.

The IETF is currently standardizing new ECC cryptosystems based on two new curves,
Curve25519 and Ed448-Goldilocks. It seems likely that Curve25519 will be critically
important for a variety of reasons; for example, it seems like it will be a
key part of the bridge connecting the new Internet of Things (IoT) to the
normal Internet of Nothings. If we were to implement Curve25519 and
Ed448-Goldilocks using the same strategies that have been used for P-256 in
the past, we would end up compounding the maintenance problem we already
face.

Curve25519 and EdDSA (or whatever signature scheme ultimately gets
standardized) will take a while to become ubiquitous. This makes it tempting
to try to use P-256 and ECDSA as part of the bridge between IoT and
the rest of the Internet in the short term. To have even a chance of
succeeding at that, such an implementation would have to be very efficient in
terms of execution time and code size. It would also have to be easy to create
efficient ports of the code to highly-constrained (slow, low memory,
and power-constrained) architectures.

Goals

With those motivations in mind, it is worth exploring whether it makes
sense to create a new ECC implementation with the following goals:

We must be confident in the correctness of what we created,
preferably with the assistance of automated tools for correctness
checking.

In high-end smartphones, tablets, and computers, P-256 and
P-384 signature verification should take one millisecond or less,
so that all the signature verifications needed for a TLS handshake can complete
in under 10ms, so that the CPU cost of the TLS handshake is not a factor in terms of
latency. On more constrained devices, we will have to settle for “as
efficient as the code size budget allows,” for now.

The code and any precomputed static data should be as compact as possible.
In constrained applications the allowed code size is a parameter that can
vary considerably, so it is hard to put a specific figure on this. However,
informally we will aim to create something that is smaller than the ECC
portions of other implementations like BoringSSL, OpenSSL, and NSS.

It should be straightforward to reuse large amounts of the implementation of
P-256 and P-384 to implement Curve25519 and maybe other curves.

Without significantly sacrificing signature verification performance,
the code for the arithmetic should be easy to reuse (preferably) or adapt
(less preferably) for side-channel resistant implementation of signing and
ECDH key agreement.

An Implementation Strategy

ECDSA signature verification consists of parsing inputs, doing an inversion
and some multiplications modulo n (the group order), doing two scalar elliptic curve
point multiplications, and then verifying that the result of the multiplication
matches a value in the signature. We precompute a bunch of sums of the
curve’s group generator point G and a bunch of sums of the
public key, which requires multiplication, squaring, and inversion
modulo q (the field modulus). We compute the WNAF representations of the scalars of
the two points, and then perform a pattern of point doublings and point additions
based on the pattern in the WNAF representation. The point doublings and
additions are specific sequences of multiplications, additions, subtractions,
and squarings modulo q. Multiplication and squaring modulo q
are, by far, the most performance-sensitive operations because they are executed
many times in a scalar point multiplication. In order to do all these
arithmetic operations, we generally need to precompute some quantities that are
functions of n or q and the word size of the machine.
ECDSA signing and ECDH key agreement more-or-less involve the same kinds of
operations, except they need everything to be constant time.

One way to minimize code size, at the cost of speed, is to implement all the
algorithms generically. For example, there would be one modular multiplication
function that takes two multiplicands and the modulus as inputs, one modular
inversion function that takes the element to invert and the modulus as input,
etc. We would probably find that such generic implementations work good enough
in many cases, especially the operations modulo n that are not
frequently executed, so that we can avoid writing curve-specific code for many of them.
If we start with generic implementations of every algorithm, then we can use
them as the foundation for a data-driven code-profile-optimize development loop
that ensures that we are always optimizing the most impactful part of the
implementation, without guessing. Having the generic implementations of everything
available would also give us flexibility in making size vs. speed tradeoffs later
in the development process.

Besides trying to share code across curves, we could also minimize object
code size by moving work from runtime to build time.
For example, instead of parsing the string representation of q into
an array of limbs at runtime, we parse the value of q during the
build and generate a static const array definition that contains the
array. This way, we save not only the cost of the function call to the parsing
function, but we can also avoid the cost of the code for the parsing function.
More generally, we should try to do everything that can be done at build
time at build time.

Also, we should avoid doing things manually that can be done automatically by
the build system. For example, in order to do
Montgomery multiplication, we have to calculate and save the values the
constants k and r2. Rather than
calculate the values manually and hard-coding them into the program or code
generator, the code generator should calculate these values from the basic
inputs. This would end up having a much better documentation effect than
hard-coding the values because it is guaranteed that the formulas and the
results match. Further, adding this small extra complexity to the code
generator would make it much easier to expand the implementation to
support additional curves and platforms.

An ECC Compiler

We can take the ideas in the preceding section further and think about the
code generator as a compiler for an ECC domain-specific language (DSL).
The terms of the language would be field elements and scalars,
its statements would be a series of assignments of the results of
field operations (additions, subtractions, multiplications, and squarings),
and its runtime library would consist of the
curve-agnostic implementations of the various algorithms plus any
curve-specific functions that we hand code.

There are already several programs written
in this language in the Explicit-Formulas
Database. For example,
here is a program
from the Explicit-Formulas Database which, given a point (X1, Y1, Z1) in Jacobian coordinates,
calculates the double of that point (X3, Y3, Z3), alongside a very simple
translation to a C function that calls other functions that perform
the actual field operations:

By implementing a compiler, all the classic optimization techniques
used in compilers would be available to us. For example, notice in the C code
above that all the multiplications by constants were translated to calls
of functions that are optimized for multiplication by that constant; this
is a basic form of classic strength reduction. As another example, the
compiler could reorder the statements in order to
optimize the program for the out-of-order execution behavior of a
particular processor. In particular, the compiler could put effort into
ensuring, whenever possible, that a statement that stores a value into
a variable x never directly precedes a statement that reads the
value of x. Or, if our
compiler were to generate assembly language code, then we might implement a
custom calling convention where the result of a calculation is always stored
in a specific set of registers and where the inputs to a function are
always stored in those same registers, to avoid otherwise-unnecessary
load/stores. Our compiler might even transform the program into
continuation passing style (replacing ret + call
with jmp) or do other things that would be easy to
for a compiler do, but would be error-prone to do (or verify) correctly
by hand.

More interestingly, our compiler could go beyond classic
compiler techniques. It seems hard to justify the amount of effort that
would go into building a sophisticated optimizing compiler for our language.
Instead, we might implement a dumb randomized compiler that, given a
seed value, configures flags that control various simple optimizations
like reordering and various attempts at strength reduction. Then we could
operate our compiler in two modes: profiling mode and production build mode.
The profiling mode would generate a random seed, generate the code,
compile the code, run a
program that measures the performance of the code, and compares the
execution time to the previously-known best execution time. Then we would
just let it run until we run out of development time, take the best seed
the program output, update the build system so that
the new seed is used for production builds, and say we’re done. We
could call our ridiculous compiler the
“super dumb compiler”
and hope that people confuse it for a superoptimizing compiler. To scale this
incredibly inefficient optimization strategy, we could
mint a new cryptocurrency, Superdumbcompilercoin, where we trick
people into running the super dumb compiler on their computers,
cross-certifying each others’ results, and sending us increasingly better
optimization seeds in return for plausible delusions of becoming rich.

Besides generic compiler and superoptimization techniques that apply to
all languages, the compiler could also assist in the implementation of
ECC-specific optimizations. For example, it could discover, or at least
assist in the implementation of, cases where we can coalesce reductions
modulo q after each field operation. As another example,
the compiler could choose completely different formulas based on the
cost of multiplications vs. the cost of squarings and/or the cost
of multiplications vs. the cost of inversions.

The more complex the compiler becomes, the more important it would be to
automate the verification of the validity of the optimizations. We could do
this by splitting our compiler into two parts: the optimizer and the back-end.
The optimizer would take programs in the three-operand statement language
as input, and output a optimized form of the program in a similar,
lower-level, intermediate language. The back-end would then transform this
intermediate language into assembly language (and/or C) code. If we design
the intermediate language carefully, we should be able to make it easy for
proof checkers or computer algebra systems to process it and verify that
the intermediate program is correct, much more easily than if we were to
try to formally prove that equivalent C or assembly language code is
correct. It would be possible that the back-end could have a flaw that
causes it to generate bad code even for good inputs, but we can minimize
the risk of that happening by keeping our back-end simple; eventually we
could try to formally prove the compiler’s correctness.

Optimized Implementation of Field Operations

The implementations of the P-256 field operations in OpenSSL’s ecp_nistz256
are very impressive. They are very fast and are based on smart ideas.
However, I have found the code—11,000 lines of Perl mixed with
assembly language—very difficult to audit. Although
it is a lot of code, it consists of a few basic constructs repeated many times.
It would be great to find a way to use implement the same techniques as the ecp_nistz256
code, but in a way that is adaptable to other curves and that is easier to understand.

We could implement another code generator to generate optimized implementations
of field operations from more abstract, but still low-level, expressions of them,
like qhasm does. However, it seems
likely that we will find that optimal implementations of field operations
for different architectures and sub-architectures are going to be so different that
it would not be worthwhile to find a way to abstract away architecture-specific bits
into portable code. Instead, we might pick a single cross-platform assembler,
like Yasm, and try to express the abstract patterns primarily using its macro
language, hoping that the consistency of using the same macro language would
make it easier to maintain the various platform-specific implementations. Assembler
macros work quite similarly to how C++’s templates work
(particularly how C++ templates worked before C++11) and so are surprisingly
powerful. It seems like we would need to make at least a few implementations in
“native” assembler before we could understand the costs and benefits
of such abstractions.

There are aspects of the implementation of field operations that
cannot be solved using a macro assembler in a way that is easy to understand. For
example, when implementing Montgomery multiplication we need to multiply a single
limb by the field modulus, and the instruction sequence—not just
the operands of the operations, but the sequence of operations itself—depends on
the value of the modulus. Again, it would not be clear how automating the generation of
such code would help us until a few implementations have been made by hand.

One of the major design issues is the way the large integers are split into limbs.
For example, should a 32-bit implementation use full 32-bit limbs to minimize the
number of multiplications, partially-filled 29-bit limbs to minimize adds-with-carry,
or something else? I suspect it will be best to use full limbs in most constrained
implementations because the cost of multiplication will be high relative to
addition-with-carry,
and they do not try to be so clever with respect to out-of-order execution. Again,
it is hard to say without implementing some prototypes.

Dynamic Compilation and Self-Modifying Code

On x86, the most efficient way to implement modular multiplication
depends on whether the processor implements SSE2, MULX, and/or the ADCX/ADOX
instruction set extensions and/or the processor family/generation (Atom vs.
Core i7, Sandy Bridge vs. Haswell, etc.). Similarly on ARM the best
implementation depends on what instructions (NEON, in particular) are
available and whether the chip was designed for in-order or out-of-order
execution. Shipping a binary with every curve optimized in each possible way
would result in a lot of bloat. It may be better, instead, to ship the compiler
as part of the library, and then have the compiler generate just the one
implementation of each curve that is optimized for the processor that it
is running on. It would likely be hard to make the compiler small enough for this
to be a win, though.

On a constrained device, one could generate the executable code
for a curve, run the code, and then throw the generated code away to make room
for other code (e.g. the generated implementation of another curve). Or, perhaps
we could ensure that the code generated for each different curve is similar enough
that we could implement an efficient patching mechanism that would allow us to
patch the implementation of one curve to morph it into the implementation of
another curve, and then use another patch to get back to the implementation of
the initial curve. Of course, this would be complete crazy unless it were automated
by the compiler (and maybe even then).

Smart Caching of Intermediate Values

Although the cost/benefit ratio of a runtime compiler is questionable,
there are many more straightforward optimizations we can do, based on the same
idea of being flexible on how we distribute work between the compiler and
the runtime system. Normally we need to precompute a bunch of
additions of the curve’s group generator. If we were to do that at build time,
then we would bloat the size of the executable by a significant amount. (In ecp_nistz256,
the size of the static table is about 150KB; in libsecp256k1 that table would be over
a megabyte.) If we were to do that at runtime, then
we would need to do the precomputation at a time where we can afford to do compute
a multiplicative inverse modulo q, which is quite expensive. Instead of
doing either one of those, we could precompute just that inverse at build time and
then do the rest of the precomputation at runtime.

A similar consideration applies
to the precomputation of points related to public keys; if we have a public key cache,
we should consider having that cache store the result of the inversion that is
needed for the precomputation.

More generally, higher-level code that makes use of ECC primitives is often not very
efficient about caching intermediate results. For example, when verifying a TLS server
certificate chain, it is common to need to use each P-384 certificate public key twice:
once to verify the signature of a certificate, and again to verify the signature on
an OCSP response for that certificate. It may
be quite worthwhile to cache all the precomputed points for the last few public keys
used during certificate validation. In particular, if we know that we are
going to use a given public key twice then we should precompute more points than if
we expect to only use the public key once.

The Dangers of Error Checking

When verifying an ECDSA signature, we have to validate the inputs are within
acceptable ranges. We could do this range checking when we parse them, or we
could do the range checking before the computations that depend on the values
being within range, or we could implement redundant checks. It is natural to
expect that more error checking is better, but OpenSSL demonstrates
that bugs frequently lie in error handling code paths. Also, when the same
error condition is checked multiple times, the first check usually makes the
redundant checks dead code, so we would have to leave the code implementing
the redundant checks untested or we would have to jump through hoops to write
tests that reach the secondary checks.

The special case of failure to allocate memory (e.g. malloc returning
NULL) is particularly terrible, because this kind of edge case is almost
never thoroughly tested. Frequently developers on open source projects have
significant apathy for handling and testing these cases because they have configured
their own applications to abort on memory allocation failure so any bugs in the
error handling will not affect them. This apathy leaves everybody else that does
not use the same strategy at risk.

It seems it would be better, then, to do all range checking, memory allocation, and other
possibly-failing actions as early as possible, before we start calculating any
results. Then all of our low-level calculation functions would become infallible.
And then our mid-level calculation functions would become infallible, since they
would not have to worry about error cases from the low-level functions. This
pattern would result in a recursive simplification up to the top-level functions. It
should also make the compiler simpler because the compiler would not have to generate any
code to do error checking, and it should make our correctness proving tasks easier
since there would be many fewer code paths to prove correct.

Since all the low-level functions would have important preconditions that they do not
check, it would be critically important that those preconditions be clearly documented.
The C assert function could be used for this. assert’s
primary advantage as a documnetation tool is that it can be machined checked to a
certain extent so that it is less likely to get out of sync with the code, especially
with good test coverage.

What about the Algorithms?

It seems silly to build an implementation that uses a bunch of crazy optimization
techniques and hard-coded assembly-language to implement obsolete algorithms. However,
a little experimentation has shown that ecp_nistz256 is implemented so efficiently that
it beats pretty much any implementation of better algorithms that is not very carefully
coded. Choosing and implementing better algorithms is much easier and has less risk of
failure than the process of figuring out all the optimization techniques that would be
needed regardless of what algorithms are chosen. The advantage of implementing the
low-level optimizations first is that we could benchmark against ecp_nistz256 in an
apples-vs-apples manner, so that we could measure the benefits of the low-level
optimizations distinctly from the benefits of the improved algorithms. The
compiler would help in defering the choice of algorithms until later, because it
would make it make it easy to switch from commonly-used algorithms to new ones.

In any case, a description of the best algorithms to use, and why they are the best,
is worth considering seperately. For now the main points
to keep in mind with respect to algorithm choices are: (1) There are algorithms that
are not commonly used that are seem to be significantly better than the ones used in
the most commonly-used open source software, and (2) in order to realize the gains
from those algorithmic improvements, the implementations of the better algorithms
have to be very good.

Conclusion

It is plausible that a DSL-based approach to implementing an ECC library would
have significant advantages with respect to attaining strong assurance of correctness,
ease of maintenance, and optimal performance. This is particularly likely to be
important for constrained platforms where are limited in our ability to trade space
for time and where we cannot rely on C compilers to generate efficient enough code. It
seems worth trying to build a prototype to see if such an approach makes sense.