Benchmark for Various Wallet Crypto Modes

Variations

3-bit pre-computed table - pre-compute a small table for the ECDH step (tx pub shared secret). Only useful if scanning multiple wallets at a time (i.e. mymonero case).

spend public de-compression - de-compress the spend public key from (y) to (x,y,z,t) exactly once. Currently this is done for every output scanned

curve25519 shared secret - Use curve25519 for the ECDH step (tx pub shared secret) instead of the ed25519 curve. This is faster in some cases because there are many montgomery ladder implementations for this curve.

Optimizations

ref10 - default implementation currently in use by monero-wallet-cli.

amd64-51-30k - ed25519 implementation from supercop, currently proposed in a PR for wallet scanning (i.e. compatible with current protocol).

amd64-64-24k - ed25519 implementation from supercop, currently proposed in a PR for wallet scanning (i.e. compatible with current protocol)

Running

The source code is on github in a branch. Clone my repo, switch to this branch, create a build directory (anywhere) and then do cmake /PATH_TO_SOURCE/ -DCMAKE_BUILD_TYPE=Release && make wallet-crypto-bench. This should automatically added amd64 specializations if targetting that architecture. If you have a newish processor, adding -DWALLET_CRYPTO=auto -DWALLET_CRYPTO_BENCH="amd64-51-sandy2x;amd64-64-sandy2x" will add optimizations requiring instructions added by the sandy bridge line of processors.

Observations

ECDH Step

The sandy2x curve25519 EDCH is 30% faster than the amd64-51-30k ed25519 monero ECDH ("monero" means multiplying by the cofactor AFTER the ECDH whereas curve25519 uses scalar clamping with several bits of security lost). A small 3-bit table reduces the time by 15%, and if users were willing to trade more memory when scanning many wallets its likely that it would beat the sandy2x implementation (there is a reason why ed25519 exists separately). But for standard wallets, using a sandy2x curve25519 will likely remain faster.

Tx Scanning

The ECDH step is only done once per transaction. So once a Tx has an average of 3 outputs, the de-compression optimization is faster than the curve25519 protocol variant. Since they can be combined, the entire speedup will be quite large. The de-compression optimization is more likely to make it into mainline since it does not require a protocol change.