Date: Thu, 9 Feb 2012 11:18:01 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: cryptmd5 optimizations
Hi Lukas -
On Wed, Feb 08, 2012 at 01:19:36AM +0100, Lukas Odzioba wrote:
> I am trying to optimize opencl and cuda cryptmd5 code,
Thank you for working on this!
> and I got to
> dead end from my perspective. I want to understand what optimizations
> were done in MD5_std.c, but so many #ifdef's are distracting me. From
> Alex i know that MD5_X2 should be off. My question is what else
> "flags" should be on/off (and what they mean) to make code easier to
> understand?
When you say "should", you mean to make the code easier to understand,
right? If so, it's probably MD5_X2 off, MD5_IMM on, MD5_ASM off.
MD5_X2 means compute two hashes at a time with mixed instructions, for
greater instruction-level parallelism. You get two inter-mixed
implementations of MD5 (and of the higher-level logic as well) when you
enable this.
MD5_IMM means use immediate values for the 32-bit constants. When you
disable this, you instead get array lookups. Immediate values work
better on x86 and x86-64 where instruction size is variable and 32-bit
immediate operands may be encoded right in ALU instructions. Array
lookups work better on RISC architectures where instruction size is
fixed (usually 32-bit), load instructions are separate from ALU
instructions, and 32-bit immediate operands do not fit. On such RISC
architectures, it'd take two load instructions to put one 32-bit
immediate value in a register. With array lookups, we reduce this to
one load instruction (assuming that the array start address is already
loaded in a register and the constant offset to a specific element is
small enough that it fits in the immediate operand field of just one
instruction). I do not know which of these is more suitable for GPU
architectures; my (limited) understanding is that on one hand you have
fixed instruction size (RISC or VLIW), but on the other you have very
limited amount of fast memory (and even loads from fast memory might be
slower than loads of immediate operands from the instruction stream,
even if you have to use twice more load instructions for the latter).
You'll need to research this and experiment with it. I won't be
surprised if different approaches will be required for different GPUs.
MD5_ASM is obvious - it excludes some C code in favor of assembly code
from a .S file.
Alexander