Subscribe to this blog

Follow by Email

Accelerating cryptocurrency Mining with Intel ISPC

Daniel Lemire and Maxime Chevalier just had a great exchange on Twitter about the state of compilers and being able to automatically vectorize code, such as a scalar product. Of course, hoping the compiler can correctly vectorize your code can be a bit fragile, and as Maxime points out, writing raw intrinsics results in an unreadable pain in the editor, and a GPU-style implementation in CUDA or OpenCL might be more scalable and maintainable.

A few years ago, some folks at Intel wrote a compiler called ISPC, the Intel SPMD Program Compiler. A possibly unfairly simple way to describe ISPC is that it's OpenCL-style programming for x86 vector units. You can write code that looks like this:

export uniform float scalar_product(uniform float a[],

uniform float b[],

uniform int count) {

float a_times_b = 0.0;

foreach (i = 0 ... count) {

a_times_b += a[i] * b[i];

}

return reduce_add(a_times_b);

}

After compiling this appropriately with ISPC and linking it in, you'd have an appropriately-vectorized implementation of scalar product for the architecture(s) you specified. Pretty neat!

(If you can't tell, I'm a fan of ISPC, and I don't know for the life of me why it hasn't received more attention.)

At the end of 2013, sick in bed with bronchitis, I got interested in playing around with proof-of-work functions for cryptocurrencies. One was Momentum, for which I wrote and released the first publicly-available GPU miner. As a part of playing with Momentum, I decided to also write an optimized CPU miner. At the time, the fastest available one was from yvg1900, a (I believe) Ukraine-based developer who writes some of the most astoundingly optimized code I've seen. The momentum miner was the fastest available, but didn't seem as fast as it could be on Intel Haswell series processors, which support AVX-2 (256 bit integer vector support), so I decided to try to improve upon it.

The Momentum proof of work relies upon finding partial collisions in a large (2^26) set of hashes. To improve upon the miner, I used the same fast duplicate detection mechanism I'd devised for the GPU version. After implementing that, it turned out that the vast majority of time was then spent running the SHA-512 hash function, even after linking against Intel's highly tuned version of it. The Intel optimized SHA-512 didn't scale too well to AVX2: Going to SSE-optimization (128 bit vectors) gave it a big speedup, but doubling the vector width to 256 bits didn't come close to doubling the speed.

Fortunately, there was a solution at hand. In most cryptocurrencies, the hash function can be trivially parallelized, as the input is of the form: Hash(block stuff, nonce), where the nonce is usually an integer that's incremented for each turn. So you can compute many hashes in parallel -- this is one of the reasons GPUs and ASICs work so well for cryptocurrency mining.

I turned to ISPC to write an 8-way parallel SHA-512 implementation. It turns out to have been approximately trivial. I started with some GPU-style unrolled, very-inlined code that implemented SHA-512 on a restricted sized input, and then wrapped that for ISPC:

In this code, the sha512_block code is not parallelized -- it just implements a single hash at a time. I've lost some of my original source code for this, but the results were fast:

dga-sha512_ispc: 1.239s

intel-avx2: 2.853

The output of using ispc like this is probably not as fast as expert-optimized code. I've looked at some of the Intel hand-optimized cryptographic functions, and they're the work of wizards -- they expertly juggle between both the vector units and the scalar units to get maximum throughput out of both. But for the game I was playing a few years ago, I didn't have to -- I just had to write a fast miner sooner than anyone else did. ISPC was a great tool for doing so.

(For the record and for fairness, of course, I should note that yvg1900 later released a version of their AVX2 Momentum miner that was faster than mine, also using ISPC for some of the code generation, and also that there are many things that make software good -- the yvg1900 miners are robust and well written, and mine was a quick hack.)

Popular posts from this blog

I was just indulging in my favorite bad habit (reading hacker news), when I came across a discussion about prestige in faculty hiring - namely, the fact that the majority of CS professors across the US come from the top 18 or so universities, and that even within that, there's bias. The HN discussion turned to Ph.D. admissions, with a commenter noting:

"I'm not sure, but I'd bet that the best institutions actually don't care where their applicants come from. They can afford to just choose the best. It's the so-so places that have to signal their quality by 'hiring the best'. I've seen this in graduate school applications - students with poor marks often stand a better chance of admittance in a top program than a so-so one."

I just finished chairing the CMU CS Ph.D. admissions committee, so I thought it might be worth writing down my experience in doing so. With two very important notes:

I woke up on May 28th, 2014, on vacation with my family in the middle of the desert, to find a copy of my private source code plastered across the bitcointalk message board. Announced as a "new optimized version" of the Monero currency miner, it was enthusiastically adopted by cryptocurrency miners across the world. And in the process of doing so, my daily profit from the Monero Mining Project dropped by over five thousand dollars per day.

But let's start at the beginning, when I started getting in to a loose collaboration with three people I've never met---one whose name I'm not even confident I really know---with hundreds of thousands of dollars worth of a nebulous new cryptocurrency at stake.

It started with a cryptic note from someone I'd met online, with a link to a bitcointalk.org message board discussion for a new currency called "bitmonero". His note said only:

A few days ago, I posted about getting the Ph.D. in computer science. As part of that post, I mentioned that I'd publish & discuss some of the reviews my papers have received. These aren't so nasty that they're funny, but I decided to post the full submission/review/submission/review/final as a way to also help show what the publication and revision process looks like. I probably owe a better explanation of the delta between the papers, but - hey, the SIGCOMM, USENIX ATC, and SEA (algorithms) deadlines are all this week. :)

I've deliberately picked two papers with very different initial reviews. Both were eventually accepted, and neither was accepted upon first submission. One received a best paper award. Both have over 100 citations. I've picked these not because they're representative -- I've written papers with low citation counts also -- but because they illustrate the wide disparity in review constructiveness and tone that even decent, pub…