I think with a real ASIC hardware implementation of sha-256 it should easily be possible to outrun GPUs by at least a factor of 100, because of better space efficiency and the simplicity of the logic involved.

Considering that -you can manufacture one of these for ~2M€ and then get 1000s of these chips -they will not consume much power-they can be put on cheap boards, because no heavy IO is needed (graphic cards are expensive due to the heavy IO with ram)

I am sure this will happen, if Bitcoins really establish as a currency and USD exchange rates stay in the 10$ range. If not, this would be a cheap option for aggressors like governments to take over the network. Much easier and cheaper then trying to shut it down by law.

IMO, an ASIC implementation is the way to go. We already have decent RTL (those who contributed to this know who they are and I thank you guys for this). With little modifications to the currently RTL, we could easily daisy chain many "cores" (easiest implementation with current state of project is a token ring over UART...only need to assign a specific address to each core).

Let's say each manufactured chip would yield 100 MHash/s. We daisy chain 20 per boards (a board with 20 chips on it is not a big deal) That's 2 GHash/s right there. PCB design and manufacturing would be pretty straight forward. I volunteer for that.

The big question: how to we finance an ASIC project? And even more importantly: how do we get it done?

1) Outsource FPGA2ASIC flow to http://www.icnexus.com.tw/product.php?id=25 (first company I found...there's gotta be many others). Get a chips ASAP and limit the risks. With this forum, I'm sure we could get a small EE team together and do all the Synopsis, BIST, test scan, pads design, routing, etc. crap ourselves but there are specialists out there that will do it for us...and chances of success will be much higher with that approach. Being a 100% digital chip (+ regulator and PLL obviously) the project couldn't be easier for these guys (or whatever company that would get the contract)...now to mention they are already in the business of FPGA2ASIC conversion.

2) Crowd funding with kickstarter.com -- If we can get 500 people to pre-order one 2 GHash/s board at 1000$ a piece (a truly good deal IMO), we get a 500k$ budget to do #1. We need 10,000 chips. I think the budget makes sense if we spend 250k$ on design, 100k$ on chips (10$ a piece), 50k$ for tape-out (might be included in design cost...we need to see with the contractor), 10k$ on PCBs and assembly + the rest for overhead. Once we get real quote from contractor, we can adjust the cost per board...I'll I'm putting here are ball park figure to show the potential of this approach.

So far in my career all I've done is deal with PCB, FPGA and ASIC designs...this project seem very realistic to me. But maybe I'm day dreaming...please bring me back to earth if I'm doing so.

Unless someone can come up with a working prototype, then I would say outsource it. A company looking for work might do it for a very cost effective price if we can come up with the basic design, software, and cash. They can make their profits up on repeat sales, and improved designs.

I now have a bitfile for the atlys board (spartan 6 - lx45) with depth:=2 and 50mhz

The only problem is, that miner.py refuses to communicate over the serial port.It detects the core, but when it starts "Measuring FPGA performance..." it produces and timeout: "Timed out waiting for FPGA to accept work"

@TheSeven: any idea how to debug or solve the problem? is the miner.py code working for all depths and frequencies?

You'll need to adjust the pin locations for clk_in, rx and tx in the UCF file, and adjust the clock divider for the serial port for the 50MHz frequency.Replace "10000010001" with "0110110010" and "11000011001" with "01010001011" in uart.vhd.And I should probably publish the new version of my miner, it now supports multiple pools, long polling, etc.

TheSeven, can you give some lines on how to calculate the deviders, any formula?

Hmmmm. It looks like Xilinx's synthesis tools aren't very good at inferring the right meaning from various constructs this uses. I'm now seeing if I can convince them to interpret it in more efficient ways, but to be honest it looks like they may just be slightly rubbish. It ought to be possible to cram a full hashing pipeline into a XC6SLX75 in theory, but the current code isn't even getting close.

I don't feel like doing this since it's not mine, but it would be really cool to put xilinx implementation on github too so that people can fork it for different boards optimizations and improve easily with pull requests. If, of course, they want to share their changes.

I am pretty sure they can do much more. If a single mid-range fpga can house an entire pipeline and get 50 MH/s, any ASIC must be able to overperform that at least with a factor of ten.

Quote

there are specialists out there that will do it for us...and chances of success will be much higher with that approach.

Considering that the design is very simple, and we don't need to push any limits (as we can simply use more chips instead) probably the manufacturer's team could do it relatively cheaply, it's mostly an automatized process anyway.

Quote

2) Crowd funding with kickstarter.com -- If we can get 500 people to pre-order one 2 GHash/s board at 1000$ a piece (a truly good deal IMO), we get a 500k$ budget to do #1. We need 10,000 chips. I think the budget makes sense if we spend 250k$ on design, 100k$ on chips (10$ a piece), 50k$ for tape-out (might be included in design cost...we need to see with the contractor), 10k$ on PCBs and assembly + the rest for overhead. Once we get real quote from contractor, we can adjust the cost per board...I'll I'm putting here are ball park figure to show the potential of this approach.

I think the one-time costs are higher. Unless we go for some structured ASIC, which indeed can be done from a few $100K. But the problem with structured ASIC is very similar to the fpgas: we have to pay for all the unused stuff (memory blocks, hardware multipliers, etc) that we don't need, increasing the price and lowering performance. And the projects return would still be threatened by others who starts making real asics.

I am also not sure that hunderds of people would commit the neccessary amount. Buying a video card is a much lower risk, as it can be sold anytime and has uses for other purposes.

Quote

So far in my career all I've done is deal with PCB, FPGA and ASIC designs...this project seem very realistic to me. But maybe I'm day dreaming...please bring me back to earth if I'm doing so.

Just coded a fully unrolled SHA256 in VHDL using two different approaches to maximize clock rate, a simple approach that involves precalculating H + K + W, and a more advanced approach that further pipelines each stage. Initial compiles targetted Cyclone IV using web edition quartus (which sucks), with the simple version achieving 110MHz and the advanced version 133MHz. Will be interested to see maximum clock rate that can be achieved on Stratix IV.

Hmmmm. It looks like Xilinx's synthesis tools aren't very good at inferring the right meaning from various constructs this uses. I'm now seeing if I can convince them to interpret it in more efficient ways, but to be honest it looks like they may just be slightly rubbish. It ought to be possible to cram a full hashing pipeline into a XC6SLX75 in theory, but the current code isn't even getting close.

Yeah, I'm still fighting with ISE. Using SmartXplorer I got ISE to P&R a fully unrolled mining core into my 6SLX150 chip at 50MHz. ~50% resource consumption all around, which is good. Two problems: A) results returned by the running core were erratic, and B) the chip ran very hot.

IMO, an ASIC implementation is the way to go. We already have decent RTL (those who contributed to this know who they are and I thank you guys for this). With little modifications to the currently RTL, we could easily daisy chain many "cores" (easiest implementation with current state of project is a token ring over UART...only need to assign a specific address to each core).

I fully agree that ASIC is the long-term way to go, but this UART token ring thing seems to be rubbish to me. There are well-suited protocols for this, like for example I²C.

There are two possibilities:- Build a PCIe mining accelerator card, with some PCIe to I²C (or whatever) bridge, possibly on a CPLD.- Slap an ARM SoC and an ethernet adapter on the board as well and make it run autonomously.

Let's say each manufactured chip would yield 100 MHash/s. We daisy chain 20 per boards (a board with 20 chips on it is not a big deal) That's 2 GHash/s right there. PCB design and manufacturing would be pretty straight forward. I volunteer for that.

Good to know, as I have never dealt with this area before. Could you provide an estimate for the non-ASIC cost? (PCB design, prototyping, manufacturing and assembly, voltage regulators, clock generation, ...)

The big question: how to we finance an ASIC project? And even more importantly: how do we get it done?

1) Outsource FPGA2ASIC flow to http://www.icnexus.com.tw/product.php?id=25 (first company I found...there's gotta be many others). Get a chips ASAP and limit the risks. With this forum, I'm sure we could get a small EE team together and do all the Synopsis, BIST, test scan, pads design, routing, etc. crap ourselves but there are specialists out there that will do it for us...and chances of success will be much higher with that approach. Being a 100% digital chip (+ regulator and PLL obviously) the project couldn't be easier for these guys (or whatever company that would get the contract)...now to mention they are already in the business of FPGA2ASIC conversion.

I've heard rumors that Altera would be doing there HardCopy process for as low as $150K for 1000 chips, which seems very low to me. No idea whether that's true though. We might want to request a quote.

I am pretty sure they can do much more. If a single mid-range fpga can house an entire pipeline and get 50 MH/s, any ASIC must be able to overperform that at least with a factor of ten.

I'd expect the chips to run at 200-300MHz, and one of my co-workers said that he tried synthesizing the hardcopy process for my VHDL design, and that 20 of those would fit on a single chip. That's 4-6GH/s per chip.

Just coded a fully unrolled SHA256 in VHDL using two different approaches to maximize clock rate, a simple approach that involves precalculating H + K + W, and a more advanced approach that further pipelines each stage. Initial compiles targetted Cyclone IV using web edition quartus (which sucks), with the simple version achieving 110MHz and the advanced version 133MHz. Will be interested to see maximum clock rate that can be achieved on Stratix IV.

How big was the 133MHz design? (How many KLEs?)Could you share this design?

I now have a bitfile for the atlys board (spartan 6 - lx45) with depth:=2 and 50mhz

The only problem is, that miner.py refuses to communicate over the serial port.It detects the core, but when it starts "Measuring FPGA performance..." it produces and timeout: "Timed out waiting for FPGA to accept work"

@TheSeven: any idea how to debug or solve the problem? is the miner.py code working for all depths and frequencies?

You'll need to adjust the pin locations for clk_in, rx and tx in the UCF file, and adjust the clock divider for the serial port for the 50MHz frequency.Replace "10000010001" with "0110110010" and "11000011001" with "01010001011" in uart.vhd.And I should probably publish the new version of my miner, it now supports multiple pools, long polling, etc.

TheSeven, can you give some lines on how to calculate the deviders, any formula?

the first one is (clock frequency / 115200), the second one is ((clock frequency / 115200) * 1.5)

(Edit: Altera Quartus II claims FMax = 90.16 MHz on EP4CE115 for the xilinx-shiftreg branch, with one or two tweaks to the build config that may not be necessary. Took over 3 hours to build and used pretty much all the FPGA though, so not terribly useful - you would be better off adding another mining core.)

Been messing around some more with fpgaminer's code. Users of largish Altera FPGAs might want to try this branch, which skips the last 3 rounds in the fully-unrolled version and allows optimisations based on the fact that part of data is constant. Xilinx users can additionally uncomment "`define USE_RAM_FOR_KS" and combine this with teknohog's serial miner, though this may not work too well. (There's also the xilinx-shiftreg branch which only works for fully-unrolled miners.)

Note that I don't have an actual FPGA to test any of this on, so be sure to double-check the thermal results to make sure you're not going to damage your expensive hardware, and make sure it's actually submitting blocks successfully. Also, these are more size improvements than speed improvements, and most people that could benefit have probably got their own better version already.

Brief explanation: with the original code, which is what you get with USE_RAM_FOR_KS disabled, Xilinx's xst was doing something daft involving shift registers for K[ s]. Without the xilinx-shiftreg changes, it also failed to use shift registers for W where it kinda made sense to do so; unfortunately with the changes Altera's Quartus tools no longer find the shift registers.

I fully agree that ASIC is the long-term way to go, but this UART token ring thing seems to be rubbish to me. There are well-suited protocols for this, like for example I²C.

There are two possibilities:- Build a PCIe mining accelerator card, with some PCIe to I²C (or whatever) bridge, possibly on a CPLD.- Slap an ARM SoC and an ethernet adapter on the board as well and make it run autonomously.

If you're doing HardCopy-style structured ASICs, in theory you could put a fastish 32-bit processor and Ethernet MAC on the ASIC itself. It'd probably only take up a smallish proportion of the chip and you'd just need boot flash and Ethernet PHY chips externally. Not sure how much sense this would make though.

How big was the 133MHz design? (How many KLEs?)Could you share this design?

Unfortunately, OrphanedGland and a lot of the other posters in this thread can't reply to it anymore because they're too new. The forum admins have blocked users with a small number of previous posts from posting anywhere except the Newbie forum. Try this thread.

June 12th, 2011 - Xilinx and VHDL Ports AddedWith many thanks to TheSeven and teknohog, their code has been added to the public repo. TheSeven did a re-implementation in VHDL, with support for Xilinx and ISE. teknohog did a straight port of the Verilog code to simply support Xilinx and ISE. Both include Python miner control scripts, and serial port communication with the FPGA board.

I made little to no modification to their code for this first commit. If you appreciate their hard work on this Open Source project, please send them your thanks and donations!

As it stands now, the project makes lots of references to the DE2-115 board, and it being the preferred mining platform. Obviously this isn't the case, nor is it meant to be an advertisement. It was simply the first device supported, and the one that currently has a binary release available. In the near future, I will merge in full Xilinx compatibility changes into the main Verilog code and try to steer the project towards supporting many devices and boards in a Plug-and-Mine fashion (like the DE2-115 currently is); or at least Compile-Plug-and-Mine

Also, the directory structure in the repo is not optimal, but that will improve with time as I settle on a structure that fits the project's many needs (multiple code variations, and multiple device specific implementations).

I have many other promising patches to merge, including a few of my own So keep watching this thread!

If you're doing HardCopy-style structured ASICs, in theory you could put a fastish 32-bit processor and Ethernet MAC on the ASIC itself. It'd probably only take up a smallish proportion of the chip and you'd just need boot flash and Ethernet PHY chips externally. Not sure how much sense this would make though.

Depending on expense, i'd look at adding a small XMOS processor to the board. It being transputer in essence was designed to talk other chips like ASICs and other XMOS. The ASIC is left doing its special magic, the XMOS handles everything else (including block submits etc.) and can be easily connected up into massive rigs as required.

Depending on expense, i'd look at adding a small XMOS processor to the board. It being transputer in essence was designed to talk other chips like ASICs and other XMOS. The ASIC is left doing its special magic, the XMOS handles everything else (including block submits etc.) and can be easily connected up into massive rigs as required.

They're kinda neat, fairly powerful, and the chips aren't massively expensive either. I actually have an XMOS XK-1 boards here flashing an LED at me accusingly. The caveats are that USB is a bit tricky, requiring external components and using up nearly all available I/O ports on that core (so you really need a more expensive dual core chip for that), the chip has some slightly interesting power sequencing and reset requirements, it has no internal Flash so you need a seperate SPI Flash chip for firmware, and it has no internal driver for a crystal oscillator. (Edit: oh, and you're limited to 64 kilobytes of RAM per core and your code needs to fit into that too.) An ARM microcontroller with integrated Ethernet MAC and USB might work out better.

Of course, there'll be some tricky board design and manufacture anyway for the ASIC, so that may not necessarily be a huge obstacle.

Edit 2: Also, a totally untested 90MHz/90 MHash/sec bitstream for the DE2-115 is now in a branch on my git repo. PowerPlay estimates 4.4W of heat, I think? Anyway, don't blame me if you blow up your expensive board. There's a reason I'm not including instructions here.

If you're doing HardCopy-style structured ASICs, in theory you could put a fastish 32-bit processor and Ethernet MAC on the ASIC itself. It'd probably only take up a smallish proportion of the chip and you'd just need boot flash and Ethernet PHY chips externally. Not sure how much sense this would make though.

Depending on expense, i'd look at adding a small XMOS processor to the board. It being transputer in essence was designed to talk other chips like ASICs and other XMOS. The ASIC is left doing its special magic, the XMOS handles everything else (including block submits etc.) and can be easily connected up into massive rigs as required.

I don't like this idea too much. I'd keep the ASIC/FPGA interface as simple as possible, but I'd prefer a bus for easy scalability. I²C springs to my mind.This way you can keep the ASIC simple and don't waste precious space, don't waste the control processor circuitry when chaining multiple ASICs, and can use the same ASIC for both PCIe-based accelerator cards and standalone mining boards. The latter would get a simple ARM SoC with updatable firmware (based on linux?) and an ethernet interface.

Edit 2: Also, a totally untested 90MHz/90 MHash/sec bitstream for the DE2-115 is now in a branch on my git repo. PowerPlay estimates 4.4W of heat, I think? Anyway, don't blame me if you blow up your expensive board. There's a reason I'm not including instructions here.

Also, the directory structure in the repo is not optimal, but that will improve with time as I settle on a structure that fits the project's many needs (multiple code variations, and multiple device specific implementations).

I've written a document that proposes a directory structure for FPGA designs that is version control friendly and scalable:

Maybe a stupid question, but why can't existing crypto hardware acceleration chips be used? Why does a new one have to be built?

Do you have a suggestion? The only IC I could find that do SHA256 are kinda slow when looking at the throughput needed for mining. Sure you could chain a ton of them (they're cheap in lots of 1000) but the control logic surrounding it could get expensive.The custom crypto-smashing machines so far have been FPGA or structured ASIC based. I think the one that cracked the RC4 challenge was like $250K worth or something.