Let me introduce an experimental SoC and a compiler toolchain built for exploration of the design space in compute acceleration. Of course, the words "compute acceleration" do not go well together with such a small and simple FPGA as ICE40, but it still provides an opportunity to explore some simple techniques before graduating to more complex FPGAs, while enjoying the most pleasant to use, fully open source FPGA flow - yosys/arachne-pnr/icestorm.

BlackIce board is very suitable for such experiments, with its fast SRAM and convenient PMODs.

The demo I'm presenting here consists of the following:

A simple, reasonably fast 6-stage RISC CPU (around 2500LCs in total), it's retiring 1 instruction per clock cycle unless stalled by an extended instruction (no memory stalls, no interrupts, etc.). This CPU core is designed to be a minion CPU in an SoC controlled by another, more general purpose CPU, but obviously on ICE40 8k we only have space for one CPU anyway.

A monochrome 640x480 VGA

An infrastructure for adding extended instructions to the RISC CPU

An optional UART (not used in the demo)

A small 2-port RAM implemented on ICE40 block RAMs, used for both code and data

A very simple extensible C-like language compiler, with an SSA-based optimisation middle-layer and multiple CPU backends. This language allows to inline Verilog into C the same way as one would inline an assembly.

There is a demo program displaying a monochrome Mandelbrot set (computed in fixed point).

One version of this program runs entirely on a CPU, including software implemented multiplication and division. The other version is nearly the same, but it adds an "__hls" attribute to some functions (multiplication and division), immediately turning them into hardware instructions. And, the third version implements the entire Mandelbrot kernel in hardware, using 3 32-bit multipliers in parallel.

This toolchain allows to exploit HLS compute acceleration even further, by utilising pipeline level parallelism - the Mandelbrot kernel inner loop is a 9-stage pipeline, meaning one core can compute 9 threads in parallel instead of one, but, unfortunately, this one is already a bit too big for the ICE40 8k (just a couple of hundreds of LCs over the top, so there is a hope that I'll probably cram it in later, with a smaller host CPU). If you want to see this aspect of the toolchain in action, there is a 4-core demo running on the Digilent NEXYS4DDR board.

I'm surprised at what you managed to do with the lowly ICE40 w/ open source tooling. Sure, it's the 8k variant, I've played around with the 2k variant a while ago and it felt very restricted. Mandelbrot (even to a small SPI display) was one of the things I couldn't achieve to compute 100% on the device itself.

An update: ok, I managed to cram the pipelined HLS-compiled module into ICE40 8k. Now it computes and displays Mandelbrot set in up to 114 clock cycles per pixel (i.e., 1.4 seconds per frame), doing 7 threads in parallel. All in a tiny 8k FPGA which does not even have hard DSP slices. I'm really impressed by the Yosys-synthesised multipliers. Hope someone will find and useful. Feel free to play with the HLS engine.