RISC V ?

A: Drop a RISC V processor core in there. There are fully functional RISC V compilers from GCC and Clang/LLVM. RISC V cores can be pretty small and there is a bunch of them available already, free for use, in Verilog and VHDL.

Well, I'm joking. Mostly. I don't expect such a crazy thing to happen and I certainly would not want such a thing holding up P2 progress.

However, I just did something that made my eyes pop. A little while ago Chip posted the verilog for the new P2 PRNG. It seemed short and sweet so I was inspired to install the Icarus Verilog simulator and and see if I could learn just enough Verilog to check it's output was correct. Turned out to be pretty easy. I had just installed Quartus, waiting for the new P2 with PRNG release, so of course I had to see if I could get the PRNG test running on a real FPGA. Soon I had randomly flashing LEDs on my DE0 Nano. I joked that I would now proceed to design my own CPU.

The prospect of getting a CPU core working looked pretty daunting but after a while looking at what Clifford has there I realized it may not be impossible. I just cut and pasted his core into a Verilog project and started wrapping it around with memory and peripherals. I did not want all of Clifford's peripherals and buses and stuff. Too complicated for this humble beginner. Besides the challenge is to learn some Verilog so I wanted to make my own peripherals, as crude as they may be. Turns out that adding memory and GPIO to such a core is dead easy.

The end result is:

32 bit RISC V integer core with MUL and DIV running at 100MHz, about 25 MIPS.
12Kbytes RAM, using the memory on the Cyclone IV of the Nano.
GPIO port driving LEDs.
UART (Not quite done yet)
A PRNG port that serves up xorshiro128+ random numbers.
It runs some "firmware" compiled with GCC for RISC V that just counts up on 8 LEDs. (The "Hello World" of embedded systems)

This all fits in 2600 Logic Elements, about 12% of the FPGA. Shrinks to 8% without MUL and DIV.

Where does this all lead?

I have no idea. Just having fun. I could stick 8 of those cores on there and make a RISC V "poor mans Propeller" There is the SDRAM to take into use. And peripherals like the DE0 nano's ADC and accelerometer. Or perhaps what about replacing a COG from the open source P1 Verilog with a picoriscv32 core? Think I have a lot to learn for that one.

Anyway, if anyone is tempted to get their feet wet with FPGA and Verilog I highly recommend it. I suggest getting hold of the Icarus Verilog compiler/simulator. It makes experimenting very quick and easy. Rather than wait for the slow and ponderous Quartus to build anything. Bit like hacking code in Python or Javascript. Also it's easy to knock up quick test harnesses so you have some confidence your gadget will work. Without Icarus I would have given up in frustration ages ago.http://iverilog.icarus.com/

Yeah, I know this is all off topic for a P2 forum. I was just so amazed at what is possible to do relatively easily now a days I had to tell someone. Besides, it's Chip's fault for kicking me down this Verilog road Thanks Chip. Oh and it does include Chip's P2 PRNG so it is in very small part a P2!

32 bit RISC V integer core with MUL and DIV running at 100MHz, about 25 MIPS.
12Kbytes RAM, using the memory on the Cyclone IV of the Nano.
GPIO port driving LEDs.
UART (Not quite done yet)
A PRNG port that serves up xorshiro128+ random numbers.
It runs some "firmware" compiled with GCC for RISC V that just counts up on 8 LEDs. (The "Hello World" of embedded systems)

This all fits in 2600 Logic Elements, about 12% of the FPGA. Shrinks to 8% without MUL and DIV.

Where does this all lead?

Sounds cool.
Did you try a build for the Lattice ICE40UP5K-SG48ITR50 ? (testable on ICE40UP5K-B-EVN)
This part has 128K Bytes SRAM, and 5280 LE, but I'm unclear on how Lattice LE map to Altera LE....

Anyway, if anyone is tempted to get their feet wet with FPGA and Verilog I highly recommend it. I suggest getting hold of the Icarus Verilog compiler/simulator. It makes experimenting very quick and easy. Rather than wait for the slow and ponderous Quartus to build anything. Bit like hacking code in Python or Javascript. Also it's easy to knock up quick test harnesses so you have some confidence your gadget will work. Without Icarus I would have given up in frustration ages ago.http://iverilog.icarus.com/

Did you run the above on icarus, and what speed does icarus simulate at ?
Can icarus read a ROM file, or does it need to recompile the verilog for every simulate ?

I should have done this years ago. Hmm...actually I did, I tried some experiments in VHDL running under the GHDL simulator. But VHDL is complicated and verbose. And FPGA boards were not so cheap and readily available then.

Might take a while to get one's head around the fact that Verilog is not like a regular programming language. Potentially every statement you write can be happening at the same time. But if you are used to juggling parallel things on the Propeller it's not so shocking.

Did you try a build for the Lattice ICE40UP5K-SG48ITR50 ? (testable on ICE40UP5K-B-EVN)

No. The nano is all I have.

But that is where things get interesting. Clifford Wolf runs that RISC V core on some Lattice FPGA. I forget which one but they are physically tiny and very cheap.

Not only that but Clifford and a few other guys have reverse engineered the Lattice bit streams you need to configure those devices and created synthesis tools. With that one can get an FPGA working with a totally open source tool chain. Those guys are serious turbo nerds!

So yeah, a Lattice FPGA dev board is now on my want list...

Did you run the above on icarus, and what speed does icarus simulate at ?

Yep, it all runs under Icarus and you can watch the RISC V core execute instructions. Trace the memory accesses etc. I guess it's dead slow. Good enough to dump a few hundred or thousand RISC V instruction steps per second. Good enough to see something actually works or not.

What I have been doing mostly is using Icarus to develop/test the components. Eg. Create a UART, create a test bench for it, play with it till it works. Then integrate to the project. Icarus may be slow but the edit/test cycle is fast. As I said, like hacking code in Python.

Can icarus read a ROM file, or does it need to recompile the verilog for every simulate ?

Icarus compiles code into some kind of byte code. For example:

$ iverilog -o uart.vvp uart.v uart_tb.v

compiles the uart and it's test bench into a uart_tb.vvp file. Which can then be run:

$ vvp uart_tb.vvp

This is all very quick for a simple module test.

If you feel the need for speed, or have a huge design that is slow to simulate then there is the "verilator". That compiles verilog into C++. Which you can then compile and run. I did not manage to get that working yet.

If you feel the need for speed, or have a huge design that is slow to simulate then there is the "verilator". That compiles verilog into C++. Which you can then compile and run. I did not manage to get that working yet.

Ah, that is the path I was looking for.
Be interested if you do get that working, with speed stats on RISC V, as that seems a good way to get an exact P2 Simulator.
I believe the AVR simulator Atmel have, works this way - they feed it the chip design files, and get an EXE/DLL out.

But that is where things get interesting. Clifford Wolf runs that RISC V core on some Lattice FPGA. I forget which one but they are physically tiny and very cheap.

Could be the iCE40, that 128k RAM family member is very new. (~$6) Eval boards in stock, but no disti-silicon yet.
Do you have any links to his Lattice work, the link above mentions only Xilinx (but does hit some impressive MHz numbers)

Very cool, Heater. I've often thought that putting 8 or 16 RISC-V cores on a chip, with memory, a hub module and some custom Propeller like instructions (the RISC-V instruction set is extensible) would make for a very compelling Propeller3.

There are a ton of open source RISC-V implementations available now (e.g. the VectorBlox Orca which is made for FPGAs). RISC-V hardware is becoming available now from companies like SiFive. It'll be very interesting to see where it all ends up.

Oh yeah, 8 picorv32 cores and some kind of HUB memory was on my mind too. Should just about fit in the nano. I didn't realize a RISC V core could be so small.

As you say, RISC-V is extensible. With a few carefully crafted extensions to the instruction set and the smart pins it would make an excellent P3. I can't imagine Chip going for it though. The RISC V instruction set is designed to be compiler friendly not human assembler coder friendly. Consider this for example:

The new ICE40UP5K-SG48ITR50 has slightly less Logic 5280 LC (vs 7680 LC on HX8K), but is has an easier QFN48 package, and adds 128kBytes SRAM (vs 128k bits), & based on those stats, it should be roughly half full.
Space for a P1V COG ?

As you say, RISC-V is extensible. With a few carefully crafted extensions to the instruction set and the smart pins it would make an excellent P3. I can't imagine Chip going for it though. The RISC V instruction set is designed to be compiler friendly not human assembler coder friendly. Consider this for example:

The assembler will produce 2 instructions for li that will be a LUI and an ADDI to compose a 32bit constant, just like the Prop2 assembler produces an AUGS and a MOV for every ## for the equivalent Prop2 code:

and one comment here :https://news.ycombinator.com/item?id=12193769" cliffordvienna 245 days ago [-]
There is no 4K die. The 4K chips are using 8K dies, the lattice software limits the number of usable LUTs to 4K. IceStorm will give you access to all 8K LUTs in the device. "
Lattice may not like that information leaking out ..

I downloaded the latest iCEcube2 tools, and did a dummy run on a iCE40UP5K – SG48
All seems ok, Synth -> P&R with green ticks everywhere.

You are right. The assembler does deal with things like "li a5,0xffff0006" by generating two instructions.

My gut does not like the idea of an assembler producing extra instructions behind my back. That's what compilers are for.

But in cases like this it makes a lot of sense. Nobody wants to dick around figuring out how split immediates up for loading. And I guess it's not much more of a worry than an Intel assembler producing huge sequences instruction and operand bytes whose length depends on the actual values and addressing modes you use.

Next up I have to turn on the RISC V compressed code feature and see what space savings we get.

So I guess if you follow the installation instruction in that repo you end up with a RISC V SoC for iCE40.

I love that tidbit about getting around the 4K limit.

Honestly I think this whole IceStorm thing is huge. I mean, we can now develop for FPGA in Verilog using totally Open Source tools, even running on a Raspberry Pi. That is a monumental achievement. I'm surprised I have not seen any talk of it on the Raspi forums.

You are right. The assembler does deal with things like "li a5,0xffff0006" by generating two instructions.

My gut does not like the idea of an assembler producing extra instructions behind my back. That's what compilers are for.

Think of it as simply a 64b opcode, and that problem goes away.

Bull, the problem doesn't go away at all. It's just inserted a hidden instruction that takes more space and more time to execute. I'm not anti the assembler doing this but you can't say there is no gotchas.

"We suspect that ALMA will allow us to observe this rare form of CO in many other discs. By doing that, we can more accurately measure their mass, and determine whether scientists have systematically been underestimating how much matter they contain."

Certainly there can be gotchas in the assembler sneaking in extra instructions for you. I'm guessing it's only a problem if one is into timing things by counting up instructions and clock cycles so as to meet some strict timing constraints. As people do on the P1 and no doubt will do on the P2. Or perhaps when squeezing code into really small memory spaces, like a COG.

All in all not something the RISC-V designers or people writing assemblers for it worry about. RISC-V is intended as a general purpose instruction set architecture.

You just can't encode a 32bit immediate value in a 32bit wide instruction. Every processor architecture needs to handle that with 2 instructions or an additional immediate word after a load instruction.
Also the Propeller 2 does that (with AUGx).

Load Immediate (LI) is not a native RISC-V instruction, it's an assembler pseudo instruction to simplify the load of constants, just like ## on the P2.

if you want to see every instruction.
The big difference between a Propeller (1 or 2) and RISC-V is in the tight integration of counters and ports with the instructions on the Propeller. This allows fast bitbanged software peripherals which are much harder on RISC-V.
On RISC-V the ports are normally memory mapped which needs separate instructions for Load Modify and Store.

What is just an XOR OUTA,#1 on the Propeller, becomes:

li a5,PORTA_ADDR
lw a4,0(a5)
xori a4,a4,1
sw a4,0(a5)

on RISC-V. And Load/Store are often one of the slower instructions.
Same for things like WAITCNT or WAITPNE.

So if you want a Propeller like multicore with RISC-V cores, you will need custom instructions that allow tight integration with ports and counters.

That's not the issue though. The issue is simply extra instructions being generated by the assembler that you did not explicitly write. Which complicates simple minded instruction counting when making tight bit banging loops and so on. Also if you increase the size of a literal all of a sudden your code gets bigger!

Anyway, I'm not inclined to worry about that much. My RISC-V will be in FPGA, unless someone starts selling actual RISC-V chips, so any such bit banging will be done in Verilog!

Certainly the tight integration of I/O into the COG instruction set is a wonderful thing.

I was pondering the idea of RISC-V extensions for such bit banging and timing. The picorv32 core has a coprocessor interface for exactly that purpose. Currently only used for the optional MUL and DIV instructions. I was starting wonder how easy it might be to add my own instructions to that interface for ports and counters etc.

On a Propeller, they are one and the same. It's a register when you want it to be, small, local memory when you want it to be.

As for need, a Prop is a memory to memory direct design at the COG level. Code and Data are unified, registers / memory, etc... This means avoiding the load / store cycle, which improves throughput and real-time response.

At the HUB level, a Prop is a load-store machine, just having a ton of registers.

One distinction is the I/O is memory mapped, but in the same space as the COG memory is, or it's dedicated, accessed by implied addressing.

This makes it a micro-controller, in my view, as that generally isn't the model used for general purpose computing. This also makes it very fast in terms of sense, process, response too.

Finally, we have some shared resources, like CORDIC, PRNG. Most things are relative to a COG, cloned to maximize both throughput and real time.

Should we get it done this year, it's going to be distinctive. Capable of things at a process and clock speed that is hard to beat.

Yeah, I noticed the Yosys thing. It's what he uses to build his picorv32 SoC in the repository I got the picorv32 core module from. It's used with the IceStorm open source Verilog compiler we mentioned above. All mind blowing stuff.

Perhaps I'll get to looking at Yosys sometime. Just now I'm still feeling my way around Verilog itself.

Are you curious about how can be xorshiro128+ implemented with simple TTL logic gates? You can use the xorshiro128+ code as input for yosys and It will show you the logic gates needed to implement it. There is even one comand that will show you a graph of that.

Yosys is the synthesis engine of the IceStorm tools, it compiles your Verilog input into LUT definitions and Netlists. But it has other usecases like formal verification (I have no clue of that).

ArachnePNR is the Place and route tool of IceStorm which decides which LUTs are used and routes them correctly according the netlists. The output is the configuration bitstream.
It's mainly the Place and Route that takes so long in Quartus and other commercial tools. Arachne is blending fast in comparsion, at the cost of a bigger LUT count.

But the supported ICE40 FPGAs are quite limited. Only the bitstreams of older ICE40 types are known. They have max 8k LUTs and no Multipliers.
So don't expect you can use it for a P1V bigger than 4 cogs with some custom peripherals.