RISC V ?

Comments

I'm going to give up on getting the RISC V development tools going on a Raspberry Pi for now. My 32 GB SD card gave up the ghost (coincidence?) and an 8 GB card is just too small even after nuking Mathematica, LibreOffice, Minecraft, and SonicPi. Seems like cross compiling is the way to go - seeing what goes on in the sausage factory makes me a bit nostalgic for TurboPascal and LightSpeed C. Maybe since Peter is watching there will be a small Forth ;-)

Let me know if any warnings are still there. Cool that it runs at 100 MHz and you get to play with the PLL.

Well don't let me stop you from trying to get the toolchain up. Maybe you'll have better luck - I was trying to build everything, but perhaps that was simply too ambitious. Are you an expert in cross compilation?

(I built OpenCV on that same SD card, and that was pretty large. I probably would have given up if I tried to do it all on my own. Fortunately Adrian Rosebrock has a set of nice instructions on his blog. Now he sells a prebuilt SD image as part of his OpenCV book bundles; probably because he realized that most people would get stuck or bored along the way.)

As you noticed cross-compiling is easy enough, we do it all the time for Propeller, AVR, STM32, etc, etc. But getting the tool chain built and installed can be a nightmare. Works well enough with a good build system and some good instructions and perhaps a guru on hand to advise. Often times I have given up in frustration.

I was curious how if it's doable for the average person with just a Pi and not much else. e.g. a high-school student

Those digital design projects for schools look interesting.

I mentioned it here somewhere before recently but now we have Free and Open Source tools for synthesizing Verilog for those cheap Lattice devices. We have Free and Open Source RISC V (picorv32 and others) implementations that fit on those Lattice parts with space to spare for other useful logic. We have the Free and Open Source RISC V GCC tool chain. We have Yosys. All of which runs on the Raspberry Pi. As I said, soon we will have 10 year old kids turning out SoCs with their own digital designs.

So yes, this is all doable for the average person.

Perhaps it needs packaging up into an easy to use system. The Icoboard project is an attempt at this: http://icoboard.org/

It should be easy to `ifdef it away. Just have a SIMULATION define. You can create a 100 MHz clock pretty easily if you hard code everything. Either just make a new clock, or use delays and xor. I have to run, and you seem excited to figure this sort of stuff out so I'll let it at that for now.

I played around with it a bit. Here's some advice that you can ignore or not.

First I noticed that most files didn't define a timescale. It's good to add them - at some point this will save you some hassle. One easy way to do this is to make a header file to define it. Then if you ever want to change it you can edit this one file. And if it's not suitable for a particular file, then you can do something else there. (E.g. perhaps a PLL needs a different timescale to simulate properly)

I had been carefully avoiding thinking about actual time scaling in simulations. It's just logic right Anyway I hadn't got around to finding out what that mysterious "timespec" thing meant. But of course it's a good idea to get that right. I added your timespec.vh and pushed to the repo.

The manifest is a great idea too. I also added that.

That paper you linked to is a monster. TL;DR most of it! But these guidelines from there look good:

Heater - there are a lot of interesting papers there. For FPGA work you can sort of take it easy ;-)

I have a question about a statement that was made earlier in this thread by Andy.

The assembler will produce 2 instructions for li that will be a LUI and an ADDI to compose a 32bit constant

Since I don't have all of the RISC V tools I've been playing with a somewhat broken assembler written in python. (I don't recommend it, but it was quick to write some really simple test. https://github.com/wueric/riscv_assembler if anyone must look.)

The question is since ADDI and ORI both sign extend, is the above statement true? I'm probably missing something really simple but I'm getting a lot of 1's in the msbs.

Older, simpler CPU designs would just add cycles and perform more ops. 6809 could end up with 14 cycle plus, if one did something crazy like program counter relative, indexed by register, post decrement...

And there was LEA, load effective address so an instruction like that could be computed once, result captured for simpler faster forms.

With a pipeline, breaking ops out as instructions does the same thing, a bit of speed traded for maybe program size, complexity.

then does it become something different? In this case bit 11 is a one, so sign extension comes into play. Unless I'm really confused. I think that I found that macro in tc-riscv.c but was just now studying it.

Edited to add: the relevant part seems to be load_const

And edited again: I should read your examples more closely. Got fooled be decimal versus hex. They are doing some transforming in there for 123456789.

Compiling with mingw seems to be a dead end too. It checks, for instance, that bcopy is available, says no but uses it anyways and complains that cannot be found. free seems also not to be available .
Ubuntu on windoze doesn't even go that far:

It is so dead that it doesn't even react to Ctrl-C... I didn't learn the words to describe that. (There are no words to describe how ugly that is (you are), yes there are but you don't known them yet).

Ale, two common issues I have found before with github and mixed windows/unix source/compile:

1) Never use github "download as zip", It is broken. You will not get the exact original files.
2) Most tools (iverilog, perl ...) are sensitive to CRLF/LF, use dos2unix (or unix2dos) before and check if that solves the issue.

(Quite interesting links on this thread, and thanks KeithE for you tips.

Also this thread mentions two great genius programmers of XXI century : Clifford and Fabrice

About cross compiling : the current king of cross compiling is Rob Landley 'Aboriginal Linux'. A set of scripts and with one single command it will automagically download all sources, cross compile linux, and start the kernel into a QEMU virtual machine, all flawlessly).

Looking around it seems that there's not much of an alternative to using riscv64-unknown-*-gcc. (e.g. LLVM

This backend currently only supports assembly generation and riscv64-unknown-*-gcc must be used to assemble and link the executable.

So I'll have to get around to building that on the Pi. But first I wanted to give those free and open synthesis tools a try.

It seems that XORI is easier for my dumb brain than ADDI. Part of this adventure is getting some exposure to RISC V assembly so I wanted to play around a bit. (In the code below I have the full 32-bit constants, but only a portion is used in each line) I guess it's a good idea to always sign extend in hardware and just deal with this in the tools. It should simplify the hardware.

# Loading a 32-bit constant with bit 11 set
# First load the complement
LUI x31,0x55555555
# Then XOR with the constant
XORI x31,x31,0xaaaaaaaa
#
# Loading a 32-bit constant with bit 11 clear
# Just load the constant
LUI x31,0x55555555
# And XOR with the constant
XORI x31,x31,0x55555555

BTW - you can write to x0, but the value is effectively zeroed for reads. I think that's handled by this code. You can only see the value if you look at dbg_reg_x0 in waveforms. I've seen this style before in other processors. I was looking here before I understood the constant loading since I thought some code I wrote broke things.

For one thing it appears that he proves those parallel case statements. And you can see what he's doing in the FORMAL sections.

Edited to add: I find that using all of the Pi's cores with "make -j$(nproc)" can cause hangs when building IceStorm. This is with a fan blowing on the board as well. Not sure what's going on with that. Maybe it's due to the limited amount of RAM?

So I wrote a little SPI driver in Verilog with the intention of accessing the ADC on the DE0 nano. Basically it's intended to clock out 16 bits from a register to the device over MOSI on the falling edge of the SCLK and clock in 16 bits from the device over MISO on the rising edge of SCLK. Sounds simple enough.
(The ADC only needs 4 channel select bits out and only delivers 12 data bits in, I thought I'd handle that detail in whatever is using this module).

Now, I'd like to have a test bench for this to run under icarus before going to Quartus and spending hours synthesizing and running things.

But it looks to me like I would then have to write a verilog simulation of the ADC SPI interface for my driver to talk to.

Now, I'd like to have a test bench for that verilog ADC simulation before I use it.

But that test bench would look like my ADC driver that I want to test....

Chicken and egg.

How do real verilog designers handle this?

At this point I'm inclined to write a test bench that just checks the right bits get clocked out on the right edges and forget about clocking bits in.

Just hope that works and then go to Quartus.

Unlike the UART transmitter I don't think I can get to the ADC pins with a scope to check what comes out.

Here is the SPI driver so far, do let me know if it is crappy, actually I have never bit banged a SPI device before so the whole idea might be wrong :

Typically you would have a design engineer and verification engineer developing based on the same specification. It's always dangerous when the same person does both. This helps to catch errors that are caused by mistakes or by misinterpreting the specification. Also the either of these parties might buy or reuse "silicon proven" IP. And the verification IP is typically written in a more behavioral style. Of course this is no guarantee of success. So everyone uses FPGAs or other emulation platforms to interface to real hardware before takeout if at all possible ;-)

One question you might ask is how you would do this in software with GPIOs. Looking at that "always @ (SCLK) begin" makes me wonder if the synthesizer will deal with it. Maybe there could be a single flop to reclock output data on the falling edge? Or perhaps all of the logic could run from clk and take action on the edges of SCLK treating it like a data signal? Using clk for everything means you don't need to worry about different clock domains and timing constraints. Since you probably don't care about power maybe best to start simple?

Also I guess you could do this in software with GPIOs and then write the hardware ;-)

Edited to add: also if you use clk for everything then hopefully you don't need to worry about synchronization. In this example it looks like you're sending a multiunit value out on rdData and software just reads that at anytime to get the latest reading. If it's changing with some skew relative to clk, then software could get a bad reading. (And if you start using generated clocks in FPGAs you need to be careful. e.g. sending clocks through fabric intended for data.)

Edited again to add: this does bring up one interesting area. If you find yourself using multiple clocks, then you should make sure to understand CDC (clock domain crossing.) Search for synchronizers and perhaps metastability. There are all sorts of structures for different situations. The classic is a FIFO, but it's a complex place to start. There is a paper here on below. But the simplest situation would be sampling a single-bit asynchronous signal. A lot of people just put that through a couple of flip-flops. Whatever you do I recommend making a library of CDC stuff, and always instantiating from there. Often you'll see code with CDC structures written manually, and then it's harder to make changes. Or flag things - e.g. there are tools that look for CDC logic, and maybe you can give them hints.

That is exactly the document I am working from. In fact the few comments I have in the code are pretty much cut and pasted from there, just to remind me what is expected.

Typically you would have a design engineer and verification engineer developing based on the same specification. It's always dangerous when the same person does both. This helps to catch errors that are caused by mistakes or by misinterpreting the specification.

Oh yeah. Had my fill of that either writing or testing software in the avionics flight control software business. I was on the team testing the code for the Boeing 777's primary flight computers. Happy days!

In this case there is only me. Well, and guys like you if you want to pitch in.

One question you might ask is how you would do this in software with GPIOs.... Also I guess you could do this in software with GPIOs and then write the hardware ;-)

Indeed, I'd do it much the same way as I presented in Verilog! Thing is in software, on a Propeller say, you get a very quick turn around time as you try and fail, edit try and fail... I don't want to do that in the immensely slow and cumbersome Quartus!

Looking at that "always @ (SCLK) begin" makes me wonder if the synthesizer will deal with it. Maybe there could be a single flop to reclock output data on the falling edge? Or perhaps all of the logic could run from clk and take action on the edges of SCLK treating it like a data signal? Using clk for everything means you don't need to worry about different clock domains and timing constraints. Since you probably don't care about power maybe best to start simple?

Can we talk about "clock domains"?

My idea of different clock domains is dealing with two physically different clocks, say running from two different crystal oscillators, that have an unknown and varying phase relationship.

To my mind, everything I have done so far is in the same clock domain. Even if I divide that clock down the phase relations are constant.

Thing is, I have done a bit of digital design in the distant past. I have fixed problems with multi-processor systems running off different clocks getting into a mess when interacting with each other.

Problem is I don't yet have a good feel of how this Verilog thing maps to actual gates and transistors.

In this example it looks like you're sending a multiunit value out on rdData and software just reads that at anytime to get the latest reading. If it's changing with some skew relative to clk, then software could get a bad reading.

Good point.

Actually my idea was not that this module gets read by software directly. I was thinking to have a higher level module that triggered this process, waited for the transfer to complete, then read all the 16 bits of rData. That higher level process would cycle through all 8 channel addresses, and it would provide the bus interface to the processor. It would provide eight 16 bit registers from which software could get the latest ADC readings.

Perhaps I have not though this through enough...For example, my proposed higher level process has no idea when the transfer is complete!

Maybe I'm being paranoid, but the question is what would guarantee that:

always @ (posedge clk) begin
...
SCLK <= !SCLK;

is going to generate a zero skew clock where the edges line up with clk? It's not going to be asynchronous, but I've heard of some bad experiences. But it has worked for some devices in the past as well. I know some guys only use special clocking hardware (e.g. Xilinx MMCMs) to generate clocks. Maybe it's ok for Altera Cyclone IV parts. But somehow the data signal from the flop has to get back onto the clocking fabric. There's probably going to be delay. How well is it all characterized?

It just so happens that this caused me to think that random people reading the thread might like to know about how to solve the general problem where something is truly asynchronous. (A simple example would be a UART receiver)

Let's say that you choose to be paranoid and somewhat lazy. Then there are multiple ways you can go. One way is to detect edges on SCLK and react to those. Then you don't need to worry about timing any more than you already are - o.k. you do need to make sure that SPI to the ADC is going to meet setup/hold times versus SCLK.

Hopefully you can understand the above idea. By referencing how you would do it with GPIOs one idea is that you would only being using positive edges of some high frequency clock. It's not the most elegant or power efficient way, but maybe not the worst. (The next step could be to use clock enables to make it more power efficient.)