Optional RV32 compressed instruction support in the reworkFetch branch for configurations without instruction cache (will be merge in master, WIP)

The hardware description of this CPU is done by using a very software oriented approach
(without any overhead in the generated hardware). Here is a list of software concepts used:

There are very few fixed things. Nearly everything is plugin based. The PC manager is a plugin, the register file is a plugin, the hazard controller is a plugin, ...

There is an automatic a tool which allows plugins to insert data in the pipeline at a given stage, and allows other plugins to read it in another stage through automatic pipelining.

There is a service system which provides a very dynamic framework. For instance, a plugin could provide an exception service which can then be used by other plugins to emit exceptions from the pipeline.

Area usage and maximal frequency

The following numbers were obtained by synthesizing the CPU as toplevel without any specific synthesis options to save area or to get better maximal frequency (neutral).
The clock constraint is set to an unattainable value, which tends to increase the design area.
The dhrystone benchmark was compiled with the -O3 -fno-inline option.
All the cached configurations have some cache trashing during the dhrystone benchmark except the VexRiscv full max perf one. This of course reduces the performance. It is possible to produce
dhrystone binaries which fit inside a 4KB I$ and 4KB D$ (I already had this case once) but currently it isn't the case.
The CPU configurations used below can be found in the src/scala/vexriscv/demo directory.

The VexRiscv project may need an unreleased master-head of the SpinalHDL repo. If it fails to compile, just get the SpinalHDL repository and
do a "sbt clean compile publish-local" in it as described in the dependencies chapter.

Regression tests

To run tests (need the verilator simulator), go in the src/test/cpp/regression folder and run :

# To test the GenFull CPU# (Don't worry about the CSR test not passing, basicaly the GenFull isn't the truly full version of the CPU, some CSR features are disable in it)
make clean run
# To test the GenSmallest CPU
make clean run IBUS=SIMPLE DBUS=SIMPLE CSR=no MMU=no DEBUG_PLUGIN=no MUL=no DIV=no

#in the VexRiscv repository, to run the simulation on which one OpenOCD can connect itself =>
sbt "run-main vexriscv.demo.GenFull"cd src/test/cpp/regression
make run DEBUG_PLUGIN_EXTERNAL=yes
#In the openocd git, after building it =>
src/openocd -c "set VEXRISCV_YAML PATH_TO_THE_GENERATED_CPU0_YAML_FILE" -f tcl/target/vexriscv_sim.cfg
#Run a GDB session with an elf RISCV executable (GenFull CPU)
YourRiscvToolsPath/bin/riscv32-unknown-elf-gdb VexRiscvRepo/src/test/resources/elf/uart.elf
target remote localhost:3333
monitor reset halt
load
continue# Now it should print messages in the Verilator simulation of the CPU

Murax SoC

Murax is a very light SoC (it fits in an ICE40 FPGA) which can work without any external components:

VexRiscv RV32I[M]

JTAG debugger (Eclipse/GDB/openocd ready)

8 kB of on-chip ram

Interrupt support

APB bus for peripherals

32 GPIO pin

one 16 bits prescaler, two 16 bits timers

one UART with tx/rx fifo

Depending the CPU configuration, on the ICE40-hx8k FPGA with icestorm for synthesis, the full SoC has the following area/performance :

RV32I interlocked stages => 51 Mhz, 2387 LC 0.45 DMIPS/Mhz

RV32I bypassed stages => 45 Mhz, 2718 LC 0.65 DMIPS/Mhz

Its implementation can be found here: src/main/scala/vexriscv/demo/Murax.scala.

To generate the Murax SoC Hardware :

# To generate the SoC without any content in the ram
sbt "run-main vexriscv.demo.Murax"# To generate the SoC with a demo program already in ram
sbt "run-main vexriscv.demo.MuraxWithRamInit"

The demo program included by default with MuraxWithRamInit will blink the
LEDs and echo characters received on the UART back to the user. To see this
when running the Verilator sim, type some text and press enter.

If you want to add this plugin to a given CPU, you just need to add it to its parameterized plugin list.

This example is a very simple one, but each plugin can really have access to the whole CPU:

Halt a given stage of the CPU

Unschedule instructions

Emit an exception

Introduce new instruction decoding specification

Ask to jump the PC somewhere

Read signals published by other plugins

override published signals values

Provide an alternative implementation

...

As a demonstrator, this SimdAddPlugin was integrated in the src/main/scala/vexriscv/demo/GenCustomSimdAdd.scala CPU configuration
and is self-tested by the src/test/cpp/custom/simd_add application by running the following commands :

# Generate the CPU
sbt "run-main vexriscv.demo.GenCustomSimdAdd"cd src/test/cpp/regression/
# Optionally add TRACE=yes if you want to get the VCD waveform from the simulation.# Also you have to know that by default, the testbench introduce instruction/data bus stall.# Note the CUSTOM_SIMD_ADD flag is set to yes.
make clean run IBUS=SIMPLE DBUS=SIMPLE CSR=no MMU=no DEBUG_PLUGIN=no MUL=no DIV=no DHRYSTONE=no REDO=2 CUSTOM_SIMD_ADD=yes

To retrieve the plugin related signals in your waveform viewer, just filter with simd.

VexRiscv Architecture

VexRiscv is implemented via a 5 stage in-order pipeline on which many optional and complementary plugins add functionalities to provide a functional RISC-V CPU.
This approach is completely unconventional and only possible through meta hardware description languages (SpinalHDL in the current case) but has proven its advantages
via the VexRiscv implementation:

You can swap/turn on/turn off parts of the CPU directly via the plugin system

You can add new functionalities/instruction without having to modify any sources code of the CPU

It allows the CPU configuration to cover a very large spectrum of implementation without cooking spaghetti code

It allows your code base to truly produce a parametrized CPU design

If you generate the CPU without any plugin, it will only contain the definition of the 5 pipeline stages and their basic arbitration, but nothing else,
as everything else, including the program counter is added into the CPU via plugins.

PcManagerSimplePlugin

This plugin implements the program counter and a jump service to all plugins.

Parameters

type

description

resetVector

BigInt

Address of the program counter after the reset

relaxedPcCalculation

Boolean

By default jump have an asynchronous immediate effect on the program counter, which allow to reduce the branch penalties by one cycle but could reduce the FMax as it will combinatorialy drive the instruction bus address signal. To avoid this you can set this parameter to true, which will make the jump affecting the programm counter in a sequancial way, which will cut the combinatorial path but add one additional cycle of penalty when a jump occur.

This plugin operates on the prefetch stage.

IBusSimplePlugin

This plugin implement the CPU frontend (instruction fetch) via a very simple and neutral memory interface going outside the CPU.

Parameters

type

description

catchAccessFault

Boolean

If an the read response specify an read error and this parameter is true, it will generate an CPU exception trap

resetVector

BigInt

Address of the program counter after the reset

relaxedPcCalculation

Boolean

By default jump have an asynchronous immediate effect on the program counter, which allow to reduce the branch penalties by one cycle but could reduce the FMax as it will combinatorialy drive the instruction bus address signal. To avoid this you can set this parameter to true, which will make the jump affecting the programm counter in a sequancial way, which will cut the combinatorial path but add one additional cycle of penalty when a jump occur.

relaxedBusCmdValid

Boolean

Same than relaxedPcCalculation, but for the iBus.cmd.valid pin.

compressedGen

Boolean

Enable RVC support

busLatencyMin

Int

Specify the minimal latency between the iBus.cmd and iBus.rsp, which will add the corresponding number of stages into the frontend to keep the IPC to 1.

injectorStage

Boolean

Add a stage between the frontend and the decode stage of the CPU to improve FMax. (busLatencyMin + injectorStage) should be at least two.

prediction

BranchPrediction

Can be set to NONE/STATIC/DYNAMIC/DYNAMIC_TARGET to specify the branch predictor implementation, see bellow for more descriptions

historyRamSizeLog2

Int

Specify the number of entries in the direct mapped prediction cache of DYNAMIC/DYNAMIC_TARGET implementation. 2 pow historyRamSizeLog2 entries

Important : There should be at least one cycle latency between que cmd and the rsp. The IBus.cmd can remove request when a CPU jump occure or when the CPU is halted by someting in the pipeline. As many arbitration aren't made for this behaviour, it is important to add a buffer to the iBus.cmd to avoid this. Ex : iBus.cmd.s2mPipe, which add a zero latency buffer and cut the iBus.cmd.ready path.
You can also do iBus.cmd.s2mPipe.m2sPipe, which will cut all combinatorial path of the bus but then as a latency of 1 cycle. which mean you should probably set the busLatencyMin to 2.

Note that bridges are implemented to convert this interface into AXI4 and Avalon

The jump interface implemented by this plugin allow all other plugin to request jumps. The stage argument specify from which stage the jump is asked, which will allow the PcManagerSimplePlugin plugin to manage priorities between jump requests.

traitJumpService{
defcreateJumpInterface(stage : Stage) :Flow[UInt]
}

IBusCachedPlugin

Simple and light multi-way instruction cache.

Parameters

type

description

cacheSize

Int

Total storage capacity of the cache

bytePerLine

Int

Number of bytes per cache line

wayCount

Int

Number of cache ways

twoCycleRam

Boolean

Check the tags values in the decode stage instead of the fetch stage to relax timings

asyncTagMemory

Boolean

Read the cache tags in a asyncronus manner instead of syncronous one

addressWidth

Int

Address width, should be 32

cpuDataWidth

Int

Cpu data width, should be 32

memDataWidth

Int

Memory data width, could potentialy be something else than 32, but only 32 is currently tested

catchIllegalAccess

Boolean

Catch when a memory access is done on non valid memory address (MMU)

catchAccessFault

Boolean

Catch when the memeory bus is responding with an error

catchMemoryTranslationMiss

Boolean

Catch when the MMU miss a TLB

resetVector

BigInt

Address of the program counter after the reset

relaxedPcCalculation

Boolean

By default jump have an asynchronous immediate effect on the program counter, which allow to reduce the branch penalties by one cycle but could reduce the FMax as it will combinatorialy drive the instruction bus address signal. To avoid this you can set this parameter to true, which will make the jump affecting the programm counter in a sequancial way, which will cut the combinatorial path but add one additional cycle of penalty when a jump occur.

compressedGen

Boolean

Enable RVC support

prediction

BranchPrediction

Can be set to NONE/STATIC/DYNAMIC/DYNAMIC_TARGET to specify the branch predictor implementation, see bellow for more descriptions

historyRamSizeLog2

Int

Specify the number of entries in the direct mapped prediction cache of DYNAMIC/DYNAMIC_TARGET implementation. 2 pow historyRamSizeLog2 entries

Note: If you enable the twoCycleRam option and if wayCount is bigger than one, then the register file plugin should be configured to read the regFile in a asynchronous manner.

DecoderSimplePlugin

For instance, for a given instruction, the pipeline hazard plugin needs to know if it uses the register file source 1/2 in order stall the pipeline until the hazard is gone.
To provide this kind of information, each plugin which implements an instruction documents this kind of information to the DecoderSimplePlugin plugin.

Parameters

type

description

catchIllegalInstruction

Boolean

If set to true, instruction which have no decoding specification will generate a trap exception

Here is a usage example :

//Specify the instruction decoding which should be applied when the instruction match the 'key' pattern
decoderService.add(
//Bit pattern of the new instruction
key =M"0000011----------000-----0110011",
//Decoding specification when the 'key' pattern is recognized in the instructionList(
IS_SIMD_ADD->True,
REGFILE_WRITE_VALID->True, //Enable the register file writeBYPASSABLE_EXECUTE_STAGE->True, //Notify the hazard management unit that the instruction result is already accessible in the EXECUTE stage (Bypass ready)BYPASSABLE_MEMORY_STAGE->True, //Same as above but for the memory stageRS1_USE->True, //Notify the hazard management unit that this instruction use the RS1 valueRS2_USE->True//Same than above but for RS2.
)
)
}

This plugin operates in the Decode stage.

RegFilePlugin

This plugin implements the register file.

Parameters

type

description

regFileReadyKind

RegFileReadKind

Can bet set to ASYNC or SYNC. Specifies the kind of memory read used to implement the register file. ASYNC means zero cycle latency memory read, while SYNC means one cycle latency memory read which can be mapped into standard FPGA memory blocks

zeroBoot

Boolean

Load all registers with zeroes at the beginning of simulations to keep everything deterministic in logs/traces

This register file use a don't care read-during-write policy, so the bypassing/hazard plugin should take care of this.

HazardSimplePlugin

This plugin checks the pipeline instruction dependencies and, if necessary or possible, will stop the instruction in the decoding stage or bypass the instruction results
from the later stages to the decode stage.

Since the register file is implemented with a don't care read-during-write policy, this plugin also manages these kind of hazards.

Parameters

type

description

bypassExecute

Boolean

Enable the bypassing of instruction results coming from the Execute stage

bypassMemory

Boolean

Enable the bypassing of instruction results coming from the Memory stage

bypassWriteBack

Boolean

Enable the bypassing of instruction results coming from the WriteBack stage

bypassWriteBackBuffer

Boolean

Enable the bypassing of the previous cycle register file written value

SrcPlugin

This plugin muxes different input values to produce SRC1/SRC2/SRC_ADD/SRC_SUB/SRC_LESS values which are common values used by many plugins in the execute stage (ALU/Branch/Load/Store).

Parameters

type

description

separatedAddSub

RegFileReadKind

By default SRC_ADD/SRC_SUB are generated from a single controllable adder/substractor, but if this is set to true, it use separate adder/substractors

executeInsertion

Boolean

By default SRC1/SRC2 are generated in the Decode stage, but if this parameter is true, it is done in the Execute stage (It will relax the bypassing network)

Except for SRC1/SRC2, this plugin does everything at the begining of Execute stage.

IntAluPlugin

This plugin implements all ADD/SUB/SLT/SLTU/XOR/OR/AND/LUI/AUIPC instructions in the execute stage by using the SrcPlugin outputs. It is a realy simple plugin.

The result is injected into the pipeline directly at the end of the execute stage.

LightShifterPlugin

Implements SLL/SRL/SRA instructions by using an iterative shifter register, while using one cycle per bit shift.

The result is injected into the pipeline directly at the end of the execute stage.

FullBarrelShifterPlugin

Implements SLL/SRL/SRA instructions by using a full barrel shifter, so it execute all shifts in a single cycle.

Parameters

type

description

earlyInjection

Boolean

By default the result of the shift is injected into the pipeline in the Memory stage to relax timings, but if this option is true it will be done in the Execute stage

BranchPlugin

This plugin implement all branch/jump instructions (JAL/JALR/BEQ/BNE/BLT/BGE/BLTU/BGEU) with primitives used by the cpu frontend plugins to implement branch prediction. The prediction implementation is set in the frontend plugins (IBusX)

Parameters

type

description

earlyBranch

Boolean

By default the branch is done in the Memory stage to relax timings, but if this option is set it's done in the Execute stage

catchAddressMisaligned

Boolean

If a jump/branch is done in an unaligned PC address, it will fire an trap exception

Each miss predicted jumps will produce between 2 and 4 cycles penalty depending the earlyBranch and the PcManagerSimplePlugin.relaxedPcCalculation configurations

Prediction NONE

No prediction: each PC change due to a jump/branch will produce a penalty.

Prediction STATIC

In the decode stage, a conditional branch pointing backwards or a JAL is branched speculatively. If the speculation is right, the branch penalty is reduced to a single cycle,
otherwise the standard penalty is applied.

Prediction DYNAMIC

Same as the STATIC prediction, except that to do the prediction, it use a direct mapped 2 bit history cache (BHT) which remembers if the branch is more likely to be taken or not.

Prediction DYNAMIC_TARGET

This predictor uses a direct mapped branch target buffer (BTB) in the Fetch stage which store the PC of the instruction, the target PC of the instruction and a 2 bit history to remember
if the branch is more likely to be taken or not. This is the most efficient branch predictor actualy implemented on VexRiscv as when the branch prediction is right, it produce no branch penalty.
The down side is that this predictor has a long combinatorial path coming from the prediction cache read port to the programm counter by passing through the jump interface.

DBusSimplePlugin

This plugin implements the load and store instructions (LB/LH/LW/LBU/LHU/LWU/SB/SH/SW) via a simple and neutral memory bus going out of the CPU.

Parameters

type

description

catchAddressMisaligned

Boolean

If a memory access is done to an unaligned memory address, it will fire a trap exception

catchAccessFault

Boolean

If a memory read returns an error, it will fire a trap exception

earlyInjection

Boolean

By default, the memory read values are injected into the pipeline in the WriteBack stage to relax the timings. If this parameter is true, it's done in the Memory stage

Note that bridges are available to convert this interface into AXI4 and Avalon

There is at least one cycle latency between a cmd and the corresponding rsp. The rsp.ready flag should be false after a read cmd until the rsp is present.

DBusCachedPlugin

Single way cache implementation with a victim buffer. (Documentation is WIP)

MulPlugin

Implements the multiplication instruction from the RISC-V M extension. Its implementation was done in a FPGA friendly way by using 4 17*17 bit multiplications.
The processing is fully pipelined between the Execute/Memory/Writeback stage. The results of the instructions are always inserted in the WriteBack stage.

DivPlugin

Implements the division/modulo instruction from the RISC-V M extension. It is done in a simple iterative way which always takes 34 cycles. The result is inserted into the
Memory stage.

This plugin is now based on the MulDivIterativePlugin one.

MulDivIterativePlugin

This plugin implements the multiplication, division and modulo of the RISC-V M extension in an iterative way, which is friendly for small FPGAs that don't have DSP blocks.

This plugin is able to unroll the iterative calculation process to reduce the number of cycles used to execute mul/div instructions.

Parameters

type

description

genMul

Boolean

Enables multiplication support. Can be set to false if you want to use the MulPlugin instead

genDiv

Boolean

Enables division support

mulUnrollFactor

Int

Number of combinatorial stages used to speed up the multiplication, should be > 0

divUnrollFactor

Int

Number of combinatorial stages used to speed up the division, should be > 0

The number of cycles used to execute a multiplication is '32/mulUnrollFactor'
The number of cycles used to execute a division is '32/divUnrollFactor + 1'

Both mul/div are processed into the memory stage (late result).

CsrPlugin

Implements most of the Machine mode and a few of the User mode registers as specified in the RISC-V priviledged spec.
The access mode of most of the CSR is parameterizable (NONE/READ_ONLY/WRITE_ONLY/READ_WRITE) to reduce the area usage of unneeded features.

(CsrAccess can be NONE/READ_ONLY/WRITE_ONLY/READ_WRITE)

Parameters

type

description

catchIllegalAccess

Boolean

mvendorid

BigInt

marchid

BigInt

mimpid

BigInt

mhartid

BigInt

misaExtensionsInit

Int

misaAccess

CsrAccess

mtvecAccess

CsrAccess

mtvecInit

BigInt

mepcAccess

CsrAccess

mscratchGen

Boolean

mcauseAccess

CsrAccess

mbadaddrAccess

CsrAccess

mcycleAccess

CsrAccess

minstretAccess

CsrAccess

ucycleAccess

CsrAccess

wfiGen

Boolean

ecallGen

Boolean

If an interrupt occurs, before jumping to mtvec, the plugin will stop the Prefetch stage and wait for all the instructions in the later pipeline stages to complete their execution.

If an exception occur, the plugin will kill the corresponding instruction, flush all previous instructions, and wait until the previously killed instructions reach the WriteBack
stage before jumping to mtvec.

StaticMemoryTranslatorPlugin

Static memory translator plugin which allows one to specify which range of the memory addresses is IO mapped and shouldn't be cached.

MemoryTranslatorPlugin

Simple software refilled MMU implementation. Allows others plugins such as DBusCachedPlugin/IBusCachedPlugin to instanciate memory address translation ports. Each port has a small dedicated
fully associative TLB cache which is refilled from a larger software filled TLB cache via a query which looks up one entry per cycle.

DebugPlugin

This plugin implements enough CPU debug features to allow comfortable GDB/Eclipse debugging. To access those debug features, it provides a simple memory bus interface.
The JTAG interface is provided by another bridge, which makes it possible to efficiently connect multiple CPUs to the same JTAG.

Parameters

type

description

debugClockDomain

ClockDomain

As the debug unit is able to reset the CPU itself, it should use another clock domain to avoid killing itself (only the reset wire should differ)

The internals of the debug plugin are done in a manner which reduces the area usage and the FMax impact of this plugin.

Here is the simple bus to access it, the rsp come one cycle after the request :

YamlPlugin

This plugin offers a service to others plugins to generate a usefull Yaml file about the CPU configuration. It contains, for instance, the sequence of instruction required
to flush the data cache (information used by openocd).