There is a plan to publish a microarchitecture specification to make it easy
for others to implement an equivalent design in the language of their choice.

The Zscale is slightly larger than the Cortex-M0 due to having 32 vs 16
registers, 64-bit performance counters, and a fast multiply and divide. The
plan is to add an option to generate a Zscale implementing RV32E (i.e. only
having 16 registers).

Zscale is only 604 loc in Chisel. 274 lines for control, 267 for the
datapath, and 63 for the top-level. Combine with 983loc borrowed from Rocket.

A Verilog implementation of Z-scale is being implemented. It’s currently
1215 lines of code.

The repo is here, but Yunsup needs to
do a little more work to make it easily buildable. There will be a blog post
on the RISC-V site soon.

All future Rocket development will move to the public
rocket-chip repo!

Memory interfaces:

TileLink is the Berkeley cache-coherent interconnect

NASTI (Berkeley implementation of AXI4)

HASTI (implementation of AHB-lite)

POCI (implementation of APB)

The plan is to dump HTIF in Rocket, and add a standard JTAG debug interface.

Future work for Z-Scale includes a microarchitecture document, improving
performance, implementing the C extensions, adding an MMU option, and adding
more devices.

BOOM. Berkeley Out-of-order-Machine: Chris Celio

BOOM is a (work in progress) superscalar, out-of-order RISC-V processor
written in Chisel.

Chris argues there’s been a general lack of effort in academia to build and
evaluate out-of-order designs. As he points out, much research relies on
software simulators with no area or power numbers.

Some of the difficult questions for BOOM are which benchmarks to use, and
how many cycles you need to run. He points out that mapping to FPGA running at
50MHz, it would take around a day for the SPEC benchmarks for a cluster of
FPGAs.

The fact that rs1, rs2, rs3 and rd are always in the same space in the
RISC-V ISA allows decode and rename to proceed in parallel.

BOOM supports the full RV64G standard. It benefits from reusing Rocket as a
library of components.

The second RISC-V workshop is going
on today and tomorrow in Berkeley, California. I’ll be keeping a semi-live
blog of talks and announcements throughout the day.

Introductions and welcome: Krste Asanović

The beginning of Krste’s talk will be familiar for anyone who’s seen an
introduction to RISC-V before. Pleasingly, there are a lot of new faces here
at the workshop so the introduction of course makes a lot of sense.

Although the core RISC-V effort is focused on the ISA specification, there
is interest in looking to expand this to look at how to standardise access to
I/O etc.

RV32E is a “pre-emptive strike” at those who might be tempted to fragment
the ISA space for very small cores. It is a 16-register subset of RV32I.

The compressed instruction set has been released since the last workshop,
there will be talk later today about it. It gives 25-30% code size reduction,
and surprisingly there’s still lots of 16-bit encode space for additional
extensions.

Krste makes the point that AArch64 has 8 addressing modes vs just 1 for
RISC-V. The comparison of the size of the GCC/LLVM backends is perhaps less
interesting given that the ARM backend actually has rather a lot more
optimisations.

“Simplicity breeds contempt”. “So far, no evidence more complex ISA is
justified for general code”

Will be talking about a Cray-style vector ISA extension later today (i.e.
not packed-SIMD ISA or GPU-style).

Rocket core is only about ~12kloc of Chisel in total. ~5kloc for the
processor, ~2kloc for floating-point units, ~4.6kloc for ‘uncore’ (coherence
hubs, L2, caches etc).

State of the RISC-V Nation: many companies ‘kicking the tires’. If you were
thinking of designing your own RISC ISA for project, then use RISC-V. If you
need a complete working support core today then pay $M for an industry core.
If you need it in 6 months, then consider spending that $M on RISC-V
development.

Points out that Thumb2 is only a 32-bit address ISA. Although it is slightly
smaller than RV32C, the RISC-V compressed spec has the benefit of supporting
64-bit addressing.

Rather than adding the complexity of load multiple and store multiple,
experimented with adding calls to a function that does the same thing. This
hurts performance, but gives a large benefit for code size.

One question was on the power consumption impact. Don’t have numbers on that
yet.

Should we require the compressed instruction set? Don’t want to add it to
the minimal ‘I’ instruction set, but could add it to the standard expected by
Linux.

GoblinCore64. A RISC-V Extension for Data Intensive Computing: John Leidel

Building a processor design aimed at data intensive algorithms and
applications. Applications tend to be very cache unfriendly.

GC64 (Goblin Core) has a thread control unit. A very small micro-coded unit
(e.g. implement RV64C) is microcoded to perform the contest switching task.

The GKEY supervisor register contains a 64-bit key loaded by the kernel. It
determines whether a task may spawn and execute work on neighboring task
processors, providing a very rudimentary protection mechanism.

Making use of RV128I - it’s not just there for fun!

Support various instruction extensions, e.g. IWAIT, SWPAWN, JOIN, GETTASK,
SETTASK. Basic operations needed to write a thread management system (such as
pthreads) implemented as microcoded instructions in the RISC-V ISA.

Also attempting to define the data structures which contain task queue data.

Vector Extension Proposal: Krste Asanović

Goals: efficient and scalable to all reasonable design points. Be a good
compiler target, and to support implicit auto-vectorisation through OpenMP and
explicit SPMD (OpenCL) programming models. Want to work with virtualisation
layers, and fit into the standard 32-bit encoding space.

Krste is critical of GPUs for general compute. I can summarise his arguments
here, but the slides will be well worth a read. Krste has spent decades
working on vector machines.

With packed SIMD you tend to need completely new instructions for wider
SIMD. Traditional vector machines allow you to set the vector length register
to provide a more uniform programming model. This makes loop strip-mining more
straight-forward.

Mixed-precision support allows you to subdivide a physical register into
multiple narrower architectural registers as requested.

Sam binary code works regardless of number of physical register bits and the
number of physical lanes.

Use a polymorphic instruction encoding. e.g. a single signed integer ADD
opcode that works on different size inputs and outputs.

Have separate integer and floating-point loads and stores, where the size is
inherent in the destination register number.

All instructions are implicitly predicated under the first predicate
register by default.

What is the difference between V and Hwacha? Hwacha is a non-standard
Berkeley vector extensions design to push the state-of-the-art for
in-order/decoupled vector machines. There are similarities in the lane
microarchitecture. Current focus is bringing up OpenCL for Hwacha, with the V
extension to follow.

Restartable page faults are supported. Similar to the DEC Vector VAX.

Krste pleads people not to implement a packed SIMD extension, pointing out
that a minimal V implementation would be very space efficient.

Privileged Architecture Proposal: Andrew Waterman

For a simple embedded system that only needs M-mode there is a low
implementation cost. Only 2^7 bits of architectural state in addition to the
user ISA, plus 2^7 more bits for timers and another 2^7 for basic performance
counters.

Defined the basic virtual memory architectures to support current Unix-style
operating systems. The design is fairly conventional, using 4KiB pages.

Why go with 4KiB pages rather than 8KiB as was the initial plan? Concerned
with porting software hard-coded to expect 4KiB pages. Also concerns about
internal fragmentation.

Physical memory attributes such as cacheability are not encoded in the page
table in RISC-V. Two major reasons that Andrew disagrees with this are that
the granularity may not be tied to the page size, plus it is problematic for
virtualisation. Potentially coherent DMA will become more common meaning you
needn’t worry about these attributes.

Want to support device interactions via a virtio-style interface.

The draft Supervisor Binary Interface will be released with the next
privileged ISA draft. It includes common functionality for TLB shootdowns,
reboot/shutdown, sending inter-processor interrupts etc etc. This is a similar
idea to the PALCode on the Alpha.

Hardware-accelerated virtualization (H-mode) is planned, but not yet
specified.

A draft version of v1.8 of the spec is expected this summer, with a frozen
v2.0 targeted for the fall.

There are more 10Gbps RapidIO ports on the planet than there are 10Gbps
Ethernet ports. This is primarily due to the 100% market penetration in 4G/LTE
and 60% global 3G.

The IIT Madras team are using RapidIO extensively for their RISC-V work

Has been doing work in the data center and HPC space. Looking to use the AXI
ACE and connect that to RapidIO.

There is interesting work on an open source RapidIO stack.

CAVA. Cluster in a rack: Peter Hsu

Problem: designing a new computer is expensive. But 80% is the same every
time.

CAVA is not the same as the Oracle RAPID project.

Would like to build a 1024-node cluster in a rack. DDR4 3200 = 25.6GB/s per
64-bit channel. Each 1U card would be about 600W with 32 nodes.

Looking at a 96-core 10nm chip (scaled from a previous 350nm project).
Suppose you have a 3-issue out of order core (600K gates) and 32KiB I+d cache,
that would be around 0.24mm^2 in 10nm.

Estimate a vector unit might be around the same area.

Peter has detailed estimates for per-chip power, but it’s probably best to
refer to the slides for these.

Research plan for the cluster involves a unified simulation environment,
running on generic clusters of x86 using open-source software. Everyone uses
the same simulator to perform “apples to apples” comparison. This allows easy
replication of published work.

lowRISC was fortunate enough to be chosen as a mentoring organisation in this
year’s Google Summer of
Code. The Google Summer of
Code program funds students to work on open source projects over the summer.
We had 52 applications across the range of project
ideas we’ve been advertising.
As you can see from the range of project ideas, lowRISC is taking part as an
umbrella organisation, working with a number of our friends in the wider open
source software and hardware community.
We were allocated three slots from Google, and given the volume of high
quality applications making the selection was tremendously difficult. We have
actually been able to fund an additional three applicants from other sources,
but even then there were many promising projects we couldn’t support. We are
extremely grateful to all the students who put so much time and effort in to
their proposals, and to everyone who volunteered to mentor. The six ‘summer of
code’ projects for lowRISC are:

Baptiste will be working with an Emscripten-compiled version of
the Yosys logic synthesis tool, building an
online Verilog IDE on top
of it which would be particularly suitable for training and teaching
materials. A big chunk of the proposed work is related to visualisation of the
generated logic. Improving the accessibility of hardware design is essential for
growing the potential contributor base to open source hardware
projects like lowRISC, and this is just the start of our efforts in that
space.

seL4 is a formally verified microkernel, which
currently has ports
for x86 and ARM. Hesham will be performing a complete port to
RISC-V/lowRISC. Security and microkernels are of great interest to
many in the community. It’s also a good opportunity to expand RISC-V platform
support and to put the recently released RISC-V Privileged Architecture
Specification
through its paces. Hesham previously performed a port of RTEMS to
OpenRISC.

jor1k is by far the
fastest
Javascript-based full system
simulator. It also features a network device, filesystem support, and
a framebuffer. Prannoy will be adding support for RISC-V and look at
supporting some of the features we offer on lowRISC such as minion
cores or tagged
memory.
This will be great not only as a demo, but
also have practical uses in tutorial or educational material.

The intention here is to get a rump kernel
(essentially a libified
NetBSD) running bare-metal on a simple RISC-V system and evaluate
exposing the TCP/IP stack for use by other cores. e.g. a TCP/IP
offload engine running on a minion core. TCP offload is a good
starting point, but of course the same concept could be applied
elsewhere. For example, running a USB mass storage driver (and filesystem
implementation) on a minion core and providing a simple high-level
interface to the application cores.

Tavor is a sophisticated fuzzing tool
implemented in Go. Yoann
will be extending it to more readily support specifying instruction
set features and generating a fuzzing suite targeting an ISA such as
RISC-V. Yoann has some really interesting ideas on how to go about
this, so I’m really interested in seeing where this on ends up.

Implement a Wishbone to TileLink bridge and extend TileLink
documentation. Thomas Repetti mentored by Wei Song

Wishbone is the
interconnect of choice for most existing open
source IP cores, including most devices on
opencores.org. The Berkeley
Rocket RISC-V implementation uses
their own ‘TileLink’ protocol (we provide a brief
overview. By providing a
reusable bridge, this project will allow the easy reuse of opencores devices
and leverage the many man-years of effort that has already gone in to them.

The first 3 of the above projects are part of Google Summer of Code
and the bottom 3 directly funded, operating over roughly the same timeline.
We’re also going to be having two local
students interning with us here at the University of Cambridge
Computer Lab starting towards the end of June, so it’s going to be a
busy and productive summer. It bears repeating just how much we appreciate the
support of everyone involved so far - Google through their Summer of Code
initiative, the students, and those who’ve offered to act as mentors. We’re
very excited about these projects, so please join us in welcoming the students
involved to our community. If you have any questions, suggestions, or guidance
please do leave them in the comments.

We’re pleased to announce the first lowRISC preview release, demonstrating support for tagged memory as
described in our memo. Our ambition with lowRISC is to provide an open-source System-on-Chip
platform for others to build on, along with low-cost development boards
featuring a reference implementation. Although there’s more work to be done on
the tagged memory implementation, now seemed a good time to document what
we’ve done in order for the wider community to take a look. Please see our
full tutorial which describes in some
detail the changes we’ve made to the Berkeley Rocket
core, as well as how you can build and try
it out for yourself (either in simulation, or on an FPGA). We’ve gone to some
effort to produce this documentation, both to document our work, and to share
our experiences building upon the Berkeley RISC-V code releases in the hopes
they’ll be useful to other groups.

The initial motivation for tagged memory was to prevent control-flow hijacking
attacks, though there are a range of other potential uses including
fine-grained memory synchronisation, garbage collection, and debug tools.
Please note that the instructions used to manipulate tagged memory in this
release (ltag and stag) are only temporary and chosen simply because they
require minimal changes to the core pipeline. Future work will include
exploring better ISA support, collecting performance numbers across a range of
tagged memory uses and tuning the tag cache. We are also working on developing
an ‘untethered’ version of the SoC with the necessary peripherals integrated
for standalone operation.

If you’ve visited lowrisc.org before, you’ll have noticed we’ve changed a few
things around. Keep an eye on this blog (and its RSS
feed) to keep an eye on developments - we
expect to be updating at least every couple of weeks. We’re very grateful to
the RISC-V team at Berkeley for all their support and guidance. A large
portion of the credit for this initial code release goes to Wei
Song, who’s been working tirelessly on the HDL
implementation.