Post navigation

RISC-V is doing disruption right

Technical introduction here (somewhat out of date; hardware support is broader and deeper now, and I have seen video of a full Linux port running Doom), but the technicalia is not mostly where I’m going with this post.

I’m seeing a setup for a potentially classic disruption from below here. And not mainly because this instruction set is well designed, though it is. Simple, clean, orthogonal – it makes my compiler-jock heart happy; writing a code generator for it would be fun. If I needed to, but there’s already an LLVM back end for it.

And that points at what’s really interesting about RISC-V. whoever is running their product strategy has absorbed the lessons of previous technology disruptions and is running this one like a boss.

By the time I became really aware of it, RISC-V already had 19 cores, 10 SOCs, LLVM support, a Linux port, dozens of corporate sponsors, and design wins as a microcontroller in major-label disk drives and graphic cards.

That tells me that RISC-V did a couple of critical things right. Most importantly, quietly lining up a whole boatload of sponsors and co-developers before making public hype. Means there was no point at which the notion that this is nothing but an overpromising/underdelivering academic toy project could take hold in peoples’ brains and stick there. Instead, by the time my ear-to-the-ground started picking up the hoofbeats, RISC-V was already thundering up a success-breeds-success power curve.

Of course, they could only do this because open source was a central part of the product strategy. The implied guarantee that nobody can corner the supply of RISC-V hardware and charge secrecy rent on it is huge if you’re a product designer trying to control supply chain risk. It’s 2019, people – why would anyone in their right mind not want that hedge if they could get it?

Yes, yes, present RISC-V hardware is underpowered compared to an Intel or ARM part. Right now it’s only good enough for embedded microcontrollers (which is indeed where its first design wins are). But we’ve seen this movie before. More investment capital drawn to its ecosystem will fix that pretty fast. If I were ARM I’d be starting to sweat right about now.

In fact, ARM is sweating. In July, before RISC-V was on my radar, they tried to countermarket against it. Their FUD site had to be taken down after ARM’s own staff revolted!

Top of my Christmas list is now a RISC-V SBC that’s connector compatible with a Raspberry Pi. Nothing like this exists yet, but given the existence of SOCs and a Linux port this is such an obvious move that I’d bet money on it happening before Q4. Even if the RISC-V foundation doesn’t push in this direction itself, there are so many Chinese RasPi clones now that some shop in Shenzhen almost certainly will.

Let me finish by pointing out the elephant in the room that the RISC-V foundation is not talking about. They’re just letting it loom there. That’s the “management software” on Intel chips, and its analogs elsewhere. An increasingly important element of technology risk is layers hidden under your hardware that others can use to take control of your system. This is beyond normal supply-chain risk, it’s about the extent to which your hardware can be trusted not to betray you.

The real hook of RISC-V – something its Foundation has been quietly working to arrange via multiple open-source hardware designs – is auditability right down to the silicon. Open-source software running on open-source hardware with social proof that the risk of effective subversion is very low and a community that would interpret subversive success as damage and swiftly route around it.

That is huge. It’s going to up the market pressure for transparency at all levels of the computing stack. I was the author of The Cathedral and the Bazaar, so I get to say this: “Bwahahaha! All is proceeding as I have foreseen…”

113 thoughts on “RISC-V is doing disruption right”

Look at the proposal for their version of SIMD like instructions it is clever and will avoid the whole having to compile for different processor versions for every further extension in length. Basically the vector registers are variable length, so you write code along the lines of ‘I need 4 vector registers of 1000 doubles’ and you get ‘this CPU can do 16’, and so loop until you get to the end. A newer processor can then say, sure 1000 no problem.

Looking at how the Mill proposes to do it, vector operations are generic and vectors are labeled with metadata by the load unit. That way the hardware can work properly on different length and width vectors. The code is then specialized for different hardware at load time to the hardware, (or you can make self-modifying code, which is harder)

I wish Ivan well and he’s even offered me a (sweat equity) job working on the Mill in the past, and there are a *ton* of fascinating and clever ideas in it … but work started (according to Wikipedia) in 2003 and there still today, February 2019, is not even an FPGA implementation or even a low end incarnation of it.

RISC-V started in 2011 and academics had already taped out and used multiple working chips before they started publicising RISC-V and created the Foundation in 2014.

There is of course a huge difference in technical aspects. The Mill is trying to blaze a new technical trail, while RISC-V is content to follow the well-establish trail pioneered by IBM 801/ROMP, RISC I, MIPS, Alpha with only minor technical improvements (mostly avoiding pitfalls) but with a pioneering business model.

No provisions in the basic architecture for detecting integer arithmetic errors beyond divide by zero, excused by saying “the most popular languages like C and Java don’t care”, this also upsets the long integer cryptography crowd. Very much a New Jersey Worse is Better design, I’m wondering if there are any gotchas in the serious memory management design, which I haven’t examined.

Pure and total open source designs are not exactly in the picture, because digital-analog interfacing like to DRAM is pretty much always IP you buy, it’s different for the various fab lines and their processes and analog IC design is much harder than digital. But you can of course get a lot closer.

Although … aren’t there a number of ARM designs that don’t come with official built in subversion? A lot of entities like RISC-V because it’s somewhat cheaper (as of the last time I checked perhaps a year ago you don’t get the full package of verification etc. that you get with an ARM license) and you don’t have to deal with ARM, which I’ve read is very painful, but that doesn’t mean lots of ARM cores are necessarily worse in this dimension.

>Although … aren’t there a number of ARM designs that don’t come with official built in subversion?

Yes, but how do you know it’s not there if you can’t audit?

How did I know someone would make this reply?

Unless you decap and examine the IC from the foundry layer by layer and see how well it matches up to the design you sent it, you can’t really be sure.

You also need source code chain hardening to make sure the design you started with hasn’t had something slipped into it, and that no one does so while you’re adding your special sauce to it.

The advantages I see to RISC-V here are I’d guess some fraction, maybe very large, of ARM licenses keep some of the secret sauce secret from the licensees, not sure how that would work while still allowing the latter to simulate and verify the whole chip, and the sauce is secret from the world at large, so 3rd parties can’t even in theory keep an eye on it.

But auditing RISC-V designs isn’t going to be trivial, or something most hackers can do, and it has to be done for each revision of the chip. Perhaps some of that could be automated, though.

Actually, we should start with the bottom line: what’s your threat model?

It doesn’t have to be perfect. But if the Chinese Ministry of State Security and the Russian Federal Security Service say it’s secure enough for them, it would be good enough for me, if they’re signing off on the same toolchains and hardware.

That’s two organizations who have critical infrastructure security problems, and national-level funding to plug the holes.

Abstract. In recent years, hardware Trojans have drawn the attention of governments and industry as well as the scientific community. One of the main concerns is that integrated circuits, e.g., for military or criticalinfrastructure applications, could be maliciously manipulated during the manufacturing process, which often takes place abroad. However, since there have been no reported hardware Trojans in practice yet, little is known about how such a Trojan would look like, and how difficult it would be in practice to implement one.

In this paper we propose an extremely stealthy approach for implementing hardware Trojans below the gate level, and we evaluate their impact on the security of the target device. Instead of adding additional circuitry to the target design, we insert our hardware Trojans by changing the dopant polarity of existing transistors. Since the modified circuit appears legitimate on all wiring layers (including all metal and polysilicon), our family of Trojans is resistant to most detection techniques, including fine-grain optical inspection and checking against “golden chips”. We demonstrate the effectiveness of our approach by inserting Trojans into two designs — a digital post-processing derived from Intel’s cryptographically secure RNG design used in the Ivy Bridge processors and a side-channel resistant SBox implementation — and by exploring their detectability and their effects on security.

I suspect that a) players in Silicon Valley are in bed with Chinese intelligence b) Chinese intelligence has compromised a number of US secrets and organizations c) we are looking at a serious war with the PRC in the future d) the potential deaths from this war are fairly significant. I’m also paranoid enough to fear that the EU and India may also be on an opposing side of the next world war.

I would suggest that the war has been ongoing for many years, but it’s been mostly cyber / espionage / proxy. It’s unlikely to go conventional (or nuclear) anytime soon unless the USA makes it so. Honestly, back in early 2001 during the spy plane fiasco, I was certain that the USA and China would be in a shooting war with each other within a year. But then 9/11 happened and our leaders’ military adventurism got redirected…

A friend in the recycling industry recently told me that about half of American recyclables (read: trash collected for recycling) had been sold to China… Until a year or so ago. Told me that China had cut off practically all imports of materials to be recycled while they “develop their own infrastructure”. Which leaves the US recycling industry in quite a pickle. Even more interesting if one looks at it in the context of preparing for an economic showdown.

Materials that are worthwhile to recycle have been commercially recycled since, in some cases, antiquity. There’s most definitely a scrap metal industry, for example, and probably has been ever since iron age families wore down their tools to the point of uselessness.

But the ‘recycling industry’ on the whole only exists because of compulsion and subsidy.

> No provisions in the basic architecture for detecting integer arithmetic errors

The user-ISA spec comes with “suggested instruction sequences” for detecting overflow. It wouldn’t take much to add in-hardware support conditional on using the “suggested” form – instruction fusion is a well-known hardware implementation technique.

> …digital-analog interfacing like to DRAM is pretty much always IP you buy,

Yes, digital-analog, low-level stuff is a pain. But even a chip where all the _purely-digital_ stuff is open would be a huge win, and combined with newer, open-hardware-oriented HDL tools like Chisel would greatly slash the NRE cost of designing a new chip. You’ll still need someone to help you with the challenging low-level and fab-specific stuff, but SiFive will happily do that, for the right price.

The ISA is open. Western Digital recently announced an open Verilog for their part. There are a few other places talking about open source HDL for it, too, mostly small fabs. There are rumors about TI, TSMC, Samsung, and Global Foundries looking into their own open source cores for the ISA. DRAM controllers and such are often separate paid IP, but you can find open source DRAM (up through at least DDR3), SD/MMC, SATA, Ethernet MAC and PHY, hardware TCP/IP stack, H.264 decoder, flash memory controllers, PCI, PCIe, and some basic video adapters pretty easily to get a basic prototype or first product out.

I do agree risc-v is doing disruption right… but it will be years more before the juggxrnaught makes headway.

I’ve been to two conferences thus far. One really noticeable thing is the age bands, you have the old greying hardware guys like me, then a 15 year gap before you hit software people, then you hit the hardware “kids”. It’s like there were two generations of hardware designer wiped out by the intel virus.

If nothing else risc-v’s ecosystem is helping train up that next generation. I hope they can take a stab at rewriting all this legacy IP nobody understands anymore.

* Hardware takes time. Lots of time
* Lot of hype, not a lot of products
** what you can get today is quite costly
* AMD shows how successful you are if you are only 10% slower. 2% of the market
* All the high end server arms have been failures
* Best risc design evah (dead end, IMHO)

# Unknowns

* Power management
* peripherals are mostlly locked up IP
(cpu is actually the smallest part of the die)

It’s like there were two generations of hardware designer wiped out by the intel virus.

As I understand it, that’s exactly what happened, the “virus” being Intel’s fab line superiority prior to their catastrophic “10nm” blunder. And I suppose before that while Dennard scaling held, ending ~2005-7 per Wikpedia, and everyone who remembers how the Netburst e.g. Pentium 4 “marchitecture” hit a brick wall. Every time a design team came out with something clever, the next Intel node would wipe out their advantage.

Strong university support – berkeley, stanford, eth, cambridge etc

I think this is also a con, I suspect this is the #1 reason the design skips out on a processor status register with pesky flags like integer overflow. Got to keep the basic design simple enough to do real things in a single semester.

Hardware takes time. Lots of time

Indeed. I’m still waiting on lowRISC to get anywhere close to a tape-out, my primary interest in RISC-V is that this group is proposing to implement a tagged memory version of it, although in a way with questionable performance (the tag bits are in their own section of memory vs. trying to e.g. steal a few from ECC DIMMS). Their progress is glacial, although they started before I think there was even a draft of the memory management scheme needed for seriously running Linux or other general purpose kernels.

AMD shows how successful you are if you are only 10% slower. 2% of the market

Doesn’t this also have something to do with their post-K8 general failures in producing buggy CPUs and chipsets?

(Note, before anyone says a word about Meltdown, realize this should result in you being automatically ignored, seeing as how AMD chips are fully vulnerable to a host of nastier Spectre bugs, and every other high performance chip vendor had Meltdown bugs, specifically ARM and IBM, POWER and mainframe. I’m talking about non-side channel bugs, the sort that screw up normal operation.)

All the high end server arms have been failures

This brings up a question: as a general rule of thumb, if you want to displace an incumbent something, you need your thing to be 10 times better in some way. I don’t see any signs RISC-V will be that aside from the low cost embedded market. I.e. go with [forgive me for forgetting his name, I still can’t get a copy of his article] who divided CPUs into 4 types, which I follow with examples:

The status register is a major serialization problem for OOO designs (as is supporting the ancient IEEE standards for floating point).

There is a lot of cpu IP in the low end of the embedded market, things like microblaze, dsp designs, etc…. but getting the lawyers and licensing loop reduced will help. It still comes down (to me) to getting enough functionality onchip to be competitive in a given market segment.

As much as I too would like a risc-v in a raspi form factor, density is quite low, and early prices reflect the need to recoup dev costs. $999 for a dev board? yikes!

> Doesn’t this also have something to do with their post-K8 general failures in producing buggy CPUs and chipsets?

Having been at AMD at the time, I’d say the biggest issue is that when you’re competing against Intel is that you have to get everything right, every time, and the market is not tolerant of failure. AMD had near perfect execution on K8 at the same time that Intel was shooting itself in the foot. When Intel stopped screwing up, the market forgave it. When AMD stopped executing extremely well, the market dropped it like a hot potato.

It’s fairly hard to get a processor design out the door without too many mistakes and thereby be a challenge to Intel. It is extremely hard to do that repeatedly for the 10+ years you need to really chip away at Intel’s market dominance at this point.

None of the ARM server vendors seem to have what it takes. The few that had a clue have dropped out of the race, and I don’t see the survivors as threatening Intel.

I don’t know enough about RISC-V to hazard a guess, but ARM’s dominance in embedded markets has not translated to success in the server markets. In a lot of ways, the embedded market experience has actually been something a hindrance since many of the ARM silicon companies are too used to doing embedded hack kernels and have difficulties when it comes time to standardize for the server market.

> AMD shows how successful you are if you are only 10% slower. 2% of the market

Most people aren’t buying the top-performing chips, so AMD should be able to grab greater market share.

My problem of late has been the thermal envelope. I much prefer quiet computing, so requiring less power and thus less air movement is a big issue for me. Also, a lot of big data centers care about power usage as power and cooling are major overhead issues.

A friend of mine runs a data center. We’ve talked about how in the winter he can provide heating to his own building and several adjacent buildings just from the waste heat. I don’t know if anyone has built a data center in a cold climate and sold their heat to neighbors like that, but it’s an interesting proposition.

Okay, this one in particular intrigues me. I just recently started reading up on the SPARC architecture for other reasons, and the concept of the register window struck me in particular. What a nifty way of being able to pass data along from one function to the next without, keeping everything in registers, and not having to push/pull to/from the stack, and thus incur the cost of accessing main memory. But, I’ve not meditated deeply on the register window idea yet.

Is there some horrible gotcha to them? Do they make writing a compiler for that target ISA harder than it needs to be or something?

First, you have to make them bigger than most functions need, which means that a lot of the silicon you have is wasted by most functions. When you eventually run out of windows and have to dump them to RAM you have to save a lot of data that is not actually being used. And then the number of registers in each window multiplied by the number of windows means you have a lot of registers, which has area, power, and access time implications.

It Sparc-ed my interest when I heard it ran Linux and was fully opensource, but my thought is to have it with a large FPGA (something I’m getting into now) which you could build the peripherals you need for the application or acceleration.

And if it runs fast enough, you can bit-bang. Go back and study the Apple II – it didn’t have a floppy controller. That was on a 6502.

And FPGA with blockchain should be able to help the audits. Adding hidden backdoors to complex designs is easy, less so with simpler designs, and if there are multiple sources, can you affect every supplier?

The more interesting thing might be the server farms if they can get the speed up and power down – like Amazon is moving to ARM.

It Sparc-ed my interest when I heard it ran Linux and was fully opensource, but my thought is to have it with a large FPGA (something I’m getting into now) which you could build the peripherals you need for the application or acceleration.

That sounds like the sort of shenanigans the remnants of the Lost Amiga Civilization would get up to. Wonder if they’ll petition Hyperion Entertainment for a RISC-V port of AmigaOS?

Seriously, dude, where have you been? RISC-V has been the toast of the open-hardware movement for a couple years now at least, in no small part because of its successful adoption by corporations. My money isn’t on it expanding outside niche applications very soon, at least not until Apple starts producing chips based on it and using them as the basis for iPhones — and if they do that, then they may well hire the entire RISC-V team and bring control of the project in-house. But with all but the most basic of x86 instructions being patented IP — releasing a software emulation is grounds for an unpleasant conversation with Intel’s lawyers — RISC-V may be the hero the software world needs as well.

Oh, and by the way, that demo in the Linus Tech Tips video wasn’t Doom — it was Quake 2. The distinction is important because the Quake engines were way more CPU- and GPU-intensive (though Q2 also supported software rendering) than the Doom engine — meaning that this is no slouch of a part. At a minimum it has decent floating-point performance and can keep pace with a late-stage Pentium or Pentium II. (When I hear “new 32-bit embedded processor” I keep thinking it’ll be 486-class at best and is designed more for low power consumption and low cost than performance.)

So yeah, a major victory for open hardware — but it’s barely a blip on the radar when compared to hardware as a whole. Much like Linux, it’s going to have a long road ahead, and RISC-V proponents may have to content themselves with squeezing into niches underserved by the big boys and not expanding far beyond those.

That is, of course, until Tim Cook decides he doesn’t want to be beholden to ARM for his cash cow of a product line.

For those who don’t game, the difference between Doom (Doom 1, at least) and Quake is that the original Doom engine was essentially 2D disguised as 3D. It looked like 3D, but in Doom you could only move your weapon and/or player left and right, not up and down. Quake is REAL 3D, and you can essentially move your weapon/player in a 360 degree sphere. I’m not sure about program sizes, but the difference in processing power to do one rather than the other is significant.

Doom ran – well, slogged – on a 386. You needed a 486DX4 for it to be consistently smooth. That said, Quake barely ran on the 486DX4, even in lo-res mode – playable in low-action areas but it would degrade to a slideshow when things got heavy. Even a Pentium 90 was barely enough (though much better than even an Am5x86/133, because of massive improvements in FP performance). Quake 2, OTOH, struggled on a 350MHz K6-2.

Quake 2 had no ceiling, at the time. The more hardware you threw at it, the happier it was.

Quake 2 took opengl hardware acceleration out of the niche CAD market and into the consumer market, as retailers had a hard time keeping Voodoo and Voodoo 2 cards in stock. Prior to that “accelerator” in a video card meant that it had an API for drawing lines and boxes. It wouldn’t be long before even the most basic $10 no-name video cards had serious vector math and texture mapping hardware.

Quake 2 also turned overclocking into a semi-mainstream activity with every nerdy kid in the country trying to squeeze a few more frames-per-second out of their CPUs and GPUs.

>This brings up a question: as a general rule of thumb, if you want to displace an incumbent something, you need your thing to be 10 times better in some way.

I have the same rule of thumb ‘cept that it needs to be 5x better in multiple ways. Adds up the same, I guess. I outlined the risc-v wins in an earlier post, but most of those aren’t technical, and yet I do think those are sufficient for enough volume for (some variants of) it to be a viable arch. In the lwn.net article esr cited from last year ( https://lwn.net/Articles/743602/ ) though, I’d forked off a discussion of the mill (https://millcomputing.com/docs/) cpu, which (still) seems to be the only hope we have for major improvements in single threaded performance (and *security*) in a post spectre/meltdown world.

I’m still deeply disturbed by those classes of bugs, enough to have spent some time working with sel4 late last year… but I go back to any decent micro-kernel design requiring fine grained security level management from the cpu, and that’s hard to retrofit into the risc-v (though one group is trying to add “capabilities” )

What I’ve seen of pretty much every capability architecture I’ve run across (and what I recall of what I’ve read about the Mill), is that they tend to try to implement capabilities at a per-object level, and the overhead of doing this chokes performance.

I think protected-mode segmentation on 32-bit Intel had about the right granularity, but it didn’t have what it needed to be a capability system, and legacy compatibility meant that it was implemented in terms of an offset and limit in a global paged address space, rather than each segment being its own paged address space, which made more use of segmentation than just setting up a flat address space underperformant.

>What I’ve seen of pretty much every capability architecture I’ve run across (and what I recall of what I’ve read about the Mill), is that they tend to try to implement capabilities at a per-object level, and the overhead of doing this chokes performance.

Oh shit oh dear. I think I just heard you tell me that the Mill has i432 disease. What a damn dirty shame that is; I liked what I read about the belt architecture and the news that it has that kind of layer of thick cruft over it is very sad.

Mind you I had already regretfully reached the conclusion that the Mill is toast for completely different reasons. Ivan not having first silicon out yet when there are multiple RISC-V SOCs hitting the streets tells me he has run out of time – RISC-V may be inferior in significant ways (I don’t think dtaht is wrong about that) but that it’s going to worse-is-better the fuck out of the Mill is now utterly predictable.

“Mind you I had already regretfully reached the conclusion that the Mill is toast for completely different reasons.”

I don’t think it’s a done deal. A chip that can run a micro-kernel as well as traditional chips run a monolithic kernel is going to have a definite demand, especially IoT security liability becomes more of a thing.

Yes RISCV takes some of the space they wanted to do with custom chips, but if RISCV becomes fragmented with incompatible and irregular extensions peoople could be looking for a better way to do custom instructions.

Also portals aren’t quite the same as i432 capability addressing where pointers are replaced with an access control object. They act more like a function call into a library with a mandatory hardware managed protection context switch. The execution (with exclicit agruments already on the belt) is jumping across the protection boundry rather than the data passing between threads. These protection contexts are created hierarchically (allowing revocation and emulation of a ring system)

And Ivan claims function calls will be very cheap on the mill. Special hardware spills everything on the belt to the cache and a new stack frame is created. Execution can continue on the next cycle.

I don’t know, the Mill’s access controls don’t sound nearly as complicated as what the i432 did. There’s no type checking, or any nonsense about object-oriented programming. A process can grant another process access to a an arbitrary range of bytes, provided it has access to those bytes itself. (Or perhaps it was page ranges? I might be misremembering.) Thus you can make that as simple or as complicated as you want; the BIOS starts up with access to all of the memory, and grants access to everything except itself to your bootloader. Your bootloader grants access to the kernel, and so on down the chain. Managing that doesn’t seem like it would be any more complex than than managing virtual memory, and it should have similar hardware requirements.

I said “what I recall”. It’s been a while since I read up on the Mill’s protection architecture, it may be that I misrecall. Some of what other people are saying makes me think that I do misrecall, but the talk about byte level granularity makes me nervous.

While you could shoot yourself in the foot if you’re not careful . E.g. you could easily thrash the cache if you pulling in a bunch of small references from non-continuous space as memory seems to be the conventional cache line at a time. Or you could just have too many references floating around, so that it would flush parts of the PLB (protection lookup buffer) to memory, which would cause a cache miss if they are needed later.

However seems like a potentially nice way to do IPC, in cache lines that are likely to be hot anyways. (like micro-kernel services or device drivers) It’s really not meant for every bit of code to be tagged like that. Part of thier plan is to have a CPU that eliminates the vast majority of micro-kernel performance penalty, without limiting themselves to that use-case.

The mill actually provides at least byte level granularity, but not every individual object is expected to have unique permissions. But everything inside the CPU is in virtual space which simplifies the issue. (You can lay out protection space contentiously, and where the TLB is on traditional architectures, they put a PLB, which is multi-level and just has to do protection and not address translation) And they make calls/changes to protection boundaries very cheap, in many cases a single cycle, and call and stack frames are hardware managed/protected.

Have you seen the Kestrel Computer Project? I have occasional conversations with the lead developer and he’s really onto something. The project aims “to build a full-stack, open source, and open hardware home computer” and … RISC-V is the chosen CPU architecture, for all of the right reasons.

FreeBSD is getting on board, too. According to https://www.freebsd.org/platforms/, RISC-V is a ‘Tier 3’ platform as of FreeBSD 12.0. That means it’s experimental, which in this context covers “architectures in the early stages of development”.

> And not mainly because this instruction set is well designed, though it is. Simple, clean, orthogonal – it makes my compiler-jock heart happy; writing a code generator for it would be fun. If I needed to, but there’s already an LLVM back end for it.

Write one anyway, if for no other reason than the fun of it or to exercise those compiler-writing chops. Maybe you’ll have some insight into how to best play with the architecture that would serve as a useful patch to LLVM’s backend for it. Or maybe it will just be fun.

Every product goes through a life cycle, from new hotness everyone wants and can’t be made fast enough to commodity with commodity pricing, paper thin margins, and pennies on the dollar in profit with the need to get as many dollars to make profits on and Lowest Cost Producer Wins.

Guess what tends to become open source?

Neither Intel nor ARM will feel any chill. What they turn out is not commodities with paper thin margins. It will be some time, if ever, before RISC-V or MIPS is capable of competing in the main markets Intel and ARM sell into.

They might be just the thing for IoT and other embedded work, but do you really see either becoming something you might build the Great Beast around? I’d call the time frame where that might occur measurable in years.

>It will be some time, if ever, before RISC-V or MIPS is capable of competing in the main markets Intel and ARM sell into.

That time is already within ARM’s planning horizon; we know this because they have already deployed FUD. Entertainingly, the site was taken down under pressure from their own employees.

I have long had a rather high opinion of ARM. Its C-level strategists have a pretty good track record; the company is smart and nimble, as good at identifying and exploiting market openings as it is at computer architecture. Accordingly, I credit then with having a longer planning horizon than most firms. Still, the fact that they’re worried enough to risk drawing attention to RISC-V tells me that they foresee a potential disruption crisis no more than five years out.

>They might be just the thing for IoT and other embedded work, but do you really see either becoming something you might build the Great Beast around? I’d call the time frame where that might occur measurable in years.

Again, the only way I see that happening is if Apple decides to get into the RISC-V chip business. We are single-digit years away from the first ARM part to best top-end x86 in raw performance, not just performance per watt — and that chip will come from Apple.

For anyone else, it will be a struggle, especially if you apply the “reasonable price” and “reasonable power consumption” constraints.

Someone should figure out which vintage pre-2000 instruction sets and CPU architectures suck the least, update them just enough for modern clock speeds and manufacturing methods, and then sell to hobbyists whatever seems most promising. Multi-core and 64 bit would be nice additions but not strictly necessary. Maybe not to make a huge profit, but to see what is possible.

I was a wee pup at the time, but I vaguely remember good things about MIPS, PA-RISC, and moto 68k.

>I was a wee pup at the time, but I vaguely remember good things about MIPS, PA-RISC, and moto 68k.

Those of us old enough to remember it are still nostalgic about the 68000. Much nicer architecture than Intel. Long ago I heard that the original IBM PC engineers at Boca Raton wanted to build around it, but Motorola could not ship some glue chips it needed and missed its window.

And that is a shame. The 68K would have scaled up a lot more gracefully than the 8088 did. Later, it became the basis for the earliest Unix workstations.

IIRC, Freescale NXP is still selling the 683xx and Coldfire microcontroller lines, which are derived from 68k familiy.

Personally, I’d love to hack together a 68k-based computer, if for no other reason than to see what I can teach myself about hardware and what I can make it do. The original 68k chips might not be in production anymore, but the more I read up about the family, it seems that family of chips or microcontroller derivatives thereof went into damn near everything for a while. So, there should be no shortage of either old-stock or still-functioning recovered scrap to play with if I end up insisting on an original.

A few years ago I read about a project to develop open-source implementations of the Hitachi Super H family — best known as the difficult-to-program RISC chips powering Sega consoles in the mid-late 90s — because all the patents on these had expired. I don’t know if that effort got anywhere.

Looks like updates stopped back in late August of 2016. I don’t know if that means the project is dead or if they’ve been super heads down on getting things done and haven’t had time to update the site.

> the Hitachi Super H family — best known as the difficult-to-program RISC chips powering Sega consoles in the mid-late 90s

Thinking about this, I think perhaps the Super H itself being difficult to program is an unfair characterization it inherited from the overcomplex design of the Sega Saturn. The Sega Dreamcast (the Saturn’s successor) also used an SH chip as it’s main CPU, and I never heard anything about the Dreamcast being difficult to program for.

I however find the common accusation of the Saturn being hard to program for quite plausible. Let’s review it’s specs:
* 2x SH-2s serving as Main and Slave CPU, both of which share the same memory bus (hello, bus contention)
* Two VDP chips, one to control bit-mapped backgrounds, the other to control sprites and polygons. The latter’s output is an input bitmap to the former, which made effects other contemporary systems could do (like transparency) much trickier.[1]
* A custom 32-channel PCM & FM synth sound chip, driven by a Motorolla 68k
* A SH-1 to drive the CD-ROM drive
* A bus controller which additionally sported it’s own DSP.

IMO, it’s probably a combination of the two CPUs and the two GPUs that made the system harder than it needed to be to program for. As our host has noted in other posts, some problems require inherently serial calculations. Adding an extra CPU with which you must contend for the main memory bus isn’t going to help at all. As I understand many modern videogames are *still* mostly single threaded.

The SH was not difficult to program for; that was famously the excuse Electronic Arts management gave in an interview about their not supporting the Dreamcast (in reality EA wanted deep discounts on the licesning fees after the 32X and Saturn debacles and Sega refused). At the time that interview was made, those of us in the sports mines already had Madden and FIFA playable at 60 FPS on the machine because we had extensive SH experience making previous games run decently on the Saturn. The Dreamcast architecture lived on in successor arcade boards until 2011, which is quite a run.

The SH’s 8 and 16-bit predecessors, collectively the “H8” family, were widely used in Japanese electronics in the 1990s; they’re found in many Yamaha and Roland synthesizers and sound modules of that era, for example, and I understand they were popular for car controller modules as well.

Both H8 and SuperH have nice clean architectures with 16 general-purpose registers and an orientation towards position-independent code. SuperH unfortunately added a delay slot, and GCC codegen for it was pessimistic enough early on that Sega hired someone full-time to fix it and submit patches upstream to mainline. C code compiled for SH settled in eventually at about 2/3rds the size of the same code compiled for Sony’s MIPS boxes, which was a nice savings. (SH instructions are 16 bits plus operands instead of fixed 32 like MIPS).

See, this is why there’ll never be a year of the Linux desktop. THIS IS NOT A PC. It’s pure hobbyist wankery. In order to give the SBC a peripheral interface you have to attach an FPGA board and *then* flash a bitstream to the FPGA. That’s not a PC — that’s a project. The problem of open source has always been that open source ships too many projects and too few products.

When Purism ships a clamshell RISC-V laptop with Ubuntu preinstalled we can talk about RISC-V PCs. Though it really won’t happen until Ken Burnside can use one without grumbling about how shit the software is (which will happen sometime around the 12th of Never).

Seriously … Linux has so utterly dominated everything else, that the conventional desktop isn’t even an interesting space anymore. Let microsoft keep it; eventually it’ll be about as relevant as the fact that no one ever beat IBM in mainframes.

Of course it’s not a PC! It’s “get something physical working as quickly as possible so that a few hundred developers around the world can get started of the software for the chips and boards that will come later”. The price is immaterial — $1k for the RISC-V board plus $2k for the Microsemi FPGA expansion board (or $3.5k for a Xilinx VC707) is trivial compared to the $10k+ a month for the developer using it.

By the way, the next generation of Microsemi PolarFire FPGAs will have the same penta-core 64 bit RISC-V processor built into it, in the same way Xilinx Zynq FPGAs have dual ARM A9s. And this is the dev kit for it.

These boards are also being using in build farms at Fedora and Debian.

Well, one that costs less than $1,000, lots more if you want SSD M.2, SATA and a couple of PCIe slots (expansion board has a honking big FPGA), basically anything more than a gibabit Ethernet port and a MicroSD card adapter and lower level connections. For early adapters, the HiFive Unleashed, so early it has spots for hookups to monitor DRAM timing etc. (the 2 rows of 5 spots).

This is like a weird trip back into the 1990s. I remember when everyone went on and on about how Intel was “inherently slow” because Hennessy and Patterson said so. Turns out those guys didn’t have a clue. They were still rambling on about “pipelines” when the Intel architects were saying “pipe-what-now? Oh, right, that old thing. Ha ha ha!”

Indeed, I see that Patterson is involved with RISC-V. I wonder if he’s learned anything.

Nobody who’s actually building this stuff much cares about instruction set architectures any more, and they haven’t for close to 20 years. Everything is trivially translatable into something that can be made to go fast.

I mean, if you’re just some dork who wants to grab a Verilog book and do a 5 stage pipelined throwback to 1992 to shove in an FPGA, sure, why not? It makes no sense to do this, but it’s probably an amusing exercise for a lot of people. Might get you an internship at some shop where they actually build things.

Nobody who’s actually building this stuff much cares about instruction set architectures any more, and they haven’t for close to 20 years. Everything is trivially translatable into something that can be made to go fast.

As I understand it, that depends, for example the Vax macroarchitecture had too many wild addressing modes for DEC to easily or at all make it fast, whereas for whatever reasons Intel’s didn’t have that fundamental restriction. But I’m not a real student of this subject.

Unfortunately, for someone at my level, this is massively muddied by Intel’s process node superiority until their failed move to “10 nm”. But as for “they actually build things”, I guess you’ve failed to note WD adopting RISC-V for all their product’s microcontrollers. Since that’s a “Zero Cost” type of CPU while perhaps having some “Zero Delay” flavor, the money they’re saving in licencing fees to ARM or whomever is not insignificant, but of course that’s not the area you or ESR are commenting on. But it’s where most of the microprocessor design action is.

Me, I hope that lowRISC’s tagged version of RISC-V gets somewhere, while not liking at all the New Jersey design of ignoring the issue of integer arithmetic errors besides divide by zero, and wondering if there are other corner cutting gotchas in the whole set of designs including the MMU needed for general purpose operating systems.

Design wins in these degenerate times all too often don’t have anything to do with technological excellence, but, sure, RISC-V seems to be one of a whole giant whack of ISAs out there with various implementations and descriptions in various design languages. It is not clear that the world needs more of these things, and Eric has not made much of a case for this one being particularly special.

It is, like all of them, different in various ways from all of the other ones, and in some cases one or more of those differences might matter in a design. More often, it’s simply adequate, and whoever was in charge of selection thought it would look best on his resume.

DEC actually did make VAX go pretty fast, the 9000 series was for its era quite zippy. I am also not an expert here, and and what little I did know is nigh on 20 years out of date, so I haven’t the foggiest what would be necessary to make a fast VAX implementation in this era. Probably nobody does, the experts are steeped in other architectures and the various methods and approaches used for breaking those ISAs down for performance.

There’s a paper out there that covers this in detail, which I read but did not study deeply enough to make sure it’s right, since the factors you mention determine what CPUs I work on, not raw technical merits. I think I saved it, but if so I didn’t add tags like “VAX”, I’ll see if I can find it again locally or on-line.

The key detail I heard about the VAX was that it was hard to make it fast. DEC for example took a very long time to make anything faster, starting with the 11/785 7 years later.

Yes, that is fair. In the terms that were around in 1990, VAX was “hard” to make go fast. The problems inherent in making then-current ideas of how to build a CPU go, say, 200Mhz, were intractable with something like VAX.

It turned out, though, to go much faster than that with any ISA at all, a lot of stuff had to get re-thought. The solutions were, as it were, “refactored” into something substantially more complex, and it was about at that point that the ISA began to fade into irrelevance.

You stop pipelining in any meaningful way. The instruction stream that comes in is transcoded into much smaller operations intended to enable as much speculative execution as possible (these might be implementation-specific micro operations). These tiny atoms of computation are hurled into an extremely wide and deep network of redundant computational units. There are bewildering meshes, copies of the “user visible” register set all over the place, all with different results in them.

On the back end of this mess is a wad of logic continuously reconciling things, squashing results that should never have been calculated, committing results that should have been. Conditional branches are sorted out fantastically late in the game so that raw computation can proceed at insane pace.

This kind of architecture was enabled by the newer processes, Intel engineers developed a “gates are free” mindset, and invented ways to use astonishingly large numbers of them. A Pentium processor from this era has 3 orders of magnitude more transistors than an 8 bit micro.

This is how you get the performance levels we see in single cores today. This is wildly more complex than the most ambitious pipelined architectures ever made, and somewhere in there, the relative difficulties of ISA details kind of shrank to nothing.

So, in a way, Hennessy and Patterson were right. In their own relatively naive world, they got it right. They were just not looking at the right world. They managed to prove that you’ll never get that steam locomotive to fly.

Depending on your system architecture, the most desirable feature your user-visible ISA can have might well be most compactly representing the programmer’s intent, because this saves precious memory bandwidth. I have this vague notion that there was at any rate an interval in which x86 was actually pretty close to optimal, because of this exact property (but don’t quote me on it, I could just as well have hallucinated this)

The “gates are free” approach fails because the power needed to _run_ the gates is not free. It’s in fact an especially scarce resource, given the need for shedding the resulting heat while keeping the chip (as well as the package, and even the system itself to some extent) at a workable temperature. The only gates that are still relatively cheap are memory (though even that seems to have problematic power requirements – but keeping memory and compute close together is a huge win) and specialty compute engines that are going to be powered up only rarely, when actively needed.

Even the latest process nodes don’t really save you from this dynamic – if anything, higher absolute temperature at these feature sizes implies exponentially-higher incidence of permanent degradation that causes wear and tear on the chip itself and lowers its overall endurance. Additionally, the increase in leakage power makes it even more important to get the power-gating right.

I’ve said this already, but the RISC-V folks are not “ignoring” integer overflow. They provide suggested instruction sequences for overflow checking as part of the ISA spec, and these could be used for hardware-support. It’s already common to fuse special instruction sequences as part of decode.

Additionally, the MMU and other OS-level support is an entirely separate spec from the basic ISA. It could be replaced altogether without affecting the basic design of RISC-V supporting software.

Sorry, not buying it one bit. And you’re missing the bigger point, it’s explicitly a New Jersey Worse is Better Design, justifying the absent of a flags register by noting that the most popular languages like C/C++ and Java ignore integer overflow etc. That’s a paraphrase of what they said in the base instruction set specification.

So I’m not in the least inclined to trust anything they produce. And, sure, the MMU design can be replaced, but such fragmentation would seriously hinder adoption for the sorts of niches ESR is most interested in, vs. its current success as a microcontroller core.

>And you’re missing the bigger point, it’s explicitly a New Jersey Worse is Better Design, justifying the absent of a flags register by noting that the most popular languages like C/C++ and Java ignore integer overflow etc. That’s a paraphrase of what they said in the base instruction set specification.

It is. But I think your criticism is overblown. Handling overflow in explicit operations optimized by fused decode is a legitimate approach. I don’t think they can be faulted for optimizing for the common case.

Also, it’s a bit soon to speak of “replacing” the MMU specification, as there isn’t one to replace yet.

There is a *page* *table* specification for virtual memory. The data structure in RAM. And there is a specification for telling the processor that the page table has changed. But you can implement the actual MMU any way you want.

Sure, of course. Each page table entry has RWX bits and Accessed/Dirty bits. Standard pages are 4 KB, but 32 bit systems also have 4 MB superpages, so you can map and protect the whole address space in 4 MB chunks in a single 4 KB memory page and just subdivide the parts you want more finely.

You *could* build a conforming RISC-V system that walked the page table data structure on every memory access. Obviously, that would be slow, so in practice everyone uses some form of TLB to cache page table entries. You can use a hardware page table walker, or you can treat TLB misses like a page fault and use a software page table walker to reload your TLB in a machine-dependent way. Once you find the PTE for an access you can generate a page fault exception if you try to read a page with the Accessed bit not already set, or if you try to write a page with the Dirty bit not already set. If such accesses don’t trap then hardware should set those bits. If you trap on first Access or Dirtying then you could have either hardware or software set the bits. Hardware should never clear the A and D bits.

So, there is a lot of flexibiility in how you implement virtual memory. Whatever software is needed to make it work on your particular processor should be included in the M mode software that makes up the Supervisor Binary Interface. The intention is that an identical Linux (for example) kernel should be able to run unmodified on S mode on any RISC-V processor that supports the SBI.

On the one hand, I was expecting to get a reply approximately like this – it’s pretty much inconceivable that anyone could build a processor that would be general-purpose competitive in 2019 without such primitives. On the other hand, you said “There never will be an MMU specification.”

How is what you describe not exactly specification for a hardware MMU? Has terminology shifted in some way I am unaware of?

Back when I was learning about these things, in the day of the 68000 and the VAX, we certainly would have described silicon-level support for virtual addresses and page faulting as hardware MMU, recognizing of course that there will be a software part that varies more by OS. What can have changed, I wonder?

Darnit I thought I’d replied to this last night. Must have gotten lost.

The difference is that RISC-V specifies *only* 1) the page table format in memory and a CSR that points to them and a CSR bit that says whether address translation is enabled, 2) some exceptions that can be generated, 3) the SFENCE.VMA instruction that an operating system executes after modifying the page tables so that following instructions use the new page tables.

*Everything* else is an unspecified black box.

x86 and ARM specify a lot more detail as part of the architecture, constraining implementations to work in particular ways.

One big difference is in the complexity of emulation or virtualisation. Virtualising all the CSRs and TLB and cache management instructions in other architectures — so that you can run an OS under a hypervisor but it thinks it is on real hardware — is a MAJOR exercise.

That is not just reasonable, it’s the Right Thing. Better a small, clean, orthogonal set of MMU primitives than a big elaborate one that mixes policy with mechanism and thus overconstrains what you can build on top of it. I’m sorry to hear that people now associate the term “MMU” with such overengineering, but not especially surprised.

Clean minimalism is a theme that resounds all through the RISC-V design. It makes my architect head happy.

On the back end of this mess is a wad of logic continuously reconciling things, squashing results that should never have been calculated, committing results that should have been. Conditional branches are sorted out fantastically late in the game so that raw computation can proceed at insane pace.
…
This is how you get the performance levels we see in single cores today. This is wildly more complex than the most ambitious pipelined architectures ever made, and somewhere in there, the relative difficulties of ISA details kind of shrank to nothing.

IANAPPD*, but I would strongly suspect that this “wildly more complex” design is how you get to hardware bugs like Meltdown and Spectre. The logic verification tools we have aren’t good or fast enough, and it becomes just too damned hard to reason correctly about the state of the processor.

It may well be that having a more straight-forward ISA would simplify this issue.

>It may well be that having a more straight-forward ISA would simplify this issue.

Only to a very limited extent. The requirement for speculative execution to go faster is independent of ISA; you can make it more difficult with a CISC ISA like the VAX’s but you can’t simplify it away by going RISC. This is a large part of Andrew Molitor’s point.

He’s too dismissive of the benefits of an open ISA, though. It’s true that in a purely technical sense ISA design is no longer the interesting part, but RISC-V is not a purely technical move. It’s a nice clean RISC design, but the real hack going on here is against the the economics of the processor market.

What RISC-V does is remove a choke-point where one party would normally squat collecting secrecy rent. This makes it tremendously attractive to anyone managing supply-chain costs and risks. This is why you’re seeing so much interest from Western Digital and Nvidia, and why ARM was worried enough to try to counter-FUD it.

At this level the technical details of the ISA almost don’t matter. All it has to achieve to be fit for purpose is (a) not be IP-encumbered, and (b) not utterly suck. RISC-V achieves both.

> Only to a very limited extent. The requirement for speculative execution to go faster is independent of ISA; you can make it more difficult with a CISC ISA like the VAX’s but you can’t simplify it away by going RISC.

VLIW and related approaches (such as the Mill) can be made to work without relying too much on speculation.

(They do have some drawbacks in comparison to RISC-like designs, such as surfacing way too many low-level-dependent design details as part of the ISA code itself. This leads quite naturally to an approach where essentially all user code is provided in some higher-level, “technology-independent” form, and then compiled to the actual ISA by a “driver-like” (see GPU’s) or “compiler-like” component. The Mill does this, for example. It strikes me as a potential pitfall for such chips, because compilers in general are far from the degree of reliable correctness that would be needed for such an application.

The Itanium tried to avoid this issue by specifying a sufficiently-general ISA. it was also not very successful.)

I of course agree wrt. the economic impact of RISC-V. The “getting rid of a choke-point” part is especially important. Yes, ARM chips may have been fairly cheap already, but having something like RISC-V gets rid of a problematic incentive towards secrecy and rent-seeking, and that can be quite important in the long run.

Being that oft lonely mill fan…. https://millcomputing.com/docs/switches/ goes into how a multi-way branch gets executed on one. The speculation is innate to how the instruction set works, not grafted on later with OOO risc or cisc. An OOO machine MIGHT be able to execute the 8 way branch in their example as fast as a mill but it is unlikely.

You can be sure that multiple organisations are working on Out-of-Order implementations comparable to ARM A57 or maybe A72 already. A75/76 competitors won’t be far away.

RISC-V has already got inside ARM’s reaction time. ARM announced SVE (Scalable Vector Extensions) in August 2016 and have yet to ship a core including it. They just this week announced a smaller version, MVE, which will take about two years to ship. The RISC-V Foundation is just putting the finishing touches on a vector instruction set that covers the entire range of MVE and SVE and more, and it will probably be implemented by multiple organisations within the year.

>You can be sure that multiple organisations are working on Out-of-Order implementations comparable to ARM A57 or maybe A72 already. A75/76 competitors won’t be far away.

Right, I’d be astonished if that weren’t true at this point. The economic logic driving towards it is…not just simple, it’s yelling in everybody’s face.

Amusingly, part of it is like the reason TCP/IP swamped proprietary networking; every forward planner can accurately project every other forward planner’s moves in what SF writer Marc Stiegler called “superrationality” back in the 1980s, a sort of virtuous opposite of the Prisoner’s Dilemma.

Out Of Order implementations are less attractive these days due to the Spectre side-channel issue. It’s a lot more feasible to close these nasty side channels if you’re using simple in-order cores, perhaps as part of a manycore processor. And in-order compute is more energy-efficient as well, which slashes cooling needs. The only real advantage of OOO is that it’s _slightly_ faster in inherently-serial workloads, but overall it’s just not worth it. Even the Mill processor design is (rightly, in my view) avoiding OOO and choosing a rather different approach.

Well, you could abandon the use of a flat address space. Take something like the Intel segmentation model, but, for performance, instead of making each segment an offset:limit into a single paged address space, make each segment its own paged address space. Then you set your MMU up so that userspace doesn’t use raw segment selectors, but so that each code and stack segment has a “virtual selector table” that translates program-visible selectors to raw selectors, so that a thread can’t even address segments that it doesn’t have access to. At any given time, you have one code virtual selector table and one stack VST loaded. The code VST gives the currently running code segment access to any libraries it depends on, as well as any system-global data it holds (for example, the kernel data segment). The stack VST gives the currently running thread access to data specific to the task it’s doing. A browser could maintain multiple stacks for different tabs, and switch between them (with an attendant change in stack VST and visible segments) without having to call into the kernel for a complete context switch.

The hardest thing about dealing with Spectre and Meltdown is simply knowing that they exist *before* you design your silicon — and then design it so that *all* effects of speculated actions are undone (or buffered and not committed) if the speculation fails — not only the obvious updates of CPU registers, but also other resources such as branch prediction tables and caches.

People were *just* starting to design OoO RISC-V processors at the time that Spectre and Meltdown hit, so have had a chance to do better. (except for “BOOM”, which was a university project that no one other than the students concerned ever used)

Dual issue in-order CPUs are a sweet spot for many purposes, often improving performance 40% to 60% for maybe a 15% increase in processor size — a pretty good deal. If you’re only accessing SRAM, or programs that fit into caches then that can be good enough. But once you get a deep memory hierarchy and programs that are getting a lot of cache misses, OoO becomes a very big deal. If you care about throughput rather than single-thread performance then extreme hyperthreading, GPU-style task-switching on every instruction (or every memory access), or a barrel processor work well, but if you need single-thread performance then you need OoO. And a very good branch predictor — which we essentially perfected in Pentium MMX and Pentium Pro times. Branch prediction suddenly jumped from 70% or 80% accuracy to 97% or 98%, at which point you can predict through half a dozen or ten branches with the same accuracy that you previously could predict a single branch.

“Today I am happy to announce that we are now providing RISC-V support in the FreeRTOS kernel. The kernel supports the RISC-V I profile (RV32I and RV64I) and can be extended to support any RISC-V microcontroller. It includes preconfigured examples for the OpenISA VEGAboard, QEMU emulator for SiFive’s HiFive board, and Antmicro’s Renode emulator for the Microchip M2GL025 Creative Board.”

It’s a funny coincidence to read about RISC-V here, then a few days later it pops up in one of my professional-interest RSS feeds (I do a lot w/ AWS for work).