This is the tenth topic publicly presented related to the Mill general-purpose CPU architecture. It covers only the tool chain used to generate executable binaries targeted for any member of the Mill CPU architecture family. The talk assumes a familiarity with aspects of CPU architecture in general and C++ programming in particular.

LLVM meets the truly alien:

The Mill CPU architecture in a multi-target tool chain

The Mill is a new general-purpose CPU architecture family that forms a uniquely challenging target for compilation – and also a uniquely easy target. This talk describes the Mill tool chain from language front end to binary executable.

The Mill tool chain is unusual in that it translates LLVM intermediate representation (IR) not into object code but into a different IR (genAsm), tailored for the Mill architecture family. Then a separate tool, the specializer, converts genAsm input into executable binary code (conAsm) for a particular Mill architecture family member. genAsm is a dataflow language, essentially a programmable representation of a single-assignment compiler IR.

The Mill has no general registers. Instead, intermediate results are placed on the Belt, a fixed-length queue, and these operands are accessed by temporal addressing. A Mill operation in effect says “add the third most recent value to drop on the belt to the fifth most recent, and drop the result at the front of the belt, and discard the oldest value from the other end of the belt”. The Mill is also a (very) wide issue machine, and many of these actions are taking place concurrently in each cycle. The tool chain, or rather the specializer component, must track the location of operands as they move along the belt, because their belt address changes as other operations are executed and drop results. In addition, the Mill is statically scheduled with an exposed pipeline, so an operation may produce its results several cycles after the operation was issued, possibly with intervening control flow.

This belt structure leads to unique needs for operation scheduling and operand spilling. These needs are the rough equivalent of instruction selection, register coloring, and spill on a conventional machine. The talk concludes by explaining the algorithms used.

Nice to see another talk, thanks! I particularly liked when you touched on the spiller/filler operation — which suggests a topic for the next talk: some details re Mill context-switching; interrupt handling, thread management, OS kernel calls, etc.

Speaking of which, how do you envision handling real-world I/O? Task (and tax) a Gold Mill as a monoprocessor, or have (say) a few satellite Tins and DMA engines keeping the Gold fed and happy? Given the Mill’s modest transistor budget, it seems that a message-passing multicore approach is not unreasonable…

There are two different I/O questions: pin-twiddling I/O, and DMA. We expect all the twiddling to be memory-mapped, so (as with the rest of the Mill) there’s no need for privileged ops and all access control is via the existing turf-based protection mechanism.

As for DMA, the question is coherence and real-time constraints. We expect that a DMA device will appear to be just another core to the program model. That said, how inter-process communication (other than shared memory) works is still NYF.

Watching the Compiler talk it occurred to me that it might be a good idea to include “microcoded” instructions in genAsm. Linus has a tendency to wax poetic about x86 and one of his favorite things about it is the availability of instructions like rep mov. It seems like having Mill getAsm instructions that mirror the semantics of C primitives like strcpy, memcpy, memset, etc might be useful. Of course no Mill member would implement them but they could be specialized differently on different architectures in ways that might have important performance implications.

Microcode is a bad fit with any wide-issue machine. A micro-copy routine won’t saturate the width (it’s just a loop, and internal dependencies will keep you from using all the FUs). When the same loop is macrocode then other app code can be scheduled to overlap with at least some of the ops of the copy, giving better utilization.

Most people (Linus included I expect) who prefer micro over a function call or in-line macrocode do so because they expect that a call is expensive and the in-line code will consume architectural resources such as registers, while the micro is cheap to get into and runs in the micro state so it doesn’t collide with app state.

Guess what: Mill call are cheap and give you a whole private state that doesn’t collide either.

We’re not opposed to putting complex behavior in the hardware, for which the implementation will look a whole lot like a small programmable device: the Spiller is a glaring example, and the memory controller and the exit prediction machinery are too, and there are more NYF. But these are free-standing engines that are truly asynchronous to the main core execution and do not contend for resources with the application. But classical microcode? No, our microcode is macrocode.

I think I wasn’t clear. I don’t mean literal microcode. I meant “microcode” that would be evaluated into a stream of regular instructions by the Specializer. This would let the compiler emit something like memcpy which would translate into some architecture optimized routine when run on that specific architecture.

Exactly. But instead of nice atomic ops that happen to be missing, such as floating point, these would be operations that no family member would ever implement. Unless you change your mind about actual microcode but I wouldn’t advice that. 🙂

EDIT: Of course, since this is purely software there’s nothing preventing your from creating a “MILL ISA 1.1” at some later point with these extra pseudo-ops.

One of the nice things about the Mill is that they’ve got a really nice story for being able to add and remove instructions from hardware in response to changing demand without breaking software. So it’s mostly the programmers who have to worry about feature creep but don’t worry, we’re used to it.

Yes, function-scale builtins could be emitted by the compiler and then get member-specific substitution in the specializer. The difficulty is in the compiler and host language: how would a program ask for one of these builtins? In all compilers (including those from third parties) and all languages?

Generally we consider the host language to be off limits. If something is needed by a language’s user community then let the standards process add it to the language and we will support it. Some such additions already exist: std::memcopy and others, and so for those we can do what you suggest. There are other possible candidates in libc and POSIX. However, it’s not clear that there would be much gain over simply having the compiler emit a function call and letting the specializer inline that.

Mind, it might be a good idea; the problem is that we can’t know until measurement. There is a lot of measurement and tuning in each member. It won’t get integer factors the way the basic architecture does, but a few percent here and there and it adds up too.

Keep in mind that when dealing with LLVM IR, you already have to consider lowering its intrinsics [1] (though I believe you get some defaults like “llvm.memcpy” calling “memcpy”, if your target doesn’t do smarter lowering).
Some intrinsics (“llvm.memcpy”, “llvm.memmove”, “llvm.memset” and possibly more) are automatically detected from naive loops and are also generated from compilers, e.g. for shallow copies.

A problem, indeed. I work with embedded MCUs, and can say that the C compilers are rather hideous abattoirs of non-standard __intrinsics__, #pragmas, and asm() statements — not pretty.

When I was a younger man, I made the mistake of trying to anticipate all use-cases and implement The Kitchen Sink right off the bat… Now I’m a bit wiser, and only implement functionality that is actually needed by real-world programs.

At some point, Mill silicon (whether FPGA or not) will be self-hosting its own compiler, so I’d expect that strcmp() and its ilk will be obvious examples of use-cases that demand some attention…

If you have to specialize code for each chip, does that mean that different models of Mill will not be binary compatible with each other? How will software developers release binaries that target Mill?

The Mill members are load-module (.exe) compatible, but not bitwise encoding compatible. The load module contains member-independent genAsm and a cache of previous specializations for particular members. If you run that program on a member for which there is no binary in the cache then the loader specializes it to the target, as well as doing the dynamic linking, relocation, and the rest of what loaders do, and also writes the new specialized code back to the load module cache so it doesn’t have to be specialized again.

We expect that specialization will be a routine part of the install process for most software, but you can omit the install step and just get the specialization the first time you run it. Unless you are debugging the specializer you won’t need to look at the actual binary; this is essentially the same way that Java and other systems that distribute machine-independent code work.

Where does the specialization of vectorized loops occur (eg., strcpy presented in one of the talks)? It seems that genAsm should have concrete instructions already (compiler has to know the iteration offset, right?), but how’s that possible, when each member can have different vector width?

It is done in the specializer, based on extra info (not yet) produced by the compiler. The specializer wants to know the initiation interval and which variables will be vectorized and which are loop-scalar. It has proven complicated to get this info out of the compiler, and it doesn’t work yet.

In the (a-(b+c)) & ((b+c)*d) example ‘sub’ was issued on cycle 2 while ‘mul’ was issued on cycle 4. Would it be possible to issue both ‘sub’ and ‘mul’ on cycle 4, since both inputs to ‘sub’ should already be available?

If so, is there an advantage to issuing instructions late as opposed to issuing them as soon as their arguments are available? In this particular case one could, at least naïvely, think that larger part of the logic could have been clock gated on cycles 3 and 2, possibly leading to power savings if ‘sub’ and ‘mul’ would have been issued together.

Furthermore, wouldn’t the ‘load’ example work automatically without need for special cases if all instructions (including loads) would be issued early (i.e., as soon as their inputs are available)?

Generally close to the consumer is best because it reduces belt pressure; you comput a value early and it may have fallen off the belt before it gets used, forcing spill/fill ops to retain it. The exception is loads, which should be issued as early as possible while still retiring as late as possible; maximizing load deferral gives the memory hierarchy the most time to come up with a value.

Because the inputs are themselves computed as late as possible, usually you don’t have the inputs waiting around on the belt either. There are exceptions: when the inputs are call arguments, or when an input is used more than once, with an early consumer and a late consumer. However, all of these cases are examples of a general pattern: inputs must be produced early for outside reasons, and outputs must be consumed late also for outside reasons, so there is a range between production of input and consumption of output in which the op could be placed. It turns out that where you place the op makes no difference in this pattern; if you place it early then its outputs may have to spilled, but if you place it late then its inputs may have to be spilled; the impact on belt pressure is roughly the same, assuming that the number of inputs and outputs are roughly the same. Because we are placing as late as possible in general, we place this pattern late too, just for simplicity even though it could have been placed earlier without affecting overall latency.

Incidentally, operation placement is equivalent to the NP bin-packing problem, so it’s a game of heuristics because optimal placement is impractical.

If so, is there an advantage to issuing instructions late as opposed to issuing them as soon as their arguments are available? In this particular case one could, at least naïvely, think that larger part of the logic could have been clock gated on cycles 3 and 2, possibly leading to power savings if ‘sub’ and ‘mul’ would have been issued together.

The TL;DR answer: there is no power difference as long as the number of belt value spills/fills remains the same.

As far as power savings due to clock gating is concerned, there are 2 major factors:

The power consumed in performing the operation itself (the add, sub, mul, etc.)

The power consumed in maintaining operands on the belt

The power consumed by the operation itself is independent of when the operation is performed. It is what it is, the same every time for a particular set of input values. When an operation is not being performed by a particular functional unit its clock is gated off. When the clock is gated off, the functional unit power consumption is only that due to static leakage current.

The power consumed in maintaining an operand on the belt is nearly constant and depends greatly upon the number of new result values arriving each clock times the number of potential destinations for each result.

The conclusion is that the biggest factor in reducing power is the number of belt spills/fills that must performed. The lower the number of spills/fills, the lower the power.

Ivan used word “schedule” in few places to talk about about very different things.

For example, the schedule at the bottom of slides 16 to 21, in “scheduling pass”. It is not a schedule, nor a scheduling pass. I would call it “prioritization pass”, and result a “operation priority list”. This is an input to a next pass, creation of a tablau, that is actually a scheduling pass and the result is a schedule (aka the list of instructions and ops in their intended order). The intent of the prioritization pass, is to schedule ops within highest length chain first, so they are going to use FUs preferentially. This works, because the other ops, that are not in the highest length chain, can be moved around (scheduled later, in the sense of being considered later during scheduling pass, they can be schedule later or sooner in the actual tablau), without changing the total execution time, otherwise they would be in the longest chain by definition.

As for the number of in-flight loads, Ivan kind of answered it, but not extremely clean. The number of retire stations can be smaller or bigger than maxLatency or the size of the belt. I.e. you can have 2 load units, 32 belt positions, maxLatency 32, and yet have 10 retire station, or 2, or 50, or 64. Having less than 1 per load unit would be extremely stupid, and probably actually hard to implement, so 2 would be minimal. As for the maximum, in theory it can be even higher than 64 (2*32), in the range of 80-100, but it would be hard to actually use due to the way the delays are expressed. (From my benchmarks and tests on some Intel hardware Sandy Bridge era core has ability to do about 70 loads in-flight, even issued by a single core, but some of load queues are shared by all cores, in the uncore and memory controllers, so there are limits, and you can’t do more loads in parallel just by using more cores. also there are limits in how many loads requests you can actually send to DRAM/DDR chips, but it helps with accessing caches).

Scratchpad. The “static” part of the size of the scratchpad, is that each function (frame) is analyzed by compiler, and on the entry to the function, the known constant value (for this function) is supplied to the CPU core. That is on the entry to the function foo, CPU is informed that foot needs a scratchpad of size 17 bytes, that is it wants to be able to use up to 17 bytes of memory for its spill and fill operations (many functions will use 0 bytes of scratchpad). The byte-addressable part means, that “spill b1, 3”, for example will put value from b1 (which can be more than one byte) into address 3 of this function scratchpad. The actual position IN the scratchpad hardware buffer, will depend on the context, and the offset 3. I.e. it can be 170 (determined on the entry to the function by the hardware), and upper bound of 187 (170 + 17), and offset 3 will put data in 173 in SRAM. This obviously ignores the fact that there are other facilities in the Mill, to also spill to DRAM, and to make OS do store and restore of scratchpad for context switching. The main idea for the static size, (like 17 here, that will be determined by compiler, and made as small as possible), is that it allows for cheap execution of function calls and interrupts from within another functions, i.e. nested call of a new function, will get a new scratchpad starting at position 187. If we do deeply recursive call, we might end up going into the end of the SRAM buffer, (lets say 8kB), and what will happen it will be acted like a circular buffer, the values from the start of the circular buffer, will be asynchronously be copied to DRAM by separate piece of hardware (even before we actually reach end of the buffer), and scratchpad addresses will wrap around SRAM addresses, and core will start using the start of this SRAM as a new scratchpads. On returns back from functions, the opposite will be happening, and hardware will be loading back data asynchronously as we moved toward the beginning of the SRAM. One can think of it as a two level circular buffer, with in and out queues, and some logic to make this operations basically have zero impact on latency of spill and fill.

For the question “are there optimizations on tablau done by specializer?”. Well, no, and there is no need for that. The specializer and the scheduling to a target machine, produces nearly optimal code, and optimal schedule in a lot of cases. So by construction the tablau is already optimized. As for the other optimizations, there is no need to do them. Specializers job is not to optimize the code (that is a job of the compiler and middle end), but to generate the code for specific operations. I.e. if middle end creates some redundant operations, or operations that are ultimately ignored and not used by anything, specializer is not going to care, and will still schedule it with all operations in the end code still present. And this is fine too, because this makes specializer faster, and whatever is feed into specializer will be produced by compiler that already get rid of such stuff in practice.

As for the other questions and answers, it is obvious for Ivan, because he sits in this every day, so sometimes it might be hard for him to know that other people have problems with terminology or even most obvious answers. Usually giving more context, is the best options, not just simply giving yes/no answer.