Does Hardware Multi-threading Belong in Embedded CPU’s?

by Dick Selwood

Synthesizable embedded microprocessors surpassed stand-alone conventional CPU’s in unit volumes several years ago, and are beginning to catch up to them in performance and feature complexity. Hardware multi-threading (as distinct from software multi-threading) is one of those advanced features. But does hardware multi-threading (MT) have any applicability in a synthesizable embedded space where superscalar and multicore offerings appear to dominate? If yes, where and how does it help?

History of MT

Intel pioneered multi-threading at the beginning of the millennium with hyper-threading in the Xeon and Pentium 4, and they have implemented this feature in subsequent devices, including the Atom and Core i7. Other standard processor vendors followed suit, as IBM released the Power5 in 2004 and Sun Microsystems offered the ULTRASPARC T1 in 2005. Various SoC vendors also incorporated MT in their products, including the likes of PMC Sierra and RMI (now part of Netlogic.) One can see that MT as a CPU technology has a solid pedigree.

Why MT?

MT implementations replicate the CPU’s upper pipeline portions concerned with accessing and preparing instructions for processing (and thus capturing the processor’s ‘architectural state’ for each task or thread), while sharing the execution portion of the pipeline between them. Multi-threading CPU’s switch away from a priority or active thread to a ‘waiting’ thread once the priority thread’s processing has stalled. Typical stall conditions include:

(1) Cache misses – these typically take 50 or more cycles to correct, and are especially costly in the case of page faults, which require access all the way back to a large storage device such as an HDD.

(2) Branch mis-predicts – a latency penalty of typically five or more cycles.

(3) Data dependencies.

The commonality amongst all these conditions is that they force a pipeline flush. While pipeline flushes are a primary cause for reducing single threaded CPU pipeline architectures to 40%-50% execution efficiency in terms of instructions per cycle (IPC), MT ameliorates this by switching processing context to a waiting thread already prepared for execution, frequently improving IPC efficiency to around 80%.

Advantages of MT

With MT, multi-tasking runs more smoothly, providing a system performance improvement of 20-30% over the long term. This is a direct result of greater IPC efficiency. An ancillary bonus is general system latency reduction. Furthermore, since task/process switching is no longer gated by L1 cache access, the system behaves in a more deterministic manner – a valuable factor for real-time applications.

There are further subtle salutary effects. The added logic functionality has minimal impact on silicon area – typically less than a 5% increase to CPU real estate. Furthermore, resistance to latency effects on performance contributes slack to the memory subsystem, permitting use of slower memory, and relaxing design requirements for critical path support – thus reducing the need for additional clock buffering, interconnect repeater insertion and larger, higher-drive standard cell logic.

But does an MT-capable embedded CPU instantly heal the earth, calm the waters, achieve world peace and solve all of your SoC design problems? Hardly.

MT implementation Issues

Despite its advantages in terms of IPC efficiency and task-switching latency reduction, there are potential issues with MT involving the CPU, system hardware and the software stack:

(1) Cache thrashing – a dramatic increase in system latencies can occur if more than one thread is competing for access to the same L1 memory space. This can be mitigated by a properly designed cache access and increasing segmentation/set associativity of the L1 cache.

(2) Bandwidth – though not unique to MT applications, a badly-designed local bus that results in instruction and/or data congestion defeats the advantages of MT.

(3) TLB contention – thread competition over a branch history table is a subtle source of problems for an MT-enabled processor.

(4) Threads running identical processes – one of the misconceptions regarding MT is that it provides immediate benefits to tasks such as packet processing. However, switching between threads running identical tasks does not improve execution pipeline efficiency. It is the context switching between very different processes that normally result in a pipeline flush where the benefits of MT are most evident.

(5) Matrix operations – because of their heavy data dependencies, implementing matrix calculations in any RISC or CISC CPU architecture (with or without MT) can lead to cache thrashing. Such tasks are best handled by dedicated SIMD engines, as matrix operations are frequently encountered in graphics tasks, and are thus more relevant as a system level design issue.

(6) OS support – an operating system has to be capable of supporting MT and the kernel must be configured for symmetric multi-processing (SMP). Fortunately, there is an ever-increasing prevalence of OS’s that are SMP-capable, including Windows and multiple Linux distributions. Once an OS is SMP-enabled, it will treat an MT-capable physical processor as two virtual CPU’s, each capable of supporting multiple threads. (For further exploration of OS issues related to supporting MT such as scheduler load balancing or using POSIX to build multi-threading applications, please refer to the Tim Jones/IBM technical paper in the bibliography).

MT “alternatives”

The MT approach exploits thread-level parallelism, a common phenomenon in multi-task computing. An alternative approach is to simply turn up the clock on single-threaded architectures and achieve greater throughput with raw speed. But this may exacerbate the situation. Latencies of 50 or 60 cycles created by a cache miss in a 500-MHz processor might turn into latencies of 100 cycles or more in a 1-GHz CPU.

Another alternative to MT that has seen increasing use in embedded synthesizable CPU architectures is the implementation of superscalar designs – a technique that takes advantage of instruction-level parallelism (ILP). In fact, one can view superscalar as the flip side of MT, in that superscalar CPU’s eschew the replication of the upper end of the pipeline and instead iterate the execution stages. The result is a multi-issue pipeline that can theoretically achieve execution efficiency above 100%.

As long as the upper pipeline stages correctly and efficiently check for data dependencies and opportunities for ILP in the sequential instruction stream, the CPU has a greater chance of keeping its redundant execution pipelines efficiently utilized. However, dual-issue superscalar CPU’s are subject to the same system vulnerabilities that cause inefficiencies in single-threaded architectures – cache misses, branch mis-predicts and data dependencies (which, despite hardware checking early in the pipeline, are pernicious problems caused by inherent limitations in ILP for any given code and which can interfere with maximum instruction dispatch across the execution pipes). All of these provoke a pipe flush in a single-threaded architecture; a superscalar approach simply provides redundancy of execution resources to compensate.

One can thus view MT and superscalar as two sides of the same coin. The aim of both is to increase throughput. MT uses a minimalist approach by emphasizing the efficient utilization of a single execution pipe while being parsimonious with silicon area and lowering the overall power profile – in effect, trying to do more with less. Superscalar is a more brute force approach that trades off greater CPU size, cost and power profile for greater overall throughput.

MT and the future of embedded synthesizable CPU’s

The utility of multi-threading in improving CPU pipeline efficiency, as well as the parsimonious way that MT achieves this and the resulting benefits to silicon area, cost, static and dynamic power consumption, memory subsystem implementation, SoC design optimization and SW stack development are by this point clear. Yet MT-capable CPU’s have not seen widespread adoption in embedded applications. Why is that?

MIPS pioneered MT in the embedded space with the release of the MIPS 34K in 2005 and the multicore version of it – the 1004K – in 2008. ARM, by contrast, stayed with single-threaded approaches and pursued superscalar and multicore aggressively, starting with ARM11MP in 2005 and continuing aggressively in 2007 with the A9, the first of the Cortex series.

The different technology directions pursued by MIPS and ARM in the last decade can be readily explained by the needs of their application segments. MIPS focused primarily on wired applications in the digital home market (including DVD, HDTV, STB and home gateway) and wireless infrastructure (particularly WiFi base stations). The acute need for efficient context switching between multimedia threads and media and communications processes in the control plane was what drove MIPS to pioneer MT in the synthesizable embedded space. ARM, by contrast, utterly dominated the wireless handset market, which did not experience the same level of multi-tasking pressures until the rise of the smartphone late in the 2000’s. Thus, ARM supported its customers’ primary goal of increasing bandwidth by emphasizing innovation in total throughput – hence the adoption of a superscalar approach.

Now, however, with ARM becoming dominant in the digital home and wireless infrastructure markets, as well as the continuing healthy growth of the smartphone user base, will ARM follow up with an MT offering in its product line? At first blush, there doesn’t appear to be a need for it – after all, ARM has had stunning success with its single-pipeline, superscalar and multicore offerings, so why change an already successful technology portfolio?

Upon further examination, though, one can see that MIPS did indeed achieve commercial success with its MT offerings, with the 34K and 1004K popping up in nearly three dozen different wired consumer electronics applications. With the wired consumer electronics segments having, by and large, converted from MIPS to ARM, this system OEM familiarity with the benefits of MT, combined with the multi-tasking complexity of the growing smartphone market and relentless performance, density/cost and power consumption pressures on consumer electronics silicon solutions, will likely force ARM’s hand and compel it to adopt MT.

Furthermore, if ARM wants to seriously contend with Intel for dominance, its embedded synthesizable CPU offerings can hope to close on the performance levels of their standard product counterparts only if the feature sets of synthesizable and standalone CPU offerings converge as well. Thus, future ARM cores will need to incorporate simultaneous multi-threading along with superscalar, multi-issue pipelines with out-of-order execution to become full hyper-threading cores, and these individual cores will be need to be further integrated into multi-CPU architectures with built-in memory and I/O coherency.

The conclusion is thus inescapable:MT in the embedded world is here to stay. ARM is sure to implement it and proliferate it through their new products. A vastly larger embedded engineering community will become exposed to MT as a consequence and learn to utilize it effectively, with the end result that MT will be a key feature in driving further innovation and feature richness in wired and wireless consumer electronics.

Author:

Peter Gasperini – VP of Business Development, Markonix LLC; previously President and GM of Iunika North America, with 22+ years of experience in Silicon Valley with ASIC, FPGA, embedded microprocessors and engineering services.

Why is there no mention of XMOS / XCore is this article? Eight HW threads per core, each with its own register set. Round-robin HW scheduling per instruction. Inter-thread / inter-core communication channels supported in HW. Ten 32-bit HW timers per core.

[from their web site:] The XMOS architecture enables systems to be constructed from multiple XCore processors connected by communication links. Every XMOS device includes one or more XCores and a high-speed low-latency switch. The switches are used to route messages between XCores on a chip, and messages between chips, via the links.

@edschulz,
The article is a discussion of multithreading for embedded applications in general, Ed, and is not an advertisement for any particular company. Ads are boring; general discussions on technology trends are vastly more interesting and useful.

https://youtu.be/Bs5A09med6Q Made at the Cadence campus in the rain (camera Sean) Monday: Tensilica at CES Tuesday: AMD Keynote at CES Wednesday: AlphaZero: Four Hours to World Class from a Standing...
[[ Click on the title to access the full blog on the Cadence Community si...

In December, we wrapped up a number of new features to Samtec.com to round out the 2018 year, including a new way to find Samtec sales locations and distributors, a new homepage panel for our Micro Rugged products, continued mobile optimizations, and a few updates to our Indu...