DesignWare Technical Bulletin

Designing Processors for High-Performance Embedded Linux Applications

Designers of performance-intensive, embedded SoCs running Linux or other virtual-memory operating systems must address increasing performance requirements with power budgets that are often constant or shrinking. Available processors that offer the needed performance often draw too much power, while processors that fit within the power budget lack the necessary performance.

The traditional path of increasing performance by increasing the processor clock speed has significant tradeoffs because power consumption rises linearly with clock frequency. Processors that enable dual- and quad-core designs with cache-coherent symmetric multiprocessing offer chip designers an alternative path to higher performance. Many applications, like wearables that require good performance with very low power consumption to maximize battery life, benefit from running on multiple CPU cores, especially when the software can efficiently distribute workloads across a multicore cluster. For some applications that run high-end operating systems, a single core is enough to achieve the required performance for most implementations. However, when more performance is needed a dual- or quad-core processor can be implemented in a symmetric configuration with the operating system distributing the load across the cores to achieve the required speed.

Cache Features for Easier Multicore Implementation

A processor can offer a number of cache-related features to enable easy implementation of multicore systems. L1 cache coherence is critical for multicore Symmetric Multi-Processing (SMP). When two or more CPUs access the same memory, some mechanism must keep the cached data coherent to prevent the CPUs from independently modifying the same data. Maintaining this coherence in software is difficult and consumes numerous clock cycles, so cache-coherent processors implement this mechanism in hardware by using snooping to monitor all L1 caches for read and write operations and keep the cached data coherent with the data in the other caches.

Additionally, an I/O coherency unit that keeps input/output traffic coherent with the L1 caches automatically handles complex bookkeeping. For example, when an I/O device modifies data in one core's L1 cache, the I/O coherency unit updates the other L1 caches, removing the need for the application programmer to focus on these details.

Another way to boost performance is to design a processor with a user-configurable level-two (L2) cache to reduce main-memory accesses. An L2 cache can include several features that ensure high performance while consuming minimal power. For example, an L2 cache designed to run at the same clock frequency as the processor and shared by all the cores in a multicore cluster ensures that the L2 cache can keep up with the CPUs. In addition, a cache tightly connected with the core through a separate low-latency bus avoids AXI bus traffic on the data paths between the CPU cores and the L2 cache, further improving performance.

Processors that offer a high degree of configurability can offer excellent power and performance benefits. Besides being able to configure the L2 cache's clock speed, memory size, and AXI interfaces, designers using configurable processors can customize the core and make use of different sleep modes to save power. When chip designers implement the L2 cache, they can also select higher-density SRAMs to reduce power consumption and die area, although performance will suffer a bit.

Memory Management of Virtual Memory

A memory management unit (MMU) that enables a processor to run sophisticated embedded operating systems that supports SMP is also important for high-end embedded applications. Virtual memory systems are used in high-performance applications to conceptually use more memory than is physically available and an MMU eliminating the need for the application to manage the shared memory space. The use of virtual memory reduces memory requirements by enabling programs to execute without requiring the entire address space to reside in physical memory and makes application programming easier by hiding the fragmentation of physical memory. The MMU enables the Linux kernel or other high-end operating system to manage the memory hierarchy and enable each process to run in its own address space.

ARC HS38 Processor - Designed for High Performance

Synopsys' ARC® HS38 processor, the latest addition to the ARC HS family, was designed for applications that require a virtual memory operating system, such as Linux. ARC HS38 has an MMU with 40 bits of configurable physical-address space, enough to directly address one terabyte (1TB) of main memory. The MMU also supports variable-size memory pages and can simultaneously handle two different page sizes. ARC HS38 concurrently supports memory pages in the normal range (4 KB, 8 KB, or 16 KB) as well as very large pages (4 MB, 8 MB, or 16 MB). Larger pages reduce the number of missed references in the translation-lookaside buffer (TLB).

Higher-end CPU designs tend to be more complex and pay the price in higher power consumption and transistor counts. ARC HS cores take a more streamlined approach that requires fewer transistors and less power, yet they still deliver high throughput with an unusually flexible CPU that SoC designers can customize extensively. The ARC HS38 processor has a 10-stage pipeline and delivers a maximum worst-case clock frequency of 1.6 GHz in a 28-nm high-performance mobile CMOS process. Power consumption for a minimal ARC HS38 implementation is only 36 microwatts per megahertz (58 mW at 1.6 GHz) and occupies only 0.20 mm2 of silicon.

Throughput for the ARC HS38 at 1.6 GHz (worst case in 28-nm) reaches more than 3,100 Dhrystone MIPS (DMIPS) or 5,600 CoreMarks (3.5 CoreMarks per megahertz) per core. At this speed, the processor offers the performance needed for today's embedded systems and plenty of headroom for higher performance in future designs. For designs that require higher performance, dual-core and quad-core versions of the ARC HS38 with support for full L1 cache coherency and up to 8 MB of L2 cache are available. The low power consumption design of the ARC HS38 enables dual-core and even quad-core versions to be implemented at comparable power consumption levels to competitive single core implementations, but with the multicore HS38 implementation offering much higher performance. Multicore designs have always been possible using ARC cores, and in fact, some customers' designs have incorporated hundreds of cores. The ARC HS38 processor enables designers to implement dual- and quad-core clusters supporting SMP with less effort than before, thereby cutting costs and accelerating time to market. The aggregate performance of a quad-core ARC HS38 design would be as much as 12,400 DMIPS or 22,400 CoreMarks.

Figure 1: DesignWare ARC HS38 block diagram

Conclusion

Synopsys designed the ARC HS family to maximize performance efficiency for embedded applications offering very high performance with size and power consumption levels that are less than half of what is required for competitive cores. Designing high-performance processors is not difficult when the power and transistor budgets are lavish. What is much more difficult is designing small, efficient processors that offer enough performance today plus additional headroom for future growth. With the ARC HS family, Synopsys is expanding its DesignWare IP portfolio to meet the growing high-performance needs of SoC developers while avoiding unnecessary features that would compromise today's tightening cost and power budgets.