The Give and Take of Designing RISC/DSP Dual-Core SoCs

As streaming media and gigabit networking applications become commonplace, more SoC design teams are integrating RISC and DSP processor cores on the same ASIC. This is a daunting task to be sure, but much can be learned from a review of pioneering applications such as cell phones and advanced home-entertainment systems.

The leading-edge teams who have migrated designs from 2G to 2.5G to 3G cell phones, for example, have made architectural changes to accommodate new data types such as music, graphics and simple video. They have also added conventional computing power to run what is essentially a PC in a cell phone. At an even higher level of complexity, multiple high-bandwidth streaming media channels for home entertainment center applications are driving profound changes in both architecture and operating systems.

A typicaland very importanthigh-level design decision for a RISC/DSP SoC is partitioning the system tasks between the DSP and the RISC host. The fundamental rule is that the RISC core handles control and the DSP executes specific algorithms. Some exceptions will be mentioned later in this article.

As a starting point, let's look at the components in a dual-core SoC.

DSP Core Evolution

The newest generation of high-performance DSP processor cores differs markedly from classic DSPs. Most significant is the very-long-instruction-word (VLIW) architecture. It takes advantage of two characteristics of signal-processing applications: the predictability of the instruction sequence and the ability to process data in parallel. VLIW processors depend on simple instructions that encode a single operation. Their advantage comes from issuing and executing instructions in parallel groups rather than one at a time.

From a hardware perspective, parallel execution implies multiple, independent execution units and buses. TI's TMS320C62xx, for example, has eight independent execution units, which allow the processor to issue up to eight instructions per clock cycleall encoded in a single long instruction that describes a single operation.

VLIW DSP cores typically use 32-bit wide instruction wordsdouble the size of conventional DSPs. This allows designers to use larger register sets to enhance performance. The wider word is necessary, however, because information about which unit will execute the instructions must included in the instruction word.

A downside of the long instruction word, however, is high program memory usage, which translates into additional cost for additional RAM or ROM. Power consumption is also high relative to conventional DSPs.

The most common technique used to implement VLIW DSP processing in hardware is called SIMD (single instruction/multiple data), which allows parallel execution of different data using the same instruction. SIMD hardware units vary from vendor to vendor and even within vendors. The design choice mostly involves how the different types of hardware unitsMACs, ALUs, and shiftersare grouped.

While the architecture and capabilities of the DSP core tend to be application specific, the RISC core's primary role as controller makes its desired capabilities a bit more generic. This does not mean the core can skimp on performance, however, because the peripherals are numerous and demanding. It runs the operating system, executes the application software, supervises system debugging and typically handles the graphical user interface (GUI).

Typically under the host control are I/O blocks such as UARTs, USB cores, and Firewire cores as well as functions such as memory management, debugging, expansion-bus cores, and application-specific cores such as rendering engines. A 32-bit core that clocks at 150 MHz or greater and has architectural features such as multiple pipelines and a sizeable cache memories would typically be chosen to share CPU duties with the DSP on most SoCs.

On the software side, compatibility with existing instruction-set architectures is important and so is support for a wide range of real-time operating systems including OSE System's OSECK or Wind River Systems' VxWorks, as well as other operating systems such as Linux and Microsoft CE if the end-user application requires it.

The RTOS

On-chip signal processing introduces new wrinkles into the operating-system considerations for a multi-core SoC. Except for streaming media applicationswhere the real-time demands are intenseDSP processing is interrupt-driven on a hard real-time basis. This makes the requirement for short interrupt-latency a much higher priority in a SoC than with a single RISC core.

The obvious solution of specifying a real-time operating system is not enough. The parallel-processing architectures of VLIW DSPs also demand that the OS simultaneously execute multiple algorithms as well as scheduling the tasks and communicating with the RISC host CPUall on an event-driven basis.

Inter-process communication requires a multi-threaded RTOS. Communication with the host requires a link handler that can create a logical channel between processes running on different processors and processor types. More important, message exchange between processes must follow the same protocols regardless of whether the communicating processes are located on the same processor, on multiple instances of the same type of processor, or on different types of processors.

For many applications, these requirements add up to an RTOS with a standard API, standardized inter-process communication and debugging environments, and an extensive portfolio of third-party products such as IP stacks. Table 1 shows a detailed list of RTOS enhancements required for dual-core RISC/DSP communication.

Frequently, dual-processor designs simply add the signal-processing power of a DSP core to an existing RISC-based application. The relationship between the cores is a master/slave arrangement with the DSP executing host instructions. Advanced architectural features, such as unified memory architecture, canand shouldbe avoided.

Two critically important high-level design steps are to assign each system task to the appropriate core and to have a full understanding of the conditions under which the cores communicate. Task partitioning varies from application to application but the basic premise is to assign intense signal processing tasks to the DSP and control and user-interface tasks such as access to external memory, storage media, and the incoming data stream to the RISC host. Encoding the incoming wire-stream data into a format such as PCM that is easily handled by the DSP core is usually best assigned to the host processor.

Communication issues can be considerably more complex but the basic methodology for addressing them is similar for most applications. Among the key considerations are communication bandwidth and each core's minimum and maximum latency. Data bandwidth is provided in the specifications for the cores and buses available for the SoC and is a fairly straightforward calculation. Latency, on the other hand, is not.

Typically, latency is application dependent and far more important than actual data-transfer time. DSP applications that involve voice, video, audio, or even graphics cannot tolerate interruptions in the data stream without annoying the userhence this is a critical issue. There are basically three options for solving latency induced problems: raise the priority of communication task between the RISC and DSP cores, change the buffer transfer sizes, and reduce the priorities of some of the control tasks causing the latency.

Cell Phone Generations

A 2G cell phone typifies the class of fairly straightforward DSP/RISC dual-core applications. The processors basically run their own show and communicate only when necessary. The cell-phone market leader in the DSP space is Texas Instruments, of course, and its most popular core in 2G phones is one of the company's 320C54xx devices. On the RISC side, ARM Ltd. typically licenses its 7tDMI for 2G phones, says John Rayfield, ARM's director of Technical Marketing in Los Gatos, Calif. The two cores use a mailbox to communicate. The DSP core handles voice data streams and channel communication utilizing a datapath bus while the RISC core supervises a control bus for virtually everything else. Each processor has its own SRAM main memory. Moving on to 2.5G means data rates go up and call processing becomes more intensive but does not require any dramatic architectural changes.

The same cannot be said for 3G cell phones. The introduction of multimedia data types results in still higher data rates and even more intensive call processing. But the real difference, says Rayfield, is the introduction of another processor block for application processing.

This "user compute" makes sense because the functionality of a PDA or PC is being integrated into the phone. Figure 1 shows the addition of an application processor that accesses dedicated 64 Mbytes of SDRAM.

Another significant increase in 3G design complexity is the introduction of an operating system for the application processor. This could be any of a variety of PC or PDA OSs including Palm, Symbian, WinCE, Linux and, in the Asian market, iTron. >Figure 1 shows the bus architecture of the additional processor block.

ARM is a strong proponent of adding a user-compute RISC core to solve the 3G challenge but its solution is not universal. Another option is to further offload the existing RISC core by adopting a DSP architecture that does not require the RISC core to issue long series of instructions to the DSPin other words, make the DSP more autonomous.

In the ARM architecture, CPUs connect to peripherals mostly via point-to-point buses. As cell geometries decrease, says Rayfield, wiring has become almost free and overhead is trivial. This undermines most of the reasons for adopting a tristate shared bus.

Multiple Multimedia Datastreams

Although 2G and 2.5G cell phones process voice and graphics, DSP functionality can be interrupt-driven because the bit rates are not high. Multiple-multimedia-datastream applications, however, cannot tolerate interrupts. SoC platforms such as Philips Semiconductors' Nexperia Home Entertainment Engine, for example, receive, decode, convert, and display multiple data streams, each having a different data format, including MPEG2, NTSC, PAL, and audio.

Architectural innovations to handle streaming media include:

A unified memory architecture

Three data buses each servicing specific task domains

A software architecture to tie it all together.

In the Nexperia platform, the MIPS core handles the RTOS, graphics and the applications software. A TriMedia core handles most of the streaming-media processing and system-wide task scheduling. As opposed to simpler applications mentioned earlier, the two CPUs are part of a single, integrated system. The CPUs share a unified memory that functionally allows them to swap tasks to balance computing loads when necessary, but also provides considerable savings in memory costs.

Each processor core can address virtually any of the peripherals, but every peripheral is assigned to the task domain of one of the cores. This scheme markedly reduces overall system cost and power consumption because resources such as main memory and disk and memory interfaces are shared.

The Nexperia bus architecture consists of three task domains (Figure 2). The backbone is a point-to-point memory bus that connects external SDRAM with the SoC's peripherals for high-throughput, low-latency DMA access.

Figure 2: Philips Semiconductors' Nexperia platform for multiple multimedia datastream home entertainment centers has three buses. The DVP memory bus provides DMA access to main memory and handles all streaming media on a non-interrupt basis. Each core has a Peripheral Interconnect bus that supports interrupts. Bus bridges make it possible for the MIPS and TriMedia cores to swap control tasks when necessary. D$ is data cache, I$ is instruction cache.

The two remaining domains are a MIPS PI (Peripheral Interconnect) bus that connects the MIPS core to peripherals in its domain and a TR32 PI bus that performs the same function for the TriMedia core. In addition, the bus architecture includes a crossover bridge joining the MIPS and TriMedia PI buses. This allows memory-mapped I/O access from each processor to control or observe the status of all the peripherals.

While the MIPS processor runs an RTOS, the TriMedia VLIW engine has its own software architecture. The TriMedia Streaming Software Architecture (TSSA) represents one strategy for handling separate datastreaming tasks. TSSA must communicate with the RTOS, of course, but its primary function is to support on-chip hardware with multimedia libraries. These libraries consist of components that perform most of the datastream processing including digitizing, processing, and rendering.

TSSA essentially configures processing components inside the TriMedia core according to instructions received from the host regarding what type of processing the engine requires. This is a departure from the usual process of executing algorithms in software. Here, dedicated signal processing engines are dynamically created, connected, configured, and destroyed depending on the type of data being processed at the moment.

The Nexperia platform is an early entry in the race to the next generation of DSP/RISC SoC in which signal processing is farmed out to multiple DSP cores, each supported by dedicated hardware accelerators that are software-configured by the designer in C.

If this is the next generation architecture for mixed-core SoCs, it will mark a return to the DSP being a slave to the RISC host. Oz Levia, chief technology officer of Improv Systems, a leading advocate and supplier of configurable DSPs, contends that a loosely couple architecture is the most efficient way to handle high-bit-rate streaming media.

According to Levia, "The DSP must autonomously process information coming in at wire speed." Improv envisions a loosely coupled interface between the DSP and the host in which the host has a library of high-level commands for the DSP such as "decode a frame," "execute a DCT," and "dump a buffer." The DSP already has a library of assembly-language commands it uses to respond to high-level commands from the host.

General Integration Issues

Architectures will come and go as they always have but tools and test strategies will always with us. David Baczewski is the strategic marketing manager for StarCore, a DSP IP joint venture of Agere, Infineon, and Motorola. Baczewski argues that the most efficient way to get working silicon is a development tool-suite that comprehends both the RISC and the DSP cores and has the basic relationships between the two built in.

This expertise can only come from the core vendors, he says, because only they know the intricaciesand eccentricitiesof their products. Similarly, test strategies depend heavily on the ability of JTAG circuits, for example, to access both cores. Such a strategy requires intimate knowledge of the cores that is typically found only in the vendor's realm of expertise, Baczewski says.

If there are some organizing principles for understanding the integration issues for RISC/DSP dual-core SoCs, they are:

Contributing writer Jack Shandle is a former chief editor of both Electronic Design magazine and ChipCenter.com. He holds a BSEE degree and, over his 15-year career in technical publishing, has written hundreds of articles on all aspects of the electronics OEM industry. Jack is president of eContentWorks, a consultancy that creates high-value content for publishers, EOEM corporations, and industry associations. His email address is jshandle@earthlink.net.