The accomplishment that has become accepted as Moore's Law-that the number of transistors we can fit on a chip will double every 18 months-has enabled us to manufacture exponentially complex integrated circuits, but it has provided tough challenges to the design and design-automation communities. For example, the design of individual devices and associated interconnect is becoming harder because of submicron effects resulting in increasing interconnect delay and coupling.

At the same time, there are exponentially more devices to deal with. What's more, the situation is further exacerbated by the need to integrate heterogeneous elements-digital, analog and mixed-signal, RF and software-on the same piece of silicon. Finally, all this comes with competitive pressures to further reduce the time to market, resulting in a quadruple whammy for designers that has brought us the well-publicized gap between manufacturing capability and design productivity.

In addition to the intellectual design challenges, there are significant economic challenges associated with nonrecurring engineering costs in manufacturing. The International Technology Roadmap for Semiconductors predicts that while manufacturing complex system-on-chip (SoC) designs will be practical, at least down to 50-nm minimum feature sizes, the production of practical masks and exposure systems will likely be a major bottleneck for the development of such chips. That is, the cost of masks will grow even more rapidly for these fine geometries, adding even more to the upfront NRE for a new design.

Reports indicate a single mask set and probe card cost for a state-of-the-art chip is more than $500,000 for a complex part today, up from less than $100,000 a decade ago (this does not include the design cost). At 0.15-mm technology, Sematech estimates we will be entering the regime of the million-dollar mask set.

These design and manufacturing hurdles clearly point to a reduced number of design starts, with only moderate- to high-volume devices being viable. Volume can be increased by providing flexibility through device programmability; however, the price paid is efficiency, in speed, cost and power, for a particular application implementation. A compromise here is to provide application-specific programmable systems. This is already happening in multiple domains-for example, video, networking and communications processing.

However, the cost of developing a new device is extremely high-greater than $7 million to $25 million, based on informal surveys-and even then the actual design of these devices is mostly a work of art, utilizing the expertise of a few key designers. Even with significant development costs, these devices are very hard to program because of the lack of appropriate software environments. In addition to debuggers, simulators and visualization tools, synthesis tools such as compilers and custom runtime systems are needed that understand not only the specialized architectures, but also the silicon constraints of real-time and low power.

The Mescal-for modern embedded systems, compilers, architectures and languages-project is directed towards the development of methodologies, tools and appropriate algorithms to support the efficient development of fully programmable platform-based designs for specific application domains. The goal is to design devices that in their implementation approach the efficiency of ASICs and in their use approach the flexibility of general-purpose programmable devices.

A key aspect of closing the productivity gap is closing the gap between application problems and the ICs that solve them. We see that the essential problem to solve is finding a match between the concurrency in the application and the concurrency or parallelism in the target device. By application model we mean a natural expression of application concurrency. Examples are Matlab descriptions for signal processing and Statecharts for complex control.

By architectural model we mean the natural way of thinking about the architecture of the device-but in particular its concurrency. Linking these two is the programmer's model. The programmer's model is an abstraction of key aspects of the underlying architecture so that the programmer can map the application concurrency onto the target architecture. At its essence, the Mescal project is about efficient mapping of application concurrency onto architectural concurrency-that is, finding the right programming model for tomorrow's application-specific instruction processors. Formalizing this model is important; in the Mescal project underlying the application, architecture and programmer's models is a formal model of reasoning about the concurrency, referred to as the model of computation.

Applications, drawn as they are from the real world, have always been full of concurrency, but the default architectural model has been the sequential Von Neumann model and the default programming model has been the imperative C programming language. The translation of application concurrency onto serial architectures was left to the system architects and programmers. A key observation moving forward is that efficient implementations can be possible only when there is a good match between the application concurrency and the architectural concurrency, and Mescal aims to rethink the architectural and programmer's model to bring them closer to the application model.

To accomplish this, we need to clearly understand the interface between the application and the architecture domains. So the application, architecture and software development environment must all be integrated, and the principal unifying feature is the programmer's model.

The goal of this part of the project is to provide an environment for the efficient exploration of concurrent architectures by means of a flexible interface provided with the heterogeneous simulation environment known as Ptolemy II. Concurrency in our architecture is provided at multiple levels: at the bit level through specialized functional units, at the instruction level through VLIW processors, at the thread level through multithreading and at the process level through multiple processors.

Processing elements specified

The exploration environment enables the designer to specify a given microarchitecture and architecture, and to automatically export an interface to these that is used for the retargetable synthesis of the software environment (simulators, compilers and custom runtime systems). In addition to the specification of the processing elements (PEs), the environment provides for the specification of the communication between the PEs. The communication specifi-cation provides for flexibility in the physical network-topology as well as switching type (circuit or packet), as well as flexibility in the protocol for the network usage.

The software environment provides for automatic generation of simulators, compilers and custom-runtime environments from the architectural specification of the PEs and the communication network. This enables easy retargeting from architecture to architecture. At the core of this environment is an internal representation for the concurrent application (the Mescal Concurrent Representation, or MCR) to capture the concurrency and the X-code extensions to the Impact compiler for efficient uniprocessor compilation. A fast multithreaded compiled simulator is generated from the application and architectural specification.

The compiler analyzes the application concurrency using the MCR and the architectural specification and provides for a resource binding (processing and memory resources) of the application onto the architecture.

Code is then generated for the individual PEs using the X compiler, as well as for the custom runtime system that consists of scheduling and synchronization code. This is key to reducing the overhead of using general-purpose real-time operating systems. The software environment can measure and optimize for hard real-time as well as power constraints. This is crucial in our ability to deal with embedded software as a VLSI design component.

The C-language served as a successful example of a programmer's model for processors on several generations of minicomputers, workstations and personal computers. The "register'' keyword in the C language allowed programmers to do their own register allocation without being saddled with the specifics of register implementation. Pointer arithmetic and bit-level operations provided for further software efficiency. In essence the C programming language captured the 20 percent of assembly-language features that provided for 80 percent of the final application performance.

Silicon integration is allowing high microarchitectural complexity on a die as evidenced by some recent devices such as network processors (Intel IXP1200) and communication processors (Chameleon's Reconfigurable Communication Processors). At the same time, applications have become more complex and the current practice of implementing them in assembly for unconventional architectures does not scale. It is necessary to raise the programmer's level of abstraction in dealing with architectures and microarchitectures, but without sacrificing design efficiency.

One key element of the programmer's model is opacity, hiding nonessential aspects of the architecture and its implementation. An equally key element of the programmer's model is transparency, allowing the programmer access to key hardware features, such as special-purpose execution units. The practical goal of the Mescal programmer's model is to present the programmer of the new generation of application-specific instruction processors with the essential 20 percent of the architectural/micro-architectural features that give 80 percent of the efficiency of writing in assembler.

Design drivers-applications or sets of applications that we wish to use as a model of how other applications may behave-serve both to focus our research efforts and as a tool to quantitatively measure the effectiveness of our proposed methodology. Given the importance of the driver to our research goals, we have devoted substantial effort to choosing an appropriate real-world application for Mescal, and have arrived at the general problem of silicon support for virtual private networks. Particular challenges are fast Internet Protocol routing, network security and quality-of-service management.

Performance source

In this evaluation, we are quantifying whether the performance is coming from microarchitectural improvements, software technology or both. We also aim to answer some basic questions about concurrency across a system:

Where is the concurrency available?

How much can be achieved by exploiting instruction-level parallelism vs. more general concurrency?

How much "rewriting" of the application is needed to extract it?

How well were we able to predict the performance achieved a priori?

The Mescal project is aimed at the definition of domain-specific platforms and at the development of tools for the mapping of complex functionality onto the programmable architecture for a given domain.

Currently, the design of such platforms requires significant engineering effort, and even then the software development tools for such platforms are typically found lacking.