Computer Architecture and Systems

Computer architecture is the engineering of a computer system through the careful design of its organization, using innovative mechanisms and integrating software techniques, to achieve a set of performance goals.

Research Showcase

New Tool Makes Programs More Efficient Without Sacrificing Safety Functions

Computer programs are incorporating more and more safety features to protect users, but those features can also slow the programs down by 1,000 percent or more. Researchers at North Carolina State University have developed a software tool that helps these programs run much more efficiently without sacrificing their safety features.

Embedded Computer Systems

An embedded system is a special-purpose system in which the computer is completely encapsulated by the device it controls. Unlike a general-purpose computer, such as a personal computer, an embedded system performs pre-defined tasks, usually with very specific requirements. Since the system is dedicated to a specific task, design engineers can optimize it, reducing the size and cost of the product. Embedded systems are often mass-produced, so the cost savings may be multiplied by millions of items.

Some examples of embedded systems include ATMs, cell phones, printers, thermostats, calculators, and videogame consoles. Handheld computers or PDAs are also considered embedded devices because of the nature of their hardware design, even though they are more expandable in software terms. This line of definition continues to blur as devices expand.

The field of embedded system research is rich with potential because it combines two factors. First, the system designer usually has control over both the hardware design and the software design, unlike general-purpose computing. Second, embedded systems are built upon a wide range of disciplines, including computer architecture (processor architecture and microarchitecture, memory system design), compiler, scheduler/operating system, and real-time systems. Combining these two factors means that barriers between these fields can be broken down, enabling synergy between multiple fields and resulting in optimizations which are greater than the sum of their parts.

One challenge with embedded systems is delivering predictably good performance. Many embedded systems (e.g. anti-lock brakes in a car) have real-time requirements; if computations are not completed before a deadline, the system will fail, possibly injuring the user. Unfortunately, many of the performance enhanceming features which make personal computers so fast also make it difficult to predict their performance accurately. Such features include pipelined and out-of-order instruction execution in the processor, and caches in the memory system. Hence the challenge for real-time system researchers is to develop approaches to design fast systems with easily predicted performance, or to more accurately measure existing complex but fast systems.

Memory Systems / Memory Management

The Memory Subsystem is an important component in uniprocessor and multiprocessor systems. It consists of temporary storage that is managed by hardware (cache) or software (scratchpad), as well as more permanent storage that is volatile (main memory) or non-volatile (Flash memory, disk, etc.). It consists of on-chip storage as well as off-chip storage.

The importance of the role of the memory subsystem on the overall system performance has been growing in the past few decades and is expected to grow even more in the future. Recognizing such growing importance, increased design focus on the memory subsystem has led to caches taking up a larger fraction of die area. Already, caches take up more than 50% of the die area of some high performance microprocessors. In future multicore and manycore processors, the die area allocated for caches will likely increase even more.

Designing the memory subsystem is challenging because it involves balancing among multiple goals. The goals include fast access time, programmability, high bandwidth, low power, low cost, reliability, and security. Some of the goals may contradict one another, thus it is important to strike a good balance given the target market of a system. For example, techniques to hide memory access latency such as prefetching tend to increase bandwidth consumption. Techniques to reduce bandwidth consumption such as cache and link compression tend to increase the cache access latency. Techniques for improving reliability such as parity and error correcting code tend to increase the cost, access latency, and bandwidth consumption. Techniques for improving programmability such as cache coherence increase cost and power consumption. Therefore, in designing the memory subsystem of a computer system, extensive knowledge and expertise, as well as careful attention to the target market and practical design constraints, is crucial to success.

At North Carolina State University, we provide extensive education and training in the area of memory subsystem. Memory subsystem is covered in courses ranging from the undergraduate level, introductory graduate level, and even at the advanced graduate level. Our research program is at the cutting edge of innovation in the memory subsystem, and is recognized internationally. We have a tradition of spearheading research activities in new areas, identifying new problems and proposing innovative solutions to such problems. Some examples of our past projects that have demonstrated our role in pioneering research effort in memory subsystem include:

Fair caching : a new concept in which when a cache is shared by applications running on multiple processor cores, the effect of contention on performance is equalized among all cores.

Multicore Quality of Service : a framework for providing performance guarantee to applications running on a multicore chip. when multiple processor cores on a chip share a common memory hierarchy, it is important to provide performance isolation (better yet, performance guarantee) to different cores to achieve. It is also crucial that providing Quality of Service does not penalize the system throughput. The project focuses on both providing Quality of Service and optimizing throughput in such an environment.

Encrypted memory : memory system that is encrypted and authenticated. Plain text of data and code stored in the main memory and disk is vulnerable to an attacker who tries to obtain secret information stored in the memory. To protect against the possibility of such attacks, the memory can be encrypted and protected with authentication code. The project achieves secure memory with low performance overheads and compatibility with modern system features such as virtual memory and inter-process communication.

Microprocessor Architecture

A high-performance microprocessor is at the heart of every general-purpose computer, from servers, to desktop and laptop PCs, to open cell-phone platforms such as the iPhone. Its job is to execute software programs correctly and as quickly as possible, within challenging cost and power constraints.

Research in microprocessor architecture investigates ways to increase the speed at which the microprocessor executes programs. All approaches have in common the goal of exposing and exploiting parallelism hidden within programs. A program consists of a long sequence of instructions. The microprocessor maintains the illusion of executing one instruction at a time, but under the covers it attempts to overlap the execution of hundreds of instructions at a time. Overlapping instructions is challenging due to interactions among them (data and control dependencies). A prevailing theme, speculation, encompasses a wide range of approaches for overcoming the performance-debilitating effects of instruction interactions. They include branch prediction and speculation for expanding the parallelism scope of the microprocessor to hundreds or thousands of instructions, dynamic scheduling for extracting instructions that may execute in parallel and overlapping their execution with long-latency memory accesses, caching and prefetching to collapse the latency of memory accesses, and value prediction and speculation for parallelizing the execution of data-dependent instructions, to mention a few.

Within this speculation framework, there is room for exposing and exploiting different styles of parallelism. Instruction-level parallelism (ILP) pertains to concurrency among individual instructions. Such fine-grained parallelism is the most flexible but not necessarily the most efficient. Data-level parallelism (DLP) pertains to performing the same operation on many data elements at once. This style of fine-grained parallelism is very efficient, but only applies when such regularity exists in the application. Thread-level parallelism (TLP) involves identifying large tasks within the program, each comprised of many instructions, that are conjectured to be independent or semi-independent and whose parallel execution may be attempted speculatively. Such coarse-grained parallelism is well-suited to emerging multi-core microprocessors (multiple processing cores on a single chip). With the advent of multi-core microprocessors, robust mixtures of ILP, DLP, and TLP are likely.

Microprocessor architecture research has always been shaped by underlying technology trends, making it a rapidly changing and vigorous field. As technology advances, previously discarded approaches are revisited with dramatic commercial success (e.g., superscalar processing became possible with ten-million transistor integration). By the same token, technology limitations cause a rethinking of the status quo (e.g., deeper pipelinining seems unsustainable due to increasing power consumption).

The latest trend, multi-core microprocessors, challenges a new generation of researchers to accelerate sequential programs by harnessing multiple heterogeneous and homogeneous cores. Current NC State research projects along these lines include:

FabScalar Project. A promising way to increase performance and reduce power is to integrate multiple differently-designed cores on a chip, each customized to a different class of programs. Heterogeneity poses some unprecedented challenges: (1) designing, verifying, and fabricating many different cores with one design team, (2) architecting the optimal hetereogeneous multi-core chip when faced with boundless design choices and imperfect knowledge of the workload space, (3) quickly evaluating core designs in a vast design space, (4) automatically steering applications and phases of applications to the most suitable cores at run-time. The FabScalar Project comprehensively meets these challenges with a verilog toolset for automatically assembling arbitrary superscalar cores, guiding principles for architecting heterogeneous multi-core chips, analytical methods for quickly customizing cores to workloads, and novel approaches for steering application phases to the most suitable cores.

MemoryFlow Project. This project explores a new microarchitecture that distributes a program’s data and corresponding computation among many cores.
Slipstream Project. This mature project pioneered the use of dual threads/cores in a leader-follower arrangement for improving performance and providing fault tolerance.

Parallel and Distributed Computer Architecture

Parallel computing is the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain faster results. There are many different kinds of parallel computers (or “parallel processors”). Flynn’s taxonomy classifies parallel (and serial) computers according to whether all processors execute the same instructions at the same time (single instruction/multiple data — SIMD) or each processor executes different instructions (multiple instruction/multiple data — MIMD). They are also distinguished by the mode used to communicate values between processors. Distributed memory machines communicate by explicit message passing, while shared memory machines have a global memory address space, through which values can be read and written by the various processors.

The fastest supercomputers are parallel computers that use hundreds or thousands of processors. In June 2008, the fast computer in the world was a machine called “Roadrunner,” built by IBM for the Los Alamos National Laboratory. It has more than 100,000 processors, and can compute more than one trillion (1012) floating point operations per second (one petaflop/s). Of course, only very large problems can take advantage of such a machine, and they require significant programming effort. One of the research challenges in parallel computing is how to make such programs easier to develop.

The challenge for parallel computer architects is to provide hardware and software mechanisms to extract and exploit parallelism for performance on a broad class of applications, not just the huge scientific applications used by supercomputers. Reaching this goal requires advances in processors, interconnection networks, memory systems, compilers, programming languages, and operating systems. Some mechanisms allow processors to share data, communicate, and synchronize more efficiently. Others make it easier for programmers to write correct programs. Still others enable the system to maximize performance while minimizing power consumption.

With the development of multicore processors, which integrate multiple processing cores on a single chip, parallel computing is increasingly important for more affordable machines: desktops, laptops, and embedded systems. Dual-core and quad-core chips are common today, and we expect to see tens or hundreds of cores in the near future. These chips require the same sorts of architectural advances discussed above for supercomputers, but with even more emphasis on low cost, low power, and low temperature.

Security and Reliable/Fault-Tolerant Computing

Reliable/fault tolerant computing deals with techniques to provide a computer system an ability to keep normal operation despite the occurrence of failures. A failure may be permanent in which a component cannot function properly after the failure, or transient in which a component suffers from a temporary failure (such as loss of data) but remains functional after the failure. A failure may be suffered by a hardware component or by software components due to bugs in code.

The goal of fault tolerant computing is to provide high availability, measured in the percent of time it is functioning. Availability is affected by the failure rate as well as by the time to recover from the failure. Designing fault tolerant computer systems must balance the target availability that is appropriate for the market of the systems, the cost of providing fault tolerance, and performance overheads.

Failures can be masked by using redundant execution, for example by having multiple components performing the same task and selecting the majority outcome as the correct outcome. Failures can be detected and corrected using error detection and correction coding. Failures can also be detected and recovered using a roll-back recovery scheme, in which the state of the system is rolled back to a known good state, and computation is restarted from there.

Computer security deals with techniques to keep computers secure from attacks. With the increasing interconnectedness of computer systems, security attacks are of increasing concerns. The goal of a security attack is to modify the behavior of the computer system in order to benefit the attacker, such as leaking or destroying valuable information, or making the system inoperational.

Attackers can attack different components of the software layer by exploiting vulnerabilities in application code or the operating system, or the hardware layer by exploiting unprotected hardware components in the system.

At North Carolina State University, we cover fault tolerance and computer security briefly in several courses at the undergraduate level and introductory graduate level, and cover them extensively in an advanced graduate level course. Our research program addresses fault tolerance and computer security concerns at various components of the computer system, such as at the processor microarchitecture level, memory system architecture level, and at system software level. Some examples of our past projects that have demonstrated our role in pioneering research effort in memory subsystem include:

Secure heap memory : a heap library that removes the link between vulnerabilities in the application code and the behavior of the heap library. Secure heap library is provided using the help of protection that comes naturally from separate address space between the application and a new heap management process. The library, known as “Heap Server library” has been released to the public, and is an example of how research and innovation at NCSU provides an immediate benefit to the community.

Secure processor architecture : plain text of data and code stored in the main memory and disk is vulnerable to an attacker who tries to obtain secret information stored in the memory. To protect against the possibility of such attacks, the memory can be encrypted and protected with authentication code. Data is only accessed in plaintext form inside the processor chip. The project achieves secure memory with low performance overheads and compatibility with modern system features such as virtual memory and inter-process communication.

Software and Optimizing Compilers

Constructing efficient software for different applications (video game vs. a web browser) running on different hardware platforms (desktop vs. mobile phone) is extremely challenging. This is partly because software development is a challenging task and partly because our expectations of software steadily grow. With each new generation of desktop or mobile phone, for example, we expect higher performance, lower power/longer battery life, increased reliability, and greater security. One way to help meet these expectations is the development of automatic tools that can analyze source code, optimize it for a particular platform, and catch errors and other programming flaws. These analyses and optimizations can be implemented as part of a compiler.

A compiler performs three distinct and inter-related tasks: it translates from a high level language to a target hardware-interpretive language, it optimizes the code to both improve the quality of the translation and improve on the programmer’s work, and it creates a schedule to efficiently use hardware’s resources when the program is ultimately run. These tasks typically involve multiple steps:

Compiler analysis – This is the process to gather program information from the intermediate representation of the input source files. Typical analysis are variable define-use and use-define chain, data dependence analysis, alias analysis etc. Accurate analysis is the base for any compiler optimizations. The call graph and control flow graph are usually also built during the analysis phase.

Code generation – the transformed intermediate language is translated into the output language, usually the native machine language of the system. This involves resource and storage decisions, such as deciding which variables to fit into registers and memory and the selection and scheduling of appropriate machine instructions along with their associated addressing modes. Analysis of memory requirements and execution time is also suitable in this phase.

Students can learn more about Software and Compiler Optimizations in a sequence of courses spanning the undergraduate and graduate curriculum. At the 500 level and lower, students learn about techniques for software design (ECE 517) and generating efficient code for modern architectures (ECE 466/566). In the 700 level courses, students can explore advanced analyses and optimizations for parallelization and reliability (ECE 743, ECE 785). This coursework prepares students for jobs in industry and for Masters and Ph.D. level

VLSI and Computer Aided Design

Very-large-scale integration (VLSI) is the process of building miniaturized electronic circuits, consisting mainly of semiconductor devices, called transistors, on the surface of a thin substrate of semiconductor material. These circuits, often referred as electronic chips, are used in almost all electronic equipment in use today and have revolutionized the world of electronics.

The dimension of transistors has been reduced to a fraction of a micrometer. With the aid of nanotechnology, the size of transistor may approach several nanometers. The decreasing transistor sizes make VLSI circuits faster, evidenced by the more than 1000-fold speedup of microprocessors. Large electronic chips today contain several hundreds of millions of transistors. The VLSI chips of a figure-tip size can provide functions that were delivered by thousands of print circuit-boards before.

The VLSI research has many aspects. All transistors must be properly placed and connected so that the entire circuit can operate at high frequencies. The power consumption of circuits needs to be reduced. The reliability and testability must be addressed. The VLSI design is closely related to fabrication processes and therefore research on solid-state materials is also performed.

The design of VLSI circuits is a major challenge. Consequently, it is impossible to solely rely on manual design approaches. Computer Aided Design (CAD) is widely used, which is also referred as electronic design automation (EDA). In EDA, computer programs are created to develop VLSI circuits. In CAD research, both software and hardware activities are conducted to improve design quality and reduce design time.