Gate-Level Simulation Resurgence: Is the Answer to Buy a Bigger Hammer?

In the mid-1990s, EDA providers started promoting synthesis and RTL simulation, and declared gate-level simulation (GLS) a thing of the past. After all, they said, we could now trust synthesis technology, equivalence checking, static timing analysis, and sophisticated design for test (DFT) tools to reduce, and possibly eliminate, our need for GLS. If that's all true, then why has the demand for GLS seen a resurgence in the past few years?

No one should be surprised by the answer: advanced node requirements. While these requirements point to the obvious need for higher speed and capacity, the demand can't be justified on that need alone or we would have seen more GLS as designs moved to finer process nodes throughout the 1990s and early 2000s. To understand the need for GLS, you actually have to look a bit deeper at two key design characteristics that were phased in around the 40nm node: multiple clock domains and low power.

Around 40nm, hardware performance began to depend on a sharp increase in the number of clock domains to enable parallel execution, instead of just increasing clock speeds. This was necessary because we could no longer afford the power budget associated with the relentless increase in clock speed that drove performance in previous nodes. The result was a dramatic rise in the number of simulator events needed to simulate multiple set-up and hold checks for timed simulations, more DFT simulations to verify larger scan chains and built-in self-test (BIST) implementations, and new low-power simulations—effectively a surge in all aspects of GLS.

The easiest response is to reach for a bigger hammer—a faster simulator—but is it enough? With run times reaching 21 days or more for a single test, it’s necessary but insufficient. What’s needed is a combination of tools and techniques, including acceleration for untimed DFT tests and static tools that help engineers avoid simulating some problem classes, in addition to a faster simulator. This article will focus on techniques generally available to you regardless of your EDA supplier.

Effectively Use Static Tools to Reduce GLS

Static tools, including lint checking and static timing analysis, are applied to remove specific classes of bugs before starting GLS and to reduce the number of cycles spent in GLS. Some of these bugs are simply easier to find, some of these may prevent simulation from proceeding, and some are found much faster using the static tools. Later in the cycle, static tools can be used to temporarily correct errors so that the simulator can continue without having to wait for the design team to fix and iterate each error one by one.

Lint checking tools find errors in source code without simulating that code. These tools can look for design structures that could lead to bugs. One example is a loop that would operate properly with timing annotation but fails in the zero delay mode typically used for DFT tests. Simulation tools may have features that allow users to apply nominal delays to these loops without having to fully annotate the design, so the DFT tests can continue to run at the expected speed. Static tools can also detect design structures that can lead to race conditions. Race conditions are especially insidious because, unless they trigger setup and hold violations, they may require visual inspection in the waveform tools to detect. Lint checking can direct the engineer’s attention to suspicious code, thereby saving considerable project time.

Static timing analysis (STA) focuses debugging efforts on the design areas that have not met the timing requirements. Since the timing issues reported by STA tools require fixes to the design, the Standard Delay Format (SDF) files output by STA tools will show failures when applied to GLS. The engineer running GLS can temporarily fix these violations in SDF through the STA tools while the timing closure team fixes the root issue in parallel (see Figure 1). In this way, expensive full-timing GLS cycles are not spent simulating working code, which results in shorter overall debugging time.

Setup and hold violations are typical failures that can be noted and worked around to allow the GLS process to continue. Consider the simple example in Figure 2. A setup violation is reported at FF2 and the slack is -0.56. The sample combinational logic is also shown. The SDF file generated by the STA tool will allow the simulator to report the timing violations and may report a functionality mismatch if the data for D2 is not latched into FF2 at the next clock cycle. This error might propagate to other parts of the design as well.

Figure 2: Setup Violation

Since this is a known issue and requires a fix in the combinational logic from FF1 to FF2, the verification engineer can avoid waiting for a design fix by either temporarily ignoring this path or temporarily fixing the timing. This can be done in the STA tool itself by setting a few gate delays to zero and compensating for the slack. Once the slack is compensated for, then the SDF can be generated and used by the GLS.

This approach can be helpful only when the violations reported by STA tools are limited in number and are present only in some small part of the design. Using this approach will not work in cases where there are a lot of violations and where they affect almost the complete design, as it would change the timing information in the SDF file completely and make the simulation very optimistic. However, using this approach is quite effective near the end of the project when the pressure to close the verification is the highest.

Speeding GLS for Design for Test (DFT) Verification

Gate-level DFT simulation is performed for the verification of test structures inserted by specialized DFT tools. As designs have exploded in size and complexity, great advancements have taken place in automated test tools over the last decade in the area of scan chain insertion, compression logic to reduce I/O and speed up testing, removal of hotspots, BIST logic, and so on. Tests that run on physical testers in seconds of real-time at GHz speeds can take hours to days in simulation. However, several approaches can speed these simulations.

Figure 3 shows a typical ATPG test flow. The red boxes highlighted in the graphic indicate qualitatively where the majority of simulation time is spent. There are several actions that can be taken to speed these simulations.

Figure 3: Typical ATPG Simulation Flow Using Serial Mode

The scan patterns are clocked in parallel instead of serially. Serial patterns take n clock cycles to scan in and n cycles to scan out. Even with optimized scan chains, there can be thousands of clock cycles per pattern. Parallel load and unload techniques can be used to drastically reduce the simulation time in these cases.

Unit delay simulations can be run before full timing. SDF annotated timing simulations with advanced node libraries are very long and will uncover both timing and non-timing related issues. Since the timing simulations could take four to five times more wall-clock time and memory resources to simulate than unit delay, debugging time lengthens accordingly.

Pre-layout unit delay simulations can catch errors early. For example, a few functional tests should be simulated to build confidence in the test insertion. In addition, a single ATPG pattern should be run in serial mode that exercises all scan chains to verify scan chain integrity. Depending upon the amount of compute resources available for the job at the time, a few additional, top-ranking patterns for simulation should be selected, based on coverage grading produced by the test tool. A hardware acceleration solution can be used to verify the functional integrity between RTL and pre-layout netlists because acceleration may run 10,000 to 100,000 times faster than GLS.

Figure 4: Typical ATPG Simulation Flow Using Parallel Mode

Figure 4 shows how ATPG simulations can be broken down to run in parallel on a simulation farm. Simulating a single pattern could take several hours in serial mode. If 10 to 12 patterns are chained together in a single simulation run, a simulation run could take days. Using a compute-farm to break down long serial runs into several shorter simulation runs to run in parallel will shrink the overall regression time. Doing so involves trading compute efficiency for turnaround time when validating late design changes and debug time.

A calculation can determine the right tradeoff. First, measure the DUT initialization time, and call it TinitDUT . Now measure simulation times associated with a few patterns and take an average to compute average time per pattern. Call this Tpattern_av. Note that pattern simulation time will vary due to different event densities produced by different patterns. Next, calculate the amount of time it takes to start a simulation on the farm, and call it TSimStart. Total simulation time for running “m” patterns serially should be:

Total simulation time on a single machine = TSimStart + TinitDUT + m x Tpattern_av

By segmenting patterns into multiple parallel simulation runs, verification engineers can reduce turnaround time significantly. Constraint solving can be used to identify the optimal solution for the minimum number of machines required to achieve a target regression time, as shown in the table below (all numbers are in minutes).

Table 1 shows that partitioning long single-pattern simulations into multiple shorter simulations can achieve faster regression turnaround time. This step must be planned up front so this step does not become the critical constraint right before tape-out.

Table 1: Effects of Partitioning Long Single-Pattern Simulations

Conclusions

GLS is seeing a resurgence due to the demands of advanced process nodes. At first, the EDA vendor reaction was to just build a bigger hammer in the form of a faster gate level simulator. While this is necessary, it isn't sufficient. We need to apply new techniques to speed DFT and timing simulations as outlined in this paper. We can go further with tool-oriented solutions, including save/restart, to focus the simulation on specific problem areas, and with hardware-based acceleration to speed up untimed DFT simulation by 100,000x or more. Whether you are deep into this space at 14nm, or just entering at 40nm, the good news is that many of these new techniques are available today. Yes, you do need a bigger hammer. But you also need a bigger toolbox and some new tools to go with it.

-----------------------

Adam markets the UVM and the multi-language verification simulator for Cadence, tapping 20 years of experience in verification and software engineering including roles in marketing, product management, applications engineering, and R&D. Adam is the secretary of the Accellera Verification IP Technical Subcommittee (VIPTSC) which has standardized the UVM. Adam blogs on verification subjects at http://www.cadence.com/community/fv/ and tweets on them @SeeAdamRun.

* MS EE from the University of Rochester, with research published in the IEEE Transactions on CAD * BS EE and BA CS from SUNY Buffalo