‘More Than Moore’ Reality Check

The semiconductor industry is embracing multi-die packages as feature scaling hits the limits of physics, but how to get there with the least amount of pain and at the lowest cost is a work in progress. Gaps remain in tooling and methodologies, interconnect standards are still being developed, and there are so many implementations of packaging that the number of choices is often overwhelming.

Multi-die implementations today encompass a range of packaging technologies and approaches that have evolved over the past 40 years. It began with multi-chip modules in the 1980s. In the late 1990s, system-in-package approaches were introduced. That was followed by interposer-based implementations around 2008. Today, all of those still exist, along with fan-outs, true 3D-ICs, and some proprietary implementations of chiplets, which are sometimes referred to as disaggregated SoCs.

Much of this has been driven by a reduction in performance and power benefits from scaling below 10nm, along with the growing number of physics-related issues at the most advanced nodes, such as multiple types of noise, thermal effects and electromigration. Most companies working at those nodes already are utilizing some form of advanced packaging to help justify the huge cost of moving to the next node.

Three major changes are underway in this “More Than Moore” paradigm:

Heterogeneous integration using chiplets. Companies such as Intel, AMD and Marvell already are utilizing a chiplet approach for their own designs, but there are efforts underway to standardize the interfaces for chiplets and open this up to third-party chiplets.

Big improvements in multi-chip performance. Approaches such as fan-out wafer-level packaging originally were slated to be low-cost alternatives to 2.5D and 3D-IC, but increased density, pillars, high-bandwidth memory and faster interconnects have made these approaches much more attractive. 3D-ICs likewise are beginning to take shape at the high end of this market.

Shifts by all the major foundries into advanced packaging. TSMC, UMC, GlobalFoundries, Samsung and others offer advanced packaging options today. TSMC also is developing packaging at the front end of the line, where chiplets are etched directly into silicon using a direct bond approach.

“Part of the growth of MTM means potentially that Moore’s Law is really coming to an end, and some people think that it’s already ended,” said John Park, product management group director for IC packaging and cross-platform solutions at Cadence. “In fact, ever since finFET became an option, the price per transistor actually has gone up. That’s a big part of Moore’s Law, so you could argue that it ended in 2012 or 2013.”

Regardless, it absolutely will end at some point, at least for many components in an SoC. “We can’t manufacture some things due to the laws of physics,” said Park. “Meanwhile, designing chips at the latest nodes costs millions of dollars and requires big design teams. If the Department of Defense is building 1,000 nuclear submarines, they’ll never recoup the NRE of designing at 7nm or 5nm. As a result, the DoD, along with medium- and low-volume engineering teams, have already started looking at alternatives to simply scaling based on Moore’s Law because it just doesn’t make sense anymore.”

Fig. 1: Evolution of multi-die solutions. Source: Cadence

Xilinx uncorked the first commercially available 2.5D chip in 2011, based on four chips connected through an interposer. The company said at the time that the main driver behind that decision was that smaller chips achieved better yield. Since then, the emphasis has shifted to the cost of designing a massive planar chip, as well as the difficulty of adding more RF and analog into an advanced-node design because analog does not benefit from scaling. In fact, many of the analog IP blocks in advanced chips are mixed signal, with an increasing emphasis on the digital portion.

“True monolithic 3D will add even more possibilities when it comes online in the next few years,” said Rob Aitken, fellow and director of technology for R&D at Arm. “There are two main drivers for the move to multi-die — cost and capability. Cost reduction occurs when yield on a large die is expected to be low, and the yield improvement resulting from multiple smaller die will more than cover the extra cost and complexity in assembly and packaging. In these cases, especially in adjacent die approaches, designers need to concentrate first on splitting a design between chips in a way that minimizes communication bandwidth between die. They also may choose to implement individual die in different processes, targeting high-speed digital logic to the bleeding edge while implementing analog or mixed signal circuits on an earlier node. Once the decision has been made to go multi-die, it then makes sense to look at capabilities that a multi-die solution can achieve that cannot be replicated in a single die. The simplest example is a design that is simply too large to fit in a single reticle. But other possibilities abound, especially for stacked die solutions with high inter-die bandwidth.”

Stacked die adds another dimension to floor-planning, which is a big benefit as chips become larger and wires become thinner. That allows chipmakers to move cache closer to processors, for example. Because the distance that data needs to travel is reduced, and the interconnects can be sized as needed, it can provide a significant boost in performance. In some cases, this is the equivalent to scaling to the next node. “Choosing the right function split in a multi-die system also enables different combinations of underlying logic, memory and I/O die, which enables multiple systems of differing complexity to be constructed from a few simple building blocks,” said Aitken.

Predicting performance
This isn’t always so straightforward, however. An important consideration in any design is the ability to predict performance. Estimations can vary, and implementing solutions isn’t as simple as adding LEGO blocks. Understanding how different blocks and implementations can affect performance and power is as critical as on a single die, and that starts with good characterization of the different components.

“With such performance indicators, the chip and system designer can compare different technology flavors, such as different metal stacks or threshold voltages or different technologies, in the very early design phase,” said Andy Heinig, group manager for system integration at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “Such metrics also can be used in the next phase to compare different system architectures against each other. That way the chip and system designers can get a feeling for what’s possible for system performance. But up to now, no such metrics have been available to the system designer for the package. Moreover, currently there are a lot of different package technologies available, and they all can’t be used together. Different balling technologies that fit one substrate technology don’t match with others. Such decisions only can be decided by a package technology expert, but they don’t have experience on the electrical side. And the electrical system experts don’t know the ins and outs of the package technology. So from that point, very good metrics or high-level exploration tools are necessary.”

Those tools need to hide the technology details while revealing only valid packaging options. “With such tools or metrics the system designer can compare different architectures, such as for the NoC or the number of interconnects between the chips, in an easy and fast way,” Heinig said.

One of the big advantages of advanced packaging is that heat can be spread across a package in modules, rather than packed onto a single die. With finFET designs at 7nm and below, leakage current, resistance and dynamic power density generate so much heat that complex power management schemes are necessary to avoid cooking the chip. But thermal management and power distribution in a package isn’t always so simple.

Multi-die implementations add a further layer of complexity with multiple such high-performance die, deeply embedded in 2.5D or 3D packages, observed Richard McPartland, technical marketing manager at Moortec. “Standard practice is to include a fabric of in chip monitors in each die, such as those from Moortec, to provide visibility of on-chip, real-time conditions in bring-up and mission mode. Typically, multiple tens of temperature sensors are used to monitor known and potential hotspots. Further, voltage monitors with multiple sense points are strongly recommended. These enable the supply voltage directly at critical circuit blocks, where speed is so dependent on supply voltage, to be monitored and controlled. On-chip process detectors are also an essential tool where processing performance and power efficiency are key. When used as part of a complete monitoring subsystem, they enable optimization schemes such as voltage scaling and compensation of aging.”

Why choose multi-die?
Despite these challenges and others, the industry has little choice but to press forward with multi-die implementations. At the same time, advanced packaging opens the door to some options that never existed in the past.

“[Multi-die approaches] are a great way to more specifically tailor the process technology to what that part of the system needs to do,” said Steven Woo, fellow and distinguished inventor at Rambus. “AMD has a great example of a multi-die solution, where the compute cores are built on one die, and you put in as many as you need. Then they’re all around another die, whose job is to connect to I/O and to memory. What’s really nice about that kind of implementation is you know all these technologies advance at different rates. So you may have something that is happy and talking very well to something like DDR4 or DDR5. But when it comes out, the rate of improvement of memory tends to be historically a little bit slower than the rate of improvement of processors, so when you go to build your next processor you don’t need to port that same memory interface to the next process node. You can leave it where it is, as long as you’re satisfied with the performance and the power efficiency of it. But what you get to do is ride the technology curve and build better processing cores. From that standpoint, it’s really nice because you can spend all your effort on the thing that needs to be improved, which is the processing core. And what you’ve done in the last round — the memory and I/O interfaces — they’re not changing very quickly, so you can use that die again.”

This also helps with yield. “Because the die yields depend a lot on the size of the die, if you’re always adding things like interfaces, it’s naturally going to make the die bigger,” said Woo. “So again, multi-die is a way to optimize the cost and then optimize where you’re spending your effort.”

Another consideration for multi-die implementations is that it spreads the heat out across a larger area. “All these things are affected by heat,” he said. “What you have to make sure of is that the performance, the cost, and the physical size of doing this matches the criteria for being able to hit the performance targets as well as the cost targets. We can definitely see there are cases where that’s true. But then you need some way to connect these things, so now there is an opportunity for more I/Os. There’s a range of tradeoffs you can make in designing those I/Os to connect the chips.”

Multi-die use cases
Multi-die implementations today are the trailblazers of the chip world. They are being used for everything from high-performance AI training to inference, genomics, fluid dynamics, and advanced prediction applications.

“These are very complicated, sophisticated workloads,” said Suresh Andani, senior director for IP cores at Rambus. “If you think about a monolithic die, it needs to have all the I/Os to get the data in and out of the chip that is processing it. Then, there are a lot of compute elements within the chip itself that need to do the high-performance compute. And then you have to have memory access very close by with the lowest latency and the highest bandwidth, and you have to try to fit all of these things into one monolithic die.”

Multi-die implementations are a completely new opportunity, and the potential use cases are just beginning to emerge.

“The design considerations are very dependent on the use cases, which fall into two categories,” said Manmeet Walia, senior product marketing manager at Synopsys. “One is splitting the dies — breaking a large die into smaller pieces, because chips are approaching maximum reticle size limits. They’re getting to the point where it’s not economically feasible and technically feasible to build these large dies because yields go low. It becomes an economical and technical feasibility issue.”

At present, most of the advanced packages are being used for network switching, servers and AI training and inferencing. But as these approaches become more mainstream, they also are beginning to show up in other applications.

“Another use case along similar lines is that a lot of these compute chips would want to scale, depending on different applications,” said Walia. “One of the public examples is the AMD Ryzen chipset. They may want to use the same die going into a desktop, high-end desktop or server, so for the purpose of scaling the SoC they may build a base die and then possibly use one for a laptop, two for desktops, and four for a server application. That’s the other use case, which is scaling these SoCs.”

Multi-die implementations also allow design teams to bring multiple functions together in an SoC. “They want to aggregate multiple functions. A good example of this is a 5G wireless base station, which may have an RF chip in which the antennas were developed in larger geometries, and the baseband chips, which are more digital and scaled down. This enables them to basically re-use RF chips.

“But then they keep optimizing, and bringing in multiple functions,” Walia said. “Some FPGA companies have done the same thing. This is happening in automotive, as well as consumer applications. For example, a TV may have many different types of connections, including cable connections or even wireless connections. So there may be different dies for one piece, but the digital signal processing, video processing, is happening in a big digital die that would keep scaling, and that will keep moving further down in the process geometries. Aggregating multiple functions or bringing different functions together is another use case.”

Choose your node
One of the earliest arguments for advanced packaging was the ability to mix and match IP developed at different process nodes. Initial implementations were largely homogeneous, but that has shifted over the last few years due to the slowdown in Moore’s Law and the splintering of end markets. That, in turn, has opened numerous opportunities for semi-customized solutions based on multiple process choices.

“Sometimes the solutions that we have to present are multi-chip solutions, so we may have a SiP where there are two die, and the die then is basically specific to the function it has to manage,” explained Darren Hobbs, vice president of marketing and business development at Adesto Technologies. “Typically RF and high speed RF is done in older geometries like 0.18, which is a pretty good geometry still for sub-6 Gbps. Above 6 Gbps, we probably go to 55nm. Those are the best nodes for RF. At the same time, if you’ve got a requirement for a lot of processing, you want to go on to deeper geometries like 28nm or maybe down into the finFET space. And then, if you want to get that data off that chip, it’s going to need a high-speed interface, and that in itself will determine what geometry you can use, as well. There are a lot of competing requirements, and everybody wants a monolithic die where everything’s on one die because that’s generally the cheapest thing. But inevitably, in a lot of cases we have to provide a two-chip solution or in some cases a three-chip solution. It comes down to the best tradeoff between process and between functions.”

SiP evolving to chiplets
Similar to the disaggregated/modularized SoC approach is the traditional system-in-package, which isn’t standing still, either.

“Instead of taking multiple chips, we’re now talking chiplets,” said Cadence’s Park. “We’ve always had hard and soft IP, which are the keys to driving SoCs. We now have this third version of IP called the chiplet, which has been built, manufactured and tested. It’s good to go, ready for you to plug on. Today, it’s only being done by vertically integrated companies that design the chiplet and the chip that they’re sitting on.”

But that’s expected to change as the industry begins to embrace multi-die implementations, with broad implications for the supply chain.

“This is now moving toward sensor cameras in automotive, among other applications,” said Vic Kulkarni, vice president of marketing and chief strategist for the semiconductor business unit at Ansys. “For multi-die integration, how do you do that? That’s becoming the go to market for many companies around the world. These are not the standard node-driven devices. These are use-case-driven devices. That’s what people are moving toward — not just standard technology evolution, which is Moore’s Law.”

One example is a 3D-IC developed by Sony, which has a CMOS sensor on the top, then an AI chip, and the CPU chip at the bottom, all connected with through-silicon vias (TSVs). “This is a true 3D-IC, not 2.5 D, which is mostly common now. True 3D-IC structures are going to help make better decisions for autonomous driving, whether it be in the sense of fusion cameras, for almost all the cars. What is very interesting is that it brings multiple issues together — mechanical operation, thermal expansion, solder bumps getting loose with heat, and other thermal issues, because the heat generation is very high in autonomous vehicles. These are the identical issues with high-performance computing applications.”

Which packaging approach works best for high-performance computing remains to be seen. It may depend on a variety of factors, such as what is good enough for a particular application, and whether algorithms can be developed tightly enough with the hardware to make up for any inefficiencies.

“If you agree with this definition of heterogeneous integration and the chiplet-based approach being a disaggregated SoC, it’s going to be a big hit to PPA,” said Park. “These things are going to be built out of multiple blocks, not integrated in a single monolithic device. In applications like high-performance computing, I have question marks there. There’s going to be an impact. The only question is, is it within an acceptable range for that? There are obviously benefits, including lower costs. It’s easier to do, it requires smaller design teams, and in theory has lower risk. But in the area of PPA, which is where everyone in the world of SoC design has been focused for the last decade, there are a lot of unknowns. And standards don’t exist today. There is no kind of business model. Because of this, there is no general commercialization of chiplets. It’s where the industry wants to go, but there’s no business model for the IP providers, there’s no standards, and there’s no metrics on the PPA impact on using this type of disaggregated approach.”

While the chiplet approach continues its evolution, there is much happening today with high-performance computing. In fact, many of the new packaging approaches are being driven by HPC, which requires in-package memory, whether that is GDDR6 or HBM2/2E.

“This is compared to previous compute architectures where the memory was separate on the PCB motherboard,” said Keith Felton, product marketing manager at Mentor, a Siemens Business. “With today’s performance needs — such as bandwidth and low latency, along with minimizing power — the memory is moving into the package with the processor. This is a trend that will begin to extend down into more consumer high-performance devices such as laptops. User upgradable memory will become a thing of the past.”

HPC uses homogeneous and heterogeneous devices versus a monolithic SoC. “Most HPC CPUs no longer use single monolithic SoC due to the challenges of yield and cost,” Felton said. “Instead, they often turn to homogeneous integration, literally breaking up the monolithic design into two or more die. With homogeneous, all the die must be integrated together to function. HPC also can employ the technique of heterogeneous integration, where die can operate individually or be combined to provide greater performance scaling.”

Typically a silicon interposer or an embedded silicon bridge is required to meet data-rate and latency performance requirements. When building an HPC CPU using a homogeneous or heterogeneous disaggregated approach, it’s essential to minimize data throughput and latency, not just between the die that form the CPU but also to memory. To this point, a full silicon interposer or an embedded silicon bridge (one or more) typically is used to provide silicon-level signal performance between the key inter-die functions.

All of the above items require a 3D assembly level model to be created in order to define and understand the relationships between devices and supporting substrates, but also to act as a blueprint or golden reference model (digital twin) that is used to driven implementation, verification, modeling and analysis. Also required is a thermally induced interaction stress analysis for chip-package interactions early in the design cycle to prevent early field failure. Chip-package interactions remain a major challenge due to dissimilar materials and their interactions. Effects such as warping and microbump cracking need to be factored in and mitigated before a design progresses into a full electrical design, and a 3D assembly model is critical, Felton said.

And finally, 3D assembly verification, driven by a golden 3D virtual assembly model and system-level netlist, is a necessity.

“With any multi-die, multi-substrate device that has to undergo assembly after individual element fabrication, you need to verify that everything post-fabrication still aligns and electrically and mechanically performs as expected,” he said. “This is where the 3D virtual model, or digital twin, plays a crucial role. It provides the verification, analysis and modeling tools with the blueprint of how the items should interconnect, and it can then map that to the actual physical fabricated data to detect any changes such as die shrinking caused misalignment that may cause shorts or opens or eventual lifecycle failures.”