USB 3.2 applications

Many common applications are already outgrowing the capabilities of USB 3.0. For example, USB 3.0 connections work well for mass-storage devices based on hard disk drives, but form a bottleneck for flash-based solid-state disks (SSD). USB 3.2-based mass-storage devices, connected at 20Gbit/s, offer more than four times the throughput of USB 3.0 and can keep up with the latest SSDs.

Industrial vision systems are facing similar issues, especially since they usually can’t use data compression. In these systems the process of capturing images, processing them and taking appropriate action, such as removing an item from a high-speed conveyor belt, is time sensitive. USB 3.2 enables such time-sensitive systems to support higher resolutions and/or higher frame rates. Automotive systems do not normally support USB 3.1 Gen2 connections due to cable length and proprietary automotive connectors. However, automotive applications can take advantage of USB 3.2 Gen1x2 connections, which offer twice the throughput, at 10Gbit/s, of USB 3.1 Gen1.

Firmware engineers and software developers can use the increased bandwidth of USB 3.2 to replace dedicated trace and debug ports. USB 3.2 allows the use of an existing Type-C connector, standard USB cables, and PCs/laptops to capture high-bandwidth trace and debug data.

Defining USB 3.2

The USB 3.2 specification replaces the USB 3.1 specification and introduces a new nomenclature. USB 3.2 defines the following connection speeds:

General nomenclature: Gen X x Y – (Speed x Lanes)

Enhanced SuperSpeed Gen 1x1 – (5Gbit/s)

Enhanced SuperSpeed Gen 2x1 – (10Gbit/s)

Enhanced SuperSpeed Gen 1x2 – (5Gbit/s*2 =10Gbit/s)

Enhanced SuperSpeed Gen 2x2 – (10Gbit/s*2 =20Gbit/s)

Both USB 3.2 Gen2x1 and Gen1x2 provide a 10Gbit/s raw data rate. However, due to the more efficient line encoding for Gen2, throughput for Gen2x1 is approximately 1.2 times greater than for Gen1x2. Both 10Gbit/s connection speeds are needed and support different use cases.

USB 3.2 implementation

USB 3.2 takes advantage of the four differential SuperSpeed/SuperSpeedPlus pairs present in the USB Type-C connector, unlike USB 3.1 and USB 3.0, which used one or the other TX/RX lane pair, depending on the orientation of the Type-C connector (Figure 1).

The required switching between USB TX or USB RX, DP TX, and Not Used pins (Figure 3) for each lane and each use case is best handled by a digital switch that is integrated in the PHY to preserve signal integrity. In Synopsys USB/DisplayPort PHYs, switching is handled by the Type-C Assist (TCA) function (Figure 2).

USB 3.2 software stacks

Just as the USB 3.1 programming model did not change from USB 3.0, the programming model for USB 3.2 host and device controllers does not change to support x2 connections. USB 3.0, USB 3.1, and USB 3.2 xHCI compliant host controllers all use the same xHCI host software stack. Similarly, Synopsys’ USB device controller uses the same device software stack for USB 3.0, USB 3.1, and USB 3.2.

However, 20Gbit/s throughput can reveal operating system and/or CPU and memory bottlenecks that were absent at 5Gbit/s or 10Gbit/s. Also, device class drivers and/or device functions such as mass storage, networking, and video may need to be optimized to take advantage of the new 20Gbit/s connection speed.

USB 3.2 and USB Type-C cables and connectors

USB Type-C is a small, robust connector suitable for PCs, laptops, tablets, phones etc. Type-C connectors can be plugged in either way up. USB Type-C is becoming the new standard USB connector for most consumer products. The USB Implementers’ Forum (USB-IF) is emphasizing the transition to Type-C by moving the USB cable and connector chapter to a separate document and renaming the standard-A, standard-B, and mini/micro connectors as legacy USB connectors.

All passive USB Type-C cables can be used for USB 3.2 GenXx2 connections since four SuperSpeed/SuperSpeedPlus differential pairs are mandatory per the USB Type-C specification. A passive cable designed for Gen2 (10Gbit/s) is limited to approximately 1m in length and can support the new 20Gbit/s connection speed. Passive cables of 2 or 3m long, designed for Gen1 (5Gbit/s) can support the new 10Gbit/s (5Gbit/s x2) connection speed.

Active cables are used to extend USB Type-C cable lengths beyond 1m for Gen2, and up to 5m for Gen1. Existing active cables may not support four differential pairs, since this is not required for USB 3.0 or USB 3.1. Active cable specifications are being defined by USB-IF (USB) and VESA (DisplayPort) working groups to ensure that future active cables will work seamlessly with USB 3.2 connections, including DisplayPort alternate mode.

Conclusion

USB 3.2 offers increased bandwidth and USB Type-C connector and cable support, if users can overcome the challenges of implementing it. Synopsys is developing USB IP to the latest USB standards for use in high-performance SoCs, using knowledge derived from thousands of successful customer design wins. Learn more in Synopsys' USB 3.2 demonstration, showing a USB 3.2 Host and Device communicating at USB 3.2 speeds over a standard USB Type-C cable.

Author

Morten Christiansen is technical marketing manager for Synopsys’ DesignWare USB and DisplayPort IP. Prior to joining Synopsys, Christiansen was a principal system designer at ST-Ericsson and Ericsson, designing mobile phone and modem chipsets for 19 years. He was also member of technical staff at ST-Ericsson.

Christiansen has contributed to more than 25 USB and MIPI standards, including USB 3.1, battery charging, Audio 3.0, HSIC, SSIC, gigabit debug for USB, high-speed trace interface as well as communication standards including WMC, EEM, NCM and MBIM, which are used in billions of USB products. In addition to the non-patented USB standards contributions, Christiansen holds six international patents for other USB inventions. Christiansen holds a Master of Science degree from The Norwegian Institute of Technology (NTNU) 1983.

Sign up for more

]]>http://www.techdesignforums.com/practice/technique/understanding-usb-3-2-and-type-c/feed/08.8 billion miles to verifyhttp://www.techdesignforums.com/practice/technique/digital-twin-automotive-verification-mentor-siemens/
http://www.techdesignforums.com/practice/technique/digital-twin-automotive-verification-mentor-siemens/#respondFri, 26 Oct 2018 04:03:00 +0000http://www.techdesignforums.com/practice/?p=10237This article has two themes. One looks at delivering the digital twin for automotive design today, and how that challenge will likely have wider implications. The other illustrates how Mentor’s EDA offering is being integrated into the wider Siemens systems-of-systems infrastructure for those digital twins.

Exhaustive verification

Toyota believes that a car will need to undergo the equivalent of 14.2B kilometers (8.8B miles) of test to verify all the safety, performance and functionality requirements for Level 5 autonomy, the point where no human input is needed to drive or control a vehicle.

Such a requirement is beyond any conceivable real-world solution. By the time you had completed a physical test that long, the car would be obsolete. Moreover, the requirement exceeds the capacity of traditional simulation. We know about simulation’s limitations when it comes to individual SoCs; a car takes us beyond even the billion-gate level in terms of complexity.

Over those 8.8B miles, you must become able to guarantee that all the vehicle’s components – hardware and software – will work in harmony, that all the ECUs within the vehicle will talk to one another over their different interfaces, that all the sensors will provide the data that they should – and that the right decisions will be taken as a result.

You can break down the necessary verification task into three data processing stages: Sense, Compute and Actuate (Figure 1).

You can then assign three tools from Siemens to each of those stages: the PreScan sensor and environmental simulation platform, the Mentor Veloce emulation platform, and the AmeSim mechatronic simulation platform.

These all existed before the Mentor became a Siemens company. However, the way in which they have been – and continue to be – aligned illustrates how benefits cited when the deal was first announced are being realized. And that alignment answers Toyota’s challenge.

The Digital Twin

The Digital Twin is an important concept. Siemens defines it as “a virtual representation of a physical product or process, used to understand and predict the physical counterpart’s performance characteristics”.

You can apply this to various stages in a design flow. In this specific case, we want to apply the idea to the verification of advanced automotive systems of systems.

This is how each tool contributes toward forming that digital twin.

PreScan allows you to build the virtual environment. It has within it the models needed to create various scenarios. Its sensor models are building blocks. They allow you to replicate the data regarding lane changes, pedestrians, traffic signals coming into the car or that should be coming in.

Figure 2. PreScan overview (Mentor)

Veloce is the compute horsepower. This is where the pre-prototype RTL model of your SoC sits, running many thousands of times faster than it would in traditional simulation. But acceleration is not the only reason why emulation is vital here.

It provides greater observability than, say, a FPGA prototype. The FPGA may run faster, but the prototyping board will not have the granularity needed to meet the absolute certainty required under automotive standards that what your test just exercised will actually work. Emulators also give you stop-start options to dig deeper into single-event interruptions (e.g., EMI, register flips, stuck-ats, etc.).

Mentor provides transactors and Veloce apps to recreate fault scenarios (systematic and random) within the emulator and make sure that the verification accounts for all of those automotive protocols (e.g., CAN, LIN, Flexray). Where PreScan provides external context, Veloce builds internal context on top.

Figure 3. PreScan feeds Veloce (Mentor)

AmeSim helps to virtually implement the decisions made by the RTL within Veloce. When your SoC decides to, say, operate the brakes or turn the steering wheel, AmeSim has the models to simulate the behavior of the relevant physical systems. When the computation is complete, now how does the car react?

AmeSim draws on more than 5,000 models. These covers control systems, mechanical, fluid, and heat processes – indeed, just about everything you need to virtualize a mechanical system.

Figure 4. AmeSim libraries (Mentor)

These products have been around for a while. But with them all now within Siemens, they have been brought together in a coherent flow, and their interactions continue to be optimized.

They also sit within the broader automotive tool infrastructure provided Mentor, all of which is qualified to the critical ISO26262 functional safety standard. Its elements include Questa (since there are areas you may wish to simulate and you can also take advantage of formal verification to prune your test cases), Calypto (allowing you to leverage high-level synthesis for complex algorithm and SoC definition) and Tessent (bringing full test into the flow).

In addition, the Mentor Safe program helps user ensure compliance with ISO26262 particularly when it coms to the documentation of design processes, particularly those for verification and test.

Going back to PreScan, Veloce and AmeSim, it is worth noting that all three interface to designs through the same Functional Mock-up Interface (FMI) protocol. This allows users to encapsulate virtually anything they want to. Taking the example of an automotive project, it allows components and blocks from different suppliers to be easily integrated within a digital twin, as well as those developed internally.

Figure 5. The digital twin within automotive verification (Mentor)

Virtual solutions to very real problems

The flow described here already exists: Compute -> Sense -> Actuate on interconnected systems of different abstractions. We expect to make significant enhancements soon in areas such as security. The safety of a vehicle depends on its resistance to being corrupted or hacked. Automobiles are becoming mobile datacenters with multiple access points that need to be secured.

One way of looking at safety analysis for automotive is that OEMs must be satisfied with a vehicle’s operation over a million scenarios, all constructed from different events. To have confidence in the use of the digital twin concept, they therefore need to see that it can be implemented to handle those Sense, Compute and Actuate stages efficiently and accurately.

Then broadening our view to address wider systems-of-systems challenges, we can also reasonably argue that the complexities arising in the automotive market today will soon be seen in others. For example, how safe will drones have to become to make home deliveries? What might be coming in AI-driven surgery? The ability to build digital twins – many of will may be broader than that described here – is already becoming increasingly important to many more markets than the self-driving car.

]]>http://www.techdesignforums.com/practice/technique/digital-twin-automotive-verification-mentor-siemens/feed/0Reliability verification: It’s all about the baselinehttp://www.techdesignforums.com/practice/technique/reliability-verification-its-all-in-the-baseline/
http://www.techdesignforums.com/practice/technique/reliability-verification-its-all-in-the-baseline/#respondTue, 16 Oct 2018 07:14:06 +0000http://www.techdesignforums.com/practice/?p=10227If you are an IC designer or verification engineer, you’re well-accustomed to using design rule checking (DRC), layout versus schematic (LVS), and parasitic extraction (PEX) rule decks provided by the foundry for automated verification. Using these decks just makes sense. They have been validated and qualified by the foundry. They leverage known good solutions for sign-off verification. Creating your own decks would have a tremendous cost in time and resources, with no guarantee the results would match your target foundry’s requirements.

Now there’s a new rule deck in town. The increased focus on reliability (both in performance and product lifetime) has broadened the need for context-aware reliability verification [1]. Foundries have responded by creating reliability rule decks. These deliver a wide range of reliability solutions, often starting with electrostatic discharge (ESD), latch-up (LUP), and interconnect reliability, but now including power management, electrical overstress (EOS), and other potential reliability impacts.

Foundry-qualified reliability rule decks make an excellent starting point for your reliability verification flows, and should be considered your baseline for them. EDA companies are providing verification tools that use these decks to automate and standardize the reliability verification process. And if you still need another reason, consider this: As reliability verification needs expand across the industry, customer demands are driving adherence to new and augmented reliability rules developed and qualified by foundries. They save design companies countless hours and resources that would otherwise be spent creating custom rule decks.

Foundry support for reliability verification

Taiwan Semiconductor Manufacturing Company (TSMC) was one of the first foundries to provide reliability rule decks. Its current ESD/LUP kit provides reliability verification for topology, point-to-point (P2P) resistance, current density (CD), and layout-based LUP rules [2]. As part of their TSMC9000 IP quality program – which has been designed to help customers improve intellectual property (IP) dependability – these rules help establish a consistent baseline for both design companies and IP providers across this ecosystem [3].

TowerJazz was the first commercial foundry to incorporate RESCAR-developed reliability checks into its standard reliability design kit offering, in the form of automotive reliability check templates. These checks enable designers to address the enhanced level of reliability compliance that standards, such as the functional safety standard ISO 26262, require from the entire automotive supply chain. Even though these reliability checks are targeted towards the analog portion of an SoC, they can be used to analyze and enhance the design’s overall reliability [4].

Design companies must be aware that each foundry’s offerings typically have a different reliability focus. ESD protection is a common thread but they diverge in other areas. For example, TSMC focuses on ESD, interconnect reliability, and LUP, at the IP and full-chip levels. The TowerJazz process design kit (PDK) supports checks for power management, ESD, and charge device model (CDM) protection, as well as a suite of analog design constraint checks that incorporate sensitive layout requirements, such as device alignment, symmetry, orientation/parameter matching, and more.

Establishing your reliability verification baseline

Most internal design and CAD support resources leverage the work of the foundries when creating or customizing rule decks for internal use. For DRC, LVS, and PEX rules, this means starting with the foundry-provided decks and applying relatively small additions and modifications as needed so that unique or proprietary verification requirements are satisfied. This is both more efficient in terms of time and resources, and ensures consistency across designs and nodes because the process always starts with the foundry rule deck as the source.

In the same way, companies should implement foundry-supplied reliability rule decks into verification flows with the assurance that their contents and requirements have been thoroughly vetted. From IP to full-chip reliability applications, the value of establishing a baseline for reliability acceptance throughout the design flow has been established [5]. Whether you have not done formal reliability verification before, or already have a customized in-house reliability checking process, foundry reliability rule decks provide the same benefits as DRC, LVS, and PEX decks—uniform, qualified requirements and consistent foundry maintenance across all projects and process nodes.

Many companies adopting foundry-provided reliability design rules incrementally, often by first supplementing any internal methodologies and rule checks already in place. A typical evolutionary path will start small, gaining trust in the process and the results, then transition more of the design flow to foundry-provided reliability decks.

When design teams evaluate the usefulness of a foundry-led reliability solution, they first have to understand what is being verified by the appropriate rule deck. This is important because companies may be using different foundries for different projects. Understanding different but distinct areas of reliability concerns a particular foundry’s reliability rule deck addresses should be part of that decision as to who gets the work.

The initial step is as simple as downloading the foundry’s reliability rule deck for your current design process node, and reviewing the contents with your reliability/ESD team. Next, you need to understand how well these offerings align with your internal requirements, flows, and design practices, as well as the capabilities of the tools you are using for context-aware automated reliability verification. For example, the foundry-provided ESD/LUP rules are a great place to start for developing a reliability baseline, but depending on what your foundry provides, you may need to make these types of addition to your full-chip checklist:

Validation that all IPs are correctly implemented

Context/voltage-aware LUP protection verification [6]

Interconnect robustness analysis

Stacked devices analysis in the context of the whole chip

Verification that the correct power ties are used in wells

Reliability checking often requires ‘context awareness’, or the ability to consider both geometrical and electrical information together to determine the correct implementation. If your verification tools cannot perform this combined analysis automatically, you may find yourself spending a lot of time trying to implement these checks through manual annotation and custom code. Adopting tools that provide automated context-aware checks can ensure fast, accurate reliability checking and debugging.

Also, given today’s M&A environment, project teams in a merging company often continue to develop new versions and incremental updates of chips using the same foundries employed for their original products. Faced with the time and expense of transferring to a new foundry, they adopt the adage of ‘If it isn’t broken, don’t fix it.’ Whatever the reason, when sending different projects to different foundries, you must make sure you have a clear understanding of the reliability checks each provides, and confirm they align with your own internal requirements.

Beyond providing new design opportunities, new process nodes present new learning experiences for designers as they seek to understand the nuances of the flow and the potential reliability concerns around new devices and interconnects. Design starts, particularly those on incoming process nodes, benefit greatly from leveraging foundry-provided rule decks, as many of the lessons learned and acquired knowledge from previous nodes and foundries may not translate across.

Using a foundry reliability rule deck

IP validation—early and often

While full-chip sign-off is necessary for the completion of any design, getting there can be greatly enhanced by the verification of standalone IP blocks and larger blocks during chip assembly. Foundry rule decks usually provide internal switches that allow designers to run either full-chip checks or IP-based and block-specific checks.

IP-based checks allow the verification process to begin while design teams are still implementing and assembling IPs from internal groups and/or 3rd party IP vendors. Just as ensuring the IP being delivered to a design meets the baseline criteria in foundry-provided DRC rule decks is a given, the same should be true for reliability checks. As with DRC, validating reliability at each level as you select and build up the design provides a deterministic path to success as you consider these design elements in the context of the whole chip.

IP re-use, whether the blocks are developed internally or sourced externally, makes up a substantial portion of today’s designs. A significant challenge here lies is determining a block’s suitability and reliability for a new design. While the physical layout used in a previous design may remain unchanged, the reliability context of how that IP block is used in a new design must be validated. Figure 1 shows well-trusted IP placed in multiple power domains in a new design, with unified power format (UPF) power state tables (PST) controlling their activity. While each IP may work well in a standalone context or in their previous use, validation of how they all interact with (and are physically connected within) the new IC design as a whole must be rigorously performed, particularly when validating interactions between multiple power domains.

Figure 1: Trusted IP in a new design with multiple power domains must be validated (Mentor)

IP obtained from multiple sources is likely to contain different design styles and techniques. However, there are times where validation of consistent design styles and best practices can simplify long-term maintenance and reduce the cost of ownership. Identifying IP design differences early in the design process helps eliminate late-breaking issues during IP integration and assembly. For example, one design decision in which consistency is valuable across multiple IP is the choice of which common ESD techniques to use for I/O pin protection. Are all of your I/O IP blocks developed for distributed ESD protection (often common in ball grid array designs), or not?

Validation of existing IP becomes even harder when it also involves a process node or foundry change. Retargeting IP can be especially challenging when applying a process shrink, because special care must be taken with those parts of the design that should not shrink, such as interconnect robustness and device sizing for ESD protection. This is where foundry reliability rule decks are particularly helpful. While shrinking interconnect, transistor dimensions, and spacings across most of the design may be appropriate for the new node, maintaining correct geometrical dimensions where energy must be shunted (as is the case for ESD protection circuitry) is essential, and requires careful validation. Although new nodes may offer opportunities to improve device performance, they may also present new design considerations. For example, when transitioning from planar bulk transistors to FinFET or FD-SOI, designers must educate themselves on the differences in reliability characteristics between the old and new devices and processes.

Full-chip integration

The verification of individual IP blocks is the foundation for verification of your chip assembly, but standalone IP verification lacks the overall context of how the blocks will be incorporated into the larger whole. Comprehensive reliability verification at the full-chip level is an equally important requirement. As shown in Figure 2, overall chip context is important when validating critical reliability applications, such as ESD and electrical overstress (EOS) protection, voltage-aware DRC (VA-DRC), and interconnect robustness (particularly critical for avoiding charge device model (CDM) issues by ensuring low resistance between ESD clamps).

Some reliability checks must be performed both at the IP level and in the context of the full chip. Reliability rule decks used for both IP and full-chip runs often have settings or modes that define the verification level needed to generate the appropriate results. For device-level EOS, long-term reliability issues will arise if the bulk is tied to a higher voltage than the voltage at which the gate switches. This scenario creates gate-oxide stress that will cause failure over time. Such failures are hard to recognize because they are subtle design errors not easily identified in traditional SPICE simulations. To ensure that time-dependent dielectric breakdown (TDDB) does not lead to premature oxide breakdown of interconnect, VA-DRC spacing checks must be performed in a way that considers the voltages on these interconnects [7].

While most designers expect basic ESD checking from their automated tool flows, more complex reliability checks (e.g., interconnect robustness verification with point-to-point (P2P) and current density (CD) analysis) are also critical. CDM checking to protect gates that are directly connected to power/ground is required at advanced nodes because of the shrinking of gate oxide thickness and concerns across power domains. When active clamps are used, there is a need to validate resistance between global powers (of different domains) to avoid CDM issues.

Conclusion

For many IC design companies and IP suppliers, reliability verification is a new area, one with heightened visibility and different demands. Adoption of new process nodes affords a great opportunity for design companies to consider their entire ecosystem, from IP provider to final chip assembly. Foundry-qualified and foundry-maintained reliability rule decks enable design and IP companies alike to establish baseline robustness and reliability criteria without committing extensive time and resources to the creation and support of proprietary verification solutions. However, a thorough understanding of the coverage provided by a foundry’s reliability offering is essential to ensure that the baseline for your internal criteria is covered by each foundry’s rule deck, especially when multiple projects source different foundries. As with DRC and LVS rule decks, companies may need to work with their chosen foundries to expand rule deck coverage as new reliability needs arise.

Reliability verification tools provide a wide range of automated checking capabilities, and ensure consistent and accurate reliability checking based on a foundry rule deck. They are focused on finding and resolving reliability challenges from the block/IP level through full-chip sign-off verification.

Author

Matthew Hogan is a Product Marketing Director for Calibre Design Solutions at Mentor, a Siemens Business. He has more than two decades of design, field and product development experience. Matthew is an active member of the International Integrated Reliability Workshop (IIRW), is on the Board of Directors for the ESD Association (ESDA), contributes to multiple working groups for the ESDA and is a past general chair of the International Electrostatic Discharge Workshop (IEW). Matthew is also a Senior Member of IEEE, and a member of ACM. He holds a B. Eng. from the Royal Melbourne Institute of Technology, and an MBA from Marylhurst University.

]]>http://www.techdesignforums.com/practice/technique/reliability-verification-its-all-in-the-baseline/feed/0Why AI needs securityhttp://www.techdesignforums.com/practice/technique/10216/
http://www.techdesignforums.com/practice/technique/10216/#respondTue, 16 Oct 2018 00:00:44 +0000http://www.techdesignforums.com/practice/?p=10216Artificial intelligence (AI) is creating new waves of innovation and business models, powered by new technology for deep learning and a massive growth in investment. As AI becomes pervasive in computing applications, so too does the need for high-grade security in all levels of the system. Protecting AI systems, their data, and their communication is critical for users’ safety and privacy, and for protecting businesses’ investments.

Where and why AI security is needed

AI applications built around artificial neural networks operate in two basic stages – training and inference (Figure 1). During the training stage, a neural network ‘learns’ to do a job, such as recognizing faces or street signs. The resulting dataset of weights, representing the strength of interaction between the artificial neurons, is used to configure the neural net as a model. In the inference stage, this model is used by the end application to infer information about the data with which it is presented.

The algorithms used in neural net training often process data, such as facial images or fingerprints, which comes from public surveillance, face recognition and fingerprint biometrics, financial or medical applications. This kind of data is usually private and often contains personally identifiable information. Attackers, whether organized crime groups or business competitors, can take advantage of this information to gain economic or other benefits.

AI systems also face the risk of being sent rogue data to disrupt a neural network’s functionality, for example by encouraging the misclassification of facial-recognition images to allow attackers to escape detection. Companies that protect training algorithms and user data will be differentiated in their fields from companies that face the reputational and financial risks of not doing so. Hence, designers must ensure that data is received only from trusted sources and that it is protected during use.

The models themselves, represented by the neural-net weights that emerge during the training process, are expensive to create and form valuable intellectual property that must be protected against disclosure and misuse. The confidentiality of program code associated with the neural-network processing functions is less critical, although access to it could help an attacker reverse-engineer a product. More importantly, the ability to tamper with this code could result in the disclosure of any assets stored as plaintext inside the system’s security boundary.

Another strong driver for enforcing personal data privacy is the General Data Protection Regulation (GDPR) that came into effect within the European Union on 25 May 2018. This legal framework sets guidelines for the collection and processing of personal information. The GDPR sets out the principles for data-management protection and the rights of the individual, and large fines may be imposed on businesses that do not comply with the rules.

As data and models move between the network edge and the cloud, communications also need to be secured and authenticated. It is important to ensure that data and/or models are protected and can only be downloaded and communicated from authorized sources to authorized devices.

AI security solutions

Product security must be incorporated throughout the product lifecycle, from conceptualization to disposal. As new AI applications and use cases emerge, devices that run these applications must be able to adapt to an evolving threat landscape. To meet high-grade protection requirements, security needs to be multi-faceted and deeply embedded in everything from edge devices that use neural-network processing system-on-chips (SoCs), through the applications that run on them, to communications to the cloud and storage within it.

System designers adding security to their AI product must consider a few foundational functions for enabling security in AI products, to protect all phases of operation: offline, during power up, and at runtime, including during communication with other devices or the cloud. Establishing the integrity of the system is essential to creating trust that it is behaving as intended.

Secure bootstrap

Secure bootstrap, an example of a foundational security function, establishes that the software or firmware of the product is intact. This integrity ensures that when a product is coming out of reset, it does what its manufacturer intended – and not something that a hacker has altered it to do. Secure bootstrap systems use cryptographic signatures on the firmware to determine their authenticity. While predominantly firmware, secure bootstrap systems can use hardware features such as cryptographic accelerators and even hardware-based secure bootstrap engines to achieve greater security and faster boot times. Secure boot schemes can be kept flexible by using public-key signing algorithms enabled by a chain of trust that is traceable to the firmware provider. Public-key signing algorithms also make it possible for a code-signing authority to be replaced by revoking and reissuing the signing keys, if those keys are ever compromised. Security in this case relies on the fact that the root public key is protected by the secure bootstrap system and so cannot be altered. Protecting the public key in hardware ensures that the root of trust identity can be established and cannot be forged.

Key management

The best encryption algorithms can be compromised if the keys are not protected with key management, which is another foundational security function. For high-grade protection, the secret key material should reside inside a hardware root of trust. Permissions and policies in the hardware root of trust enforce a requirement that application-layer clients can only manage the keys indirectly, through well-defined application programming interfaces (APIs). Continued protection of the secret keys relies on creating ways to authenticate the importation of keys and to wrap any exported keys with another layer of security. One common key management API for embedded hardware secure modules (HSM) is the PKCS#11 interface, which provides functions for managing policies, permissions, and the use of keys.

Secure updates

A third foundational function relates to secure updates. AI applications will continue to get more sophisticated and so data and models will need to be updated continuously. The process of distributing new models securely needs to be protected with end-to-end security. Hence it is essential that products can be updated in a trusted way to fix bugs, close vulnerabilities, and evolve product functionality. A flexible, secure update function can even be used to allow post-consumer enablement of optional features of hardware or firmware.

Protecting data and coefficients

After addressing foundational security issues, designers must consider how to protect the data and coefficients in their AI systems. Many neural network applications operate on audio, still images, video streams, and other real-time data. There are often serious privacy concerns with these large data sets and so it is essential to protect that data when it is in working memory, or stored locally on disk or flash memory systems. High-bandwidth memory encryption (usually based on AES algorithms) backed by strong key-management solutions is required. Similarly, models can be protected through encryption and authentication, backed by strong key-management systems enabled by hardware root of trust.

Securing communications

To ensure that communications between edge devices and the cloud are secured and authentic, designers use protocols that incorporate mutual identification and authentication, such as client-authenticated Transport Layer Protocol (TLS). The TLS session handshake performs identification and authentication, and if successful the result is a mutually agreed shared session key to allow secure, authenticated data communication between systems. A hardware root of trust can ensure the security of the credentials used to complete identification and authentication, as well as the confidentiality and authenticity of the data itself. Communication with the cloud will require high bandwidth in many instances. As AI processing moves to the edge, high-performance security requirements are expected to propagate there as well, including the need for additional authentication, to prevent tampering with the inputs to the neural network, or with any AI training models.

Neural network processor SoC example

Building an AI system requires high performance with low-power, area-efficient processors, interfaces, and security. Figure 2 shows a high-level architectural view of a secure neural network processor SoC for AI applications. Neural network processor SoCs can be made more secure when implemented with proven IP, such as Synopsys DesignWare IP.

Embedded vision processor with CNN engine

Some of the most vigorous development in machine learning and AI at the moment focuses on enabling autonomous vehicles. The EV6x Embedded Vision Processors from Synopsys combine scalar, vector DSP and convolutional neural network (CNN) processing units for accurate and fast vision processing in this and other application areas. They are fully programmable and configurable, combining the flexibility of software solutions with the high performance and low power consumption of dedicated hardware. The CNN engine supports common neural-network configurations, including popular networks such as AlexNet, VGG16, GoogLeNet, YOLO, SqueezeNet, and ResNet.

Hardware secure module with root of trust

Synopsys also offers a highly secure tRoot hardware secure module with root of trust for integrating into SoCs. It provides a scalable platform to enable a variety of security functions in a trusted execution environment, working with one or more host processors. Such functions include secure identification and authentication, secure boot, secure updates, secure debug and key management. tRoot also secures AI devices using unique code-protection mechanisms that provide run-time tamper detection and response, and code-privacy protection, without the cost of dedicated secure memory. This feature reduces system complexity and cost by allowing tRoot’s firmware to reside in non-secure memory space. Commonly, tRoot programs reside in shared system DDR memory. The confidentiality and integrity provisions of tRoot’s secure instruction controller make this memory private to tRoot and impervious to attempts to modify it by other subsystems on or off chip.

Security protocol accelerator

The Synopsys DesignWare Security Protocol Accelerators (SPAccs) are highly integrated embedded-security solutions with efficient encryption and authentication capabilities to provide increased performance, ease-of-use, and advanced security features such as quality-of-service, virtualization, and secure command processing. SPAccs can be configured to address the security needs of major protocols such as IPsec, TLS/DTLS, WiFi, MACsec, and LTE/LTE-Advanced.

Conclusion

Providers of AI solutions are investing significantly in R&D, and so the neural network algorithms, and the models derived from training them, need to be properly protected. Concerns about the privacy of personal data, which are already being reflected in the introduction of regulations such as GDPR, also mean that it is increasingly important for companies providing AI solutions to secure them as well as possible.

Synopsys offers a broad range of hardware and software security and neural network processing IP to enable the development of secure, intelligent solutions that will power the applications of the new AI era.

Further information

Author

Company info

Synopsys Corporate Headquarters 690 East Middlefield RoadMountain View, CA 94043(650) 584-5000(800) 541-7737www.synopsys.com
]]>http://www.techdesignforums.com/practice/technique/10216/feed/0Using threat models and risk assessments to define device security requirementshttp://www.techdesignforums.com/practice/technique/using-threat-models-and-risk-assessments-to-define-device-security-requirements/
http://www.techdesignforums.com/practice/technique/using-threat-models-and-risk-assessments-to-define-device-security-requirements/#respondTue, 09 Oct 2018 10:00:57 +0000http://www.techdesignforums.com/practice/?p=10202The proliferation of attacks against embedded systems is making designers realize that they need to do more to secure their products and ecosystems. This article describes the most common threats, offers guidance on which are most important, and suggests security measures to help prevent these attacks. Security implementations need to be designed into the system-on-chip (SoC) through a combination of software, hardware, and physical design.

Types of threat

There are several types of security threats, including remote or local non-invasive; physically invasive; and destructive attacks. Destructive attacks usually try to acquire specific assets or to interrupt access to a service, enabling the attacker to gain notoriety, valuable intelligence or simply money. Figure 1 shows the relationship between the cost and effort of making an attack and the cost to implement countermeasures.

Figure 1 Types of threats (Source: Synopsys)

Remote attacks

Device ecosystems, such as home automation systems, are example targets for remote attacks. Attackers also often use easily obtained scripts to attack vulnerabilities in specific implementations of protocols. Attackers can monitor non-encrypted traffic to gain information about the system, enabling them to escalate the attack to take control of the device. The most common way of getting control of a device is through buffer overflows or poor authentication.

Once an attacker has access to a device they can steal assets such as user data, cryptographic keys, or intellectual property such as proprietary algorithms. The los of such assets can be catastrophic for the ecosystem.

A more advanced attack replaces the firmware on the system. This is usually temporary until the next boot cycle, but it can become permanent if the regular firmware update procedure is used and the SoC does not have a secure boot authentication process. Secure boot is the first step in securing a system by checking that it came out of reset in an expected state and that its firmware is intact and has not been tampered with. It should be considered as a foundational security feature in any device.

Figure 2 Threat assessment of non-invasive attacks (Source: Synopsys)

Physical attacks

If attackers have physical access to a device, they can attempt non-invasive passive attacks, or invasive attacks that can range in severity up to the point at which the device is destroyed.

Simple non-invasive attacks include bus monitoring via JTAG or logic analyzer, port scanning, and GPIO manipulation. More complex non-invasive attacks include side-channel attacks through power or timing analysis. Fault injection can also be used to cause denial of service or even unintended behavior in a device, resulting in a change of its state.

Invasive attacks are more costly because they require specialized equipment. These include de-capsulation of a chip, etching away layers of the semiconductor substrate, and electron microscopy to reveal cryptographic keys or design logic.

Threat models

The first step in defining security requirements for an SoC is to understand what assets it is important to protect, how the SoC is going to be used and how it could be threatened . This can be complex if SoCs have been designed to address multiple markets and be used in different ways.

Designers should do a threat and risk assessment for their SoC, device and ecosystem. There are published methodologies for performing this task, as well as consultancies that can help designers apply the process. Generally, there are four main steps:

Establish the scope of assessment and identify assets

Determine the threat to the assets and assess the impact and probability of occurrence

Assess vulnerabilities based on the implemented protection and calculate the risks

Implement additional protection to reduce the risks to acceptable levels

All assets are classified as having one or more values related to their confidentiality, availability and integrity. This process should continue throughout the usage lifetime of the device, to ensure that the protection mechanisms in place meet current security needs.

Evaluating risk

Designers need to take into account an attacker's motivation, the tools, equipment, skills, time, and money they need to break into a system, and the probability of success in order to judge risk. Generally, if the cost to break into a system is greater than the benefit of doing so, then no one attacks it because there are no incentives.

Sometimes, new attack models or tools are developed or become available, altering this calculus. This means that devices that were once rated as to be highly tamper resistant suddenly become easily compromised when a new attack emerges.

Measuring or evaluating the damage impact of an attack from a system perspective can be extremely difficult. The damage can range from device failure, service interruption and monetary loss, to brand damage. The designer of the system must understand these impacts when assessing potential damage.

Security implementations

To reduce the risks revealed during the threat assessment, the cost/benefit tradeoff for each response must be considered. There are three broad categories of possible response, which are, in order of cost and vulnerability to compromise: software, hardware, and enhanced hardware protection mechanisms.

Software can address many security threats. Using approved cryptographic algorithms and tested protocol implementations can provide basic protections. A software-based secure boot process, started from ROM, creates a trusted environment to execute validated application code. Software development processes, such as implementing secure coding guidelines and fuzz testing, can help prevent the introduction of vulnerabilities.

Hardware implementations of cryptographic algorithms increase performance and security, by creating an isolated hardware environment that protects keys, data and operations from observation. A complete hardware security module, based on a root of trust and implementing a ladder of keys, isolates critical assets from remote attacks. Hardware solutions take more silicon area to implement, but meet higher performance requirements.

Enhanced hardware security features help resist security threats such as differential power analysis and timing attacks. Other enhanced features counter specialized fault injection techniques. These techniques are more difficult to implement and it is also difficult to validate the efficacy of the counter measure.

SoC design example

Figure 3 shows an architectural view of an SoC targeted at simple IoT devices. The callouts relate to various threats and attacks, and are listed below along with suggested responses to the threat.

Response: Implement a secure boot process to validate executable code before it runs. The initial boot code and the key used to perform these checks must not be modifiable.

2 - Attack: Theft of software algorithms or other intellectual property from program memory

Response: Implement a secure boot process to encrypt stored code and only decrypt it into a local secure memory before execution.

3 - Attack: Theft of user data from memory

Response: Encrypt the user data or other assets with cryptographic algorithms implemented in software (or hardware for higher performance).

4 - Attack: Read keys from system bus

Response: Disable JTAG and other debug access ports by blowing fuses or programmatic control. Only re-enable them using a secure process (e.g. certificate based identification and authentication). Destroy all critical security parameters before granting access.

5- Attack: Interception or replay of external communication

Response: Encrypt network traffic with a validated protocol such as TLS (Transport Layer Security) or similar. Encryption prevents attackers from intercepting the data being transferred between endpoints. Generate random session keys with a true random number generator so that packets cannot be replayed on the network

Conclusion

Security threats are evolving, threatening disrupted services and the loss of critical assets. Security and trust must be designed into an SoC and the ecosystem it serves, by including security features in the architecture of the controlling SoCs. Design engineers must manage the risk of the various attacks by selecting the appropriate levels of protection to build a cost effective and robust product.

Post-synthesis gate-level netlist timing verification - as well as DFT validation techniques with ATPG - rely on simulation to ensure that designs meet different timing requirements. Additionally, despite challenges seen in gate-level simulations, simulators remain unchallenged when it comes to ease of debug.

Because gate-level verification depends so heavily on simulation, any improvements in simulation methodologies and technologies can go a long way toward improving the overall verification efficiency. Let’s examine various methodology improvements that help boost gate-level simulation throughput for single-test modes and regression suites.

Parallel compilation into multiple libraries

Netlist compilation time is often a significant bottleneck that affects turn-around times to validate any changes. A very efficient compilation flow for gate-level tests includes the compilation of each IP/block to its own work library. This can be achieved quite easily by scripting individual compilation commands and executing them in parallel on several grid machines. You must ensure that the compilations do not depend on each other for the maximum gain.

The implementation of such a parallel flow on a customer’s design helped reduce DUT compilation time from 3.5 hours to 45 minutes, a 4.6X improvement. The design that was targeted for initial deployment of this flow was a full-chip, multimedia SoC with approximately 50 million gates. The flow was easily ported to all of the customer’s future projects and these have had a larger number of both blocks/sub-systems and gates.

Leave the SDF at home

We need to start gate-level simulation earlier in the verification cycle. We do not want to wait until all the IP blocks or even the standard delay format (SDF) file are ready. Zero-delay gate-level simulations (netlist simulations with no SDF or delays) typically account for 90% of all the gate-level simulations run by verification engineers. These run much faster than thos with SDF and can be used to verify basic netlist functionality and synthesized design liveness. These simulations are also easier to debug without the complexities of a full-timing gate-level simulation.

Say 'no' to zero-delay/feedback loops

Running zero-delay gate-level simulations can cause false negatives and failures due to zero-delay feedback loops. This is especially true of designs with tight timing. These loops are very common in cells with cascaded combinatorial gates and primitives, and can cross cell-module boundaries and propagate across data paths. All of that then complicates debug.

Figure 1: A zero-delay loop circuit (Mentor)

A common workaround is to add artificial delay values to the outputs of cells, in the assured knowledge that the complete SDF will have the necessary delays in place to prevent the feedback loops during the final gate-level simulation run. The addition of this delay can be achieved using the following methods:

Manually add delays to the cell source.

Automatically add delays to the outputs of UDPs.

Add delays through fake SDFs.

Do not simulate timing checks.

A complete SDF with correct timing values typically will not be available until a late stage of verification. Under such scenarios, early gate-level timing simulations using an SDF that is not timing-clean can be expensive to simulate with timing checks, if they are to be waived off or individually suppressed on an instance-basis at runtime. In this case, it is ideal to compile the netlist with the option to remove all timing checks from the design. This can help improve the performance of gate-level simulation by not simulating timing checks and printing potential violation messages that can clog a log file and/or create a performance bottleneck at runtime.

We have observed a typical performance boost of between 5% and 30% and up to 2X on various gate-level simulations by compiling a design without timing checks.

Most tools provide a run-time option to prevent the simulation of timing checks. However, we recommend compiling the design with the option to allow the simulator to apply any additional optimizations after removal of the timing check statements.

Other ways to improve gate-level simulation performance include:

Time precision. If possible, set up simulations to run at a coarser resolution (such as ps) rather than at a finer resolution (such as fs).

Do not log cell internals. If cell internals are logged, the simulation performance can be severely impacted (multiple-X slowdown), and the resulting debug database can end up being large and difficult to work with.

Replace verified gate-level simulation blocks with RTL or stubs. As a full-chip configuration is built through IP and sub-system integration, it is a good idea to replace verified netlist blocks with equivalent RTL blocks or even stubs with appropriate port connectivity. Thi process can provide a significant performance boost as you verify only what is required.

Switch timing corners during simulation. A very effective way to improve gate-level simulation throughput when simulating multiple timing corners is to switch the SDF at the start of simulation. This saves precious recompilation time in gate-level designs.

Validating ATPG and BIST tests

Because verification teams spend a considerable amount of time doing ATPG simulation, this presents another important opportunity to improve gate-level verification performance.

On large-scale designs, the simulation time to run ATPG tests can vary from a few hours to a few weeks. Duration will be based on the design size, scan-chain size, and number of patterns tested. This is true whether users are doing stuck-at-fault simulation or chain-integrity-tests, and also whether they are doing serial or parallel pattern testing. Verification time is enormous for some teams, and they are actively looking to improve or shrink the time they need for running simulation. And that’s before they even get to debugging the issues.

Fortunately, there are a few ways that throughput can be improved in this context. ATPG test regressions can often be categorized into different test configurations, where multiple tests share the same test configurations.

ATPG simulation can also be split into two phases: the test-setup phase and the pattern-simulation phase. Again, in an ATPG test regression, multiple tests share the same test-setup phase and then start the pattern-simulation phase. During the pattern-simulation phase, users often have multiple patterns in a single test that are being simulated serially. Simulating each pattern can take a few hours to a few days, based on design and/or test type.

Following a few steps using the Questa simulator’s checkpoint/restore feature, users can achieve significant improvement in throughput for ATPG simulations. Here is how that process works:

Start by identifying tests that share the same design configuration and/or test set up phase.

Next create a set of such tests.

Then create multiple sets of such test-sets. For each test-set, run the simulation until the test setup phase is done and then checkpoint or save the state of the simulation.

Now users can run multiple tests directly by restoring the state and resuming simulations. All these tests can be run in parallel to efficiently use the grid system.

A single test with multiple patterns can be split to run those patterns in parallel. Based on the number of patterns test test has, our recommendation is to create sets of at least three or four patterns to get the best throughput efficiency.

When it is time for debugging a failing pattern, the user first needs to wait for the simulation to finish to find which pattern failed and then start the debug process. Using the above flow, this wait time has already been reduced. The user begins the debug process by logging the waveforms and rerunning simulation until reaching the failing pattern.

The above methodology can also be used to improve the debug process itself. It is possible to enable waveform dumping for only the pattern-simulation phase. This saves a great deal more time and disk space by avoiding unnecessary waveform logging for the initial setup phase. The user also can start simulating the failing pattern directly. He or she does not need to rerun the whole simulation from the setup phase, which can take a very long time.

Using the methodology improvements suggested in this article, verification teams can improve their gate-level simulation flows and achieve significant reductions in debug turnaround time.

About the author

Rohit K. Jain is a Principal Engineer at Mentor, a Siemens company, working on QuestaSim technology. He has 18 years of experience working in the EDA industry in various engineering development roles. He is experienced in HDL simulation, parallel processing, compiler optimization, and parsing/synthesis engines. He has a Master’s degree from IIT, Kanpur, India.

]]>http://www.techdesignforums.com/practice/technique/gate-level-simulation-throughput/feed/0An open-source framework for greater flexibility in machine-learning developmenthttp://www.techdesignforums.com/practice/technique/nnef-open-source-framework-for-greater-flexibility-in-machine-learning-development/
http://www.techdesignforums.com/practice/technique/nnef-open-source-framework-for-greater-flexibility-in-machine-learning-development/#respondWed, 03 Oct 2018 14:15:35 +0000http://www.techdesignforums.com/practice/?p=10186Data scientists and engineers working in machine learning today have a wide variety of tools available to them. Many of them are open source, free and available from technology giants such as Facebook, Google and Microsoft. A key choice in any machine-learning project is which deep-learning framework to use. These frameworks are used to develop, train and deploy neural-network models and include Caffe, Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and Tensorflow.

Each framework offers different features and approaches to developing neural networks and each user may prefer to work with a particular framework. However, it is important when choosing a framework to consider how the resultant models can be deployed. Most frameworks make it easy to support deployment to the cloud or to a farm of workstations. However, many of the neural networks being developed today are targeted at the network edge, for use on constrained devices that may not support a chosen framework. This is where a recent initiative called the Open Neural Network Exchange (ONNX) comes in.

ONNX was started by Facebook and Microsoft and has since gathered widespread support from other companies including Amazon, Intel, Qualcomm and others. It is an open interchange format for describing neural-network models. This means that a data scientist can develop and train a model in his or her favorite framework and then export it to the ONNX format (Figure 1). If the tools for the deployment target also support ONNX as an input format then that model can be deployed to the target. Thus, the choice of framework in which to develop and train a model becomes much less constrained by the final deployment target. For example, Synopsys recently announced support for ONNX in the MetaWare EV Development Toolkit for use with its EV6x Processor IP, which includes a CNN inference engine. This support gives users targeting that CNN engine much more freedom to choose which framework to start with on their project.

Figure 1 Develop and train a model in one framework and then move via ONNX to a new framework for inference (Source: Synopsys)

There may also be stages in a machine-learning project where one framework has more useful tools than another, e.g. when visualizing or profiling neural-network models. Users could switch to that framework by exporting the model in the ONNX format for that phase of the project and then switch back again as necessary.

ONNX is not the only emerging standard in this area. The Khronos Group has the Neural Network Exchange Format (NNEF). Many of the same companies supporting ONNX also participate in the NNEF standard. ONNX certainly seems to have the momentum at the moment, but both formats offer more flexibility for data scientists and engineers, and that has to be a good thing!

Allen is a senior product marketing manager for Synopsys’ ARC development tools and ecosystem, and brings to the role many years of experience in hardware and software product development. Prior to Synopsys, Allen held various marketing and business development positions at MIPS Technologies and Mentor Graphics. Allen holds a BSc in electronic and electrical engineering from the University of Surrey.

]]>http://www.techdesignforums.com/practice/technique/nnef-open-source-framework-for-greater-flexibility-in-machine-learning-development/feed/0Flexible embedded vision processing architectures for machine-learning applicationshttp://www.techdesignforums.com/practice/technique/flexible-embedded-vision-processing-architectures-for-machine-learning-applications/
http://www.techdesignforums.com/practice/technique/flexible-embedded-vision-processing-architectures-for-machine-learning-applications/#respondSun, 23 Sep 2018 19:40:30 +0000http://www.techdesignforums.com/practice/?p=10170Machine-learning strategies for vision processing are evolving rapidly, as cloud service providers such as Facebook and Google invest heavily in automating tasks such as object recognition, segmentation and classification. Many deep-learning algorithms, based on multiple-layer convolutional neural networks (CNNs), were developed to work on standard CPUs or GPUs housed in data centers. The utility of these algorithms has grown so quickly that they are now finding applications in embedded systems, such as in vision systems for autonomous vehicles.

The challenge for SoC designers is to specify the right processor for their vision-processing task, within the much tighter power, performance and cost constraints of an embedded system. Designers need to integrate the new machine-learning strategies with support for today’s algorithms and graphs, while being confident that the decisions they make today will scale appropriately to service more complex algorithms and the rapidly increasing computational requirements of higher quality video streams.

A heterogeneous architecture for heterogeneous tasks

An embedded vision processor now needs to support the established algorithms of machine vision, as well as emerging CNN-based algorithms. This demands a combination of traditional scalar, vector DSP and CNN engines, integrated on one device with synchronization units, a shared bus and shared memory, so that operands and results can be pipelined from processor to processor as efficiently as possible.

It is also important that an embedded processor architecture is flexible, so that it can adapt to serve today’s rapidly evolving algorithms, and scalable, so that a single architecture and code base can serve as the basis of multiple generations of embedded vision processing tasks.

The EV6x family of embedded vision processing IP offers multiple forms of flexibility and scalability, depending on how it is configured during design, and programmed in use.

For example, the heart of the EV6x architecture is the Vision CPU, which includes a 32bit scalar processor and a 512bit vector DSP. The scalar and vector processors can include configurable or optional floating-point units. The three-way 512bit wide vector DSP can be configured in a single or dual MAC configuration. Dual MAC configuration enables two of the three lanes to be multiply-accumulations, while in single MAC configuration only one of the three 512bit lanes can be a multiply. The EV6x configuration can be flexible:

32bit MACs (single config) => 16 MAC/cycle

16bit MACs (single config) => 32 MAC/cycle

16bit MACs (dual config) => 64 MAC/cycle

8bit MACs (dual config) => 128 MAC/cycle

The vector DSP unit also has its own crossbar and vector memory. For greater throughput, users can specify up to four cores, made up of the scalar and vector processors, in the Vision CPU.

Figure 2 The Vision CPU of the EV6x architecture (Source: Synopsys)

If users specify multiple Vision CPUs, their work can be coordinated by a cache coherency unit, a streaming transfer unit for DMA, shared memory and an ARConnect synchronization block. Multiple EV6x processors can be connected to each other or to a host processor over an AXI bus for even greater scalability.

The scalability of the Vision CPUs gives the flexibility to handle sequential frames of vision data in different ways: with a single CPU, frames have to be sequenced through it one at a time, but with two cores, two frames can be handled in parallel.

Figure 3 The CNN engine of the EV6x processor IP (Source: Synopsys)

The EV6x’s CNN engine handles real-time image classification, object recognition, detection and semantic segmentation. The base configuration of the CNN engine has 880MACs, but it can be configured with 1760 or 3520 MACs for more computationally demanding tasks. The engine handles 2D convolutions for the convolution and pooling layers of CNN algorithms, and 1D convolution for classification. The CNN engine receives data from an external memory via its built-in DMA. The CNN engine can also communicate to the other Vision CPUs in the system via an optional Clustered Shared Memory or the AXI bus.At 1.28GHz, on a 16nm finFET process under typical conditions, a fully configured CNN engine should deliver 4.5TMAC/s.

While neural networks are often trained on computers using 32bit floating-point numbers, in embedded systems, using smaller bit resolutions will result in smaller power and area consumption. For many CNN tasks, such as object detection, 8bit resolution appears to be enough so long as the hardware supports effective quantization techniques. Some newer use cases such as super resolution and denoising – which are signal-processing tasks – benefit from 12bit of CNN resolution.

Figure 4 (below) illustrates this point by showing how the same de-noising algorithm performs when using different number representations. It shows the visual difference between the version of the algorithm that used a 32bit floating-point representation and versions that used 12bit or 8bit fixed-point representations. There’s little difference between the floating-point and 12bit versions, but the 8bit version is visually degraded.

Figure 4 The effect of different number representations on a denoising algorithm

The EV6x CNN engine supports all these insights by using a 12bit MAC, rather than 16bit, to strike a balance between accuracy and die area (a 12bit MAC takes about half the area of a 16bit MAC). There’s closely coupled memory for both the convolution and classification engine, and it is designed to store coefficients in compressed form. The CNN is configured to support all popular CNN graphs, such as AlexNet, GoogLeNet, ResNet, SqueezeNet, TinyYolo, and SSD. A lot of research is being done on reducing memory sizes and the number of computations of these graphs to further improve area and power. It is also possible to ‘prune’ neural networks, that is to ignore those nodes whose weightings are relatively low, to save unnecessary calculations. And both feature maps and weighting coefficients can be compressed to reduce the bandwidth needed to load them from memory.

Having optimized hardware to implement vision tasks requires development tools that can take advantage of all the hardware features. Development tools for the EV6x processor include support for OpenCV libraries, the OpenVX framework, OpenCL C compiler, C/C++ compiler and a CNN mapping tool. The CNN mapping tool is vital to exploring different configurations of the EV6x, since it can automatically partition graphs across multiple engines. This gives designers the opportunity to explore issues such as whether to run multiple graphs at once, on one or more CNN engines, the hardware cost of reducing an algorithm’s latency, and issues such as the synchronization overhead of each approach.

Trading power and performance in embedded vision

Another key trade-off in embedded vision is computational performance versus power consumption. This is a complex area to explore. Some customers are interested in how much performance they can get for less than 1W power budget (even as low as 200mW). Others want to know what performance they can extract from a given die size (perhaps two or four square millimeters on their die). In some application areas, such as automotive, customers want as much performance as possible within a multiple-watt power budget: some are looking for 50TMAC/s from a 10W budget.

Answering these kind of questions means exploring system issues. For example, you can constrain the power consumption of the processor by reducing its operating frequency, at the cost of lower performance. However, if you’re prepared to dedicate more die area to local memory, this may reduce the systemic power consumption involved in going off-chip for coefficients and feature maps, and therefore provide scope for pushing the operating frequency back up.

Benchmarking

The large number of computations required to execute traditional vision processing and/or CNN algorithms makes benchmarking very difficult. What ‘apples to apples’ comparisons can truly be made? The best way to compare benchmarks between processors is to make sure that an FPGA development platform is used. If you aren’t running algorithms on actual hardware with real memory bandwidth concerns, it is harder to trust simulated results.

Support for compressed coefficients and feature maps is another important variable to take into account in benchmarking. If a processor isn’t using compression, then the implication is that its performance could be improved by doing so.

The EV6x is a digital design so it will easily be applied to multiple process nodes. However, the processor nodes chosen can impact area and power as well as maximum performance.

The future

Pressure from cloud service providers such as Facebook and Google to get better at finding objects in an image, segmenting them, and classifying them, has driven very rapid improvements in CNN architectures. Hardware flexibility will continue to be as important as performance, power and area as new neural networks are introduced. Synopsys’ DesignWare EV6x Embedded Vision Processors are fully programmable to address new graphs as they are developed, and offer high performance in a small area and with highly efficient power consumption.

]]>http://www.techdesignforums.com/practice/technique/flexible-embedded-vision-processing-architectures-for-machine-learning-applications/feed/0How Starblaze combined simulation and emulation to design SSD controller firmwarehttp://www.techdesignforums.com/practice/technique/how-starblaze-combined-simulation-and-emulation-to-design-ssd-controller-firmware/
http://www.techdesignforums.com/practice/technique/how-starblaze-combined-simulation-and-emulation-to-design-ssd-controller-firmware/#respondThu, 20 Sep 2018 10:32:39 +0000http://www.techdesignforums.com/practice/?p=10165Starblaze is a Beijing-based fabless start-up. It was established in 2015, and taped out the prototype of its first target design, an SSD controller, within six months. Starblaze went on to tape out its first production chip, the STAR1000, in January 2017. That silicon has already been incorporated in a consumer SSD drive, the T10 Plus, from LITEON, the third largest SSD manufacturer.

In conversation with Lauro Rizzatti, Bruce Cheng, Starblaze’s chief ASIC architect, described some of the key decision choices and decisions the company took to develop the STAR1000.

Lauro Rizzatti: Bruce, can you start by describing some of the main challenges involved in the design of an SSD controller, and how you were able to overcome them.

Bruce Cheng: In an SSD controller, the firmware determines the major features of the controller. So, the primary design challenge is to develop firmware and hardware together and as soon as possible. To get the best performance and lowest power consumption, the firmware must be fine-tuned on well-optimized hardware.

Most hardware components, apart from a CPU, system bus and a few peripherals like UART, are designed from scratch and must be carefully optimized according to the firmware usage. Essentially, the SSD firmware is customized and optimized to fit the hardware.

The storage media driven by the SSD controller, whether it is NAND Flash or some other new emerging type of media, really determines the complexity of the controller.

To overcome the challenge, we adopted a software-driven chip-design flow in contrast to a traditional hardware/software design flow where hardware and firmware development are serialized, starting with the hardware and following with the software. In a software-driven design flow, the firmware development starts at the same time as the hardware design begins (Figure 1).

Figure 1. A software-driven IC design flow (Starblaze)

The design flow initiates with a definition of the product specs, and that involves simultaneously both the firmware team and the hardware team. When changes are required –– for example, if additional registers or some specific functions are necessary –– the firmware engineers can ask their hardware colleagues to implement those bits. Any bug or any optimization requirement –– for instance, a late design request or feature change –– can be implemented on-the-spot in the hardware.

This parallel hardware/firmware approach accelerates the development cycle and avoids delays in getting into production typically caused by late firmware. By the time the design is ready for tape out, hardware and firmware have been optimized and are virtually ready for mass production. Very little time is spent in chip bring-up after tape out.

LR: So what verification environment are you using to achieve this?

BC: Our design verification and validation environment requires a high-performance system as close to the real chip environment as possible with powerful debug capabilities and easy bring up.

The simulator model of the design is created in C and C++. It is a register-accurate model that can be designed much faster than designing the hardware.

The emulator is Mentor’s Veloce. When deployed in virtual mode, all peripherals are modeled in software providing full control and visibility of the emulation environments, including design under test (DUT), peripherals and interfaces (Figure 3)

Figure 3. Veloce deployed in virtual mode (Starblaze/Mentor)

In the virtual mode, PCI traffic can be monitored, while a QEMU virtualizer runs on the host, providing complete control of the software. The content of the DDR memory, NAND Flash and SPI NOR Flash can be read and written, and their types and sizes modified.

We use Mentor models for the NAND flash that are accurate. In fact, when we got the chip back from the foundry, none of the changes that typically arise because of differences between the model and the actual physical NAND were necessary.

The virtual setup also added three unique capabilities not possible with a physical setup.

First we could get remote access 24/7 from anywhere in the world.

Second, the emulator was a sharable resource across multitude concurrent users

Third, the same clock frequencies of the DUT and peripherals eliminated the need for speed adapters to accommodate the fast physical peripherals to the slow running clock of the emulator, and enabled realistic performance evaluations.

When we emulate an embedded system-on-chip (SoC) design, we run the firmware on the actual CPUs mapped inside the emulator. The firmware accesses the SoC hardware components by writing and reading the registers mapped on the bus. Conversely, when we simulate the SoC design, we run the firmware on the x86 system compiled via GCC or Visual Studio. In this instance, the firmware accesses the SoC hardware components, written in C/C++ as behavioral models, through register variables that are mapped to hardware addresses in the SoC.

Basically, we compile the firmware on the ARC CPU core that is in the actual SSD controller, and compile that exact same firmware on an x86 CPU running on the host workstation (Figure 4). The firmware runs in either place without changes. In the behavioral simulation environment processed on a PC, we step through the firmware code just like it was in the actual emulator. The approach allows us to take the entire SoC through a register variable and map it to real hardware or to a behavioral model.

Figure 4. Taking the SSD through a register variable (Starblaze)

For example, consider a real hardware DMA controller (Figure 5). The firmware runs in the ARC processor included in the SoC, and it accesses the direct memory access (DMA) by writing and reading a register variable named ‘reg_dmac.’

The address of this variable is mapped to the hardware register address 0x2000300’ through the link file. When we write or read ‘reg_dmac,’ the internal DMA registers take that operation in C and map it to the DMA controller. This is how it works in real hardware.

Figure 5. In the real SoC, the firmware accesses the DMA by writing/reading register variable named ‘reg_dmac’ (Starblaze).

In simulation, firmware and behavioral models communicate with each other through a shared global variable ‘reg_dmac’ (Figure 6).

Figure 6. In the simulated SoC, the firmware accesses the DMA behavioral model through a shared variable (Starblaze)

There is no code difference in the firmware file. We have two identical files on the left for hardware and software. On the right, we have actual hardware (figure 5) or the behavioral simulation model (figure 6), ideal when some of the hardware blocks are still in development. As the hardware components created by the hardware team come alive, we synthesize and map them in the emulator and run the simulator on the behavioral models.

Not only can we run the same firmware code, but we can run the same stimulus.

Instead of waiting for the whole SoC implemented by the hardware team, we can start verifying it on day one, mixing mature blocks in emulation and blocks that don’t exist yet in simulation.

For example, we have two behavioral models #1 and #2, and the register transfer level (RTL) code of a DMA Controller (Figure 7).

We compile the firmware on the x86 and run it with the two behavioral models in the simulator, and synthesize the DMA Controller and map it onto the emulator. We then create a DMA stub or wrapper in the simulator that via an inter-process communication (IPC) socket and a hardware verification language (HVL)/RTL adaptor communicates to the DMA controller in the emulator (Figure 8).

Figure 8. Starblaze creates a DMA stub in the simulator through an IPC socket and a HVL/RTL adaptor to communicate to the DMA controller in the emulator (Starblaze)

It is important to have a cohesive and homogenous environment when switching from simulation to emulation. We cannot swap firmware when switching from one to the other. The firmware needs to be basically the same. We made this a mandate from day one.

LR: Can you describe a bug you were able to pinpoint using this simulation/emulation environment?

BC: We had a boot problem. The NAND flash that contained the bootable code failed after several cycles. When testing a NAND or SSD device, it is important to test the code on a NAND device that has been aged since it has different properties than when it is fresh out of the box. Unfortunately, an aged NAND becomes flaky, and displays non-repeatable behavior. We run it once and get a failure. We run it again and that failure disappears.

Our simulation/emulation environment came to the rescue. We ran the firmware in simulation as fast as possible and once the boot failure occurred, we captured the exact same NAND data and moved it into emulation.

The scenario became completely repeatable. We could rerun that firmware as much as we wanted until we fixed the bug.

LR: Could you talk a little about the debugging capabilities of this environment?

BC: We added a couple of features to improve the firmware debug capability.

Debugging firmware is quite different from debugging RTL code. The firmware engineers can’t tell at which simulation cycle the bug occurs, but can tell in which function or C expression they see the bug.

We designed a firmware interface by accessing certain bus accesses to fully control the emulation process. The firmware calls application programming interface (API) functions to display current simulation time, to start, stop or pause the run, or to start dumping waveforms. This allows us to set breakpoints and stop to see the whole picture of things. We may even change some registers. (Figure 9).

Figure 9. The firmware calls API functions to display current SSD controller simulation time, to start, stop or pause the run, and to dump waveforms (Starblaze)

The other enhancement concerns the CPU trace. When the firmware runs into a bug or fires an assertion, it is important to know what sequence of functions the firmware runs before it hits the assertion. We designed a CPU trace module to continuously trace the PC value and map the PC value to its corresponding assembly code and current function. The result is a waveform that can simultaneously show the assembly code, function names and hardware signals.

I can give you a good example of a dump waveform (Figures 10 and 11).

Figure 10. The formal CPU trace (Starblaze)

Figure 11. The waveform of the CPU trace capability can show assembly code, function names and hardware signals (Starblaze)

So how would you summarize your experience with this unified verification environment.

Time to market (TTM) is everything in the storage industry. If we can cut a week off the TTM, we can save millions of dollars. Emulation gave us the ability to start developing hardware and firmware together from the offset. By implementing the innovative features noted above, including hardware or the software debug features inside the waveform, we have been able to find many bugs that we fixed in hours instead of weeks if not months.

The unified simulator/emulator environment has opened up many possibilities. The symbiosis of the two allowed us to use the same stimulus patterns for hardware verification, software verification, and platform validation.

This has been a huge part of Starblaze’s success. In fact, Starblaze was able to shed four months off the SSD controller development cycle, easily justifying the purchase of a best-in-class emulation system.

]]>http://www.techdesignforums.com/practice/technique/how-starblaze-combined-simulation-and-emulation-to-design-ssd-controller-firmware/feed/0EUV’s arrival demands a new resolution enhancement flowhttp://www.techdesignforums.com/practice/technique/euv-arrival-demands-new-resolution-enhancement-flow/
http://www.techdesignforums.com/practice/technique/euv-arrival-demands-new-resolution-enhancement-flow/#respondTue, 11 Sep 2018 02:51:02 +0000http://www.techdesignforums.com/practice/?p=10157The transition from optical to extreme ultraviolet (EUV) lithography in high-volume manufacturing is underway. Some issues are still being ironed out but the resolution enhancement technology (RET) flow is ready. The computational lithography software has been in development for many years and has already been deployed at a number of leading-edge fabs.

EUV presents some unique challenges that today’s RET and optimal proximity correction (OPC) tools need to resolve. For example, the accurate computation of the density-dependent component of flare and the elimination of the imaging impact from black border effects. Moreover, the entire process—including the scanner, materials, resist, and process integration—is still evolving. This presents new challenges and opportunities to improve the RET flow.

Consider the question of whether sub-resolution assist features (SRAFs) are needed. Do they improve the process margin with EUV? If so, what is the ideal approach? EUV RET optimization for next-generation designs involves co-optimization and many complicated trade-offs. In a joint effort with GLOBALFOUNDRIES and IMEC [1], Mentor found that with powerful optimization tools, such as inverse lithography techniques, and a careful balancing of requirements, SRAFs are very useful. Here are two examples.

Another challenge is the impact of aberrations on EUV lithography. We can adequately simulate and correct EUV scanner aberrations during OPC across the slit to deliver excellent edge placement control. The problem is that the level of aberration variability from tool to tool is currently significant and leads to uncorrectable edge-placement errors if OPC is done using one tool while exposure happens on a different tool. This means that the current and near-term anticipated aberrations levels on EUV scanners imply very significant edge-control challenges.

There are a substantial number of combinations of aberrations referenced in OPC, and aberrations referenced in verification across two layers with critical inter-layer edge placements for a fleet of EUV scanners in manufacturing. However, certain combinations yield better lithographic results than others. Computational lithography can be a powerful tool for assessing these combinations for manufacturing use. Mentor has demonstrated the clear advantage of using dedicated OPC models with tool-specific aberration correction. Without such models, uncorrectable relative edge placement errors of up to 5nm can be realized (Figure 3).

Building a fast and accurate OPC flow for EUV

High-volume tapeout flows that apply retargeting, SRAF insertion, and OPC typically exploit design hierarchy to minimize total flow time. This strategy has enabled leading-edge foundries to meet time-to-market requirements with the Calibre platform. Proper use of design hierarchy can demonstrably reduce tapeout flow runtime by between two and 10 times, depending on design type. Ideally, an EUV tapeout flow should try to use as much of the design hierarchy as possible to constrain runtime. But long-range flare and mask-shadowing effects complicate the use of design hierarchy in OPC.

Mentor has focused on creating a flow that preserves as much of the design hierarchy as possible without compromising accuracy or the process window. Our SRAF solution can, for example, safely use ‘local’ EUV models. That is, for small variations in flare, the SRAF placement does not need to consider flare across the chip; the variations can be approximated. This allows for the use of design hierarchy for SRAF placement. The OPC flow for EUV is designed to maintain hierarchy with no loss in accuracy when the full, global EUV models are used. Using Calibre’s advanced hierarchical flow instead of a fully flat flow accelerates runtime by 2.3X on average with no loss in accuracy.

EUV roadmap

It looks like the semiconductor industry will follow an ‘EUV-to-the-end-of-the-roadmap’ strategy. Starting at the 5nm node, EUV will likely find its way into high-volume production alongside multi-patterning. This will continue at 3nm. Beyond that, there may be help from the scanner hardware side with a planned increase in the numerical aperture (NA) to >= 0.5 (from today’s 0.33). This will provide an increase in resolution, but at a cost. The industry will probably adopt an optical system with anamorphic magnification – 4x in the x-direction, and 8x in the y-direction. This will require reticle layouts to be split in half along the x-axis, and each half ‘stretched’ along the y-axis and placed onto different reticle plates. Although no such full-field exposure tools exist today, Calibre’s modeling and mask synthesis solutions already support anamorphic optics from modeling through OPC and on to mask data-prep and mask process correction.

EUV lithography is nearly ready to support high-volume manufacturing at 7nm and beyond. Although it has presented many new challenges for OPC and RET in terms of accuracy and runtime, the tools are now ready. Production solutions for modeling and correction for flare, off-axis illumination, and aberration effects exist and have been integrated into a fast advanced hierarchical platform within Calibre.