Today ARM is announcing their partnership with Xilinx to deliver design solutions for their products on TSMC’s upcoming 7nm process node. ARM has previously partnered with Xilinx on other nodes including 28, 20, and 16nm. Their partnership extends into design considerations to improve the time to market of complex parts and to rapidly synthesize new designs for cutting edge process nodes.

Xilinx is licensing out the latest ARM Artisan Physical IP platform for TSMC’s 7nm. Artisan Physical IP is a set of tools to help rapidly roll out complex designs as compared to what previous generations of products faced. ARM has specialized libraries and tools to help implement these designs on a variety of processes and receive good results even on the shortest possible design times.

Design relies on two basic methodologies. There is custom cell and then standard cell designs. Custom cell design allows for a tremendous amount of flexibility in layout and electrical characteristics, but it requires a lot of man-hours to complete even the simplest logic. Custom cell designs typically draw less power and provide higher clockspeeds than standard cell design. Standard cells are like Legos in that the cells can be quickly laid out to create complex logic. Software called EDA (Electronic Design Automation) can quickly place and route these cells. GPUs lean heavily on standard cells and EDA software to get highly complex products out to market quickly.

These two basic methods have netted good results over the years, but during that time we have seen implementations of standard cells become more custom in how they behave. While not achieving full custom performance, we have seen semi-custom type endeavors achieve appreciable gains without requiring the man hours to achieve fully custom.

In this particular case ARM is achieving a solid performance in power and speed through automated design that improves upon standard cells, but without the downsides of a fully custom part. This provides positive power and speed benefits without the extra power draw of a traditional standard cell. ARM further improves upon this with the ARM Artisan Power Grid Architect (PGA) which simplifies the development of a complex power grid that services a large and complex chip.

We have seen these types of advancements in the GPU world that NVIDIA and AMD enjoy talking about. A better power grid allows the ASIC to perform at lower power envelopes due to less impedence. The GPU guys have also utilized High Density Libraries to pack in the transistors as tight as possible to utilize less space and increase spatial efficiency. A smaller chip, which requires less power is always a positive development over a larger chip of the same capabilities that requires more power. ARM looks to be doing their own version of these technologies and are applying them to TSMC’s upcoming 7nm FinFET process.

TSMC is not releasing this process to mass production until at least 2018. In 1H 2017 we will see some initial test and early production runs for a handful of partners. Full blown production of 7nm will be in 2018. Early runs and production are increasingly being used for companies working with low power devices. We can look back at 20/16/14 nm processes and see that they were initially used by designs that do not require a lot of power and will run at moderate clockspeeds. We have seen a shift in who uses these new processes with the introduction of sub-28nm process nodes. The complexity of the design, process steps, materials, and libraries have pushed the higher performance and power hungry parts to a secondary position as the foundries attempt to get these next generation nodes up to speed. It isn’t until after some many months of these low power parts are pushed through that we see adjustments and improvements in these next generation nodes to handle the higher power and clockspeed needs of products like desktop CPUs and GPUs.

ARM is certainly being much more aggressive in addressing next generation nodes and pushing their cutting edge products on them to allow for far more powerful mobile products that also exhibit improved battery life. This step with 7nm and Xilinx will provide a lot of data to ARM and its partners downstream when the time comes to implement new designs. Artisan will continue to evolve to allow partners to quickly and efficiently introduce new products on new nodes to the market at an accelerated rate as compared to years past.

Someone, who wasn’t Intel, seeded Tom’s Hardware an Intel Core i7-7700k, which is expected for release in the new year. This is the top end of the mainstream SKUs, bringing four cores (eight threads) to 4.2 GHz base, 4.5 GHz boost. Using a motherboard built around the Z170 chipset, they were able to clock the CPU up to 4.8 GHz, which is a little over 4% higher than the Skylake-based Core i7-6700k maximum overclock on the same board.

Before we continue, these results are based on a single sample. (Update: @7:01pm -- Also, the motherboard they used has some known overclock and stability issues. They mentioned it a bit in the post, like why their BCLK is 99.65MHz, but I forgot to highlight it here. Thankfully, Allyn caught it in the first ten minutes.) This sample has retail branding, but Intel would not confirm that it performs like they expect a retail SKU would. Normally, pre-release products are labeled as such, but there’s no way to tell if this one part is some exception. Beyond concerns that it might be slightly different from what consumers will eventually receive, there is also a huge variation in overclocking performance due to binning. With a sample size of one, we cannot tell whether this chip has an abnormally high, or an abnormally low, defect count, which affects both power and maximum frequency.

That aside, if this chip is representative of Kaby Lake performance, users should expect an increase in headroom for clock rates, but it will come at the cost of increased power consumption. In fact, Tom’s Hardware states that the chip “acts like an overclocked i7-6700K”. Based on this, it seems like, unless they want an extra 4 PCIe lanes on Z270, Kaby Lake’s performance might already be achievable for users with a lucky Skylake.

I should note that Tom’s Hardware didn’t benchmark the iGPU. I don’t really see it used for much more than video encoding anyway, but it would be nice to see if Intel improved in that area, seeing as how they incremented the model number. Then again, even users who are concerned about that will probably be better off just adding a second, discrete GPU anyway.

The sheet also states that none of these are supposed to contain integrated graphics, like we see on the current FX line. There is some merit to using integrated GPUs for specific tasks, like processing video while the main GPU is busy or doing a rapid, massively parallel calculation without the latency of memory copies, but AMD is probably right to not waste resources, such as TDP, fighting our current lack of compatible software and viable use cases for these SKUs.

The sheet also contains benchmarks for Cinebench R15. While pre-rendered video is a task that really should be done on GPUs at this point, especially with permissive, strong, open-source projects like Cycles, they do provide a good example of multi-core performance that scales. In this one test, the Summit Ridge 7 CPU ($350) roughly matches the Intel Core i7-6850K ($600), again, according to this one unconfirmed benchmark. It doesn’t list clock rates, but other rumors claim that the top-end chip will be around 3.2 GHz base, 3.5 GHz boost at stock, with manual overclocks exceeding 4 GHz.

These performance figures suggest that Zen will not beat Skylake on single-threaded performance, but it might be close. That might not matter, however. CPUs, these days, are kind-of converging around a certain level of per-thread performance, and are differentiating with core count, price, and features. Unfortunately, there doesn’t seem to have been many leaks regarding enthusiast-level chipsets for Zen, so we don’t know if there will be compelling use cases yet.

A Holiday Project

A couple of years ago, I performed an experiment around the GeForce GTX 750 Ti graphics card to see if we could upgrade basic OEM, off-the-shelf computers to become competent gaming PCs. The key to this potential upgrade was that the GTX 750 Ti offered a great amount of GPU horsepower (at the time) without the need for an external power connector. Lower power requirements on the GPU meant that even the most basic of OEM power supplies should be able to do the job.

That story was a success, both in terms of the result in gaming performance and the positive feedback it received. Today, I am attempting to do that same thing but with a new class of GPU and a new class of PC games.

The goal for today’s experiment remains pretty much the same: can a low-cost, low-power GeForce GTX 1050 Ti graphics card that also does not require any external power connector offer enough gaming horsepower to upgrade current shipping OEM PCs to "gaming PC" status?

Our target PCs for today come from Dell and ASUS. I went into my local Best Buy just before the Thanksgiving holiday and looked for two machines that varied in price and relative performance.

The specifications of these two machines are relatively modern for OEM computers. The Dell Inspiron 3650 uses a modest dual-core Core i3-6100 processor with a fixed clock speed of 3.7 GHz. It has a 1TB standard hard drive and a 240 watt power supply. The ASUS M32CD-B09 PC has a quad-core HyperThreaded processor with a 4.0 GHz maximum Turbo clock, a 1TB hybrid hard drive and a 350 watt power supply. Both of the CPUs share the same Intel brand of integrated graphics, the HD Graphics 520. You’ll see in our testing that not only is this integrated GPU unqualified for modern PC gaming, but it also performs quite differently based on the CPU it is paired with.

Introduction

In August at the company’s annual developer forum, Intel officially took the lid off its 7th generation of Core processor series, codenamed Kaby Lake. The build up to this release has been an interesting one as we saw the retirement of the “tick tock” cadence of processor releases and instead are moving into a market where Intel can spend more development time on a single architecture design to refine and tweak it as the engineers see fit. With that knowledge in tow, I believed, as I think many still do today, that Kaby Lake would be something along the lines of a simple rebrand of current shipping product. After all, since we know of no major architectural changes from Skylake other than improvements in the video and media side of the GPU, what is left for us to look forward to?

As it turns out, the advantages of the 7th Generation Core processor family and Kaby Lake are more substantial than I expected. I was able to get a hold of two different notebooks from the HP Spectre lineup, as near to identical as I could manage, with the primary difference being the move from the 6th Generation Skylake design to the 7th Generation Kaby Lake. After running both machines through a gamut of tests ranging from productivity to content creation and of course battery life, I can say with authority that Intel’s 7th Gen product deserves more accolades than it is getting.

Architectural Refresher

Before we get into the systems and to our results, I think it’s worth taking some time to quickly go over some of what we know about Kaby Lake from the processor perspective. Most of this content was published back in August just after the Intel Developer Forum, so if you are sure you are caught up, you can jump right along to a pictorial look at the two notebooks being tested today.

At its core, the microarchitecture of Kaby Lake is identical to that of Skylake. Instructions per clock (IPC) remain the same with the exception of dedicated hardware changes in the media engine, so you should not expect any performance differences with Kaby Lake except with improved clock speeds.

Also worth noting is that Intel is still building Kaby Lake on 14nm process technology, the same used on Skylake. The term “same” will be debated as well as Intel claims that improvements made in the process technology over the last 24 months have allowed them to expand clock speeds and improve on efficiency.

Dubbing this new revision of the process as “14nm+”, Intel tells me that they have improved the fin profile for the 3D transistors as well as channel strain while more tightly integrating the design process with manufacturing. The result is a 12% increase in process performance; that is a sizeable gain in a fairly tight time frame even for Intel.

That process improvement directly results in higher clock speeds for Kaby Lake when compared to Skylake when running at the same target TDPs. In general, we are looking at 300-400 MHz higher peak clock speeds in Turbo Boost situations when compared to similar TDP products in the 6th generation. Sustained clocks will very likely remain voltage / thermally limited but the ability spike up to higher clocks for even short bursts can improve performance and responsiveness of Kaby Lake when compared to Skylake.

Along with higher fixed clock speeds for Kaby Lake processors, tweaks to Speed Shift will allow these processors to get to peak clock speeds more quickly than previous designs. I extensively tested Speed Shift when the feature was first enabled in Windows 10 and found that the improvement in user experience was striking. Though the move from Skylake to Kaby Lake won’t be as big of a change, Intel was able to improve the behavior.

The graphics architecture and EU (execution unit) layout remains the same from Skylake, but Intel was able to integrate a new video decode unit to improve power efficiency. That new engine can work in parallel with the EUs to improve performance throughput as well, but obviously at the expensive of some power efficiency.

Specific additions to the codec lineup include decode support for 10-bit HEVC and 8/10-bit VP9 as well as encode support for 10-bit HEVC and 9-bit VP9. The video engine adds HDR support with tone mapping though it does require EU utilization. Wide Color Gamut (Rec. 2020) is prepped and ready to go according to Intel for when that standard starts rolling out to displays.

Performance levels for these new HEVC encode/decode blocks is set to allow for 4K 120mbps real-time on both the Y-series (4.5 watt) and U-series (15 watt) processors.

It’s obvious that the changes to Kaby Lake from Skylake are subtle and even I found myself overlooking the benefits that it might offer. While the capabilities it has will be tested on the desktop side at a later date in 2017, for thin and light notebooks, convertibles and even some tablets, the 7th Generation Core processors do in fact take advantage of the process improvements and higher clock speeds to offer an improved user experience.

Though we are still months away from shipping devices, Qualcomm has announced that it will be building its upcoming flagship Snapdragon 835 mobile SoC on Samsung’s 10nm 2nd generation FinFET process technology. Qualcomm tells us that integrating the 10nm node in 2017 will keep it “the technology leader in mobile platforms” and this makes the 835 the world's first 10nm production processor.

“Using the new 10nm process node is expected to allow our premium tier Snapdragon 835 processor to deliver greater power efficiency and increase performance while also allowing us to add a number of new capabilities that can improve the user experience of tomorrow’s mobile devices.”

Samsung announced its 10nm FinFET process technology in October of this year and it sports some impressive specifications and benefits to the Snapdragon 835 platform. Per Samsung, it offers “up to a 30% increase in area efficiency with 27% higher performance or up to 40% lower power consumption.” For Qualcomm and its partners, that means a smaller silicon footprint for innovative device designs, including thinner chassis or larger batteries (yes, please).

Other details on the Snapdragon 835 are still pending a future reveal, but Qualcomm says that 835 is in production now and will be shipping in commercial devices in the first half of 2017. We did hear that the new 10nm chip is built on "more than 3 billion transistors" - making it an incredibly complex design!

I am very curious to see how the market reacts to the release of the Snapdragon 835. We are still seeing new devices being released using the 820/821 SoCs, including Google’s own flagship Pixel phones this fall. Qualcomm wants to maintain leadership in the SoC market by innovating on both silicon and software but consumers are becoming more savvy to the actual usable benefits that new devices offer. Qualcomm promises features, performance and power benefits on SD 835 to make the case for your next upgrade.

It's been a hell of a 24 hours for NVIDIA and the Tegra processor. A platform that many considered dead in the water after the failure of it to find its way into smartphones or into an appreciable amount of consumer tablets, had two major design wins revealed. First, it was revealed that NVIDIA is powered the new fully autonomous driving system in the Autopilot 2.0 hardware implementation in Tesla's current Model S, X and upcoming Model 3 cars.

Now, we know that Nintendo's long rumored portable and dockable gaming system called Switch is also powered by a custom NVIDIA Tegra SoC.

We don't know much about the hardware that gives the Switch life, but NVIDIA did post a short blog with some basic information worth looking at. Based on it, we know that the Tegra processor powering this Nintendo system is completely custom and likely uses Pascal architecture GPU CUDA cores; though we don't know how many and how powerful it will be. It will likely exceed the performance of the Nintendo Wii U, which was only 0.35 TFLOPS and consisting of 320 AMD-based stream processors. How much faster we just don't know yet.

On the CPU side we assume that this is built using an ARM-based processor, most likely off-the-shelf core designs to keep things simple. Basing it on custom designs like Denver might not be necessary for this type of platform.

Nintendo has traditionally used custom operating systems for its consoles and that seems to be what is happening with the Switch as well. NVIDIA mentions a couple of times how much work the technology vendor put into custom APIs, custom physic engines, new libraries, etc.

The Nintendo Switch’s gaming experience is also supported by fully custom software, including a revamped physics engine, new libraries, advanced game tools and libraries. NVIDIA additionally created new gaming APIs to fully harness this performance. The newest API, NVN, was built specifically to bring lightweight, fast gaming to the masses.

We’ve optimized the full suite of hardware and software for gaming and mobile use cases. This includes custom operating system integration with the GPU to increase both performance and efficiency.

The system itself looks pretty damn interesting, with the ability to switch (get it?) between a docked to your TV configuration to a mobile one with attached or wireless controllers. Check out the video below for a preview.

I've asked both NVIDIA and Nintendo for more information on the hardware side but these guys tend to be tight lipped on the custom silicon going into console hardware. Hopefully one or the other is excited to tell us about the technology so we can some interesting specifications to discuss and debate!

UPDATE: A story on The Verge claims that Nintendo "took the chip from the Shield" and put it in the Switch. This is more than likely completely false; the Shield is a significantly dated product and that kind of statement could undersell the power and capability of the Switch and NVIDIA's custom SoC quite dramatically.

Qualcomm has announced new 400 and 600-series Snapdragon parts, and these new SoCs (Snapdragon 653, 626, and 427) inherit technology found previously on the 800-series parts, including fast LTE connectivity and dual-camera support.

The integrated LTE modem has been significantly for each of these SoCs, and Qualcomm lists these features for each of the new products:

In addition to the new X9 modem, all three SoCs offer faster CPU and GPU performance, with the Snapdragon 653 (which replaces the 652) now supporting up to 8GB of memory - up from a max of 4GB previously. Each of the new SoCs also feature Qualcomm's Quick Charge 3.0 for fast charging.

Availability of the new 600-series Snapdragon processors is set for the end of this year, so we could start seeing handsets with the faster parts soon; while the Snapdragon 427 is expected to ship in devices early in 2017.

Intel and recently acquired Altera have launched a new FPGA product based on Intel’s 14nm Tri-Gate process featuring an ARM CPU, 5.5 million logic element FPGA, and HBM2 memory in a single package. The Stratix 10 is aimed at data center, networking, and radar/imaging customers.

The Stratix 10 is an Altera-designed FPGA (field programmable gate array) with 5.5 million logic elements and a new HyperFlex architecture that optimizes registers, pipeline, and critical pathing (feed-forward designs) to increase core performance and increase the logic density by five times that of previous products. Further, the upcoming FPGA SoC reportedly can run at twice the core performance of Stratix V or use up to 70% less power than its predecessor at the same performance level.

The increases in logic density, clockspeed, and power efficiency are a combination of the improved architecture and Intel’s 14nm FinFET (Tri-Gate) manufacturing process.

Intel rates the FPGA at 10 TFLOPS of single precision floating point DSP performance and 80 GFLOPS/watt.

Interestingly, Intel is using an ARM processor to feed data to the FPGA chip rather than its own Quark or Atom processors. Specifically, the Stratix 10 uses an ARM CPU with four Cortex A53 cores as well as four stacks of on package HBM2 memory with 1TB/s of bandwidth to feed data to the FPGA. There is also a “secure device manager” to ensure data integrity and security.

The Stratix 10 is aimed at data centers and will be used with in specialized tasks that demand high throughput and low latency. According to Intel, the processor is a good candidate for co-processors to offload and accelerate encryption/decryption, compression/de-compression, or Hadoop tasks. It can also be used to power specialized storage controllers and networking equipment.

Intel has started sampling the new chip to potential customers.

In general, FPGAs are great at highly parallelized workloads and are able to efficiently take huge amounts of inputs and process the data in parallel through custom programmed logic gates. An FPGA is essentially a program in hardware that can be rewired in the field (though depending on the chip it is not necessarily a “fast” process and it can take hours or longer to switch things up heh). These processors are used in medical and imaging devices, high frequency trading hardware, networking equipment, signal intelligence (cell towers, radar, guidance, ect), bitcoin mining (though ASICs stole the show a few years ago), and even password cracking. They can be almost anything you want which gives them an advantage over traditional CPUs and graphics cards though cost and increased coding complexity are prohibitive.

The Stratix 10 stood out as interesting to me because of its claimed 10 TFLOPS of single precision performance which is reportedly the important metric when it comes to training neural networks. In fact, Microsoft recently began deploying FPGAs across its Azure cloud computing platform and plans to build the “world’s fastest AI supercomputer. The Redmond-based company’s Project Catapult saw the company deploy Stratix V FPGAs to nearly all of its Azure datacenters and is using the programmable silicon as part of an “acceleration fabric” in its “configurable cloud” architecture that will be used initially to accelerate the company’s Bing search and AI research efforts and later by independent customers for their own applications.

It is interesting to see Microsoft going with FPGAs especially as efforts to use GPUs for GPGPU and neural network training and inferencing duties have increased so dramatically over the years (with NVIDIA being the one pushing the latter). It may well be a good call on Microsoft’s part as it could enable better performance and researchers would be able to code their AI accelerator platforms down to the gate level to really optimize things. Using higher level languages and cheaper hardware with GPUs does have a lower barrier to entry though. I suppose ti will depend on just how much Microsoft is going to charge customers to use the FPGA-powered instances.

FPGAs are in kind of a weird middle ground and while they are definitely not a new technology, they do continue to get more complex and powerful!

Earlier this week at its first GTC Europe event in Amsterdam, NVIDIA CEO Jen-Hsun Huang teased a new SoC code-named Xavier that will be used in self-driving cars and feature the company's newest custom ARM CPU cores and Volta GPU. The new chip will begin sampling at the end of 2017 with product releases using the future Tegra (if they keep that name) processor as soon as 2018.

NVIDIA's Xavier is promised to be the successor to the company's Drive PX 2 system which uses two Tegra X2 SoCs and two discrete Pascal MXM GPUs on a single water cooled platform. These claims are even more impressive when considering that NVIDIA is not only promising to replace the four processors but it will reportedly do that at 20W – less than a tenth of the TDP!

The company has not revealed all the nitty-gritty details, but they did tease out a few bits of information. The new processor will feature 7 billion transistors and will be based on a refined 16nm FinFET process while consuming a mere 20W. It can process two 8k HDR video streams and can hit 20 TOPS (NVIDIA's own rating for deep learning int(8) operations).

Specifically, NVIDIA claims that the Xavier SoC will use eight custom ARMv8 (64-bit) CPU cores (it is unclear whether these cores will be a refined Denver architecture or something else) and a GPU based on its upcoming Volta architecture with 512 CUDA cores. Also, in an interesting twist, NVIDIA is including a "Computer Vision Accelerator" on the SoC as well though the company did not go into many details. This bit of silicon may explain how the ~300mm2 die with 7 billion transistors is able to match the 7.2 billion transistor Pascal-based Telsa P4 (2560 CUDA cores) graphics card at deep learning (tera-operations per second) tasks. Of course in addition to the incremental improvements by moving to Volta and a new ARMv8 CPU architectures on a refined 16nm FF+ process.

Drive PX

Drive PX 2

NVIDIA Xavier

Tesla P4

CPU

2 x Tegra X1 (8 x A57 total)

2 x Tegra X2 (8 x A57 + 4 x Denver total)

1 x Xavier SoC (8 x Custom ARM + 1 x CVA)

N/A

GPU

2 x Tegra X1 (Maxwell) (512 CUDA cores total

2 x Tegra X2 GPUs + 2 x Pascal GPUs

1 x Xavier SoC GPU (Volta) (512 CUDA Cores)

2560 CUDA Cores (Pascal)

TFLOPS

2.3 TFLOPS

8 TFLOPS

?

5.5 TFLOPS

DL TOPS

?

24 TOPS

20 TOPS

22 TOPS

TDP

~30W (2 x 15W)

250W

20W

up to 75W

Process Tech

20nm

16nm FinFET

16nm FinFET+

16nm FinFET

Transistors

?

?

7 billion

7.2 billion

For comparison, the currently available Tesla P4 based on its Pascal architecture has a TDP of up to 75W and is rated at 22 TOPs. This would suggest that Volta is a much more efficient architecture (at least for deep learning and half precision)! I am not sure how NVIDIA is able to match its GP104 with only 512 Volta CUDA cores though their definition of a "core" could have changed and/or the CVA processor may be responsible for closing that gap. Unfortunately, NVIDIA did not disclose what it rates the Xavier at in TFLOPS so it is difficult to compare and it may not match GP104 at higher precision workloads. It could be wholly optimized for int(8) operations rather than floating point performance. Beyond that I will let Scott dive into those particulars once we have more information!

Xavier is more of a teaser than anything and the chip could very well change dramatically and/or not hit the claimed performance targets. Still, it sounds promising and it is always nice to speculate over road maps. It is an intriguing chip and I am ready for more details, especially on the Volta GPU and just what exactly that Computer Vision Accelerator is (and will it be easy to program for?). I am a big fan of the "self-driving car" and I hope that it succeeds. It certainly looks to continue as Tesla, VW, BMW, and other automakers continue to push the envelope of what is possible and plan future cars that will include smart driving assists and even cars that can drive themselves. The more local computing power we can throw at automobiles the better and while massive datacenters can be used to train the neural networks, local hardware to run and make decisions are necessary (you don't want internet latency contributing to the decision of whether to brake or not!).

I hope that NVIDIA's self-proclaimed "AI Supercomputer" turns out to be at least close to the performance they claim! Stay tuned for more information as it gets closer to launch (hopefully more details will emerge at GTC 2017 in the US).

What are your thoughts on Xavier and the whole self-driving car future?