From smartphone to server room: Nvidia’s “Kayla” shows the future of Tegra

An 8×5.5" motherboard with next year's Tegra features will be available in May.

The "Kayla" motherboard is a way for software developers to begin porting their CUDA apps to the ARM architecture before it goes mainstream in 2014.

Andrew Cunningham

SAN JOSE, CA—Tegra 4 phones and tablets aren't quite here yet, but Nvidia is already giving us details about its successor, a chip codenamed "Logan." Due in early 2014, the new processor's primary innovation is that it brings the graphics architecture in Nvidia's mobile processors more or less up to date with the architecture in its GeForce and Quadro graphics products.

This updated GPU will obviously bring performance increases relative to present-day chips, but more important will be the level of API support that it makes possible in mobile devices. Anything supported by current Kepler-based graphics cards—including OpenGL 4.3, Direct3D 11, CUDA, OpenCL, and PhysX—will be possible in phones and tablets, albeit with less raw horsepower behind it than a high-end PC graphics card can provide.

To give developers a chance to start playing with these APIs, many of which aren't yet mainstream in ARM devices, Nvidia (with an Italian company named Seco) is showing off a new small form-factor motherboard called "Kayla" that's meant to provide roughly the same software features as Logan about a year before Logan is actually scheduled to come to market. We got a chance to take a look at the board here at Nvidia's GPU Technology Conference this week.

Tech specs

Enlarge/ Kayla includes a fairly extensive list of ports and interfaces. The heatsink and fan on the left cools the Kepler GPU, while the passive heatsink on the right cools the Tegra 3 CPU.

Andrew Cunningham

Kayla combines three different components: a Seco motherboard, which offers three HDMI ports, one 100 megabit Ethernet port, one gigabit Ethernet port, two USB 2.0 ports, a headphone jack, a micro-USB port, a microSD card reader, and a SATA connector; a Seco daughterboard with a 1.3GHz Tegra 3 processor and 2GB of RAM; and an MXM daughterboard from Nvidia based on the Kepler architecture, which includes a GPU and 1GB of graphics RAM. Technically speaking, any card based on the MXM standard could be used with this board, though the MXM standard is rarely used outside of expensive, bulky gaming laptops, and cards can be difficult to come by.

The GPU itself, which will be available in May, is a Kepler-based GPU with 384 of Nvidia's CUDA cores. This exact GPU isn't quite the same as any of the company's currently shipping products. At least in core count, it's similar to some low-end and mid-range cards in the GeForce 600 series, most notably the GeForce 640 and 650 cards. Obviously, performance will vary based on clock speeds, memory speed and interface, and the number of PCI Express lanes available (the Seco daughterboard appears to offer four lanes, where most full-fledged PCs offer sixteen)—the power requirements of an ARM chip are going to keep Logan from running as quickly as today's low-end desktop graphics cards, but the raw processing power is there.

On the graphics side, at least, this board will give developers a good idea of what the next-generation of Tegra chips will be like with respect to supported APIs and performance level.

"With CUDA 5 and OpenGL 4.3, this is a preview of Logan," Nvidia General Manager of GPU Computing Ian Buck told Ars. "It provides the same programming model, features, and approximate performance of Logan."

In other words, for developers who want to bring their CUDA or PhysX-enabled workstation apps or games over to the ARM architecture, the Kayla board is the testing environment for you. Both in the exhibition hall and during CEO Jen-Hsun Huang's opening-day keynote, the board was shown running an ARM build of Ubuntu 12.04 as well as some Nvidia graphics demos—ray-tracing, smoke, and water simulations all appeared to be running with a reasonable degree of smoothness (as you can see in the official Nvidia video below).

While Nvidia's CUDA technology hasn't been available to developers on ARM before, this is the first step in that direction, and Nvidia will continue to improve support with future hardware and software. "Our next version of CUDA will support ARM first-class," Buck said.

Of course, while the Kayla board approximates the features and performance of Logan on the GPU side, the CPU side is pretty far-removed from what that product will be capable of. The Tegra 3 chip runs four 1.3GHz Cortex-A9 CPU cores, which is a far cry from even the 1.9GHz Cortex-A15 CPU cores that Nvidia has been showing off in its reference tablets for Tegra 4. We don't yet know much about Logan's CPU, but it stands to reason that it will be even faster than Tegra 4, putting quite a bit of distance between it and the Tegra 3 chip in Kayla. Nvidia told us that the decision to use Tegra 3 for this board rather than Tegra 4 came down to interface support—Tegra 3 supports the SATA interface and a PCI Express connection to an external GPU, and Tegra 4 does not.

Enlarge/ Support for SATA is one reason why Kayla uses a Tegra 3 CPU rather than the newer, faster Tegra 4.

Andrew Cunningham

There's also no way that Kayla will fit in anything approaching the size of a modern tablet—it provides many of Logan's features, but it obviously lacks Logan's integration. “What’s amazing is that Logan will be the size of a dime, whereas Kayla is now the size of a tablet PC,” Jen-Hsun Huang said during Tuesday's keynote.

What’s next?

Enlarge/ Kayla-esque boards paired with Tegra 3 CPUs and Nvidia GPUs are already being used to power high-density servers.

Andrew Cunningham

"A lot of people are interested in ARM for high-performance computing and next-generation supercomputers," Buck told Ars, "but they don't really have a development platform to port their code or experiment with their code on ARM. One of the reasons why we're doing this and bringing all the CUDA stuff to there is to help that software ecosystem get jumpstarted on ARM."

Some of Nvidia's competitors are also looking into this space, which would theoretically replace the (relatively) few powerful x86 CPUs in today's servers and supercomputers with many slower but more power-efficient ARM processors. AMD has announced its intention to enter into this market in 2014 with its own ARM chips, and Intel is fighting back by using its Atom processors to do the same thing without the switch in processor architectures.

While Buck told us that Nvidia's focus with Tegra was to stay focused on the consumer market—that is, smartphones and tablets and related devices like the Shield gaming tablet—he didn't rule out the use of Nvidia's ARM chips in future servers. In particular, the follow-up to Logan (codenamed Parker) will support 64-bit ARM instructions, making it better-suited for memory-intensive server applications than current Tegra chips.

There's nothing stopping Nvidia's partners from using Kayla (and other, similar products) in products like this now, though. A company called E4 already uses Kayla-esque motherboards, Tegra 3 CPUs, and Quadro GPUs in its ARKA Microcluster, which Seco was showing off in the exhibition hall. These servers can cram 24 CPUs and 24 GPUs into a server that consumes about 1200 watts of power, a relative pittance if you have a workload that needs to be able to execute many smaller tasks rather than a few large ones. Products like these are something we'll only see more of as ARM chips (and their accompanying GPUs) become more capable.

It's there in Kepler, but Nvidia seems pretty ambivalent about it (and several tests have shown Kepler being worse at OpenCL than its predecessor, Fermi, was, though I haven't checked in on that situation in awhile). Sadly, Nvidia would probably prefer that you use the implementation that locks you into their hardware ecosystem.

This right here is something I could have made use of a few weeks ago. I have several websites I host, and I previously had some relatively high-power full blown computers running backup and file serving services for them. In order to drop that and save a fair chunk of change, I managed to replace those full blown computers with Raspberry PI units. Problem is, the PI does not support SATA, so the drives are attached via USB.

It does work, and fairly well at that, and both units combined (with drives) draw a maximum at the wall of 20 watts vs 275 watts for the other two computers. For something strictly for backup purposes and some basic file serving, it works just fine, gets the job done, but still having something with a bit more oomph and that would also be able to handle a straight SATA drive would be preferable.

Wonder how much one of those will go for. Looks like it would make for an interesting HTPC as well.

This right here is something I could have made use of a few weeks ago. I have several websites I host, and I previously had some relatively high-power full blown computers running backup and file serving services for them. In order to drop that and save a fair chunk of change, I managed to replace those full blown computers with Raspberry PI units. Problem is, the PI does not support SATA, so the drives are attached via USB.

If 'a Pi with SATA' is the goal, something like the Cubieboard might better than this for you.

...I look at this thing and think 'make that GPU sink bigger and passive and price it right and that's a pretty nifty ARM desktop.'

I can see how cloud-based services would want more of this, basically replacing VM's with actual compute cores, but it ONLY woks when your building your back-end on top of a sevice, not if your trying to use these as part of a traditional stack. The other side of this is dual and quad CPU's on a single server being able to saturate a 10GB connection (no need for external load balancing or complex worker threads based on what micro server has the data needed).

This right here is something I could have made use of a few weeks ago. I have several websites I host, and I previously had some relatively high-power full blown computers running backup and file serving services for them. In order to drop that and save a fair chunk of change, I managed to replace those full blown computers with Raspberry PI units. Problem is, the PI does not support SATA, so the drives are attached via USB.

If 'a Pi with SATA' is the goal, something like the Cubieboard might better than this for you.

...I look at this thing and think 'make that GPU sink bigger and passive and price it right and that's a pretty nifty ARM desktop.'

At least for the moment, your best bet might not be any of the bare-board products at all. Practically every cheap NAS on the market is built around a little ARM board with SATA support, sometimes even two or four bays worth(Marvell's SoCs seem to be a pretty big presence in the area). These are more expensive than something like a cubieboard; but you get the case, drive bays, PSU, etc. thrown in, unlike the dev boards, which are typically not standard motherboard sizes and can turn into a real rat's nest of wall warts and external HDD enclosures. Virtually all the cheapy NASes also run Linux. The robustness of 3rd-party support varies from 'you go first' to 'one of Debian's supported hardware targets'; but there are some good options.

This Nvidia part is probably the only game in town if you want to do CUDA on ARM; but I would be firmly shocked if it ends up being very competitive for anything else. If you want to do media center, the $99 Ouya has the same CPU and a lesser; but still 1920x1080-h.264-decode-in-hardware capable GPU. If you want to do NAS/server, you can buy a NAS(and, with the additional power that that MxM GPU is using, despite you not actually needing the graphical punch for server work, you might even be able to get an Atom or low-end APU in the same power budget and considerably cheaper).

This doesn't make it a bad product or anything; but unless Nvidia pulls quite a surprise out of their pricing hat, the appeal of repurposing one is going to be limited.

Nvidia was pretty firm in saying that Tesla was where most of their HPC ambitions still lie - if you need the most performance, that's still the way to go. This is more about bringing parity in API support and etc. across their entire product stack (Geforce, Quadro, Tegra, etc.), and filling in market niches.

It's there in Kepler, but Nvidia seems pretty ambivalent about it (and several tests have shown Kepler being worse at OpenCL than its predecessor, Fermi, was, though I haven't checked in on that situation in awhile). Sadly, Nvidia would probably prefer that you use the implementation that locks you into their hardware ecosystem.

I remember reading somewhere that NVIDIA toned down openCL in Kepler to reign in power consumption compared to Fermi.

We've seem similar announcements/paper releases from Nvidia in the past, why does any one think they'll get close to delivering on spec and power figures for once ?

For a nice arm desktop dual or quad A15 at 1.8GHz or higher should give decent performance. Been running a few beagle and pandaboards - both omap with powervr and still a bit slow for "standard" desktops but reasonable at video

This looks really neat. If it's workstation Kepler and not gaming Kepler, it could be just the dev/test system I'm looking for for my research project (the difference between the two is the cool scheduler features, like being able to launch kernels from kernels).

If you want to do media center, the $99 Ouya has the same CPU and a lesser; but still 1920x1080-h.264-decode-in-hardware capable GPU. If you want to do NAS/server, you can buy a NAS(and, with the additional power that that MxM GPU is using, despite you not actually needing the graphical punch for server work, you might even be able to get an Atom or low-end APU in the same power budget and considerably cheaper).

If you want just media centre duties, you could do worse than the Pi, considering it has h.264 decoding out of the box, and mpeg2 with a £2.40 extention. And with XBMC extentions supported, you can get YouTube and iPlayer playing quite nicely.

If you want just media centre duties, you could do worse than the Pi, considering it has h.264 decoding out of the box, and mpeg2 with a £2.40 extention. And with XBMC extentions supported, you can get YouTube and iPlayer playing quite nicely.

If you want to do it on the super-cheap, and you're a hacking sort, RPi is the way to go, no doubt. For me, though, for a light server/NAS sort of thing, I built a box on a Sapphire E350 AMD board - the board was around $100, the Pico-PSU around $70. 4GB of RAM when it was dirt-cheap, and SATA drives when they were cheap - not counting the drives, it was around $200. It plays video decently, but I've only had it on VGA to a 15" LCD while I was building it. Now it runs headless and I RDP to it when needed.

I like the features of some commercial NAS boxes - ZFS filesystems and whatnot. I just can't bring myself to drop that kind of money on a dedicated NAS.

It's possible that these problems can be/have been resolved in software, but with Logan and Kayla Nvidia always mentions CUDA and never mentions OpenCL.

The Kepler architecture was cut down a bit from Fermi to focus on 3D rendering efficiency rather than compute. I'm sure they optimize heavily for CUDA over OpenCL but there was a drop in performance wrt both APIs unless you went up to the workstation/server cards.

And I imagine most of the uses for this will end up being server-based (despite that they don't do it 'officially', they're probably hedging their bets until a market develops) or as a phone/desktop hybrid, such as Ubuntu has been proposing.

Andrew Cunningham / Andrew has a B.A. in Classics from Kenyon College and has over five years of experience in IT. His work has appeared on Charge Shot!!! and AnandTech, and he records a weekly book podcast called Overdue.