AMD reveals new Radeon GPU architecture, codenamed Vega

Over a year ago, AMD laid out a two-pronged strategy for its 14nm GPU refreshes. First, it would refresh its midrange GPUs
with new 14nm hardware based on an updated Graphics Core Next GPU,
codenamed Polaris. These entry-level to midrange cards would be followed
by a full high-end refresh based on a new GPU design, codenamed Vega.
Polaris arrived on-schedule and delivered a significant performance
boost in the areas where AMD needed it most, but details on Vega have been slow to materialize.

Today, that changes. We don’t have review
hardware in-hand yet, but AMD has finally pulled back the curtain and
shared some significant information on what Vega can do and what it
changes compared with GCN. Vega will use second-generation High
Bandwidth Memory (HBM2) rather than HBM or GDDR5. HBM2 delivers two
substantial improvements over HBM — it doubles the data rate per pin,
meaning 2x more memory bandwidth is provided in the same number of
“stacks,” and it significantly increases how much RAM can fit into each
stack. HBM, if you recall, topped out at 4x1GB stacks, or 4GB of RAM
total. This was already a bit of a tight squeeze for AMD’s Fury X family
in June 2015, but HBM2 demolishes this limitation.

Consumer cards with HBM2 will likely start at
8GB of RAM, with the standard capable of supporting at least 32GB. Any
cards with that much RAM that appear in 2017 will be workstation or
server-oriented, but the headroom is there when AMD eventually needs it
for consumer cards. Rumors AMD
would release both an HBM2 and GDDR5X version of Vega appear to be
wrong, much like the rumors of an 8GB Fury X in the run-up to that GPU’s
launch that never materialized.

AMD isn’t just relying on HBM2 for traditional
memory, however. Vega will also introduce two new, HBM2-related
features: A High Bandwidth Cache, and High Bandwidth Cache Controller.

The HBC and HBCC give Vega a large (in
comparison to on-die caches, though exact size isn’t known) memory pool
that it can use in a variety of ways. AMD isn’t giving out the exact
details on how this cache functions yet, but the goal is to enable
fine-grained data movement and keep important data local to the GPU
without having to pull it out of memory. It can also be accessed without
stalling other workloads — normally the GPU will stall if pulling
texture data out of main memory, whereas AMD’s HBCC avoids this problem.

The High Bandwidth Cache Controller provides
512TB of virtual address space and it uses relatively small pages to
ensure the GPU gets fed the data it needs rather than a bunch of
information that ultimately won’t be used. There are also algorithms
in-place to monitor the rate at which data is loaded or evicted from the
cache.

One of the most common misconceptions about
GPU RAM allocation and the popular freeware utility GPU-Z is that GPU-Z
is capable of telling you how much RAM the GPU is actually using. As we first covered in our tests of whether 4GB of RAM was enough for the Fury X, it is not.
GPU-Z and all of the utilities that report VRAM usage under DirectX 11
cannot tell you how much RAM the GPU is actually utilizing because that
information is not actually given by the DirectX 11 API. Instead, they
report how much RAM has been allocated by the GPU, not whether the GPU
is actually making any use of that VRAM. As the slide above shows, the
gap between how much VRAM has been allocated and how much VRAM is
actually in-use is quite significant, even in popular titles. The goal
of AMD’s HBC + HBCC cluster is to allow the GPU to load and access data
more efficiently.

We’ve gathered the next few slides, with
briefer explanations, into a single slideshow. Each slide can be clicked
on to open a larger version in its own window.

Meet the NCU:

From 2012 to the present day, AMD’s GPUs have
all been built around Graphics Core Next and its Compute Units. With
Vega, AMD is debuting its New Compute Units (NCUs). There are 128 cores
per NCU (double what GCN offered) and the cores themselves are capable
of 512 8-bit operations per clock, 256 16-bit operations, or 128 32-bit
operations.

NCUs can pack multiple 8-bit or 16-bit
operations into the same execution window, allowing the GPU to double or
quadruple its throughput depending on its workload. Our understanding
is the ALU doesn’t dynamically reconfigure itself on the fly, but it can
execute variable width instructions (1x 32-bit, 2×16-bit, etc). This
gives AMD a potent hand to play in emerging fields like AI or deep
learning by boosting throughput.

One of the weak spots of GCN was that the core
dramatically favored width over clock speed. This worked well early in
its life, when it competed against Kepler and to some extent Maxwell,
but Pascal gave Nvidia an enormous amount of clock speed headroom that
AMD’s wider RX 480 design didn’t counter effectively. AMD isn’t
releasing its target clock speeds or IPC rates just yet, but Vega is
designed to give improvements on both fronts, with both higher clock
rates and higher IPC efficiency.

Finally, AMD is connecting Vega’s ROPs
directly to its L2 cache. This will boost performance in games that use
deferred rendering because it allows the GPU’s render backends to write
directly to L2 rather than moving data through main memory first.

Final thoughts

There’s a lot we still don’t know about Vega,
including TDP, number of cores, price, and performance figures. AMD has
played its cards close to its vest with both Vega and Ryzen,
only disclosing information bit by bit. We still don’t have a release
date for Vega — AMD has previously said H1, but it’s also possible the
company is playing this close to the chest as well. For now, these aren’t areas where I’m willing to speculate.

As far as the GPU design itself, these look
like the right sort of improvements Vega needs to make. Nvidia’s Maxwell
was a huge efficiency leap over Kepler and its tiled renderer is
thought to have been a large part of the reason. AMD adopting this
approach makes good sense, while the core’s high-bandwidth cache and
cache controller offer capabilities we haven’t seen on a GPU before.

We know AMD needed both higher IPC and faster
clock speeds, and the company is promising Vega delivers both. While
throughput figures don’t tell us everything, being able to issue up to
11 polygons per clock instead of four is a substantial improvement to
geometry processing, even before we take relative efficiency into
account.

There are some other interesting parts of the
block diagram, like the “Network Storage” block. This could be a
reference to the SSD+GPU concept AMD unveiled at SIGGRAPH 2016, or even
an on-die low-latency storage pool that bypasses the need to pull data
in via PCI Express. Meanwhile, the variable ALU width that supports
8-bit, 16-bit, and 32-bit data gives AMD the opportunity to duke it out
in the high-end HPC, AI, and deep learning markets where Nvidia has
dominated to-date.

Paper specs can only tell us so much and I’m
not going to render a verdict on Team Green versus Red performance until
we’ve got hardware in-hand. But based on what we’ve seen, AMD has made
the right moves with Vega. After five years with GCN, AMD needed a
dramatically new approach. It looks like they’ve got one.