The "Vega 20" silicon will be significantly different from the "Vega 10" which powers the company's current Radeon RX Vega series. AMD CEO Dr. Lisa Su unveiled the "Vega 20" silicon at the company's 2018 Computex event, revealing that the multi-chip module's 7 nm GPU die is surrounded by not two, but four HBM2 memory stacks, making up to 32 GB of memory. Another key specification is emerging thanks to the sharp eyes at ComputerBase.de - system bus.

A close inspection of the latest AMDGPU Linux driver includes PCI-Express link speed definitions for PCI-Express gen 4.0, which offers 256 Gbps of bandwidth per direction at x16 bus width, double that of PCI-Express gen 3.0. "Vega 20" got its first PCIe gen 4.0 support confirmation from a leak slide that surfaced around CES 2018. AMD "Vega" architecture slides from last year hinted at a Q3/Q4 launch of the first "Vega 20" based product. The same slide also hinted that the next-generation EPYC processor, which we know are "Zen 2" based and not "Zen+," could feature PCI-Express gen 4.0 root-complexes. Since EPYC chips are multi-chip modules, it could also hint at the likelihood of PCIe gen 4.0 on "Zen 2" based 3rd generation Ryzen processor family.

26 Commentson AMD Vega 20 GPU Could Implement PCI-Express gen 4.0

A checkbox feature for the time being. No CPU currently supports PCIe 4.0 and to design a GPU that actually uses all that bandwidth, would mean it would run crippled unless paired with upcoming CPUs.
No biggie, we've been there before with PCIe 2.0 and 3.0.

What would be really neat would be for the card to run PCIe 4.0 x8 and free up 8 more lanes for NVMe storage.

bugA checkbox feature for the time being. No CPU currently supports PCIe 4.0 and to design a GPU that actually uses all that bandwidth, would mean it would run crippled unless paired with upcoming CPUs.
No biggie, we've been there before with PCIe 2.0 and 3.0.

What would be really neat would be for the card to run PCIe 4.0 x8 and free up 8 more lanes for NVMe storage.

I think that might be the point. Even 5.0 coming up, that pure x16 bandwidth is not as necessary (aside from say maybe HPC NICs) but how x1 lane would be enough now for a lot more devices per chassis. So at 3.0 it takes x16 lanes. But at 4.0 it takes x8 lanes for the GPU. Plus it's backwards compatible so doesn't hurt.

bugA checkbox feature for the time being. No CPU currently supports PCIe 4.0 and to design a GPU that actually uses all that bandwidth, would mean it would run crippled unless paired with upcoming CPUs.

windwhirlOkay. So it is likely that next-gen graphics card (Vega 20) and whatever comes after Volta will switch to PCIe 4.0 for the high end products.

This also makes me think that I should probably put on hold any upgrade plans and wait until motherboards and CPUs with support for PCI 4.0 reach consumer market, for better future proofing...

Exactly my strategy as well. Wait until 7nm cpus and gpus arrive with all the nice new features and make a quality upgrade. Should be there circa 2019 or 2020. In a meantime - just keep using 6700k and 1080.

eidairaman1Considering it seems like it is mainly Network controllers and storage devices that need all the throughput because graphics cards just plain jane don't utilize it.

This is absolutely incorrect.

The motivation behind the sudden improvement of PCI-E (3.0 was back in 2010) is exactly GPU, but not the gaming part; it is the compute part which finds the PCI-E interface a horrible bottleneck.
PS: I am writing GPU code to accelerate data capture, the PCI-E speed is basically what determines the shortest processing time.

eidairaman1I thought pcie 4 was relegated to servers and 5.0 would be the next major version for desktops...

That makes sense since this article is talking about a GPU that is intended for the data center market and not the consumer space. Of course, that doesn't stop anyone from fantasizing about the implications that this would theoretically have on the gaming market.

How many TFLOPs can even a cheap card handle today, how much bandwidth those cards have, and finally how many GB/s can PCI-E 3.0 pass.
This bandwidth limit does not kill everyone as one could do a lot of operations in GPU, plus for various reason performance does not just scale with TFLOPs, however, in cases like digital down-conversion where vectors are multiplied element-wise, the transfer is the limitation.

Well would be interesting but still pointless until we get actual desktop support. However, I say we focus on getting something out that can actually best the GTX 1080ti by a decent margin before innovating more. We are still waiting on some competition so new cards come out...

GhostRyderWell would be interesting but still pointless until we get actual desktop support. However, I say we focus on getting something out that can actually best the GTX 1080ti by a decent margin before innovating more. We are still waiting on some competition so new cards come out...

is your mind only gaming ?? pointless ?? there is IBM Power9 that support PCIE 4.0

And vega 7Nm with 32 GB HBM2 is Designing ONLY for Datacenter and HPC,,, AMD making GPU For datacenter not dekstop,, where your point ??

SandboThat's a mindset people have to renew, AI/DSP and a lot more new applications of GPU will make up the majority of the GPU sales in the future, the crytomining said it all.

Actually, I think the sane thing to do would to have a clear delimitation between GPUs (usually associated with desktops) and stuff that's used for compute (rarely associated with desktops). But since SKUs are very closely related, that's not going to happen anytime soon.
In the meantime, I think we should cut some slack for those that see GPU and don't automatically think mining, AI and whatnot. In exchange, they (myself included) should be more careful choosing their words.

JoniISkandaris your mind only gaming ?? pointless ?? there is IBM Power9 that support PCIE 4.0

And vega 7Nm with 32 GB HBM2 is Designing ONLY for Datacenter and HPC,,, AMD making GPU For datacenter not dekstop,, where your point ??

No, my mind is on whats available and what we can do with it. Yes most of my reference was to the mainstream but either way there is not even much professional support for PCIE 4.0. Congrats on the googling to find one of the few things that can support it. Its on the label, and far to early to celebrate support for something that's hard to find support for at the time. If that changes then it will become something to look forward to.

bugActually, I think the sane thing to do would to have a clear delimitation between GPUs (usually associated with desktops) and stuff that's used for compute (rarely associated with desktops). But since SKUs are very closely related, that's not going to happen anytime soon.
In the meantime, I think we should cut some slack for those that see GPU and don't automatically think mining, AI and whatnot. In exchange, they (myself included) should be more careful choosing their words.

The problem is more along the lines of what can even support it and when. Having a card even for professional use is nice with the feature but right now there ain't much that can do PCIE 4.0.

GhostRyderThe problem is more along the lines of what can even support it and when. Having a card even for professional use is nice with the feature but right now there ain't much that can do PCIE 4.0.

It's been pointed above: only server chips for now. But I wouldn't make a fuss about, it's the chicken and egg problem we get with every new generation. Gotta start somewhere with support ;)

The motivation behind the sudden improvement of PCI-E (3.0 was back in 2010) is exactly GPU, but not the gaming part; it is the compute part which finds the PCI-E interface a horrible bottleneck.
PS: I am writing GPU code to accelerate data capture, the PCI-E speed is basically what determines the shortest processing time.

That's why we have Nvlink, IF & IIRC Intel's also working on something.
Can you tell us how much data is transferred between the CPU/GPU through PCIe, by certain applications in your line of work? I'm not talking about theoretical limits, but actual observed data transfers.

R0H1TThat's why we have Nvlink, IF & IIRC Intel's also working on something.
Can you tell us how much data is transferred between the CPU/GPU through PCIe, by certain applications in your line of work? I'm not talking about theoretical limits, but actual observed data transfers.

I am currently working on a digitizer, it is moving ~ 800 MB channel at a sampling frequency of 200 MHz.
Each input will create two output, so the output size will be 1600 MB.
I am still exploring and optimizing it, the best I got is around a few times above of the theoretical limit; I believe there are some overhead.
We want to utilize all 4 channels, and possibly boost the sampling rate to 400 MHz (or higher), so it will be up to 8 times longer than that. As the operations are relatively simple, the operations are mostly bandwidth limited in our case.
The above is the best case scenario, otherwise we could also be limited by sharing of the PCI-E lanes and stuff alike. The time seems short, but comparing to the computation the transfer dominates things here.

A possible solution is to see if the transfer time (as well as small VRAM size) can be solved by using APU (Raven Ridge) where the RAM is shared between the host and the on-die GPU. If this works, that would be a low cost and efficient processor for our application. Also, by using lower precision (half) and doing decimation in the GPU might also help solve the problem.

SandboI am currently working on a digitizer, it is moving ~ 800 MB channel at a sampling frequency of 200 MHz.Each input will create two output, so the output size will be 1600 MB.
I am still exploring and optimizing it, the best I got is around a few times above of the theoretical limit; I believe there are some overhead.
We want to utilize all 4 channels, and possibly boost the sampling rate to 400 MHz (or higher), so it will be up to 8 times longer than that. As the operations are relatively simple, the operations are mostly bandwidth limited in our case.
The above is the best case scenario, otherwise we could also be limited by sharing of the PCI-E lanes and stuff alike. The time seems short, but comparing to the computation the transfer dominates things here.

A possible solution is to see if the transfer time (as well as small VRAM size) can be solved by using APU (Raven Ridge) where the RAM is shared between the host and the on-die GPU. If this works, that would be a low cost and efficient processor for our application. Also, by using lower precision (half) and doing decimation in the GPU might also help solve the problem.

To add to the diagram, a raw input vector will be sent to GPU (VRAM), then got multiplied by a sin and cos vector with the same number of points, element-wise.
The two product vectors are then sent back to the host (RAM).
In practice, I tried to use something called DMA transfer which ideally does not read or write the data from host until run-time, and it should reduce overhead and allowed the read/write to happen together.

For DDC itself, more processing like decimation can be done to reduce the output vector we return from GPU to host, but the feeding of the raw input vector cannot be avoided.