Quote of PR Lead for Gaming and Enthusiast Graphics at AMD Robert Hallock:

“People often compare the hardware of the next-gen consoles and the PC, compare specs on paper, and conclude that these consoles must be “PCs in a box.” That is patently untrue. While there are many commonalities, there were platform architecture decisions made for the consoles that set them apart from the PC in a significant way: how developers access the hardware, the Xbox One’s ESRAM, and the PlayStation 4’s UMA are all powerful examples of such decisions." ~Source

Quote of PR Lead for Gaming and Enthusiast Graphics at AMD Robert Hallock:

“People often compare the hardware of the next-gen consoles and the PC, compare specs on paper, and conclude that these consoles must be “PCs in a box.” That is patently untrue. While there are many commonalities, there were platform architecture decisions made for the consoles that set them apart from the PC in a significant way: how developers access the hardware, the Xbox One’s ESRAM, and the PlayStation 4’s UMA are all powerful examples of such decisions." ~Source

AMD must be tired of playing Switzerland and must really want to tout the new tech they've developed

It seems it could be pointing at Xbone not having a Unified Memory Architecture. People were still debating over whether Xbone could have UMA with the eSRAM setup. The way this was stated, it seems he may be implying that Xbone doesn't have UMA.

It seems it could be pointing at Xbone not having a Unified Memory Architecture.

The Xbox one doesn't have UMA, it had UMA last gen.

Lats gen it was UMA+EDRAM Cache

This gen the EDRAM is fully addressable. Which means it is a a hell of a lot more flexible.
UMA aint anything particularly exciting. Its just makes that memory flexible.
Your not strictly limiting what memory is used for which purpose.

i.e. 512mb on the PS3 is strictly 256mb CPU, 256mb VRAM (there are some shairng but its has drawbacks).

The 360 was 512mb for what ever you wanted, ie. 128mb cpu, 384mb GPU.

People were still debating over whether Xbone could have UMA with the eSRAM setup. The way this was stated, it seems he may be implying that Xbone doesn't have UMA.

The Xbox one still has UMA on the main memory PLus it has a non-UMA memory for the GPU of 32mb.

You could still have HUMA on the 8gb memory no problems at all.

I doubt if it would be available on the esram though.

HUMA isnt really about the memory its about being able to pass pointers to "a particular memory" spot, and have it addressable by both CPU and GPU.

It doesnt mean you cant have both HUMA and non HUMA memory.
And neither is specifically linked to UMA.

One unique aspect of the chip is a shared memory pool that can be accessed by CPUs, GPUs and other processors in the system. Typically, GPUs and CPUs have different memory systems, but the new features increase the overall addressable memory in the Xbox One. The GPUs and CPUs have also been modified to enable shared memory.......
"An important aspect of that is letting CPU and GPU share memory spaces," Sell said.

HUMA isnt really about the memory its about being able to pass pointers to "a particular memory" spot, and have it addressable by both CPU and GPU.

More specifically, hUMA includes the ability for both a CPU and GPU to use the same pointer value to access the same object, WHICH ITSELF MAY CONTAIN POINTERS. Simply moving the object will invalidate the internal pointers. A fast hardware copy from CPU space to GPU space (or vise versa) is not good enough.

I guess everyone miss the arrow running from the cpu to the esram on the x1 slide.

Sent from my SAMSUNG-SGH-I337 using Tapatalk 2

"We don't provide the 'easy to program for' console that (developers) want, because 'easy to program for' means that anybody will be able to take advantage of pretty much what the hardware can do, so then the question is, what do you do for the rest of the nine-and-a-half years?"
--Kaz Hirai, CEO, Sony Computer Entertainment

The proof will be in the year 2 games and onward. To be honest I put more stock in the talents of SCEWWS more than I do any kind of wizard $#@! hummus...Looking what ND, QD (not FP I know) and SSM have accomplished on PS3, and knowing that info is shared within WWS, Ill big up the workers, not their tools. Though PS4 is looking to be a badass tool.

HUMA or more specifically HSA is very much the way forward. It more tightly integrates the data in the memory. And is essential for compute to work efficiently.

With out it Ghost, all that extra teraflops in the compute side of the PS4 would be very much hindered and inefficient.
Its one of those reasons Naughty dog will be able to extend the work they did with SPU'S.

There has been a lot of controversy about this matter in the last days, but we will try to clarify that Playstation 4 supports hUMA technology or at least it implements a first revision of it. We have to remember that AMD haven’t released products with hUMA technology yet, so it is difficult to compare with something in the market. Besides, no finished specifications are settled yet, therefore PS4 implementation may differ a bit with finished hUMA implementations.

But first of all, what is hUMA? hUMA is the acronym for Heterogeneous Uniform Memory Access. In the case of hUMA both processors no longer distinguish between the CPU and GPU memory areas. Maybe this picture could explain the concept in a easy way:

PlayStation 4 includes hUMA technology

If you want to learn more about this tech, this article explains how hUMA works.

PS4 has enhancements in the memory architecture that no other “retail” product has, as Mark Cerny pointed in different interviews. We will try to show the new parts in PS4 components in the next pages.

We need to put our diagram about PS4 memory architecture to explain how it works.

lvp2 1024x647 PlayStation 4 includes hUMA technology

Mapping of memory in Liverpool

- Adresses are 40 bit. This size allows pages of memory mapped on both CPU and GPU to have the same virtual address

- Pages of memory are freely set up by the application

- Pages of memory do not need to be both mapped on CPU and GPU

If only the CPU will use, the GPU does not need to have it mapped
If only the GPU will use, it will access via Garlic

- If both the CPU and GPU will access the memory page, a determination needs to be made whether the GPU should access it via Onion or Garlic

If the GPU needs very high bandwidth , the page should be accessed via Garlic; the CPU will need to access it as uncached memory
If the CPU needs frequent access to the page, it should be mapped as cached memory on the CPU; the GPU will need access it via Onion.

Five Type of Buffers

- System memory buffers that the GPU uses are tagged as one of five memory types

- These first three types have very limited CPU access; primary access is by the GPU

- Read Only (RO)

A “RO” buffer is memory that is read by CU’s but never written to them, e.g a texture or vertex table
Access to RO buffers can never cause L1 caches to lose coherency with each other, as it is write operations that cause coherency problems.
- Private (PV)

A “PV” buffer is private memory read from and written to by a single threadgroup, e.g. a scratch buffer.
Access to PV buffers can never cause L1 caches to lose coherency, because it is writes to shared memory areas that cause the problems

- GPU coherent (GC)

A “GC” buffer is memory read from and written to by the CU’s as a result of draw calls or dispatches, e.g. outputs from vertex/shaders that are later read by geometry shaders. Depth buffers and render targets are not GC memory as they are not written to by the CU, but by dedicated hardware in the DBs and CBs.
As writes are permitted to GC buffers, access to them can cause L1 caches to lose coherency with each other

- The last two types are accessible by both CPU and GPU

- System coherent (SC)

A “SC” buffer is memory read from and written to by both CPU and GPU, e.g. CPU structure GPU reads, or structures used for CPU-GPU communication
SC buffers present the largest coherency issues. Not only can L1 caches lose coherency with other, but both L1 and L2 can lose coherency with system memory and the CPU caches.

- Uncached (UC)

A “UC” buffer is memory that is read from and written to by both CPU and GPU, just as the SC was
UC buffers are never cached in the GPU L1 or L2, so they present no coherency issues
UC accesses use the new Onion+ bus, a limited bandwidth bus similar to the Onion bus
UC accesses may have significant inefficiencies due to repeated reads of the same line, or incremental updates of lines

- The first three types (RO, PV, GC) may also be accessed by the CPU, but care must be taken. For example, when copying a texture to a new location

The CPU can write the texture data in an uncached fashion, then manually flush the GPU caches. The GPU can then subsequently access the texture as RO memory through Garlic at high speed
Two dangers are avoided here. As the CPU worte the texture data using uncached writes, no data remains in the CPU caches and the GPU is free to use Garlic rather than Onion. As the CPU flushed the GPU caches after the texture setup, there is no possibility of stale data in the GPU L1 and L2.

Tracking of Type in Memory Accesses

- Memory accesses are made via V# and T# definitions that contain the base address and other parameters of the buffer or texture

- Three bits have been added to V# and T# to specify the memory type

- And extra bit has been added to the L1 tags

It is set if the line was loaded from either GC or SC memory (as opposed to RO or PV memory)
A new type of packet-based L1 invalidate has been added that only invalidates the GC and SC lines
A simple strategy is for application code to use this invalidate before any draw call or dispatch that accesses GC or SC buffers

- An extra bit has been added to the L2 tags

It indicates if the line was loaded from SC memory
A new L2 invalidate of just the SC lines has been added
A new L2 writeback of just the SC lines has been added. These both are packet-based.
A simple strategy is for application code to use the L2 invalidate before any draw call or dispatch that uses SC buffers, and use the L2 writeback after any draw call or dispatch that uses SC buffers
The combination of these features allows for efficient acquisition and release of buffers by draw calls and dispatches

Simple Example:

- Let’s take the case where most of the GPU is being used for graphics (vertex shaders, pixel shaders and so on)

- Additionally, let’s say that we have an asynchronous compute dispatch that uses a buffer SC memory for:

Dispatch inputs, with are created by the CPU and read by the GPU
Dispatch outputs, which are created by the GPU and read by the CPU

- The GPU can:

1) Acquire the SC buffer by performing an L1 invalidate (GC and SC) and an L2 invalidate (SC lines only). This eliminates the possibility of stale data in the caches. Any SC address encountered will properly go offchip (to either system memory or CPU caches) to fetch the data.

2) Run the compute shader

3) Release the SC buffer by performing an L2 writeback (SC lines only). This writes all dirty bytes back to system memory where the CPU can see them

- The graphics processing is much less impacted by this strategy

On the R10xx, the complete L2 was flushed, so any data in use by the graphics shaders (e.g. the current textures) would need to be reloaded
On Liverpool, that RO data stays in place – as does PV and GC data

This technical information can be a bit overwhelming and confuse, therefore we will disclose more information and examples of use of this architecture in a new article this week.

Posting Permissions

PlayStation Universe

Copyright 2006-2014 7578768 Canada Inc. All Right Reserved.

Reproduction in whole or in part in any form or medium without express written
permission of Abstract Holdings International Ltd. prohibited.Use of this site is governed
by our Terms of Use and Privacy Policy.