SPU and its little Atomic Cache Unit

Legend

Ok... I was searching for PS3 DLNA information and came across the article. Decided to post it here after a quick read.

Here's DeanoC hard at work on his blog...

Atomic Cache Unit
Never heard of it right?

Well its one of my favorite things on the PS3 and gets little love cos its one of those tiny features that make life so much nicer.

Click to expand...

Make life easier ? on PS3 ? Hmppphhh !

P.S. Free beer from me next time you guys (or any of the pushing-the-envelope guys) stop by Bay Area.

EDIT: Holy Sh*t ! Why didn't anyone highlight this before ? It will make a huge difference.

The ACU (s) are a part of each SPU that allow atomic updates to occur very quickly. It appears fairly simple each SPU had 512 bytes of cache (yes contrary to what you might have heard SPU do have a tiny bit of cache). 512 bytes is divided into 4 128 byte lines. The MFC (DMA) unit can bring a cache line in from memory and mark it reserved… if another processor writes to the same bit of memory the reservation is lost and you know to repeat the read/modify/write cycle to guarantee atomicy.

All good, but whats really clever its how its implemented. If another SPU asks for that bit of memory its get its from another SPUs cache, if its in there. So you effectively have a fast SHARED 512 bytes. When an SPU writes, the other SPU only have to read from the writing SPU cache rather than DMAing it back to main memory and DMAing it into LS. Cuts down alot of memory traffic. I even abuse it and just use it as a conventional cache and communication channel between SPU. You can push alot of data around with a fast 128 byte path.

Newcomer

I'm not sure it'll make a huge difference... infact, I'm interested as to why you think it would! Even without keeping this data in the 4 entry cache, it's my understanding that full LS to LS DMAs stay on the EIB.. they don't go via main memory.

So bearing that in mind, I'm not sure why Deano is describing a system where data goes out from LS, to main memory, and back to LS. As that simply doesn't happen in the case of LS->LS DMA.

And surely the utilisation of the SPU cache in this way pretty much requires that in order to run at full speed the other SPUs you're communicating with are not evicting cache contents by performing other DMAs? So your system needs to be pretty static in terms of DMA usage to reap the full benefit of what is described.

Regular

I'm not sure it'll make a huge difference... infact, I'm interested as to why you think it would! Even without keeping this data in the 4 entry cache, it's my understanding that full LS to LS DMAs stay on the EIB.. they don't go via main memory.

So bearing that in mind, I'm not sure why Deano is describing a system where data goes out from LS, to main memory, and back to LS. As that simply doesn't happen in the case of LS->LS DMA.

And surely the utilisation of the SPU cache in this way pretty much requires that in order to run at full speed the other SPUs you're communicating with are not evicting cache contents by performing other DMAs? So your system needs to be pretty static in terms of DMA usage to reap the full benefit of what is described.

Cheers,
Dean

Click to expand...

LS->LS DMA still requires more programmer synchronization between SPUs than the ACU to my naive mind for some cases, so it seems more a matter of convenience than speed...though if, as you say, DMAs with memory can erase lines in the cache then that is a bit of a bummer. I assumed that the ACU locked the atomic lines in while other DMAs went directly between LS and main memory, though I guess I had no reason to think this.

Newcomer

I assumed that the ACU locked the atomic lines in while other DMAs went directly between LS and main memory, though I guess I had no reason to think this.

Click to expand...

Hmm.. I thought that the ACU shares some bits with the DMA subsystem, but hey.. irrespective of this, if other SPUs are doing things (unrelated to stats update), then it would be possible for entries to become evicted.

VeteranSubscriber

I'm not sure it'll make a huge difference... infact, I'm interested as to why you think it would! Even without keeping this data in the 4 entry cache, it's my understanding that full LS to LS DMAs stay on the EIB.. they don't go via main memory.

So bearing that in mind, I'm not sure why Deano is describing a system where data goes out from LS, to main memory, and back to LS. As that simply doesn't happen in the case of LS->LS DMA.

And surely the utilisation of the SPU cache in this way pretty much requires that in order to run at full speed the other SPUs you're communicating with are not evicting cache contents by performing other DMAs? So your system needs to be pretty static in terms of DMA usage to reap the full benefit of what is described.

Cheers,
Dean

Click to expand...

Cos its very hard to do LS->LS DMA in real world usage (you need static memory layout and synchronised tasks). In practise you do a LS->EA on one SPU and EA->LS on another. If you lucky this occurs at the same time so its shortcut, else it goes in back into the main cache/memory system. Tho atomic put/get is higher priority so should be faster for 128 bytes than a LS->LS DMA anyway...

The ACU cache gives you a place to leave the data effectively on the ring bus for a while without knowing any details of the destination. Its partly LRU and AFAICT doesn't get evicted via normal DMA get, tho put does. Its also a high speed ring bus op, faster than normal ring bus movement. So its should always be better or the same as normal get.

Its not perfect but it does appear to be better than the alternatives 'most' of the time. Which is true of all caches really.

Legend

LOL. I vaguely remember your reply but I didn't quite grasp it last time.

DeanA said:

I'm not sure it'll make a huge difference... infact, I'm interested as to why you think it would! Even without keeping this data in the 4 entry cache, it's my understanding that full LS to LS DMAs stay on the EIB.. they don't go via main memory.

So bearing that in mind, I'm not sure why Deano is describing a system where data goes out from LS, to main memory, and back to LS. As that simply doesn't happen in the case of LS->LS DMA.

Click to expand...

Ok cool... I have confirmation about efficient LS<->LS DMA (without PPE or other external subsystem involvment). The gain from the cache would be relatively smaller if so.

What is the time saved between an atomic cache write/read (cache hit) versus a LS atomic store/read (cache miss) for multiple SPUs ?

And surely the utilisation of the SPU cache in this way pretty much requires that in order to run at full speed the other SPUs you're communicating with are not evicting cache contents by performing other DMAs? So your system needs to be pretty static in terms of DMA usage to reap the full benefit of what is described.

Click to expand...

Yes it seems. The algorithm in question should be pretty regular/predictable (Some globally shared data structure needs to be consulted/updated "everytime"). In DeanoC's case it looks to be the death/alive counter.

Banned

Well its one of my favorite things on the PS3 and gets little love cos its one of those tiny features that make life so much nicer.

Atomic ops refer to the most important principle in multi threading. It say that a single processor must read/modify/write without another thread interrupting (hence atomic). Without atomicy, multiple core system are much harder (if not near impossible)

The ACU (s) are a part of each SPU that allow atomic updates to occur very quickly. It appears fairly simple each SPU had 512 bytes of cache (yes contrary to what you might have heard SPU do have a tiny bit of cache). 512 bytes is divided into 4 128 byte lines. The MFC (DMA) unit can bring a cache line in from memory and mark it reservedâ€¦ if another processor writes to the same bit of memory the reservation is lost and you know to repeat the read/modify/write cycle to guarantee atomicy.

All good, but whats really clever its how its implemented. If another SPU asks for that bit of memory its get its from another SPUs cache, if its in there. So you effectively have a fast SHARED 512 bytes. When an SPU writes, the other SPU only have to read from the writing SPU cache rather than DMAing it back to main memory and DMAing it into LS. Cuts down alot of memory traffic. I even abuse it and just use it as a conventional cache and communication channel between SPU. You can push alot of data around with a fast 128 byte path.
And the nicest thing about itâ€¦ It just worksâ€¦ All the cache snooping, routing etc. all just happens magically inside the chip. You say ATOMIC_GET and ATOMIC_SET and treat yourself to a 512 byte shared cache.

So for example for some of the army stuff, I need statistics to be kept (things like how many dead, koâ€™ed etc.)

These are 128 byte structure, that each SPU read/writes to as nessecary. When first you look at it, its seems to be really slow if not for the ACU. All those 128 bytes DMA, every time I need to add a number Iâ€™m doing 2 128 DMA (one read/one write) but due to the fact that its sitting inside SPU cache most of the time its ends up just being EIB ring traffic. And thats fast, really really fast.

I just have a shared counter statistics system that all works even tho I can be making 100â€™s of atomic updates per frame.

Nice oneâ€¦ Whoever at STI who added that bit of hardware deserves a beer from me

Veteran

Since this has been posted, in the Heavenly Sword thread already and there have been some really nice responses (thanks DeanA, DeanoC, etc.), I've decided to copy the posts over here, since some might miss those.

Regular

WOW!I learned something new today thanks for the article it's really a good read.and thanks for making it into a thread coz i rarely go to the HS thread coz there's too many posts to skim through just to get to the juicy parts

Veteran

P.S. Free beer from me next time you guys (or any of the pushing-the-envelope guys) stop by Bay Area.

Click to expand...

This reminds me, any fellow game developers in the Bay Area may want to check out the SF game dev meetup. It used to be at Thirsty Bear, but now it's at the Metreon. The next meeting should be around mid-June (check out the site for details). There are lots of local developers in attendance. It's an informal get together, so please, no solicitors (i.e. people trying to sell middleware) or journalists.

Newcomer

This ACU - does it get used for successive NON128-byte aligned DMAs.. or just atomic ops.

e.g. lets say you're streaming through a list of 96 byte objects*, - do the crossover cachelines get buffered instead of adding main-memory accesses for the overlap..
Up until now ive' been thinking in terms of manually buffering this sort of data with larger 128byte aligned loads (i.e. to get multiple missaligned objects in togther back to back)

VeteranSubscriber

This ACU - does it get used for successive NON128-byte aligned DMAs.. or just atomic ops.

e.g. lets say you're streaming through a list of 96 byte objects*, - do the crossover cachelines get buffered instead of adding main-memory accesses for the overlap..
Up until now ive' been thinking in terms of manually buffering this sort of data with larger 128byte aligned loads (i.e. to get multiple missaligned objects in togther back to back)

About Us

Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!