Wednesday, October 9, 2013

You all might remember a interview with Sony's CTO last year where he talked about the future of PlayStation & he described the PS4 Chip as being a chip with most of the processing still being done by the CPU & GPU but also being helped out by DSP's & Programmable Logic

Fast forward to today & we have a SoC pictured in AMD's Next Generation APU (Kaveri ) & HSA presentation.

This SoC shown in the documents fit the description having a DSP & 'Fixed Function Accelerator' which could be the Vector Processor\ Compute Modular that some people have reported to have seen but we haven't heard much about it.

"However, there's a fair amount of "secret sauce" in Orbis and we can disclose details on one of the more interesting additions. Paired up with the eight AMD cores, we find a bespoke GPU-like "Compute" module, designed to ease the burden on certain operations - physics calculations are a good example of traditional CPU work that are often hived off to GPU cores. We're assured that this is bespoke hardware that is not a part of the main graphics pipeline but we remain rather mystified by its standalone inclusion, bearing in mind Compute functions could be run off the main graphics cores and that devs could have the option to utilise that power for additional graphical grunt, if they so chose."

-It's not stock x86; there are eight very wide vector engines and some other changes. It's not going to be completely trivial to retarget to it, but it should shut up the morons who were hyperventilating at "OMG! 1.6 JIGGAHURTZ!".

-The memory structure is unified, but weird; it's not like the GPU can just grab arbitrary memory like some people were thinking (rather, it can, but it's slow). They're incorporating another type of shader that can basically read from a ring buffer (supplied in a streaming fashion by the CPU) and write to an output buffer. I don't have all the details, but it seems interesting.

-As near as I'm aware, there's no OpenGL or GLES support on it at all; it's a lower-level library at present. I expect (no proof) this will change because I expect that they'll be trying to make a play for indie games, much as I'm pretty sure Microsoft will be, and trying to get indie developers to go turbo-nerd on low-level GPU programming does not strike me as a winner.

A truly innovative feature - the Playstation 4's world-famous vector co-processor. It's connected to a CPU core. It doesn't look like much on paper:

100 million transistors

300 mhz

$25 to produce

25 nm production process

16-way SIMD floating-point multiply-add instructions (512-bit)

1024-bit load-store instructions

512 Kbytes local memory

L2 cache of the X86 CPU

Write-to-memory feature - bypassing the CPU

Has a connection to the instruction dispatch of a regular x86 Jaguar core

It doesn't look like much, except for the 16-way multiply-adder! Besides that though - it only costs $25.00. "How can $25.00 worth of CPU logic improve the graphics pipeline?" Thus - top-secret alien technology. The world-famous vector co-processor. With it - developers can pull off pretty amazing effects, in real-time. I've already seen:

Full-screen per-pixel lighting! Pretty amazing - I haven't seen any games do that.Extremely realistic fire effectsExtremely realistic ice simulationMassive degrees of normal/displacement mappingAnd this is just the start.

512 Kbytes L2 cache connected to the 7 SPUs via a Token-Ring bus - which they used to inter-communicate as well

200 GB/s EIB L2 cache bandwidth

3.2 ghz clockrate

230 million transistors

$400 million to develop, $100 to produce

4-way SIMD

Playstation 4 - Vector Co-Processor

Developed by AMD and me (just a joke)

Based on the X86 architecture

1, 512-bit floating point multiply-adder

1.6 ghz clock rate

512 Kbyte local buffer + the CPU core's main buffer

Connected to the bus directly via a bus-interface unit, and to a CPU core

About 300 million transistors

$0 to develop, $80 to produce

307.2 GB/s buffer to execution unit

16-way SIMD

8 X86-based cores, each with 128-bit FP unit, L2 cache, and shared local memory between pairs of CPUs and between all 8

Doesn't seem as powerful as Cell does it? In fact - it's way, way more powerful. I just found this an interesting comparison. This design is more powerful - as the CPUs do not require direct developer intervention to get them to perform, as opposed to Cell, which only worked if the developers specifically coded for it.

16-way SIMD makes it 4 times faster, per clock. I.e. to do 16 FP multiply-adds, it takes 8 cycles. 4 load-stores, followed by 4 multiply adds - the Cell. But for the Vector Co-Processor to do 16 FP mulitply-adds, it takes 1 cycles. The Out-of-Order, 512-bit load-stores load the data, while the Vector Co-Processor is performing the data. Thus - the Vector Co-Processor alone is equal to 8 SPUs.

The PS4 specter vector? I'm not a 100% sure what it is at this stage... I'm 45% leaning towards physics offloading or helping out within that department. The other 55% is screaming a modified component for helping PS3 games work within the G/Cloud environment.