harlock wrote on Jan 9, 2014, 18:57:AMD’s most powerful APUs ever, the AMD A10 7850K and 7700K (codenamed “Kaveri”), are now shipping and will be on shelves in desktops early next week, with pre-orders starting today from select system builders. “Kaveri” is the world’s first APU to include Heterogeneous System Architecture (HSA) features, the immersive sound of AMD TrueAudio Technology and the performance gaming experiences of Mantle API. “Kaveri”-based notebooks will be available in the first half of this year.

To fully exploit the capabilities of parallel execution units, it is essential for computer system designers to think differently. The designers must re-architect computer systems to tightly integrate the disparate compute elements on a platform into an evolved central processor while providing a programming path that does not require fundamental changes for software developers. This is the primary goal of the new HSA design.

HSA creates an improved processor design that exposes the benefits and capabilities of mainstream programmable compute elements, working together seamlessly. With HSA, applications can create data structures in a single unified address space and can initiate work items on the hardware most appropriate for a given task. Sharing data between compute elements is as simple as sending a pointer. Multiple compute tasks can work on the same coherent memory regions, utilizing barriers and atomic memory operations as needed to maintain data synchronization (just as multi-core CPUs do today).

The HSA team at AMD analyzed the performance of Haar Face Detect, a commonly used multi-stage video analysis algorithm used to identify faces in a video stream. The team compared a CPU/GPU implementation in OpenCL™ against an HSA implementation. The HSA version seamlessly shares data between CPU and GPU, without memory copies or cache flushes because it assigns each part of the workload to the most appropriate processor with minimal dispatch overhead. The net result was a 2.3x relative performance gain at a 2.4x reduced power level*. This level of performance is not possible using only multicore CPU, only GPU, or even combined CPU and GPU with today’s driver model. Just as important, it is done using simple extensions to C++, not a totally different programming model.