Monday, 29 June 2015

Inside Videocore

Posted by:
Unknown

The GPU in the Raspberry Pi is remarkably powerful and energy-efficient. Tim Anderson talks to its creators…

“This is for me probably the finest bit of engineering I’ve ever been involved in,” says Eben Upton, founder of the Raspberry Pi Foundation. He is talking about VideoCore IV, the Broadcom GPU (graphics processing unit) in both the original Raspberry Pi and the newer Raspberry Pi 2.

We chatted to Eben, along with director of software Gordon Hollingworth and director of hardware engineering James Adams, about the history of VideoCore, what makes it so impressive, and how you can take advantage of it in your own projects.

The origins of VideoCore go back to a Cambridge-based firm called Alphamosaic. “The company was spun out of Cambridge Consultants in 2000. The guys who did the very first Orange videophone figured there was a market for a dedicated multimedia chip to do the video compression. VideoCore I was basically a two-dimensional DSP [digital signal processing] engine plus SRAM plus peripherals,” recalls James Adams, who worked on the project at Alphamosaic.

SRAM (static RAM) is memory that keeps data without having to be constantly refreshed, and is faster than dynamic RAM, though it also uses more power and takes more space. VideoCore I was used in a number of mobile phones.

The firm went on to develop VideoCore II. “It was a refinement with more power, a bit more memory, fixed all the issues, and dual issue on the scalar side,” says James. ‘Dual issue’ is the ability to execute two instructions per cycle. “It won the slot in a popular video media player of the time.”

Eben Upton did not work on VideoCore until later, but says that even these early versions had a cool architecture. “You have a 64×64 register file and a fairly conventional scalar CPU that’s got its own register file. But there is also this vector ALU (arithmetic logic unit) that is 16-way and can access strips of pixels anywhere in the 64×64 register file. It turns out this is incredibly useful for doing the kinds of operations you need to do for video, such as lots of the transform operations.

“On the front end you have the instruction fetch that then issues instructions for one of those two paths (scalar ALU or vector ALU). That turns out to be fantastically powerful.” Parts of this design still live on in the current VideoCore.

Alphamosaic was acquired by Broadcom in 2004, when VideoCore II was beginning production and the team was getting started on VideoCore III, which was to support 720p resolution (1280×720).

“It was funny because at the time, I’m not convinced that most of the team believed in 720p on a phone. Everyone goes, I know we’re doing it but why? Why would you want to plug this into a TV? It’s a ph one,” recalls Gordon Hollingworth.

The design of VideoCore III was influenced by what people were attempting to do with VideoCore I and II. “People were trying to do composited display output, stuff on the camera, decode and encode video, and even try and do some 3D graphics,” Eben tells us.

VideoCore III also saw the move to dedicated video hardware. “For higher resolutions you need a lot more RAM, so the SRAM was chucked. It’s a more conventional architecture with a lot of dedicated hardware blocks for 3D, video, camera processing, and scale display,” explains James.

On the table is a prototype board for VideoCore III, dating from 2005, and saved from the skip because it bears the number 3.14.

“I saw this on the day I came for interview [at Broadcom]”, remembers Eben. “This was then a fixture of my life for over a year until we got the silicon back.”

James led the 3D graphics team. “It’s a nice plug-in block, it’s got an interface, and it’s just a black box to the rest of the chip,” he says, though it does use the 2D vector processor to do shader processing, “which actually worked surprisingly well.”

Eben describes the 3D processing in VideoCore III as like a train track. “We have a buffer here, and a vector register file, and then four shadow vector register files which were 4K blocks, and we have a 3D pipeline. The 3D pipeline would prepare some pixels or some vertices and would deposit them, then push the shadow VRFs off round the train track. Then the VPU (vector processing unit) would sit there waiting to be signalled that something had arrived, it would read the vertex data, perform the vertex or pixel shading operation, and it would then write the shaded vertices or the colour back. The shadow VRFs would then amble off around the train track a bit more and then end up back in the 3D pipeline, so the 3D pipeline is both a front-end and a backend.” In its back-end role, the 3D pipeline pushes the pixels into the frame buffer. “That was a really nice architecture,” says Eben. “You were able to reuse quite a lot of logic over here which you could then use for other stuff, when you weren’t doing 3D.”

Another challenge in VideoCore III was to adapt it for general-purpose use. In earlier devices like media players, the chip would be used in one of several modes, such as to play audio or video, and you could optimise accordingly. When you have to support a generalpurpose operating system, the usage is unpredictable and contention, where the chip is asked to do several different things simultaneously, is more likely. “What you end up with is a much more conventional architecture,” says Eben.

The work on VideoCore III resulted in support for H.264,MPEG2, MPEG4, and VC1 at 720p, plus the ability to encode video in MPEG4 or H.264.

Power optimisation – a special ingredient

There was also a huge focus on power optimisation. “It’s one of the really special ingredients,” says James.

What goes into power optimisation? The high-level architecture needs to avoid energy-expensive operations, while at the micro-architectural level, the design has to be amenable to things like clockgating, where parts of the circuit are disabled to save power. There is also work at the tools level, the tools being the software programmed into the ASIC (application-specific integrated circuit) that drives the system. “If you’ve wrapped up some opportunities in your microarchitecture itself to do all sorts of clever stuff, you need to then drive the tools properly to make sure they actually exploit those opportunities,” explains Eben.

Imagine the chip gets an easy video clip to decode. Should you turn the voltage down to run the core slower, or run it fast and then turn it off so it is only running half the time? “The balance between those two is a non-obvious calculation,” advises Gordon Hollingworth.

You might think that power optimisation is more important in a battery-powered device like a smartphone than in something like the Pi, which is more often mainspowered, but in fact it is important in all scenarios. “Power dissipation is about more than the actual power usage,” says Gordon. “It has all got to go somewhere and results in the same thing, which is heat.”

VideoCore III went into production in 2007. “The first tapeout was June 4th,” recalls Eben. “It was going to tapeout in the first half of the year, even if we had to add more days.”

‘Tapeout’ is the term for when the final circuit of a chip is sent for manufacture.

A little-known fact is that Eben built a prototype Raspberry Pi based on VideoCore III. “I did build a thing, but it never went into a shipping product,” he says. “I had a port of Python to VideoCore III and it was the first thing I showed to people. It had a PS2 keyboard interface on a bit of Veroboard. So Raspberry Pi was actually founded as an organisation with a view to building something on VideoCore III. But because it’s a proprietary closed architecture, you end up building everything yourself. It could have made it into a product, but it wasn’t the thing.”

Introducing VideoCore IV

The thing, rather, was VideoCore IV, which is when the chip moved from an immediate mode to a tiled mode architecture, where the image is rendered in a series of small tiles rather than as a single large image. “James had been kicking round this idea of building a tile-mode architecture for fun. And then we just proposed it,” recalls Eben. “I was massively surprised when we were allowed to do [it]; we were just cut an enormous amount of slack.”

“What happened with III to IV was that the 3D team effectively went off on their own, got more resources, were allowed to rip up most of what we’d done for the current 3D and restart,” says James.

The advantage of tile mode is less memory traffic, which translates to less energy. “You trade up-front work in figuring out which bits of the screen a triangle will cover for less overall energy,” says Eben.

Another part of this change was the introduction of a processor called the QPU (quad processor unit), a term which is unique to Broadcom. It is called ‘quad’ for two reasons. “One, because the hardware is four-way parallel, and two, because we have this notion of a quad, four adjacent pixels,” explains Eben. “You can’t really make a 3D core that pushes single pixels through because you can’t get derivative information to select mipmaps.”

The QPU then is a lightweight processor for doing shading, though the team gave it unusual flexibility. “Your conventional shader processor is a very siloed architecture, so you’ve got say 16 SIMD [single instruction, multiple data] channels and they proceed along in stately isolation from each other. A lot of what made VideoCore was the ability to get data between channels. You don’t need that for 3D, but we wanted to keep some element of that. For example, there is a rotate capability so you can take one of your 16 vectors which you’ve read out of the register file and you can rotate it by a certain amount; that turns out to be totally useless for 3D and super-useful for pretty much everything else.

“We also ended up with a thing which we called the VPM, the Vertex and Primitive Memory, which all of the QPUs can access, so when you need it you have some transposition capability.”

Eben wrote the QPU proposal at the end of 2007. The Broadcom team spent 2008 writing VideoCore IV, with the first samples arriving in 2009. VideoCore IV supports 1080p (1920×1080) encode and decode along with OpenGL ES 2.0, the embedded system version of the popular 3D graphics API.

Work is also under way to support H.265, the next-generation video compression standard, also known as High Efficiency Video Coding. “We’ve got somebody working on H.265 at the moment and he is in the process of trying to get 1080p 24 H.265 support up and running,” discloses Gordon.

“If you look at Raspberry Pi, the performance is pretty damn good. That’s a chip we taped down in 2009. It’s crushingly good performance,” says Eben.

“It’s also the first unified shader architecture in mobile. In fact, it beat some of the PC guys,” adds James. “It’s a well-understood architecture, it’s power-efficient, and it’s now very well supported. One of the great things about Pi is how stable the software is. We like our architecture and we also like the form factor of the Pi.”

Opening up VideoCore

Another key feature of VideoCore IV is how open it is. “One of the wonderful things is that Broadcom released the docs,” says Eben. “That is completely unheard of. Mali [ARM], you can’t get the docs. Imagination Technology, you can’t get the docs. Adreno [Qualcomm], you can’t get the docs. Nvidia has a nod in this direction. In the mobile world, nobody releases the docs and nobody releases the driver sources. In February 2014, Broadcom released the full architecture docs and a full driver implementation for one of the other chips that uses VideoCore.”

The extent of the Pi’s openness is increasing. In June 2014, as a result of Broadcom’s publication of VideoCore documentation, a Linux graphics driver developer called Eric Anholt joined the team, coming from Intel, where he had been working on Intel’s support for the Mesa open-source graphics stack. He is now working on an MIT-licensed (a permissive free software licence) Mesa and kernel DRM (direct rendering manager) driver for the Raspberry Pi. This will bypass most of the code which currently runs in VideoCore firmware as a closedsource ‘blob’ which is called by software drivers.

“This will be the first driver for a mobile-side modern PC graphics architecture which has been developed by open specification,” reveals Eben. “I expect we will ship at least a technical preview this year.”

What’s next? “There is a VideoCore V,” says Eben, who will not be pressed on further details. “It’s freaking awesome.”

BEYOND VIDEO

What can you do with VideoCore IV? It is key to the appeal of the Raspberry Pi, especially in Raspberry Pi 2 with its faster CPU and increased RAM, bringing more balance to the capabilities of the GPU and CPU sides. Aside from the ability to use the Pi as a PC with surprisingly capable graphics, VideoCore enables media-centre and camera applications.

What about if you are using the Raspberry Pi as a headless board, or want to exploit the GPU to accelerate non-graphical applications? Eben Upton is not an enthusiast for OpenCL, an open-source API for using the GPU for general-purpose computing, rather than just to drive a display.

“I’m a massive OpenCL skeptic,” reveals Eben Upton. He points out that you can do image processing, a common use case for OpenCL, with OpenGL instead. While there are examples of other uses, they are niche, he argues. “There are people using it for finance. But where’s the consumer killer app for this stuff? It’s image processing, right? You could do image processing in a much less expressive language.”

That said, it is possible to use VideoCore for general-purpose computing. It is ideal for fast Fourier transforms (FFT), for example, but you have to code in assembly language as well as understanding VideoCore at a low level. “We’ve helped out people who email us and say how do I do this, or ask about a strange feature. We’re keen to help people with that. Is it a mainstream thing? Not unless we come up with some sort of API,” says James Adams.