If you are used to high-level graphics libraries – such as CoreGraphics – you might be surprised with the amount of effort it takes to achieve the same goals using Metal. That is because Metal is a highly optimized framework for programming GPUs. As opposed to a high-level graphics library that provides easy to use APIs for custom drawing, Metal was designed to provide fine-grain, low-level control of the organization, processing, and submission of graphics and computation GPU commands as well as the management of the associated data and resources. As such, one can achieve much higher performance.

As you might expect, that doesn’t come cheap. As Uncle Ben once said, “Remember, with great power, comes great responsibility.”

For example, drawing a simple line on screen is not as straight-forward of a task. That is, if you want some thickness and rounded caps or joints, you have to translate that information into a data format the GPU understands (meaning you tessellate the line yourself). Besides, if you want good performance you need to understand the rendering pipeline and be concerned with how the hardware operates, so your data is represented in a way it can be processed efficiently. And the rules are not the same for CPU and GPU bound operations.

It is surprisingly easy to write crappy code, performance-wise.

And that is what I want to share with you today. More specifically on how to operate with vertex buffers.

There are a few practices you should always keep in mind when designing or developing Metal applications:

Keep your vertex data small: A lot of what happens on your scene depends on computations made on top of the vertices you have. One very simple example are transform operations such as translation, scaling or rotation. When you translate an object, the translation matrix is multiplied by each vertex. The less vertices you have, the less multiplications are required. It is true the number of vertices is tied to the quality of the scene. But keep in mind objects that are far from the viewer don’t need as much vertices as objects that are near. Similarly if you are working on a game, textures can be used to emulate many of the object’s features.

Reduce the pre-processing that must occur before Metal can transfer the vertex data to the GPU: When designing your vertex data structure, align the beginning of each attribute to an offset that is either a multiple of its component size or 4 bytes, whichever is larger. When an attribute is misaligned, iOS must perform additional processing before passing the data to the graphics hardware.

Avoid – or reduce the time spent – copying vertex data to the GPU: Transferring data from CPU to GPU memory space is generally the biggest bottleneck on a graphics application. That is because the GPU needs to wait for the CPU data to be copied over to it’s memory space. Metal allows you to do Zero-copy implementations by using a CPU/GPU shared buffer. A shared buffer allows the CPU to write data while the GPU is reading it, resulting in high-frame rates. Such dance needs to be efficiently – manually – managed to avoid having the CPU concurrently overriding data the GPU is still processing. Techniques such as Double Buffering can be very effective for that purpose.

Reduce computations performed for each vertex: Objects go through a process called tessellation before they can be rendered. The tessellation process consists of representing the object as a series of triangles. As these triangles are lay-ed out side-by-side, many of the vertices are shared among multiple triangles. You can reduce the number of vertices – avoiding vertex duplication – by using triangle strips instead. A triangle strip requires N+2 vertices to represent N triangles. As opposed to 3 * N in the traditional representation. For best performance, your objects should be submitted as a single indexed triangle strip.

The GPU operations are not the only tasks you can optimize: There is actually a lot to talk about here. We could go over very old techniques such as loop unfolding, using the smallest acceptable types you can, reducing the number of operations you perform, pre-computing expensive operations (such as trigonometric functions)…the list goes on and on. There are however two techniques I find very important mentioning:

Leverage as much as you can vertex CPU optimizations. Since the 80’s CPUs are built with optimizations for operating with tuples. For example, if you need to multiply a vertex by a scalar, instead of doing (x * s, y * s, z * s, w * s), you could do instead (x,y,z,w) * s. That kind of operation happens on the CPU as 4 parallel multiplications. You can do that type of stuff by using the simd library.

Use Interleaved Vertex Data: You can specify vertex data as a series of arrays or as an array where each element includes multiple attributes. The preferred format on iOS is the later (an array of structs) with a single interleaved vertex format. Interleaved data provides better memory locality for each vertex.