Writing a Modern Metal App from Scratch: Part 2

In the previous article, we wrote enough Metal code to get the spinning silhouette of a teapot on the screen, but that still leaves a lot to be desired as far as a “modern” app is concerned. In this article, we’ll further flesh out the app and introduce lighting, materials, texturing, and managing multiple objects with scene graphs.

A Little Bit of Housekeeping

Thanks to some gentle chiding by my pal Greg Heo, I decided to slightly refine the property initialization code in the Renderer class. Rather than use implicitly-unwrapped optionals for members that require long-winded initialization, I refactored the creation of the vertex descriptor and render pipeline state into their own static methods. John Sundell gives a good overview of alternatives here. I find closure-based lazy vars distasteful, so I took the “lightweight factory” approach.

Do a git checkout step2_1 to get a version of the project with the initialization refactorings applied.

And now, back to our regularly-scheduled programming.

Going Deep

Because we’re currently writing a solid color from our fragment function, the 3D nature of the model isn’t really apparent. Let’s make a small modification to the fragment shader so we can visualize the eye-space normal of the model 1.

If you run the app now, you’ll notice that portions of the model that shouldn’t be visible are. It’s as though the triangles that comprise the model are being drawn without regard to which should be nearer or farther to the camera. That is, in fact, what is happening. To fix this issue, we need to tell Metal to store the depth of each fragment as we process it, keeping the closest depth value and only replacing it if we see a fragment that is closer to the camera. This is called depth buffering, and fortunately, it’s not too hard to configure.

The Depth Pixel Format

Depth buffering requires the use of an additional texture called, naturally, the depth buffer. This texture is a lot like the color texture we’re already presenting to the screen when we’re done drawing, but instead of storing color, it stores depth, which is basically the distance from the camera to the surface.

First, we need to tell the MTKView which pixel format it should use for the depth buffer. This time, we need to select a depth format, so we choose .depth32Float, which uses one single-precision float per pixel to track the distance from the camera to the nearest fragment seen so far.

Adding Depth to the Pipeline State

Now the view knows that it needs to create a depth texture for us, but we still need to configure the rest of the pipeline to deal with depth. Specifically, we need to configure our pipeline descriptor with the view’s depth pixel format:

The depthCompareFunction is used to determine whether a fragment passes the so-called depth test. In our case, we want to keep the fragment that is closest to the camera for each pixel, so we use a compare function of .less, which replaces the depth value in the depth buffer whenever a fragment closer to the camera is processed.

We also set isDepthWriteEnabled to true, so that the depth values of passing fragments are actually written to the depth buffer. Without this flag set, no writes to the depth buffer would occur, and it would essentially be useless. There are circumstances in which we want to prevent depth buffer writes (such as particle rendering), but for opaque geometry, we almost always want it enabled.

We call our new method in the renderer’s initializer:

depthStencilState = Renderer.buildDepthStencilState(device: device)

Rendering with Depth

Finally, we need to ensure that our render command encoder is actually using our depth-stencil state and our depth buffer to track the depth of each fragment. To do this, we set the depth-stencil state on the render encoder before issuing our draw calls:

commandEncoder.setDepthStencilState(depthStencilState)

If you run the app after these changes, you should see a much more plausible rendition of the teapot. No more overlapping geometry! (Tea-ometry?)

git checkout step2_2 to view the code up to this point.

Lighting and Materials

The art of illuminating virtual 3D scenes is a deep and much-studied field. We will consider a very basic model here, just enough to create a passable shiny surface. Our basic model calculates the lighting at a particular point as a sum of terms, each of which represents one aspect of how surfaces reflect light. We’ll look at each of these terms in turn: ambient, diffuse, and specular.

Ambient Illumination

The ambient term in our lighting equation accounts for indirect illumination: light that arrives at a surface after it has “bounced” off another surface. Imagine a desk illuminated above by a fluorescent light: even if the light is the only source of illumination around, the floor beneath the desk will still receive some light that has bounced off the walls and floor. Since modeling indirect light is expensive, simplified lighting models such as ours include an ambient term as a “fudge factor,” a hack that makes the scene look more plausible despite being a gross simplification of reality.

To implement ambient illumination in our fragment shader, we declare the “base color” of the object’s material to be bright red, then select an arbitrary “ambient intensity” of 0.3, or 30% of full brightness. The color of the fragment is the product of these two factors as seen in our revised fragment function:

For the time being, we lift the definitions of ambient intensity and base color out of the function. We’ll eventually pass them in as uniforms, since we want them to vary from object to object (and maybe from frame to frame).

Ambient illumination produces a flat-looking appearance, since it assumes that indirect light arrives uniformly from all directions.

git checkout step2_3 to view the code up to this point.

Diffuse Illumination

Diffuse illumination represents the portion of light that arrives at a surface and then scatters uniformly in all directions. Think of a concrete surface or a matte wall. These surfaces scatter light instead of reflecting it like a mirror.

Since we thought ahead and already have the surface normal and position in world space2 available in our fragment shader, we can compute the diffuse lighting term on a per-pixel basis.

It turns out that there is a fairly simple relation between how much light is reflected diffusely from a surface and its orientation relative to the light source: the diffuse intensity is the dot product of the surface normal (N) and the light direction (L). Intuitively, this means that more light is reflected by a surface that is facing “toward” a light source than one to which the incident light is “glancing.” We can express this by modifying the fragment shader as follows:

We need to normalize the surface normal in the fragment shader even if each vertex supplies a unit-length normal, because the rasterizer uses linear interpolation to compute the intermediate normal values, which does not preserve length. We calculate the direction to the light, L, as the vector difference between the world-space position of the light and the world-space position of the surface. This vector also needs to be normalized, since it can have arbitrary length.

To compute the diffuse light intensity being reflected from the surface at this point, we take the dot product between N and L, which will be 1 when they are exactly aligned, and -1 when they are pointing in opposite directions (e.g., when the light is behind the surface). Because a negative light contribution would produce odd artifacts, we use the saturate function to clamp the result to the range of [0, 1].

We add the diffuse intensity to the ambient intensity to get the total proportion of light reflected by the surface. Then, as before, we multiply this factor by the base color to get the final color.

git checkout step2_4 to view the code up to this point.

Specular Illumination

There is another component to illumination that is important for surfaces that reflect light in a more coherent fashion: the specular term. Surfaces where specular illumination dominates include metals, polished marble, and glass.

In order to compute specular lighting, we need to know a couple of vector quantities in addition to the surface normal and light direction. We use the worldPosition member of the structure produced by our vertex shader to store the location of the current vertex in world space:

The reason we need the world position of the vertex is that the vector from the viewing position to the surface affects the shape and position of the specular highlight, the exceptionally bright point on the surface where the most light is reflected from.

Specifically, we’ll use an approximation to specular highlighting devised by Blinn that uses the so-called halfway vector, the vector that points midway between the direction to the light and the direction to the camera position. By taking the dot product of this halfway vector and the surface normal, we get an approximation to the shiny highlights that appear on reflective surfaces. Raising this dot product to a quantity called the specular power controls how “tight” the highlight is. A low specular power (near 1) creates very broad highlights, while a high specular power (say, 150) creates pinpoint highlights. If you’re interested in the theoretical grounding of the halfway vector, consult papers on the microfacet model of surfaces, first popularized by Cook and Torrance in 1982.

Consult the vertex function to see how we generate the position and normal in world space:

We can now update the fragment shader to compute the view vector (the vector that points from the surface to the camera) and the halfway vector. The dot product of these vectors indicates the “amount” of specular reflection the surface generates. We raise this value to the specular power as mentioned before to get the final specular factor, which is multiplied by the light color (not the base color 3) to get the specular term, which we add to the previously computed ambient-diffuse combination:

If you run the code at this point, you should see a red teapot with a new, shiny white highlight. Our virtual scene is starting to look a little bit truer to life. Of course, we still have a long way to go on the road to realism.

git checkout step2_5 to view the code up to this point.

Textures

Most surfaces in the real world are not uniformly smooth or rough, nor do they have a single uniform color. Painted surfaces chip and wear unevenly over time. Metal statues develop patina. Wood and marble have an identifiable and distinctive grain. Many of these variations can be captured with the use of texture maps, which provide a way to “wrap” an image over a surface in such a way as to supply color information in a much more detailed manner than supplying a per-surface or even per-vertex color.

A texture map is just an image, but in order to know how it should wrap and stretch over a surface, each vertex has a 2-D set of coordinates–called texture coordinates–that indicate which texel (the texture equivalent of a pixel) should be affixed to that vertex. The texture coordinates are interpolated between vertices (just like the position and normal) to determine each fragment’s texture coordinates, which are what we will actually use to “look up” each fragment’s base color.

You can read this article for more about texturing in Metal, but we’ll cover the essentials below.

Loading Textures with MetalKit

In Metal, textures are objects that conform to the MTLTexture protocol. We could ask our device to create a blank texture and then fill it with image data we load from a file, but there’s an easier way.

MetalKit provides a very nice class called MTKTextureLoader that allows us to very easily load an image file into a texture. We make a texture loader by initializing it with a Metal device. Add the following to the bottom of the existing loadResources() method:

let textureLoader = MTKTextureLoader(device: device)

To make textures at run-time, we need images to load. It’s advisable to use Asset Catalogs to store these images, both because it makes loading them with MetalKit easier, and because Asset Catalogs support App Thinning, which can greatly reduce the size of the app bundle your user needs to download.

To add images to the project’s asset catalog, select Assets.xcassets in the project navigator and drag images into the side pane. Here, for reference, is what the asset catalog of the sample looks like:

Textures are expensive objects, in terms of how much memory they occupy, and in terms of how long they take to create, so we want to hold references to them rather than creating them frequently. We can add a property to store the base color texture in our renderer:

var baseColorTexture: MTLTexture?

Then we can add the actual texture loading code to the makeResources() method, referencing the image by its name in the asset catalog:

Note that we can supply an options dictionary to affect some aspects of how the texture loader works. One of the options we’re supplying here, .generateMipmaps, is beyond the scope of this article series, but you can read all about mipmapping in Metal here.

Sampler States

Now that we have a texture, we need to tell Metal how to “sample” from it. The details of texture sampling are somewhat beyond the scope of this article, but essentially, we need to decide what happens when more than one texel is mapped to a single fragment (this process is called minification), or when one texel stretches across multiple fragments (called magnification). These choices allow us to make a trade-off between how jagged or blurry our textures appear when there isn’t a one-to-one mapping between pixels and texels.

To communicate these choices to Metal, we create a sampler state. We can use different sampler states to hold different sets of sampling parameters. In the sample project, we’ll only need one sampler for now, so we can add a property for it to the renderer:

let samplerState: MTLSamplerState

To create a sampler state, we fill out a MTLSamplerDescriptor object, then ask the device to create a sampler state for us. We can add a new initializer method for this:

And as has now become common, call it from the renderer’s initializer:

samplerState = Renderer.buildSamplerState(device: device)

We’ll discuss how to actually use a sampler state object in the next section.

Using Textures in a Fragment Function

Now that we have a texture and a sampler state, we need to actually use them inside our fragment function to determine the base color of each fragment. We can tell Metal which textures and sampler states to use by assigning each of them to an index on the render command encoder:

Note that the texture parameter is of type texture2d<float, access::sample>, which means that it is a 2-D texture that we intend to sample from, and when we do sample it, we get back a float4 vector containing the RGBA components of the sampled color. Using the access::sample template parameter requires us to have created our texture in such a way that it can be sampled from. Fortunately, MTKTextureLoader takes care of this for us by default.

If we run the sample now, we can see that instead of a solid red teapot, we have a checkered teapot, using the base color texture map to affect the color at every fragment:

git checkout step2_6 to view the code up to this point.

From One Object to Many

Up until this point, our little world has consisted of a single, central object, but this approach quickly grows limiting. What we really want, in any game or application of reasonable size, is to represent many objects, positioned relative to one another. One popular way of representing objects and their relationships is a scene graph, which is a hierarchy of objects, related by the transforms that move us from the coordinate system of one object to another. Additionally, we might imagine that we want these different objects to have different materials, expressed as different combinations of textures and parameters.

In this section, we will refactor our code to account for many different objects by adding data structures that represent each of the conceptual entities we’ve dealt with so far (lights, materials, and objects), turning them into concrete objects that can be configured and combined into larger scenes.

Refactoring Lights

Previously, we hard-coded the various parameters of our lighting environment (ambient intensity, light direction, and light color) in the shader, but what if we want more than one light in our scene? It makes sense to collect these parameters into their own structure, which we can write (in a new file I’m calling Scene.swift) as:

Refactoring Vertex Uniforms

Let’s revisit the Uniforms structure. Whereas before, we were only passing uniforms into the vertex function, we’ll now need to pass uniforms into the fragment function as well, so let’s rename the Uniforms struct to VertexUniforms in Renderer.swift:

Refactoring Materials

We also need to refactor our representation of materials. The base color of the object naturally comes from the base color texture map, and we can change that each time we draw, but we also need a way to vary the shininess (specular power) of materials, and we’ll also find it convenient to control the color of the specular highlight 4. To do this, we create a new Material type. In Swift, it looks like this:

That completes the refactoring of uniforms. Now we can get back to the business of writing more flexible shaders that take advantage of these configurable materials and lights.

Lighting Shader Revisited

We need to adapt our fragment function to take the new fragment uniforms, so we add a new parameter that indicates which buffer to read them from (buffer 0), similar to the previously-existing uniforms parameter in the vertex function:

Finally, we pack all of those parameters into an instance of FragmentUniforms and tell the command encoder to write it to a buffer at the same index as the one we previously specified in the shader (index 0):

What’s the reward for all of this hard work? Well, now our shiny teapot is even shinier, since it’s lit from three different directions:

This section laid the groundwork for adding multiple objects to the scene and controlling them independently. In the next section, we’ll actually add multiple objects to the scene and show how to position them relative to one another.

git checkout step2_7 to view the code up to this point.

Nodes and Scene Graphs

In this section we’ll develop an abstraction that will help us do two important things: associate materials and geometry with different objects, and express how objects are positioned and oriented in the world relative to other objects.

A node, in this context, will be an object that has a transform that specifies where it is located and how it is oriented (this is just another 4×4 matrix). A node may also have a reference to a mesh and material which should be used to draw the node. Crucially, nodes also have a reference to their parent and a list of children, which is what defines their place in the hierarchy. A scene graph, then, is a tree of such nodes 6.

Node and Scene Types

We’ll start by laying out our basic Node type (this can go in the Scene.swift file, where we’re keeping such things):

We’ll also find it useful to declare a Scene type, which holds the root node (the node that is the ancestor of all other nodes), as well as some lighting parameters. The intention here is to remove as much “teapot-specific” state from the renderer class itself, as we continue to abstract our way toward supporting multiple objects with different materials:

First, we create the utility objects we need (the buffer allocator and texture loader). Then we instantiate the scene and configure its lighting environment. Then, we instantiate our teapot node, configuring it with a mesh and various material properties, finally adding it to our scene’s root node and thus adding it to our scene graph.

Now, we can remove the call to makeResources and construct our scene instead:

We then write an update method that advances the time variable, updates the view and projection matrices (in case the camera has moved or the view has resized), and updates the transform of the root node to keep the teapot spinning:

We call this method right at the top of our draw method in order to update every time we render a frame.

Drawing the Scene

Finally, we can put some pixels on the screen. Since we’re still going to wind up with one spinning teapot, this might feel like an anticlimax, but we’ve actually created something very flexible here, and we’ll see the power of the scene graph very shortly.

We can greatly simplify our draw method by extracting the parts of it that relate to setting uniforms and issuing draw calls into a new method. Here’s the revised method:

At the top, we compute a local model matrix, which is the product of the parentTransform parameter and the model matrix of the current node. This combined model matrix will move us to world space through the model space of our parent node. At the end of the method, we pass this same transform recursively into this method so that the children of this node are additionally transformed by the transform of this node. By descending the scene graph (tree) and transforming each node by the product of its ancestors’ transforms, we effectively position those objects relative to one another.

The interior of this method is very similar to our previous draw method: build the uniforms structs, set the texture, and draw. Here’s drawNodeRecursive in its entirety:

Finding Nodes by Name

Each node in the scene graph has a name. If these names are unique, we can use them to locate a node anywhere in the scene graph by name. Adding this functionality to our Scene and Node classes is another straightforward exercise in recursion.

First, let’s implement nodeNamedRecursive(:) in the Node class. The gist of the algorithm is this: for each child of the current node, if the child has the name we’re looking for, return that child. Otherwise, ask each child in turn whether it has a descendant with the name and return the first match. If no match is found, return nil. Here it is in Swift:

Creating a More Interesting Scene

“Okay, enough with the teapots,” I hear you saying. Let’s create a scene with some action. I’m a big fan of Kennan Crane’s public domain models, so we’ll borrow a couple of them.

Meet Bob and Blub

Bob is a model of an inflatable duck float. We’ll load it in pretty much the same way we loaded the teapot model, and also load its accompanying texture. We combine the mesh and texture in a new node named “Bob” and add it to the root of our scene like so:

We’ll create several Blubs and add them to Bob so they can do some aquabatics together. We load Blub in exactly the same way we loaded Bob, but this time we create several nodes in a loop and give each one a unique name so we can reference them later when animating (“Blub 1”, “Blub 2”, etc.):

Animating Bob and Blub

It doesn’t make much sense to have a pool float named Bob that doesn’t actually bob. Let’s add some code to the update method to displace Bob by a small amount vertically according to a sine wave, which will make him appear like he’s bobbing up and down on gently lapping water:

Note that we used the nodeNamed method to look Bob up by name. We could also have kept a reference to the Bob node when we created it.

Now let’s animate our Blubs. Their behavior is somewhat more complex: each fish’s transform is composed of a number of basic transform matrices that scale, translate, and rotate each fish into the correct size, position, and orientation:

And here’s the result, a rather engrossing display of marine talent, if I may say so:

git checkout step2_9 to view the final version of the code for this article.

In Closing

We covered a ton of ground in this article. From basic materials and lighting, to texturing, all the way to basics of scene graphs. I hope you’ve enjoyed following along! Feel free to comment below with any questions.

I’ve been intentionally vague about the specifics of normals so far. If you’re curious (and especially if you’re wondering how a normal matrix is calculated), other people have written well about that here and here. ↩

I’m fudging a bit here. In the previous article, we computed the normal in eye space rather than world space, even though we never used it. Between then and now, I’ve patched the code to use world space instead, since it winds up being more intuitive to think of the lights as entities that can move around independently in world space rather than being “attached” to the camera. The uniforms struct and math utilities have been slightly tweaked to provide proper normals in world space. ↩

It turns out that for materials that are non-metallic, the color of the specular highlight is dependent on the color of the light, not the color of the surface. You can notice this if you look at a colored plastic or ceramic surface under a spot light or track lighting. Metals, on the other hand, do “tint” their specular reflections, which is one of the most significant visual signifiers of metals versus non-metals. ↩

So far, we’ve been assuming that all materials are dielectrics, with a plastic-like appearance whose specular highlights take on the color of the incident light. As mentioned previously, metals have colored specular highlights, so we will use this new specularColor parameter to select a sort of “specular tint” color: when rendering dielectrics, we’ll select white or a shade of gray, while metals (like gold) will be permitted to have colored highlights. This isn’t physically-accurate, but it will allow us to represent metals slightly better. ↩

In general, the view matrix is the inverse of the world transform of the camera. We’re exploiting the fact that the inverse of a translation matrix is simply a translation matrix that translates in the opposite direction. ↩

There’s some controversy as to whether scene graphs are the “right way” to model these relationships. Some people argue—correctly, I think—that the rigid spatial hierarchy that scene graphs entail is not always the correct choice of the “one true scene representation.” This only gets worse when you need to traverse the tree for different purposes. An oft-cited rant on this is Tom Forsyth’s. See also the discussion here. We’ll stick with scene graphs for now, because they’re a simple way to model spatial relationships. ↩

6 thoughts on “Writing a Modern Metal App from Scratch: Part 2”

Thank you for your most recent posts and on top of that in Swift. 🙂
One of the most asked questions about Metal is how to draw lines with different widths. Can you make a post how to achieve it in the most efficient way. Please also include an example that can be run on Macs.

I actually agree with João – “how to render lines” (straight lines, splines etc) is always a puzzle in different pipelines. I usually end up just building some mesh and so on. What’s the Way to go in Metal2? That would be a great article!

Integrating rendering and physics isn’t my area of expertise, but I strongly recommend taking a look at the qu3e project by Randy Gaul. The architecture is questionable, and the use of old-school OpenGL means it won’t translate directly to Metal, but the engine itself makes for interesting reading.

When it comes to rigid-body simulation, a basic implementation should be pretty easily decomposable into distinct steps of simulation and visualization. First, you gather the current state of bounding geometry and transformations in some format that’s easily digested by your physics engine, then you run your simulation, then you apply the resolved transformations back to the scene, and finally render.

In a well-designed system, these are entirely separate concerns. Every modern game engine keeps separate copies of state for physics and rendering, so that they can be processed concurrently.