Session 603WWDC 2016

Building on the fundamentals, dive into the specifics of constructing games and graphics apps with Metal. Learn about scene management and understand how to manage and update Metal resources. Understand the rendering loop, command encoding, and multi-thread synchronization.

[ Music ]

[ Applause ]

Hi, everyone, and welcome to WWDC.

I hope you're all having a good time so farand you've had some nice sessions you've seen.

And if you really want to dig in heavy,we'll have an awesome talk about advanced shader optimization,shader performance fundamentals, tuning shader code,more detailed about how the hardware works.

It'll be great.

So if you're really interested in tuning your shadersto make them to best they can be, check out that tomorrow.

So this is Part 2 of Adopting Metal and we're going to buildon what we learned in Part 1.

We figured out how to get up and running.

So let's take a look at the concepts that you needto get the most out of Metal in a real-world situation.

We've got a demo that will draw a ton of stuff in a simple sceneand we'll use that demo for context during today's sessionas we discuss and learn a couple lessons from it.

We'll talk about the ideal organization flow of your data,how to manage large chunks of dynamic data, the importanceof synchronization between the CPU and the GPU, and,like I said before, some multithreaded encoding.

So hopefully you're familiar with the fundamentals of Metalbecause we won't be going over them again.

So we expect that you understand how to create a Metal queue,a Metal command buffer, how to encode commands, and we'll buildon that to go forward.

So let's start with the demo itselfand see what we're aiming towards.

So right now we've got 10,000 cubesand they're all spinning around, loading in space.

It's an interesting scene.

Metal allows us to issue a ton of draw callswith very low overhead.

So here we have 10,000 cubes and 10,000 draw calls.

You can see on the bottom there's a little shadow.

We're using a shadow map, playing on the bottom,some nice anti-aliased lines give you some depth cues,and of course all of our cubes.

So what goes into rendering a scene like this?

As you can see, we've got a lot of objects and eachof these objects has its own associated piece of unique data.

We need the position, rotation, and color.

And this has to update every framebecause we're animating them.

So this is a bunch of data that we're constantly changing,constantly have to reinform the GPU what we're drawing.

We can also draw a few more objects, maybe a little more.

You can spin it around a little bit and seethat we're actually floating in space.

So we have a draw call for cube and a bunch of data for cubeand we have to think about the best way to thinkabout this data, how to manage it,and how to communicate it to the GPU.

So let's dive right in.

Thanks, Jared.

Managing Dynamic Data: This is a huge chunkof data that's changing every frame.

And as you can imagine in a modern app like a game,you also have a bunch of datathat every frame needs to be updated.

So our draw basically looks like this.

We want to go through all the objects we're interestedin drawing and update them.

Then we want to encode draw calls for every objectand then we have to submit all these GPU commands.

We have a lot of objects.

We started at 10,000 and we were cranking itup to up to 100, 200,000.

Each of these objects has its own set of data and we haveto figure out the best way to update this.

Now in the past, you might've done something like this.

You push updated data to the GPU, maybe uniformsor something, you bind a shader, some buffers,some textures, and you draw.

And you push some more data up.

You bind shader, buffers, textures.

You draw your next object.

In our scene we repeat this 10,000, 20,000 times,but we really want to get away from this sort of paradigmand try something new.

What if we could just load all our data upfrontand have every command that we issue reference the datathat was already there.

The GPU is a massively powerful processerand it does not like to wait.

So if all our data in already in place, we can just point the GPUto it and it will go happily crunch awayand do all our rendering for us.

And each draw call we make then references the appropriate datathat's already there.

In our sample, it's very straightforward.

We have one draw that references one chunk of data.

So the first draw call references the first chunkof data, the second, the second chunk, and so on.

But it doesn't have to be that wayand we can actually reuse data.

We have some data, like at the front here, frame data,that we can reference from all our draw callsor we could have a draw call that references two piecesof data in different places.

If you're familiar with instancing,it's a very similar idea.

All your data will be in place before you start rendering.

So how do we do this in Metal?

In our application, we create one single Metal bufferand this is our constant buffer.

It holds all the data that we need to render our frame.

We want to create this upfront, outside of the rendering loop,and reuse it every time we draw.

We don't duplicate any data.

Again, any draw call can reference any piece of data,so there's no need for duplication.

Each draw call will reference an offset into the buffer.

It'll do a little bit of tracking to knowwhich draw represents which offset.

And then you'll just draw with everythingand everything will be in place.

Let's take a look at the code for this.

Here's the code from the app.

You can think of us as having two sets of data.

Like I mentioned before, there's a set of frame datathat will update here and there's a set of datathat will change per object.

This is the unique rotation position, et cetera.

So we need to put both sets of data in place.

Now what do I mean by per-frame data?

Well this is data that is consistentacross every draw call we make.

For example, in our sample we have a ViewProjection matrix.

It's a 4 by 4 matrix, very straightforward,if you're familiar with graphics.

It represents the camera transform and the projection.

This is not going to change throughout our frame,so we only need one copy of it.

And we'd like to reuse data as much as we canso we can create one copy and put it into our buffer.

Let's start filling this out.

So here, we have our constant buffer,which is just a Metal buffer we've created.

And with the Contents function, we have a pointer to it.

Our app has a helper function, which is GetFrameData,and this returns that main pass structure I just showed youthat has the view transform in it,the ViewProjection transform.

Excuse me.

And then we simply just copy this into the startof our buffer and then we're in place.

So our buffer will look like this.

We'll have a MainPass with the appropriate data for our frameand we'll put it at the start of our giant constant buffer.

So now we have all this empty space afterwards.

And like we saw, we need to do 10,000, 20,000 draw calls,so we need to start filling this out with a ton of information.

So then we have a set of per-object dataand this is the unique data we need to draw a single object.

In our case, we have a single LocalToWorld transform,which is the concatenation of the position and the rotationand we have the color.

So this is the set of data we need per draw call.

So we'll walk through every object we want to render.

We'll keep track of the offset into the buffer.

We have our updateData utility function,which will do our little update for our rotation,and then we'll update the offset.

This will pack our data tightlyand we'll fill it out as we go through.

Let's take a closer look at what updateData looks like.

It's quite simple.

Now, animation is kind of out of the scope of this talk,so I have a little helper function here that'supdateAnimation with a deltaTime.

This could be whatever you want in your own applicationand indeed you should but depending on what sortof animation you need.

But it my case it returns an objectData objectwhich has the LocalToWorld transform and the color.

And just as I did before, I copy it into my constant buffer.

So here's what that looks like.

I've got my frame data in place.

I have my other data, another piece, and another piece.

So all our data is in place and we're ready for rendering.

But are we missing anything?

Turns out that we are and I want to bring your attention to this.

We have one constant buffer.

I mentioned I created one Metal buffer and I was reusing it.

Now there's a problem with this.

The CPU and the GPU are actually two unique parallel processors.

They can read and write the same memory at the same time.

So what happens when you have something reading to a pieceof memory while something else is writing to it?

Resource contention.

So it looks a little like this.

The CPU prepares a frame and writes it to a buffer.

The GPU starts working on this and reads from the buffer.

The CPU doesn't know anything about this,so it decides I'm going to prepare the next frameand it starts overwriting the same data.

And now our results are undefined.

We don't actually know what we're reading to, reading from,or writing to or what the data state will be.

So it's important to realize in Metal,this is not handled for you implicitly.

The CPU and GPU can write the same dataat the same time however they'd like.

You must synchronize access yourself.

It's just like writing CPU code that's multithreaded.

You have to ensure you're not stomping yourself.

And that brings us to CPU-GPU synchronization.

Let's start simple.

The easiest way to do this would to just be to waitafter you've submitted commands to the GPU.

Your CPU draw function does all of its work,submits the commands, and then just sits thereuntil it's ensured the GPU is done working.

That way we know we won't ever override itbecause the GPU will be idle by the time we tryto generate our next frame.

This won't be fast but it's safe.

So we need some sort of mechanism for the GPUto let us know, hey, I'm done with this, go do your thing.

Metal provides this in the form of callbacks.

We call them handlers and there are two of themthat are interesting, addScheduledHandlerand that executes when a command buffer has been scheduledto run on the GPU.

And for us, an even more interesting one is thecompletion handler and this is calledwhen the GPU has finished executing a command buffer.

The command buffer is completely retired and we're ensuredat this point it's safe to modify whatever resourcesthat we were using there.

So this is perfect.

We just need some way to signal ourselves that, hey,we're done, we can go forward.

Now how many of you are familiar with the concept of a semaphore?

Anyone? Pretty good.

Quick background on semaphores.

They are synchronization primitive and they're usedto control access to a limited resourceand that fits us perfectly here.

We have one constant buffer and that's a limited resource,so we'll have a semaphoreand we'll create it with a value of 1.

The count on a semaphore represents how many resourceswe're trying to protect.

So we'll create our semaphore.

And again, this is somethingthat should be created outside of your render loop.

And the first thing we do once we startto draw is we wait on the semaphore.

Now in Apple semaphore, we call it waiting.

Some people call this taking.

Some people call it downing.

It doesn't really matter.

The idea is that you wait on it and our timeout we setto distant future,which effectively means we'll wait forever.

Our thread will go to sleep if there's nothing availableand wait for something to do.

When we're done,in our completion handler we will signal the semaphore.

That'll tell us that it's safe to modify the resources again.

We're completely done with it and we can go forward.

So this is sort of a naive approach to synchronizationbut it looks a little like this.

Frame 0 we'll write into the buffer.

And on the GPU, we'll read from the buffer.

The CPU will wait.

When the GPU is done processing Fame 0,it will send the completion handler and frame 1 will workand create another frame on the CPU.

And that will process on the GPU and so on.

So this works but, as you can see here,we have all these waits and both the CPUand GPU are actually idle half the time.

It doesn't seem like a good use of our computing resources.

What we'd like to do is overlap the CPU and the GPU work.

That way we can actually leverage the parallelism that'sinherent in this system, but we still needto somehow avoid stomping our data.

So we'd like our ideal workload to look like this.

Frame 0 would be prepared on the CPU, pushed to the GPU.

While the GPU is processing it, the CPU then getsto work creating frame 1 and so on, and again.

So one thing to keep in mind here isthat the CPU is actually getting a little ahead of the GPU.

If you notice where frame 2 is on the CPU,frame 0 is the only thing that's done on the GPU.

So we're a little bit ahead and I want you to keepthat in mind for a little later.

But first let's talk about our solutionin the demo and what we do here.

We'd like to overlap our CPU and GPU but we know we can't do itwith one constant buffer without waiting a lot.

So our solution is to create a pool of buffers.

So when we create a frame, we write into one bufferand then our CPU proceedsto create the next frame while writing into another buffer.

While it's doing this, the GPU is free to read from the bufferthat was produced before.

Now we don't have an infinite number of buffersbecause we don't have infinite memory.

So our pool has to have a limit.

On our application, we've chosen three.

This is something that you need to decide for yourself.

We can't tell you what to do because there are a lotof things that go into the latency consideration,how much memory you want to use.

So we recommend you experiment with your app what fits for you.

For this example, we've chosen three.

So here, you can see we've exhausted our pool.

We have three frames that have been preparedbut only one is finished on the GPU.

So we need to wait a little bit.

But by now, frame 0 is done, so we can reuse the bufferfrom the pool and so on.

At the start, we'll wait on the semaphore and sleepif nothing's available.

We've enforced our ordering with enqueue and we push it through.

Now we knowthat mainCommandBuffer is the final command bufferin our frame.

And we know that we want to signal that our frame is done.

So we should add our completion handler to the mainCommandBufferand you could do this from within the dispatch.

So the mainCommandBuffer is the final command buffer.

We add the completion handler to it, to signal our semaphore,and we commit it from within the dispatch,just like we did before.

Now you may notice here that I'm referencing self.semaphoreand a second ago I just told you to watch out for that.

So what's going on?

Well it turns out a semaphore is a synchronization primitiveand we do actually want to be looking at the same oneas all of our other threads.

So we want the value of the semaphoreat the time the thread is executing.

So in this case, we actually want self.semaphore,something to keep aware of.

And here's the recipe for our rendering.

At the start of our render function,we wait on the semaphore.

We select the current constant buffer.

We write the data into our constant bufferthat represents all of our objects.

We encode the commands into command buffers.

We can do the single-threaded,multithreaded, however you'd like.

We add a completion handler onto our final command bufferand we use it to signal the semaphore to let us knowwhen we're done and we commit our command buffers.

And the GPU takes all thisand starts chugging away at our frame.

So let's look at the demo again and see what this got us.

So here you can see in the top left,this is single-threaded encode modeand you can see how many draws we're issuing, 10,000.

And the top right, you can see the time it takes usto encode a frame.

So here we've got 5 milliseconds and we can crank the numberof draws up and see that it starts costing moreand more as we draw things.

Now this is single-threaded mode.

And when you think about it, we're drawing a shadow map,which means we have to issue 40,000 draws in the shadow map,and then we're drawing the main pass, which means we haveto issue another 40,000 draws to reference that.

But again, we can do this in parallel,so we've added a parallel mode to this demo.

And you can see how it's faster to go through.

Now take a look at everything that's going on.

You can fly around a little bit.

So here we have 40,000 cubes, unique, independent.

They're all being updated.

We're using GCD to encode a bunch of stuff in parallel.

We have two command buffers: One to generate the shadow mapon the ground and one to render all of the cubes in color.

The lighting is quite simple, Lambert shadowing,which is basically what Warren talkedabout earlier, the N.L lighting.

And that's our demo.

This will be available as sample codefor you guys to take a look at.

Hopefully you can rip it apart, take some of the ideasand the thoughts in it and apply them to your own code.

So what did we talk about today?

When you walked in here, hopefully you cameto Warren's session earlier and maybe you knew a little bitabout graphics or had done some programming before,but we took you through everything in Metal.

The conceptual overview of Metal, the reasoningaround it is to use an API that is close to the hardwareand close to the driver.

We learned about the Metal device, which is the root objectin Metal that everything comes from.

We talked a bit about loading data into Metaland the different resource types and how you use them,the Metal shading language, which is the C++ variant you useto write programs on the GPU.

We talked about building pipeline states,prevalidated objects that contain your two functions,vertex and fragment or a compute function,and a bunch of other baked-in, prevalidated stateto save you time at runtime.

Then we went into issuing GPU commands,creating a Metal queue, creating command buffers off that queue,and creating encoders to fill the command buffer in,and then issuing that work and sending it over to the GPU.

We walked you through animation and texturingand using set vertex bytes to send small bits of datato do your animation in.

Then when the small bits of data weren't enough,we talked about managing large chunks of dynamic dataand using one big constant buffer and referencing itin multiple places to get some data reuse out of the system.

We talked about CPU-GPU synchronization, the importanceof making sure your CPU and your GPU aren't overriding each otherand playing nicely.

And then lastly, we talked a little bitabout multithreaded encoding, how you can use GCD with Metalto encode multiple command bufferson your queues at the same time.

And that's adopting Metal.

Hopefully you enjoyed the talk and you can apply some of theseto your apps and make your apps even betterthan they already are.

If you'd like some more information, you can checkout this website, developer.apple.com/wwdc/603.

We have a few more sessions tomorrowthat I recommend you go check out.

At 11:00 o'clock, we have What's New in Metal,Part 1 and then a little later at 1:40,we have What's New in Metal, Part 2.

That'll tell us everything that's new in the worldof Metal, awesome stuff you can addto your applications to make them better.

And then for you hardcore shader heads out there,we have Advanced Metal Shader Optimization at 3:00.

So if you want to know how to get the bestout of your shaders, I recommend you go check out that talk.

It's really great.

Thanks for coming to hear us talk.

Welcome to WWDC.

Have a good rest of the week.

Thanks again.

Apple, Inc.AAPL1 Infinite LoopCupertinoCA95014US

ASCIIwwdc

Searchable full-text transcripts of WWDC sessions.

An NSHipster Project

Created by normalizing and indexing video transcript files provided for WWDC videos. Check out the app's source code on GitHub for additional implementation details, as well as information about the webservice APIs made available.