Session 606WWDC 2016

The Metal shading language is an easy-to-use programming language for writing graphics and compute functions which execute on the GPU. Dive deeper into understanding the design patterns, memory access models, and detailed shader coding best practices which reduce bottlenecks and hide latency. Intended for experienced shader authors with a solid understanding of GPU architecture and hoping to extract every possible cycle.

[ Music ]

[ Applause ]

So, hello everyone.

My name is Fiona and this is my colleague Alex.

And I work on the iOS GPU complier team and our job isto make your shaders run on the latest iOS devices,and to make them run as efficiently as possible.

And we work with the Open Source committeeto make LVM more suitable for use on GPUs by everyone.

Here's a quick overview of the other Metal session,in case you missed them,and don't worry you can watch the recordings online.

Yesterday we had part one and two of adopting Metaland earlier today we had part one and two of what's newin Metal, because there's quite a lot that's new in Metal.

And of course here's the last one,the one you're watching right now.

So in this presentation we're going to be going over a numberof things you can do to work with the compilerto make your code faster.

And some of this stuff is going to be specific to A8and later GPUs including some informationthat has never been made public before.

And some of it will also be more general.

And we'll be noting that with the A8 icon you can see therefor slides that are more A8 specific.

And additionally, we'll be noting some potential pitfalls.

That is things that may not come up as often as the kindof micro optimizations you're used to looking for,but if you run into these, you're likely to loseso much performance, nothing is going to matter by comparison.

So it's always worth making sure you don't run into those.

And those will be markedwith the triangle icon, as you can see there.

Before we go on, this is not the first step.

This is the last step.

There's no point to doing low-level shader optimizationuntil you've done the high-level optimizations before,like watching the other Metal talksfrom optimizing your draw calls, the structureof your engine and so forth.

Optimizing your later shader should be roughly the last thingyou do.

And, this presentation is primarilyfor experienced shader authors.

Perhaps you've worked on Metal a whole lot and you're lookingto get more into optimizing your shaders, or perhaps your newto Metal, but you've done a lot of shader optimizationon other platforms and you'd like to know howto optimize better for A8 and later GPUs,this is the presentation for you.

So you may have seen this pipeline if you watched anyof the previous Metal talks.

And we will be focusing of courseon the programmable stages of this pipeline,as you can see there, the shader course.

So since this functionality doesn't existin all shading languages, I'll give a quick primer.

So, GPUs have multiple paths for getting date from memory.

And these paths are optimized for different use cases,and they have different performance characteristics.

In Metal, we expose control over which path is usedto the developer by requiring that they qualify all buffers,arguments and pointers in the shading languagewith which address space they want to use.

So a couple of the address spaces specifically applyto getting information from memory.

The first of which is the device address space.

This is an address space with relatively few restrictions.

You can read and write data through this address space,you can pass as much data as you want, and the buffer offsetsthat you specify at the API level have relatively flexiblealignment requirements.

On the other end of things, you have the constant address space.

As the name implies, this is a read only address space,but there are a couple of additional restrictions.

There are limits on how much data you can passthrough this address space, and additionally the buffer offsetsthat you specify at the API level have more stringentalignment requirements.

However, this is the address space that's optimized for caseswith a lot of data reuse.

So you want to take advantageof this address space whenever it makes sense.

Figuring out whether or not the constant address space makessense for your buffer argument is typically a matterof asking yourself two questions.

The first question is, do I know how much data I have.

And if you have a potentially variable amount of data,this is usually a sign that you needto be using the device address space.

Additionally, you want to look at how much each itemin your buffer is being read.

And if these items can potentially be read many times,this is usually a sign that you want to put theminto the constant address space.

So let's put this into practice with a couple of examplesfrom some vertex shaders.

First, you have regular, old vertex data.

So as you can see, each vertex has its own piece of data.

And each vertex is the only one that reads that piece of data.

So there's essentially no reuse here.

This is the kind of thing that really needs to bein the device address space.

Next, you have projection matrices, another matrices.

Now, typically what you have here is that you have oneof these objects, and they're read by every single vertex.

So with this kind of complete data reuse, you really want thisto be in the constant address space.

Let's mix things up a little bitand take a look at standing matrices.

So hopefully in this case you have some maximum numberof bones that you're handling.

But if you look at each bone that matrix may be readby every vertex that references that bone,and that also is a potential for a large amount of reuse.

And so this really ought to beon the constant address space as well.

Finally, let's look at per instance data.

As you can see all verticesin the instance will read this particular piece of data,but on the other hand you have a potentially variable numberof instances, so this actually needs to bein the device address space as well.

So Fiona will spend some time talking about howto actually optimize loads and stores within your shaders,but for many cases the best thing that you can do isto actually off load this work to dedicated hardware.

So we can do this for you in two cases,context buffers and vertex buffers.

But this relies on knowing things about the access patternsin your shaders and what address space you've placed them into.

So let's start with constant buffer preloading.

So the idea here is that rather than loadingthrough the constant address space,what we can actually do is take your data and put itinto special constant registers that are even fasterfor the ALU to access.

So we can do this as longas we know exactly what data will be read.

If your offsets are known a compile time,this is straightforward.

But if your offsets aren't knownuntil run time then we need a little bit of extra informationabout how much data that you're reading.

So indicating thisto the compiler is usually a matter of two steps.

First, you need to make sure that this data isin the constant address space.

And additionally you need to indicatethat your accesses are statically bounded.

The best way to do this is to pass your argumentsby reference rather than pointer where possible.

If you're passing only a single item or a single struct,this is straightforward, you can just change your pointersto references and change your accesses accordingly.

This is a little different if you're passing an arraythat you know is bounded.

So what you do in this case is you can embed that size arrayand pass that struct by reference ratherthan passing the original pointer.

So we can put this into practice with an exampleat a forward lighting fragment shader.

So as you can see in sortof the original version what we have are a bunch of argumentsthat are passed as regular device pointers.

And this doesn't expose the information that we want.

So we can do better than this.

Instead if we note the numberof lights is bonded what we can do is we can put the light dataand the count together into a single struct like this.

And we can pass that struct in the constant address spaceas a reference like this.

And so that gets us constant buffer preloading.

Let's look at another exampleof how this can affect you in practice.

So, there are many ways to implement a deferred render,but what we find is that the actually implementation choicesthat you make can have a big impact on the performancethat you achieve in practice.

One pattern that's common now is to use a single shaderto accumulate the results of all lights.

And what you can see form the declaration of this function,is that it can potentially read any or all lights in the sceneand that means that your input size is unbounded.

Now, on the other hand if you're able to structure your renderingsuch that each light is handledin its own draw call then what happens isthat each light only needs to read that light's dataand it's shader and that means that you can pass itin the constant address spaceand take advantage of buffer preloading.

In practice we see that on A8 later GPUsthat this is a significant performance win.

Now let's talk about vertex buffer preloading.

The idea of vertex buffer preloading isto reuse the same dedicated hardware that we would usefor a fix function vertex fetching.

And we can do this for regular buffer loads as long as the waythat you access your buffer looks justlike fix function vertex fetching.

So what that means is that you needto be indexing using the vertex or instance ID.

Now we can handle a couple additional modificationsto the vertex or instance IDs such as applying a deviserand that's with or without any base vertexor instance offsets you might have applied at the API level.

Of course the easiest way to take advantage of this is justto use the Metal vertex descriptor functionalitywherever possible.

But if you are writing your own indexing code,we strongly suggest that you layout your dataso that vertexes fetch linearly to simplify buffer indexing.

Note that this doesn't preclude you from doing fancier things,like if you were rendering quads and you want to pass one valueto all vertices in the quad, you can still do thingslike indexing by vertex ID divided by fourbecause this just looks like a divider.

So now let's move on to a couple shader stage specific concerns.

In iOS 10 we introduced the ability to do resource writesfrom within your fragment functions.

And this has interesting implicationsfor hidden surface removal.

So prior to this you might have been accustomed to the behaviorthat a fragment wouldn't need to be shaded as longas an opaque fragment came in and occluded it.

So this is no longer true specificallyif your fragment function is doing resource writes,because those resource writes still need to happen.

So instead your behavior really only dependson what's come before.

And specifically what happens depends on whetheror not you've enabled early fragment testson your fragment function.

If you have enabled early fragment tests,once it's rasterized as longas it also passes the early depth and stencil tests.

If you haven't specified early fragment tests,then your fragment will be shadedas long as it's rasterized.

So from a perspective of minimizing your shading,what you want to do is use early fragment testswherever possible.

But there are a couple additional thingsthat you can do to improve the rejection that you get.

And most of these boil down to draw order.

You want to draw these objects,the objects where the fragment functions do resource writesafter opaque objects.

And if you're using these objects to update your depthand stencil buffers, we strongly suggestthat you sort these buffer from front to back.

Note that this guidance should sound fairly familiarif you've been dealing with fragment functionsthat do discard or modify your depth per pixel.

Now let's talk about compute kernels.

Since the defining characters of a compute kernelsthat you can structure your computation however you want.

Let's talk about what factors influence how you do thison iOS.

First we have computer thread launch overhead.

So on A8 and later GPUs there's a certain amount of timethat it takes to launch a group of compute threads.

So if you don't do enough workfrom within a single compute thread you can potentially,it leaves the hardware underutilizedand leave performance on the table.

And a good way to deal with this and actually a good patternfor writing computer kernels on iOS in general isto actually process multiple conceptual work itemsin a single compute threat.

And in particular a pattern that we find works well isto reuse values not by passing themthrough thread group memory, but rather by reusing values loadedfor one work item when you're processing the next work itemin the same compute thread.

And it's best to illustrate this with an example.

So this is a syllable filter kernel, this is sortof the most straightforward version of it, as you see,it reads as a three- [inaudible] region of its sourceand produces one output pixel.

So if instead we apply the patternof processing multiple work itemsin a single compute thread,we get something that looks like this.

Notice now that we're striding by two pixels at a time.

So processing the first pixel looks much as it did before.

We read the 3 by 3 region.

We apply the filter and we write up the value.

But now let's look at how pixel 2 is handled.

So stents are striding by two pixels at a time we needto make sure that there is a second pixel to process.

And now we read its data.

Note here that a 2 by 3 regionof what this pixel wants was already loadedby the previous pixel.

So we don't need to load it again,we can reuse those old values.

All we need to load now is the 1by 3 region that's new to this pixel.

After which, we can apply the filter and we're done.

Note that as a result we're not doing 12 texture reads,instead of the old 9, but we're producing 2 pixels.

So this is a significant reduction in the amountof texture reads per pixel.

Of course this pattern doesn't work for all compute use cases.

Sometimes you do still need to pass datathrough thread group memory.

And in that case, when you're synchronizing between threadsin a thread group, an important thing to keep in mind isthat you want to use the barrier with the smallest possible scopefor the threads that you need to synchronize.

In particular, if your thread group fits within a single SIMD,the regular thread group barrier functionin Metal is unnecessary.

What you can use instead is the new SIMD group barrier functionintroduced in iOS 10.

And what we find is actually the targeting your thread groupto fit within a single SIMDand using SIMD group barrier is often faster than tryingto use a larger thread group in order to squeezethat additional reuse,but having to use thread group barrier as a result.

So that wraps things up for me, in conclusion,make sure you're using the appropriate address spacefor each of your buffer arguments accordingto the guidelines that we described.

Make sure you're using early fragment tests to rejectas many fragments as possiblewhen you're doing resource writes.

Put enough work in each compute threadso you're not being limitedby your compute thread launch overhead.

And use the smallest barrier for the job when you needto synchronize between threads in a thread group.

And with that I'd like to pass it back to Fiona to dive deeperinto tuning shader code.

[ Applause ]

Thank you, Alex.

So, before jumping into the specifics here, I want to goover some general characteristics of GPUsand the bottlenecks you can encounter.

And all of you may be familiar with this,but I figure I should just do a quick review.

So with GPUs typically you have a set of resources.

And it's fairly common for a shader to be bottleneckedby one of those resources.

And so for example if you're bottleneckedby memory bandwidth, improving other thingsin your shader will often not give any apparentperformance improvement.

And while it is important to identify these bottlenecksand focus on them to improve performance,there is actually still benefit to improving thingsthat aren't bottlenecks.

For example, in that example if you are bottleneckedat memory usage, but then you improve your arithmeticto be more efficient, you will still save power evenif you are not improving your frame rate.

And of course being on mobile,saving power is always important.

So it's not something to ignore,just because your frame rate doesn't go up in that case.

So there's four typical bottlenecks to keepin mind in shaders here.

The first is fairly straightforward, ALU bandwidth.

The amount of math that the GPU can do.

The second is memory bandwidth, again, fairly straightforward,the amount of data that the GPU can load from system memory.

The other two are little more subtle.

The first one is memory issue rate.

Which represents the number of memory operationsthat can be performed.

And this can come up in the casewhere you have smaller memory operations,or you're using a lot of thread group memory and so forth.

And the last one, which I'll go into detail a bit moreabout later is latency occupancy register usage.

You may have heard about that,but I will save that until the end.

So to try to alleviate some of these bottlenecks,and improve overall shader performance and efficiency,we're going to look at four categoriesof optimization opportunity here.

And the first one is data types.

And the first thing to considerwhen optimizing your shader is choosing your data types.

And the most important thing to rememberwhen you're choosing data types is that A8and later GPUs have 16-bit register units,which means that for example if you're using a 32-bit data type,that's twice the register space, twice the bandwidth,potentially twice the power and so-forth,it's just twice as much stuff.

So, accordingly you will save registers,you will get faster performance, you'll get lower powerby using smaller data types.

Use half and short for arithmetic wherever you can.

Energy wise, half is cheaper than float.

And float is cheaper than integer,but even among integers, smaller integers are cheaperthan bigger ones.

And the most effective thing you can do to save registers isto use half for texture reads and interpolates because mostof the time you really do not need float for these.

And note I do not mean your texture formats.

I mean the data types you're using to store the resultsof a texture sample or an interpolate.

And one aspect of A8 in later GPUs that is fairly convenientand makes using smaller data types easierthan on some other GPUs isthat data type conversions are typically free,even between float and half, which means that you don't haveto worry, oh am I introducing too many conversions in thisby trying to use half here?

Is this going to cost too much?

Is it worth it or not?

No it's probably fast because the conversions are free,so you can use half wherever you want and not worryabout that part of it.

The one thing to keep in mind here though isthat half-precision numericsand limitations are different from float.

And a common bug that can come up herefor example is people will write 65,535 as a half,but that is actually infinity.

Because that's bigger than the maximum half.

And so by being aware of what these limitations are,you'll better be able to know where you perhaps shouldand shouldn't use half.

And less likely to encounter unexpected bugs in your shaders.

So one example applicationfor using smaller integer data types is thread IDs.

And as those of you who worked on computer kernels will know,thread IDs are used all over your programs.

And so making them smaller can significantly increase theperformance of arithmetic, and can save registers and so forth.

And so local thread IDs, there's no reason to ever use uintfor them as in this case,because local thread IDs can't have that many thread IDs.

For global thread IDs, usually you can get away with a ushortbecause most of the time you don't havethat many global tread IDs.

Of course it depends on your program.

But in most cases, you won't go over 2 to the 16 minus 1,so it is said you can do this.

And this is going to be lower power, it's going to be fasterbecause all of the arithmetic involving your thread ID is nowgoing to be faster.

So I highly recommend this wherever possible.

Additionally, keep in mind that in C like languages,which of course includes Metal, the precisionof an operation is defined by the larger of the input types.

For example, if you're multiplying a float by a half,that's a float operation not a half operation, it's promoted.

So accordingly, make sure not to use float literalswhen not necessary, because that will turn here what appearsto be a half operation, it takes a half and returns a half,into a float operation.

Because by the language semantics,that's actually a float operation since at least oneof the inputs is float.

And so you probably want to do this.

This will actually be a half operation.

This will actually be faster.

This is probably what you mean.

So be careful notto inadvertently introduce float precision arithmeticinto your code when that's not what you meant.

And while I did mention that smaller data types are better,there's one exception to this rule and that is char.

Remember as I said that native data type size on A8and later GPUs is 16-bit, not 8-bit.

And so char is not going to save you any space or poweror anything like thatand furthermore there's no native 8-bit arithmetic.

So next we have arithmetic optimizations,and pretty much everythingin this category affects ALU bandwidth.

The first thing you can do is always use Metal built-inswhenever possible.

They're optimized implementationsfor a variety of functions.

They're already optimized for the hardware.

It's generally better than implementing them yourself.

And in particular, there are some of thesethat are usually free in practice.

And this is because GPUs typically have modifiers.

Operations that can be performed for free on the inputand output of instructions.

And for A8 and later GPUs these typically include negate,absolute value, and saturate as you can see here,these three operations in green.

So, there's no point to trying to "be clever" and speedup your code by avoiding those, because again,they're almost always free.

And because they're free, you can't do better than fee.

There's no way to optimize better than free.

A8 and later GPUs, like a lotof others nowadays, are scalar machines.

And while shaders are typically written with vectors,the compiler is going to split them all apart internally.

Of course, there's no downside to writing vector code,I mean often it's clearer, often it's more maintainable,often it fits what you're trying to do, but it's also no betterthan writing scaler code from a compiler perspectiveand the code you're going to get.

So there's no point in trying to vectorize codethat doesn't really fit a vector format, because it's just goingto end up the same thing in the end,and you're kind of wasting your time.

However, as a side note, which I'll gointo more detail a lot later, in later A8 and later GPUs,do have vector load in store even though they do not havevector arithmetic.

So this only applies to arithmetic here.

Instruction Level Parallelism is something that someof you may have used optimizing for,especially if you've done work on CPUs.

But on A8 and later GPUs this is generally not a good thingto try to optimize for because it typically worksagainst registry usage,and registry usage typically matters more.

So a common pattern you may have seen is a kind of loopwhere you have multiple accumulators in orderto better deal with latency on a CPU.

But on A8 and later GPUs this is probably counterproductive.

You'd be better off just using one accumulator.

Of course this applies to much more complex examplesthan the artificial simple ones here.

Just write what you mean, don't try to restructure your codeto get more ILP out of it.

It's probably not going to help you at best, and at worst,you just might get worse code.

So one fairly nice feature of A8 and later GPUs isthat they have very fast select instructionsthat is the ternary operator.

And historically it's been fairly commonto use clever tricks, like this to tryto perform select operations in ternariesto avoid those branches or whatever.

But on modern GPUs this is usually counterproductive,and especially on A8 later GPUs because the compiler can't seethrough this cleverness.

It's not going to figure out what you actually mean.

And really, this is really ugly.

You could just have written this.

And this is going to be faster, shorter, and it's actually goingto show what you mean.

Like before, being overly clever will often obfuscate what you'retrying to do and confuse the compiler.

Now, this is a potential major pitfall,hopefully this won't come up too much.

On modern GPUs most of them do not have integer divisionor modulus instructions, integer not float.

So avoid divisional modulus by denominatorsthat are not literal or function consonants,the new feature mentioned in some of the earlier talks.

So in this example, what we have over here, this first onewhere the denominator is a variable,that will be very, very slow.

Think hundreds of clock seconds.

But these other two examples, those will be very fast.

Those are fine.

So don't feel like you have to avoid that.

So, finally the topic of fast-math.

So in Metal, fast-math is on by default.

And this is because compiler fast-math optimizations arecritical to performance Metal shaders.

They can give off in 50% performance gain or moreover having fast-math off.

So it's no wonder it's on be default.

And so what exactly do we do in fast-math mode?

Well, the first is that someof the Metal built-in functions have different precisionguarantees between fast-math and non fast-math.

And so in some of them they will have slightly lower precisionin fast-math mode to get better performance.

The compiler may increase the intermediate precisionof your operations like by forming a fuse multipleadd instructions.

It will not decrease the intermediate precision.

So for example if you write a float operation you will get anoperation that is at least a float operation.

Not a math operation.

So if you want to write half operations you better writethat, the compiler will not do that for you,because it's not allowed to.

It can't your precision like that.

We do ignore strict if not a number, infinity steal,and sign zero semantics, which is fairly important,because without that you can't actually provethat x times zero is equal to zero.

But we will not introduce a new not at new NaNs, not a numberbecause in practice that's a really nice wayto annoy developers, and break their codeand we don't want to do that.

And the compiler will perform arithmetic re-association,but it will not do arithmetic distribution.

And really this just comes down to what doesn't break codeand makes it faster versus what does break code.

And we don't want to break code.

So if you absolutely cannot use fast-math for whatever reason,there are some ways to recover some of that performance.

Metal has a fused multiply-add built in which you can see here.

Which allows you to directly request a fusedmultiply-add instructions.

And of course if fast-math is off,the compiler is not even allowed to make those,it cannot change one bit of your rounding, it is prohibited.

So if you want to use fused multiply-addand fast-math is off, you're goingto have to use the built-in.

And that will regain some of the performance,not all of it, but at least some.

So, on our third topic, control flow.

Predicated GP control flow is not a new topic and someof you may already be familiar with it.

But here's a quick review of what it means for you.

Control flow that is uniform across the SIMD,that is every thread is doing the same thing,is generally fast.

And this is true even if the compiler can't see that.

So if your program doesn't appear uniform, but just happensto be uniform when it runs, that's still just as fast.

And similarly, the opposite of this divergence,different lanes doing different things, well in that case,it potentially may have to run allof the different paths simultaneously unlike a CPUwhich only takes one path at a time.

And as a result it does more work, which of course meansthat inefficient control flow can affect anyof the bottlenecks, because it just outright means the GPU isdoing more stuff, whatever that stuff happens to be.

So, the one suggestion I'll make on the topic of control flow isto avoid switch fall-throughs.

And these are fairly common in CPU code.

But on GPUs they can potentially be somewhat inefficient,because the compiler has to do fairly nasty transformationsto make them fit within the control flow model of GPUs.

And often this will involve duplicating code and all sortof nasty things you probably would rather not be happening.

So if you can find a nice way to avoid these switch fall-throughsin your code, you'll probably be better off.

So now we're on to our final topic.

Memory access.

And we'll start with the biggest pitfallthat people most commonly run intoand that is dynamically indexed non-constant stack arrays.

Now that's quite a mouthful,but a lot of you probably are familiar with codethat looks vaguely like this.

You have an array that consist of values that are definedin runtime and vary between each thread or each function call.

And you index it to the array with another valuethat is also a variable.

That is a dynamically indexed non-constant stack array.

Now before we go on, I'm not going to ask you to takefor grabs at the idea that stacks are slow on GPUs.

I'm going to explain why.

So, on CPUs typically you have like a couple threads,maybe a dozen threads, and you have megabytes of cache splitbetween those threads.

So every thread can have hundreds of kilobytesof stack space before they get really slow and haveto head off to main memory.

On a GPU you often have tens of thousands of threads running.

And they're all sharing a much smaller cache too.

So when it comes down to it each thread has very,very little space for data for a stack.

It's just not meant for that, it's not efficient and soas a general rule, for most GPU programs,if you're using the stack, you've already lost.

It's so slow that almost anything else would havebeen better.

And an example for a real world app is at the startof the program it needed to select one of two floatfor vectors, so it used a 32-byte array,an array of two float fours and tried to selectbetween them using this stack array.

And that caused a 30% performance lossin this program even though it's only done once at the start.

It can be pretty significant.

And of course every time we improve the compiler we aregoing to try harder and harder to avoid, do anything we canto avoid generating these stack access because it is that bad.

Now I'll show you two examples here that are okay.

This other one, you can see those are constants,not variables.

It's not a non-constant stack array and that's finebecause the values don't vary per threads, they don't needto be duplicated per thread.

So that's okay.

And this one is also okay.

Wait, why?

It's still a dynamically indexed non-constant stack array.

But it's only done dynamically indexed because of this loop.

And the compiler is going to unroll that loop.

In fact, your compiler aggressively unrolls any loopthat is accessing the stack to try to make it stop doing that.

So in this case after it's unrolled it will no longer bedynamically indexed, so it will be fast.

And this is worth mentioning,because this is a fairly common pattern in a lotof graphics code and I don't want to scare you into not doingthat when it's probably fine.

So now that we've gone over the topic of howto not do certain types of loads and stores,let's go on to making the loads and storesthat we do actually fast.

Now while A8 and later GPUs use scalar arithmetic,as I went over earlier, they do have vector memory units.

And one big vector loading source of course fasterthan multiple smaller ones that sum up to the same size.

And, so as of iOS 10, one of our new compiler optimizations,is we will try to vectorize some loads and stores that goto neighboring memory locations wherever we can,because again it can give good performance improvements.

But nevertheless, this is one of the cases where workingwith the compiler can be very helpful,and I'll give an example.

So as you can see here, here's a simple loopthat does some arithmetic and reads in an array of structures,but on each iteration, it reads just two loads.

Now we would want that to be one if we could,because one is better than two.

And the compiler wants that too.

It wants to try to vectorize this but it can't, because Aand C aren't next to each other in memoryso there's nothing it can do.

The compiler's not allowed to rearrange your structs,so we've got two loads.

There's two solutions to this.

Number one, of course, just make it a float to,now it's a vector load, you're done.

One load, a set of two, we're all good.

Also, as of iOS 10, this should also be equally fast,because here, we've reordered our structto put the values next to each other,so the compiler can now vectorize the loadswhen it's doing it.

And this is an example again of working with the compiler,you've allowed the compiler to do something it couldn't before,because you understand what's going on.

You understand how the patterns need to beto make the compiler happyand make it able to do a [inaudible].

So, another thing to keep in mind with loads and stores isthat A8 and later GPUs have dedicated hardwarefor device memory addressing, but this hardware has limits.

The offset for accessing device memory must fitwithin a signed integer.

Smaller types like short and ushort are also okay,in fact they're highly encouraged,because those do also fit within a signed integer.

However, of course uint does not because it can have valuesout of range of signed integer.

And so if the compiler runs into a situationwhere the offset is a uint and it cannot provethat it will safely fit within a signed integer,it has to manually calculate the address,rather than letting the dedicated hardware do it.

And that can waste power,it can waste ALU performance and so forth.

It's not good.

So, change your offset to int, now the problem's solved.

And of course taking advantageto this will typically save you ALU bandwidth.

So now on to our final topic that I sort of glossedover earlier, latency and occupancy.

So one of the core design tenantsof modern GPUs is they hide latencyby using large scale multithreading.

So when they're waiting for something slow to finish,like a texture read, they just goand run another thread insteadof sitting there doing nothing while waiting.

And this is fairly importantbecause texture reads typically take a couple hundred cyclesto complete on average.

And so the more latency you have in a shader,the more threads you need to hide that latency,and how many threads can you have?

Well it's limited by the fact that you have a fixed setof resources that are sharedbetween threads in a thread group.

So clearly depending on how much each thread uses,you have a limitation on the number of threads.

And the two things that are split are the numberof registers and thread group memory.

So if you use more registers per thread,now you can't have as many threads.

Simple enough.

And if you use more thread group memory per thread, again you runinto the same problem,more thread your memory per thread means to your threads.

And you can actually check out the occupancy of your shaderby using MTLComputePipeLineState incurringmaxTotalThreadsPerThreadgroup,which will tell you what the actual occupancyof your shader is based on the register usageand the thread group memory usage.

And so when we say a shader is latency limited,it means you have too few threadsto hide the latency of a shader.

And there's two things you can do there,you can either reduce the latency of your shader,your save registers or whatever else it isthat is preventing you from having more threads.

So, since it's kind of hard to go over latencyin a very large complex shader.

I'll go over a little bit of a pseudocode examplethat will hopefully give you a big of an intuition of howto think about latency and how to sortof mentally model in your shaders.

So, here's an example of a REAL dependency.

We have a texture sample, and then we use the operativeof that texture sample to run an if statementand then we do another texture sample inside that x statement.

We have to wait twice.

Because we have to wait once before doing the if statement.

And we have to wait again before using the valuefrom the second texture sample.

So that's two serial texture accessesfor a total of twice the latency.

Now here's an example of a false dependency.

It looks a lot like the other,except we're not using a in the if statement.

But typically, we can't wait across control flow.

The if statement acts an effective barrier in this case.

So, we automatically haveto wait here anyways even though there's no data dependency.

So we still get twice the latency.

As you noticed the GPU does not actually careabout your data dependencies.

It only cares about what the dependencies appear to beand so the second one will be just as long latencyas the first one, even though there isn't a datadependency there.

And then finally here's a simple onewhere you just have two texture reads at the top,and they can both be done in paralleland then we can have a single wait.

So it's 1 x instead of 2 x for latency.

So, what are you going to do with this knowledge?

So in many real world shaders you have opportunitiesto tradeoff between latency and throughput.

And a common example of this might be that you have some codewhere based on one texture read you can decide, oh we don't needto do anything in this shader, we're going to quit early.

And that can be very useful.

Because now all that work that's being done in the caseswhere you don't need it to be done,you're saving all that work.

That's great.

But now you're increasing your throughputby reducing the amount of work you need to do.

But you're also increasing your latency because now it hasto do the first texture read, then wait for that texture read,then do your early termination check,and then do whatever other texture reads you have.

And well is it faster?

Is it not?

Often you just have to test.

Because which is faster is really going to dependon your shader, but it's a thing worth being awareof that often is a real tradeoff and you often haveto experiment to see what's right.

Now, while there isn't a universal rule,there is one particular guideline I can give for A8and later GPUs and that is typically the hardware needsat least two texture reads at a timeto get full ability to hide latency.

One is not enough.

If you have to do one, no problem.

But if you have some choicein how you arrange your texture reads in your shader,if you allow it to do at least two at a time,you may get better performance.

So, in summary.

Make sure you pick the correct address spaces, data structures,layouts and so forth, because getting this wrong is goingto hurt so much that often none of the other stuffin the presentation will matter.

Work with the compiler.

Write what you mean.

Don't try to be too clever,or the compiler won't know what you mean and will get lost,and won't be able to do its job.

Plus, it's easier to write what you mean.

Keep an eye out for the big pitfalls,not just the micro-optimizations.

They're often not as obvious, and they often don't comeup as often, but when they do, they hurt.

And they will hurt so much that no numberof micro-optimizations will save you.

And feel free to experiment.

There's a number of rule tradeoffs that happen,where there's simply no single rule.

And try them both, see what's faster.

So, if you want more information, go online.

The video of the talk will be up there.

Here are the other session if you missed them earlier, again,the videos will be online.

Thank you.

Apple, Inc.AAPL1 Infinite LoopCupertinoCA95014US

ASCIIwwdc

Searchable full-text transcripts of WWDC sessions.

An NSHipster Project

Created by normalizing and indexing video transcript files provided for WWDC videos. Check out the app's source code on GitHub for additional implementation details, as well as information about the webservice APIs made available.