Session 605WWDC 2016

Discover enhancements to the Metal shading language and how to use function specialization to improve performance while reducing the number of shader configurations in your app. Take advantage of resource read-writes to enable amazing new rendering techniques, understand how to support wide color, and accelerate your deep learning algorithms using the Metal Performance Shaders framework.

[ Music ]

[ Applause ]

Welcome.

This is Part 2 of our What's New in Metal session.

My name is Charles Brissart, and I'm a GPU Software Engineer,and together with my colleague, Dan Omachi and Ana Tikhonova,I will be telling you about some of our new features.

But first, let's take a lookat the other Metal session at the WWDC.

The first two sessions I call Adopting Metal uncovered someof the basic concepts of Metal as wellas some more advanced considerations.

The What's New in Metal session covered our new features.

Finally, the Advanced Shadow Optimization session will tellyou how to get the best performance out of your shaders.

So this morning you were told about tessellation,resource heaps, memoryless render targets as wellas some improvement for GPU tools.

This afternoon we'll tell you about function specialization,function resource read-writes, wide color, texture assets,as well as some addition to the Metal Performance Shaders.

So let's get started with function specialization.

It is a common pattern in a rendering engineto define a few complex master functionsand then use those master functions to generator minimumof specialized simple functions.

The idea is that the master function allow youto avoid duplicating card while the specialized function aresimpler on those as a result of better performance.

So let's take an example.

If we are trying to write a material function you couldwrite a master function that implements every aspectof any material that you might need.

But then, if you are trying to implement a shinya simple shiny material,you would probably not need reflection,but you will need a specular highlight.

If you implement a reflected materialon the other hand you will need to add reflectionon also the specular highlights.

Our transition material will need subsurface scattering,but probably no reflectionor may be no specular highlights either, and so on.

You get the idea.

So this is typically implemented using preprocessor macros.

The master function is compiled with a set of valuesfor the macro to create a specialized function.

This can be done at runtime, but this is expensive.

You can also try to precompile every single variantof the precompiled function, but and then store them in Metal,but this requires a lot of storagebecause you can have many, many variants,or maybe you don't know which one you will need.

Another approach is to use runtime constants.

Runtime constants avoid the need to recompile your functions.

However, you need to evaluate the valuesof the constant at runtime.

That will impact the performance of your shaders.

So we are proposing a new wayto create specialized functions using what we callfunction constants.

So function constants are constantsthat are defined directly in the Metal shading languageand can be compiled into IR and stored in the Metal lib.

Then at runtime you can provide the value of the constantto create a specialized function.

The advantage of this approach isthat you can compile the master function offlineand store it in the Metal lib.

The storage requirement is smallbecause you only store the master functions.

And since we run a quick optimization passwhen we create the specialized function,you still get the best performance.

So let's look at an example.

This is what a master function could looklike using a preprocessor macro.

Of course, this is a simple example.

A real one would be much more complex.

As you can see, different parts of the code surrounded by whatif statements so that you can eliminatethat section of the code.

Here is what it would look like with function constant.

As you can see at the top, we are defining a numberof constants, and then we use them in the code.

To define the constants, you use the constant keyword followedby the type, in this case Boolean, and finally the nameof the constant and the function constant attribute.

The function constant attribute specifies that the valueof the constant is not going to be provided at compile timebut will be provided at runtimewhen we create the specialized function.

You should also note that we are passing an index.

That index can be used in addition to the nameto identify the constant when we create the specialized functionat runtime.

You can then use the constant anywhere in your codelike your normal constant.

Here we have a simple if statement that is usedto conditionalize part of the code.

So once you've created your master function and compiled itand stored it in a Metal lib,you need to at runtime create specialized functions.

So you need to provide the values of the constant.

To do that, we use an MTL function constant values objectthat will solve the values of multiple constants.

Once we created the object, we can then set the valuesof a constant either by name, by index, or by name.

Once we have created an object,we can then create the specialized functionby simply coding the new function with namesand constant values on the library, providing the nameof the master function as well as the values we just filled.

This will return a regular MTL function that can then be usedto create compute pipeline or render pipeline dependingon the type of the function.

So to better understand how this works,let's look at the compilation pipeline.

So at build time, you use the source of your master functionand compile it and store into a Metal lib.

At runtime you load the Metal liband create a new function using the MTL function constant valuesto specialize the function.

At this point, we run some optimizationto eliminate any code that's not used anymore,and then we have an interior function that we can useto create a render pipeline or a compute pipeline.

You can declare constants of any scalar or vector typethat is [inaudible] in Metal , so float,half, int, uint, and so on.

Here we are defining half4 color.

You can also create intermediate constants using the valueof function constants.

Here we're defining a Boolean constantthat has the opposite value of a function constant a.

Here we are calculating a value based on the valueof the value function constant.

We can also have optional constants.

Optional constants are constants for which you don't needto always provide the value when you specialize the function.

This is exactly the same thing as using a what ifdefin your code when using preprocessor macros.

To do this, you use the if function constant defined builtin that will return true if the value has been providedand false if otherwise.

You can also use function constant to addor eliminate arguments from function.

This is useful to avoid, to making sure you don't haveto bind a buffer or textureif you know it's not going to be used.

It's also useful to replace the type of an argument,and we'll talk aboutwe'll talk more about this in the next couple of slides.

So here we have an example.

This is a vertex function that can implement skinning dependingof the value of the doSkinning constant.

The first argument of the function is the matrices bufferthat will exist dependingon whether the doSkinning constant is true or false.

We use the function constant attribute to qualifythat argument as being optional.

In the code, you still need to use the same function constantto protect the code that's using that argument.

So here we use doSkinning in the if statement,and then we can use the matrices safely in our code.

You can as well use function constant to eliminate argumentsfrom the stage in struct.

Here, we have two color arguments.

The first color argument as type float4 on these usefor attributes, that is attribute 1.

The second lowp color is a lower precision color half4but is overriding the same attribute index.

So you can have either one or the other.

These are used to specifically change the typeof the color attributes in your code.

There are some limitations with function constants, namely,you cannot really change the layout of a struct in memory,and that can be a problem because you might wantto have different constants for different shaders and so on.

But you can work around thatbut adding multiple arguments with different types.

So in this example, we have two buffer argumentsthat are using buffer index 1.

They are controlled by function constants,use ConstantA and ConstantB.

So these are used to select one or the other.

Note that we have we use an intermediate constantthat is the opposite of the first constantto make sure only oneof the arguments will exist at a given time.

So in summary, you can use function constantto create specialized function at runtime.

It avoids front end compilation, and because we only useand it only uses fast optimization phaseto eliminate unused code.

The storage is compact because you only needto store the master function in your library.

You don't have to ship your source.

It can only ship the IR.

And finally, the unused code is eliminated,which gives you the best performance.

Function buffered read-writes is the ability to read and writeto a buffer from any function type and also the abilityto use atomic operations on those buffersfrom any function type.

As you guessed, function texture read-writes is the abilityto read and write to texture from any function type.

Function buffer read-writes is available on iOSwith a 9 processor and macOS.

Function texture read-writes is available on macOS.

So let's talk about function buffered read-writes.

So what's new here?

What's new is the ability to write to bufferfrom fragment function as well as using an atomic operationin the text and fragment function.

These can be used to implement such thingsas order-independent transparency, building listsof lights that affect the given tile,or simply to debug your shaders.

So let's look at the simple example.

Let's say we want to write the positionof the visible fragments we are rendering.

It could look like this.

So we have a fragment functionto which we pass an output buffer.

The output buffer is where we are goingto store the position of the fragments.

Then we have a counter, so another buffer that we startafter [inaudible] that we use to find the positioninto the buffer, the first buffer,to which we want to write.

We can then use an atomic preparation to count the numberof fragments with that has been already writtento get an index in the buffer.

And then we can write into the buffer the positionof the fragments.

So this looks pretty good, but there is a small problem.

The depth and stencil test when you're writingto buffer is actually always exhibitedafter the fragment shader.

So this is a problem because we are goingto still perform the rights to the buffer,which is not what we want.

We only want the visible fragments.

It's also something to be awareof because it will impact your performance.

That means we don't have any early Z optimization here,so we are going to exhibit fragment shaderwhen we probably wouldn't want to.

Fortunately we have a new function qualifier earlyfragment test that can be used to force the depthand stencil test to appear before the fragment shader.

As a result, if the depth test fail, we will skip the executionof the fragment shader and thus not write to the buffer.

So this is what we need here, to reach the final functionwith the early fragment test attribute which otherwiseto only execute the function when the fragments are visible.

Now let's talk about function texture read-writes.

So what's new is the ability to write to texture from the vertexand fragment functions as well as the ability to read and writeto a texture from a single function.

This can be used, for instance, to save memorywhen implementing post processing effectsby using the same texture on both input and output.

So writing to texture is fairly simple.

You just define your texture with the access qualifier write,and then you can write to your texture.

Read-write texture, a texture to which you can boththat you can both read and write in your shader.

Only a limited number of formats is reported for those textures.

To use the read-write texture you will use the accessqualifier of read-write, and then you can read to the textureand write to it in your shader.

However, you have to be careful when you write to the textureif you want to read the results,if you want to read the same pixel again in your shader.

In this case, you need to use a texture fence.

The texture fence will ensurethat the writes have been committed to memoryso that you can read the proper value.

Here, we write to a given pixel, and then we use a texture fenceto make sure we can read that value againand then we can finally read the value.

We should also be careful with texture fencebecause they only apply on a single SIMD thread,which means that if you have two threads that are writingto a texture and the second thread is tryingto read the value that was written by the first thread,even after a texture fence, this will not work.

What will work is if each thread is reading the pixel valuesthat it was writing to but not the onesthat are written by other threads.

So one note about reading, we talked a lot about writingto buffers and textures.

With vertex and fragment functions,you have to be careful.

In this example, fragment function is trying to writeis writing to a bufferand a vertex function is trying to read the results.

However, this is not going to workbecause of having the same RenderCommandEncoder.

To fix this, we need to use two RenderCommandEncoder.

The fragment function writes to the bufferin the first RenderCommandEncoder while thetexture the vertex functionin the second RenderCommandEncoder can finallyread the result and get proper results.

You should note that with compute shader,this is not necessary.

It can be done the same compute CommandEncoder.

So in summary, we introduced two new features,function buffer read-writes and function texture read-writes.

You can use early fragment tests to make sure the depthand stencil test is done because the executionof the fragment shader.

You should use a texture fence if you are trying to read datafrom a read-write texture that you have been writing to.

And finally, when using vertex and fragment shader to writeto buffers, you need to make sureto use a different RenderCommandEncoderwhen you want to read the results.

So with this, I will hand the stage to Dan Omachi to talkto you about wide color.

[ Applause ]

Thank you, Charles.

Thank you.

As Charles mentioned, my name is Dan Omachi.

I work as an engineer in Apple's GPU Software Frameworks Teamand I'd like to start off talking to youabout color management, which isn't a topicthat all developers are actually familiar with.

So if you are an artist at either theeither a texture artist creating assets for a gameor a photographer editing photos for distribution,you would have a particular color scheme in mind,and you'd choose colors pretty carefully.

And you'd want consistency regardless of the displayon which your content is viewed.

Now it's our responsibility as developersand software engineers to guarantee that consistency.

If you're using a high level framework like SceneKit,SpriteKit, or Core Graphics, much of this work is donefor you, and youas app developers don't need to think about it.

Metal, however, is a much lower level API.

This offers increased performance and some flexibilitybut also places some of this responsibility in your hands.

So why now?

You've been able to use different displayswith different color spaceswith Apple devices for many years now.

Well, late last year, Apple introduced a couple of iMacswith a display capable of rendering colorsin the P3 color space.

And in April, we introduced the 9.7-inch iPad Pro,which also has a P3 display.

So what is the P3 color space?

Well, this is a chromaticity diagram,and conceptually this represents all of the colorsin the visual spectrum, in other words, all the colorsthat the normal human eye can see.

Of that, within this triangle are colorsthat a standard sRGB display can represent.

The P3 display is able to represent colorsof a much broader variety.

So here's how it works on macOS.

We want you to be able to render in any color spaceand as I mentioned, high level frameworks take care of this,this job of color management for youby performing an operation called color matchingwhere your color and one color space is matched to thatof the display color space so that the same intensityon the display regardless of the color spacethat you're working in is displayed.

So by default, you're ignoring the color profileof the display, and therefore,the display will interpret colors in its own color space.

Now, this means that sRGB colors will be interpretedas P3 colors, and rendering will be inconsistent between the two.

So if this is your application with an sRGB drawableand this is the display, well, when you call present drawable,these colors become much saturated.

So why does this happen?

Well, let's go back to our chromaticity diagram.

This is the most green color that you can representin the sRGB color space, and in a fragment shader,you'd represent this as 0.0 in the red channel,1.0 in the green channel and 0.0 in the blue channel.

Well, the P3 Display just takes that raw valueand interprets it,and it basically thinks that it's a P3 color.

So you're getting the most green color of a P3 Display,which happens to be a different green color.

Now, for content creation apps, it's pretty criticalthat you get this right because artists have used carefulconsideration to render their colors.

For games, the effect is more subtle, but if your designersand artists are looking for this dark and gritty theme, well,they're going to be disappointed when it looks much more cheerfuland happy when you plug in a P3 Display.

Also, this problem can get worseas the industry moves towards even wider gamut displays.

So, the solution is really quite simple.

You enable color management on the NSWindow or CAMetalby setting the color space to your working color space,probably the sRGB color space.

This causes the OS to perform a color match as partof its window server's normal compositing pass.

So if here's your display, or excuse me,here's your application with sRGB drawableand here's the display,the window server takes your drawable when you call presentand performs the color match before slapping it on the glass.

Now, all right, so now you've got that consistency.

What if you want to adopt wide color?

You want to purposefully render those more intense colors a widegamut display is only capable of rendering.

Well, first of all, you need to create some content.

You need your artist to create wider content,and for that we recommend using the extended range sRGBcolor space.

This allows existing assets that aren't offered for wide colorto continue working as they have,and your shader pipelines don't need to do anything different.

However, your artists can create new wider color assetsthat will provide much more intense colors.

So what exactly is the extended range sRGB?

Well here's the sRGB triangle and here's P3.

Extended range sRGB just goes out infinitelyin all directions, meaning values outside of 0 to 1in your shader represent values that can only be viewedon a wider than sRGB color display.

So I mentioned values outside of 0 to 1.

This means that you will need to use floating point pixel formatsto express such values, and for source textures we recommend acouple of formats.

You can use the BC6H floating point format.

It's a compressed format offering high performanceas well as the pack float and shared exponent formats.

For your render targets, you can use this pack float formator the RGBA half-float format, allowing youto specify these more intense colors.

Color management on iOS is a bit simpler.

You always render in the sRGB color space,even when targeting a P3 Display.

Colors are automatically matched with no performance penalty.

And if you want to use wide colors, you can make useof some new pixel formatsthat are natively readable by the display.

There's no compositing operation that needs to happen.

They can be gamma encoded, offering better blacksand allowing you to do linear blending in your shaders,and they're efficient for use as source textures.

All right.

Here are the bit layouts of these new formats.

So, there are there is a 32-bit RGB formatwith 10 bits per channel and also an RGBA formatwith 10 bits per channel spread across 64 bits.

Now, this, the values of this 10 bits arecan express values outside of 0 to 1.

Values from 0 to 384 represent negative values, 384 to 894,the next 510 values, represent values between 0 and 1and those greater than 894 represent these moreintense values.

Now, note here that the RGBA pixel format is twice as largeand therefore uses twice as much memory and twiceas much bandwidth as this RGB format.

So, in general, we recommend that you use this onlyin the CAMetal Layer if you need destination alpha.

All right, so you've made the decision that you wantto create some wide gamut content.

How can you do this?

Well, you have an artistauthor using image editor on macOS,which supports the P3 color space, such as Adobe Photoshop.

You can save that image as a 16-bit per channel PNGor JPEG using the display P3 color profile.

Now, once you've got this image,how do you create textures from it?

Well, you've got two solutions here.

The first is you can create your own asset conditioning tool,and from that 16-bit per channel Display P3 image you can convertusing the extended sRGB floating point color space using eitherthe ImageIO or vImage frameworks.

And then from that on macOS, you'd convert to oneof those floating point pixel formats I mentioned earlier,and on iOS you'd convert to oneof those extended range pixel formats I just mentioned.

All right, so that's option oneif you really want explicit controlof how your textures are built.

The next option is to use Xcode supportfor textures in asset catalogues.

So for a while now you've been able to put icons and imagesinto an asset catalogue within your Xcode project.

Last year, we introduced app thinning whereby you can createa specialized versionfor various devices based upon device capabilitysuch as the amount of memory, the graphics features set,or the type of device, whether it be an iPad, Mac or TVor watch or even phone, of course.

And when your app was downloaded, you downloadand install only the single version of that assess madefor that device with the capabilities you specified.

The asset was compressed over the wire and on the device,saving a lot of storage on the user's device,and there were numerous APIs,which offer efficient access to those assets.

So now we've added texture sets to these asset catalogues.

So what does this offer?

Well, storage for mipmap levels.

Textures are more than just 2D images.

You can perform offline mipmap generation within Xcode,will automatically color match this texture.

So if it's a wide gamut texture in some different color space,will perform a color matching operation to the sRGBor extended range sRGB color space.

And I think the most important feature of this ability here isthat we can choose the most optimal pixel formatfor every device on which your app can run.

So on newer devices that support ASTC texture compression,we can use that format.

On older devices which don't support that,we can choose either a noncompressed formator some other compressed format.

Additionally, we can choose a wide color formatfor devices with a P3 Display.

So here's the basic workflow.

You create texture sets within Xcode.

You assign a name to the set, a unique identifier.

You'll add an image and indicate basically howthat texture will be used, whether it's a color textureor some other type of data like a normal map or a height map.

Then, you'll can create this texture.

Xcode will build this textureand deliver it to your application.

Now, you can create these texture sets via the Xcode UIor programmatically.

Once your texture is on the device, you can supply the nameto MetalKit, and MetalKit will build a texture,a Metal texture, from that asset.

So I'd like to walk you through the Xcode workflowto introduce some of these concepts to you.

So, you'll first select the asset cataloguein your projects navigator sidebarand then hit this plus button here, which brings up this menu.

If you have a number of textures that are called base texture,one for each object, you can create a folder for each objectand stuff your base texture for that object in that folder,and your hierarchy can be as complex as you'd like.

You add your image, and then you set the interpretation.

Now there are three options here.

Color, in color NonPremultiplied perform this colormatch operation.

The NonPremultiplied option will multiply the alpha channelby your R, B, and GRGB channels before building the texture.

The data option here will is used for normal maps,height maps, roughness maps, textures of noncolor type.

Now, this is all you need to do.

Xcode will go off and build various versionsof this texture, and it will pick the most optimalpixel format.

You can, however, have more explicit control.

You can select any number of these traits here,which will open up a number of bucketsthat you can select to customize.

You can add different images for each version.

You probably wouldn't use a different image,but may be a different size of an image.

So on a device with lots of memory,you can use a bigger texture, and a devicewith a smaller memory, you would use a much smaller texture.

And then you can specify how or whether you want mipmaps.

The all option will generate mipmaps all the waydown to the 1 by 1 level and the fixed option here will give yousome more explicit control, such as whether you wantto use a max level and also whether you wantto have different images for each level.

And finally, you can override our automatic selectionof pixel formats.

Now I mentioned that you can programmatically create thesetexture sets.

You don't really want to go through the Xcode UIif you've got thousands of assets.

So there's a pretty simple directory structure,and within that directory structure are a numberof JSON files.

Now these files and directory structure is fully documentedon the asset catalogue reference.

So you can create your own asset conditioning toolto set up your texture set.

So once you've got this asset on the device,how do you make use of it?

Well, you create a MetalKit texture loader supplying yourMetal device, and then you supply the name alongwith its hierarchy to the texture loaderand MetalKit will go off and build that texture.

You can supply a couple of other options heresuch as scale factor if you have different versionsof the texture for different scale factors or the bundleif the asset catalogue isin something other than the main bundle.

There are also a coupleof options here that you can specify.

So I'd really like you to pay attention to color spaceand set your apps apartby creating content with wide color.

Asset catalogues can help you achieve that goal.

As well, they provide a number of other featureswhich you can make use of,such as optimal pixel format selection.

I'd like to have my colleague Anna Tikhonova up here to talkabout some exciting improvementsto the Metal Performance Shaders framework.

[ Applause ]

Hi. Good afternoon.

Thank you, Dan, for the introduction.

As Dan said, my name is Anna.

I'm an engineer on the GPU Software Team.

So let's talk about some new additionsto the Metal Performance Shaders.

We introduced the Metal Performance Shaders frameworklast year in the What's New in Metal Part 2 talk.

If you haven't seen that session,you should definitely check out the video.

But just to give you a quick recap,the Metal Performance Shaders framework is the frameworkof optimized high performance data parallel algorithmsfor the GPU in Metal .

The algorithms are optimized for iOS,and they have been available for you since iOS 9, for the A8and now the A9 processors.

The framework is designed to integrate easilyinto your Metal applications and be very simple to use.

It should be as simple as calling a library function.

So last year, we talked about following a listof supported image operations, and you should watch the videofor lots of details and examples.

Deep learning is a field of machine learning which goal isto answer this question.

Can a machine do the same task that a human can do?

Well, what types of tasks am I talking about?

Each one of you has an iPhone in your pocket.

You probably took a few pictures today,and all of us are constantly exposed to images and videoson the Web every day, on news sites, on social media.

When you see an image, you know instantly what is depictedon it.

You can detect faces.

If you know these people, you can tag them.

You can annotate this image.

And this works well for a single image,but what if you have more images and even more images?

Think about all of the images uploaded to the Web every day.

No human can hand annotate this many images.

So deep learning is a techniquefor solving these kinds of problems.

It can be used for sifting through large amounts of dataand for answering questions such as, "Who's in this image?"

And "Where was it taken?"

But I'm using image-based examples in this talkbecause they are visual.

So they are a great fit for this type of a presentation,but I just want to mentionthat deep learning algorithms can be usedfor other types of data.

For example, other types of signal like audioto do speech recognition and hapticsto create the sense of touch.

Deep learning algorithms have two phases.

The first one is the training phase.

So let's talk about it, give a specific example.

So image that you want train your systemto categorize images into classes.

This is an image of a cat.

This is an image of a dog.

This is the image of a rabbit.

This is a labor intensive task that requires a large numberof images, hand-labeled annotated imagesfor each one of these categories.

So for example, if you want to train your systemto recognize cats, you need to feed it a large number of imagesof cats all labeled, and same for your rabbitsand all the other animals that you want your systemto be able to recognize.

This is a one-time computationally expensive step.

It's usually done offline, and there are plentyof training packages available out there.

The result of the training phase is trained parameters.

So I will not talk about them right now,but we will get back to them later.

The trained parameters are required for the next phase,which is the inference phase.

This is the phase where your system is presentedwith a new image that has never seen before, and it needsto classify in real-time.

So in this example, the system correctly classified this imageas an image of a cat.

We provide GPU acceleration for the inference phase.

Specifically, we give you the building blocksto build your inference networks for the GPU.

So let's now talk about what are the convolutional neuralnetworks and what are these building blocks we provide?

When our brain processes visual input, the first hierarchyof neurons that receive informationin the visual cortex are sensitive to specific edgesor blobs of color, while the brain regions furtherdown the visual pipeline respond to more complex structureslike faces or kinds of animals.

So in a very similar way,the convolutional neural networks are organizedinto layers of neurons which are trainedto recognize increasingly complex features.

So the first layers are trained to recognize low level featureslike edges and blobs of color,while the subsequent layers are trainedto recognize higher level features.

So for example, if we are doing face detection,then will have layers that will recognize features like noses,eyes, cheeks, and then combination of these features,and then finally faces.

And then the final few layers combine all the generatedinformation to produce the final output for the network,such as the probability that there is a face in the image.

And I keep mentioning features.

Think of a feature as a filter that filters the inputfor that feature, such as a nose,and if that information is found, it's passed along.

If that feature is found, this information is passed alongto the subsequent layers.

And, of course, we need to look for many such features.

So if we're doing face detection, then lookingfor just noses is simply not enough.

We also need to look for other facial features like cheeks,eyes, and then combinations of such features.

So we need many of these feature filters.

So now that I've covered convolutional neural networks,let's talk about the building blocks we'll provide.

The first building block is your data.

We want you to use MPS images and MPS temporary images,which we added specifically to support convolutional networks.

They provide and optimize layout for your data, for your inputand intermediate results.

Think of MPS temporary images as light-weight MPS images,which we want you to use for image datawith a transient lifetime.

MPS temporary images are built using the Metal resource heaps,which were described in the Part 1 of these sessions.

They address some of the reused cache memory,and they avoid expensive allocationand deallocation of texture resources.

So the goal is to save you lots of memoryand to help you manage intermediate resources.

We also provide a collection of layers, which you can useto create your inference networks.

But you may be thinking right now, "How do I knowwhich building blocks I actually needto build my own inference network?"

So the answer is trained parameters.

The trained parameters, I mentioned them previouslywhen we talked about the training phase.

They tell you how many layers you will have,what kind they will be, in which order they will appear,and you also get all those feature filters for every layer.

So we take care of everything under the hood to make surethat the networks you build using these building blocks havethe best possible performance on all iOS GPUs.

All you have to do is to mine your datainto this optimized layout that we provideand to call library functions to create the layersthat make up your network.

So now let's discuss all these building blocks in more detail,but let's do it in a context of a specific example.

So in this demo, I have a systemthat has been trained to detect smiles.

And what we'll have isin real-time the system will detect whether I am smilingor not.

So I will first smile, and then I will frown,and you will see the system report just that.

[ Laughter ]

All right.

So that [inaudible] my demo.

[ Applause ]

Okay. So now let's take a look at the building blocksthat I needed to build this kind of a network.

So the first building block we're going to talkabout is the convolution layer.

It's the core building block of convolutional neural networks,and its goal is to recognize features and input.

And it's called a convolutional layerbecause it performs a convolution on the input.

So let's recall how regular convolution works.

You have your input and your output and in this case a 5by 5 pixel filter with some weight.

And in order to compute the value of this pixelin your output, you needto convolve the filter with the input.

Pretty easy.

The convolution layer is a generalizationof regular convolution.

It allows you to have multiple filters.

The different filters are applied to the input separately,resulting in different output channels.

So if you have 16 filters.

That means you have 16 output channels.

So in order to get the value of this pixel in the first channelof the output, you need to take the first filterand convolve it with the input.

And in order to get the value of this pixel in the second channelof the output, you need to take the second filterand convolve it with your input.

Of course, in our examples,mild detection we are dealing with color images.

So that means that your input actually has three separatechannels, and just because of how convolutional neuralnetworks work, you need three sets of 16 filterswhere you have one set for each input channel.

And then you apply the different filtersto separate input channels and combine the resultsto get a single output value.

So this is how you would create oneof these convolution layers in our framework.

You first create a descriptor and specify such parametersas the width and height of the filters you're going to useand then the number of input and output channels.

And then you create a convolution layerfrom this descriptor and provide the actual datafor the feature filters, which you getfrom the trained parameters.

The next layer we are going to talk about is the pooling layer.

The function of the pooling layer isto progressively reduce the spatial size of the network,which reduces the amount of competitionfor the subsequent layers.

And it's common to insert a pooling of thein between successive convolution layers.

Another function of the pooling layer is to summarizeor condense information in a region of the input,and it would provide two pooling operations, maximum and average.

So in this example, we take a 2 by 2 pixel region of the input.

We take the maximum value and store it as our output.

And this is the API you need to usein the Metal Performance Shaders framework to create oneof these pooling layers.

It's common to use the max operationwith a filter size of 2 by 2.

The fully connected layer is a layer where every neuronin the input is connected to every neuron in the output.

But think about it as a special type of a convolution layerwhere the filter size is the same as your input size.

So in this example, we have a filter of the same sizeas the input, and we convolve themto get a single output value.

So in this architecture, the convolutionand pooling layers operate on regions of input,while the fully connected layer can be usedto aggregate information from across the entire input.

It's usually one of the last layers in your network,and this is where your final decision-making is taking placeand you create you generate the output for the network,such as the probability that there's a smile in the image.

And this is how you would create oneof these fully connected layersin the Metal Performance Shaders framework.

You create a convolution descriptorbecause this is a special type of a convolution layer,and then you create a fully connected layerfrom this descriptor.

We'll also provide some additional layers,which I'm not going to cover in detail in this presentationbut they are described in our documentation.

We provide the neural layer, which is usually usedin conjunction with the convolution layer,and we also provide the soft max and normalization layers.

So now that we've covered all of the layers,let's talk about your data.

I mentioned that you should be using MPS images.

So what are they really?

Most of you are already familiar with Metal textures.

So this is a 2D Metal texture with multiple channelswhere every channel corresponds to a color channel and alpha.

And I mentioned in my previous examples that we needto create images with multiple channels,for example, 32 channels.

If we have 32 feature filters,we need to create an output channelan output image that has 32 channels.

So how do we do this?

So an MPS image is really a Metal 2D array texturewith multiple slices.

And when you're creating an MPS image,all you really should care about isthat you are creating an image with 32 with 32 channels.

But sometimes you may need to reach the MPS image data backto the CPU, or you may wantto use an existing Metal 2D array texture as your MPS image.

So for those cases, you need to knowthat we use a special packed layout for your data.

So every pixel in a sliceof the structure contains the data for four channels.

So a 32-channel image would really just have eight slices.

And this is the API you need to use to create oneof the MPS images in our framework.

You first create a descriptor and specify such parametersas the channel for data format with the height of the imageand the number of channels.

And then you create an MPS imagefrom this descriptor, pretty simple.

Of course, if you have small input images,then you should batch them to better utilize the GPU,and we provide a simple mechanism for you to do this.

So in this example, we create an array of 100 MPS images.

Okay, so now that we've covered all the layers,we've covered data, and now let's take a lookat the actual network you need to build to do smile detection.

So we start with our inputs, and now we're goingto use the trained parameters that I keep mentioningto help us build this network.

So the trained parameters tell us that the first layerin this network is going to be a convolution layer,which takes a three-channel images inputand outputs a 16-channel image.

The trained parameters also give us the three sets of 16 filtersfor this layer, and these colorful blue images show youthe visualization of the output channelsafter the filters have been applied to the input.

The next layer is a pooling layer,which reduces the spatial resolution of the outputof the convolution layer by a factor of two in each dimension.

The trained parameters tell usthat the next layer is another convolution layer,which takes a 16-channel images inputand outputs a 16-channel image, which is further down reducedin size by the next pooling layer,and so on until we get to our output.

As you can see, this network has a seriesof convolution layers followed by the pooling layers,and the last two layers are the fully connected layers,which generate the final output for your network.

So now that we know what this network should look like,and this is very common for a convolutional neural networkfor inference, so now let's write the codeto create it in our framework.

So the first step is to create the layers.

Once again, the trained parameters tell us that we needto have four convolution layers in our network and I'm showingthat the code had to create one of them for simplicitybut as you can see, I'm using exactly the same APIthat I've showed you before.

Then we need to create our pooling layer.

We just need one because we're always goingto be using the max operation with a filter size of 2 by 2.

And we also need to create two fully connected layers,and once again I'm only showing you the codefor one for simplicity.

And now, we need to take care of our input and output.

In this particular example, I'm assumingthat we have an existing Metal app and you have some texturesthat you would like to use for your input and output,and this is the API that you need to use to create MPS imagesfrom existing Metal textures.

And so the last step is to encode all your layersinto an existing command buffer in the order prescribedby the trained parameters.

So we have our input and our outputs, and now we noticethat we need one more thing to take care of.

We need to store the output of the first layer somewhere.

So let's use MPS temporary images for that.

This is how you would create an MPS temporary image.

As you can see, this is very similarto the way you would create a regular MPS image.

And now we immediately use it when we encode the first layer.

And the temporary image will go away as soonas the command buffer is submitted.

And then we continue.

We create another temporary image to store the outputof the second layer, and so on until we get to our output.

That's it.

And just to tie it all back together,the order in which you encode the layers matches the networkdiagram that I showed you earlier exactly,so starting from the input and all the way to the output.

So now we worked through a pretty simple example.

Let's look at a more complex one.

We've ported the inception inference networkfrom tensor flow to run using the Metal PerformanceShaders framework.

This is a very commonly used inference networkfor object detection, and this is the full diagramfor this network.

As you can see, this network is a lot more complexthat the previous one I showed you.

It has over 100 layers.

But just to remind you, all you have to do isto call some library functions to create these layers.

And now first, let's take a look at this network in action.

So here I have a collection of images of different objects,and as soon as I tap on this image,we will run the inference network in real-timeand it will report the top five guessesfor what it thinks this object is.

So the top guess is that it's a zebra.

Then this is a pickup truck, and this a volcano.

So that looks pretty good to me, but of course,let's do a real live demo right here on this stage.

And we'll take a picture of this water bottle,and let's use this image, water bottle.

[ Applause ]

So what I wanted to show you with this live demo isthat even a large network with over 100 layers can runin real-time using the Metal Performance Shaders framework,but this is not all.

I also want to talk about the memory savings we gotfrom using MPS temporary images in this demo.

So in the first version of this demo, we used MPS imagesto store intermediate results, and we endedup needing 74 MPS images totaling in sizeover 80 megabytes for the entire network.

And of course, you don't have to use 74 images.

You can come up with your own clever scheme for howto reuse these images, but this means more stuff to managein your code, and we want to make sure that our framework isas easy for you to use as possible.

So in the second version of the demo,we replaced all the MPS images with MPS temporary images,and this gave us several advantages.

The first one is reduced CPU cost in terms of timeand energy, but also creating 74 temporary images resultedin just 5 underlying memory allocations, totaling justover 20 megabytes and this is 76% of memory savings.

That's pretty huge.

So what I showed you with these two live demos isthat the Metal Performance Shaders framework providescomplete support for building convolutional neural networksfor inference, and it's optimized iOS GPU use.

So please, use the convolutional neural networksto build some cool apps.

So this is the end of What's New in Metal talks,and if you haven't seen the first session, please checkout the video so you can learn about such cool new featuresas tessellation, resource heaps, and memoryless render targetsand improvements to our tools.

You can catch the video and get linksto related documentation and sample code.

And here's some information on the related sessions.

You could always check out the videosof the past Metal sessions online,but you can also catch an advanced Metal shaderoptimization talk later today, and just note the locationof this talk has changed to Knob Hill.

Tomorrow, you have an opportunity to catch the Workingwith White Color talk and the Neural Networksand Accelerate talk where you can learn howto create neural networksfor the CPU using the Accelerate framework.

So thank you very much for coming,and I hope you have a great WWDC.

[ Applause ]

Apple, Inc.AAPL1 Infinite LoopCupertinoCA95014US

ASCIIwwdc

Searchable full-text transcripts of WWDC sessions.

An NSHipster Project

Created by normalizing and indexing video transcript files provided for WWDC videos. Check out the app's source code on GitHub for additional implementation details, as well as information about the webservice APIs made available.