Metal 2 Optimization and Debugging

Developing Metal 2-based apps is even easier with the redesigned tools for debugging and profiling in Xcode. Dive into the enhanced Metal Frame Debugger and explore techniques for fine-tuning graphics and compute workloads. Learn about accessing detailed GPU performance counters, and check out new support in Metal System Trace for optimizing VR apps.

WWDC 2017

And welcome to Metal 2
Optimization and Debugging.
As you know, we're talking a lot
about Metal 2 this year, with
some great new enhancements to
the platform including GP driven
rendering, machine learning
acceleration, and a macOS VR,
and external GPU support.

And not forgetting Advanced
Optimization Tools.
So, this afternoon, we're going
to talk about the current Metal
tools, give you a recap of
those, talk about some great new
enhancements to the Metal frame
debugger, and then finally cover
some major enhancements in terms
of GPU profiling.
But first, the frame debugger.
So hopefully you're all familiar
with this tool.
It is our fully-featured frame
debugger integrated into Xcode,
that lets you capture your Metal
2 work, be it computer graphics,
and then step through it into
the debugger, to inspect state
and resources, letting you debug
and optimize.

One of our focuses this year has
been on improving the
performance to the frame
debugger, and in particular, we
paid particular regard to
improving the speed with which
captures happen.
And I'm happy to say that
compared to Xcode 8, Xcode 9 now
captures up to 10 times as fast,
getting you from clicking the
Capture button to into the
debugger much, much more
quickly.
As you'd expect, we have full
support for all the new Metal 2
API, including Raster order
groups, sampler arrays, viewport
arrays, and the new pixel and
vertex array formats.
One particular part of Metal 2
that we've paid a lot of
attention to is the support for
the new argument buffers.

With this, in the buffer viewer,
you can now see all the
argument, buffer arguments,
displayed in line, and you can
click through whether it be to
attach your sampler, buffer, or
another argument buffer itself.
We've also added support for VR
captures this year, with
automatic support for SteamVR.
And we've added support for you
to view your submitted surfaces
in stereo.
So when you get to the Submit
call, you're sending your
surfaces to the VR compositer,
you will see left eye and right
eye alongside each other, to
quickly spot any discrepancies.
Another area of focus this year
has been on improving the
workflow for capturing more
complex workloads.
So now if you're doing
compute-only work in Metal, or
perhaps you're using multiple
Metal cues, it's much easier to
capture exactly the work that
you want.
We've added the lightweight
capture API, with some new Metal
capture scope objects, that you
create at startup and then reuse
every frame to surround the work
that you want to capture.
You'll see a demo of this later,
but it's great for being able to
say, group all your regular
rendering work in one scope, and
some asynchronous work like
regenerating your tessellation
factor buffers, in another
scope, and when you come to
capture, you get exactly what
you want.

We also have support for
triggering captures
programmatically from your app.

We use this a lot within our
test apps ourselves, so that we
can quickly do a gesture on the
device, and trigger a frame
capture, without having to
switch our focus back to Xcode.

Another new feature this year is
support for Xcode's Quick Looks
support.

So now, you have lightweight
Metal debugging in the CPU
debugger.

So if you hit a breakpoint, and
there is a Metal texture there,
we'll pull back the data from
the metal texture on the GPU and
let you view it there and then.
Similarly, with buffers and
samplers.
This is great for those cases
where a full frame capture might
be too invasive, for instance,
if you're debugging your
resource loading or some of the
setup code.
It is also great in cases where
you're debugging some compute
workloads as well.
So last year we introduced
support for rich filtering
throughout the frame debugger,
so that you could filter both on
things like resource properties,
but also within the frame
navigator, you can filter based
on contextual data, so what
resources you're using at a
given draw call will let that
draw call show up.

Well we've taken this to the
next level this year, with
support for data mining
throughout your capture tray.
So now, when you type, we'll
give you context aware, or
complete suggestions.
And we now allow compound terms.
So now if you search for a given
encoder, and then you search for
a texture, we'll only show you
auto-complete suggestions for
the textures that are actually
used within that encoder.
One of our most requested
features over the years has been
support for pixel inspection.
And we've finally caught up with
that.
So now you can do detailed
inspection of individual pixels
within your textures and your
render targets.
And if you have multiple
attachments to your render
targets, we'll show you the
pixel values for the same
location in each attachment at
the same time.
So it's really good if you're
trying to debug what the color
value is alongside that and
stencil and such like.

It's also very valuable for
debugging compute workloads if
you're working with images there
and you are, for instance,
halfway through your CNN and you
want to test watch to see what
the exact values in the buffers
are.
Another new feature we
introduced last year was our
vertex attribute viewer, where
you can see all the vertex data
as it goes into your vertex
shader, you know, shown on a per
vertex basis.
Well, this year, we've added
support for viewing the outputs
from your vertex shader as well,
and we will display this inline
with all the other input data,
so in this case, you can see
your position inputs, and your
position outputs, at the same
time.
Well, to show you all these
great new features in action,
I'd like to invite my colleague,
Max, to the stage, who is going
to give you a demo of all this
new stuff.
Hello. Great to see you here
today.
I hope you are doing fine and
you are as hyped about Metal as
we are.
Xcode GPU debugger helps you
debugging your GPU and the Metal
usage.
I am Max, and I am going to
maximize your debugging
experience, showing our new
features.
Yeah, let me run my demo app.
It is rendering a beautiful
scenery, reflects snowy
mountains, grass that is waving
in the wind, and to make it even
look nicer, I added some
particles that are glowing in
the air.
But as you can see, the
particles of the grass, there is
some kind of a problem.
So let's figure this out.

As a first step, let's check if
the texture is correct.
Let me set a breakpoint in the
rendering loop, where this
texture is being used.
Hovering over a variable gives
you access to Xcode's data tips,
and you can quick look into the
texture data.

This data is fetched live from
the GPU, and it helps you to
verify the resources you are
binding, and, of course, it
works with all Metal resources.
The texture in this case looks
correct.
So what else can we check?
Our next step is to capture a
frame.
Using the little camera icon in
the debug bar, let's you capture
a frame, but using a long press
gives you access to capture
scopes and command cues.

A capture scope is one path
through your rendering pipeline.
Like my environment map, I'm
only updating every couple of
frames.
In this case, however, we want
to capture rendering, this is
where the particles are being
drawn.

So let's capture this.
And already done.
For those who are not familiar
with our tool, I will give you a
quick run through of all the
views you are seeing here.

On the left side, we have the
debug navigator.
It is in [inaudible]
presentation of your frame, and
to help you, we automatically
group by command buffers and
command encoders.
But also your debugging groups
are shown here, giving you fine
grain control over the grouping.
You can select the draw call or
any other Metal call to inspect
further details.
The editor in the center is
showing the bound resources.

All the Metal objects you are
using in the selected API call.
Again, you can see labeling your
objects will greatly increase
readability.
So I suggest to do that.

The editor on the right side is
showing the attachments, the
output of the last issued draw
call.
So whenever you are, like,
navigating through your frame,
you instantly see where you are.
On the bottom, we have our
variables view, where you can
access all the states of each
Metal object.
Back to our problem with the
particles.
Here, we can make use of our new
super powerful filtering.

So I know the particles are
drawn somewhere in my forward
rendering.

So let me filter for this.
Filtering for command encoder
will only show API calls inside
this command encoder.
Like this.
But it is still a lot.

So let me add an additional
filter.
We know it is using our particle
texture.
Filtering for texture will only
show draw calls using this
texture.
And boom, this combination of
filters results in a single API
call we want to inspect further.
So let's go here.
Let's take a look at the bound
resources.
The vertex attributes combines
the data that is going into your
vertex function and leaving it.
Maybe we are doing something
wrong with our geometry here, so
let's open this by double
clicking.
Let me also hide the attachments
for a moment.
Last year, we started to show
you a nice layout for all the
buffers.
And this year we added something
more.

In the header, you can see the
direction where the data is
flowing.

And if we take a look, this is
the output data, the data that
is leaving the vertex function.

This is the output position of
every particle vertex, and as we
can see here, there is no
obvious error, like big numbers
or something like this, so I
assume this data is correct.

So what else can we check?
The debug navigator now gives
you quick access to all the
views related to this draw call.
Let's switch back to the
attachments again.

We are using two render targets
here.
Color, and depth.

Let's inspect some more pixel
values.
Using the inspect pixels button
in the lower right corner, we
will present a new tool.
A loop. This loop displays the
value like they are outputted by
the fragment function.
And you can move the loop around
all the render targets.
But you can also use your arrow
keys for pixel precise control,
even if you are not zoomed in.
Also, you notice, all the loops
are being synchronized between
all the render targets.
That helps you to relate values.
Let me find an interesting pixel
now.
Using a long press will
instantly move the cursor.

And here we can see something
strange.
The depth value inside and
outside a particle is different,
and let's opt, our particles
shouldn't write into the depths
buffer, of course.
That will be an easy fix.
And I'm also sure our new GPU
debugger will help you fixing
your issues with the GPU.
I hope we see each other in the
labs tomorrow morning, or at the
[inaudible] later today.
Back to my colleague, Seth.

So now, onto GPU profiling.
As you know, performance is
crucial to games and other
graphical applications, and
achieving a consistent, fast
framework is always necessary.
But on the flip side, you want
to get the most of the GPU for
the best looking game as well,
and at the same time, increase
efficiency for a longer game
experience.
Well, for all this, you need to
use the GPU Profiling tools.

The first tool I want to talk
about is Metal System Trace.
This is our tool for
investigating timing issues, by
which I mean investigating cases
where the CPU and the GPU might
not be running in parallel,
because you have some synching
operations by mistake, and
you're forcing them to work in
serial.
It's also great for
investigating those cases where
you're mostly achieving the
framework you want, but
occasionally you get a stutter,
and you need to figure out,
okay, what is going wrong in
that particular frame?
It lets you trace your Metal
workloads through the system,
from CPU to GPU to display.
This year, we've added support
for VR applications, with
specific VR trace points for
activities like when you query
the head set for post-data, when
you submit your surfaces to the
VR compositor, when it does its
work to do the compositing, and
finally, when it hits the glass
on the headset.
In effect, it lets you trace
from motion to photon.
We've also added support this
year for the new ProMotion
displays, as you'll find in the
new iPads, iPad Pros released
early this week, and also
support for external GPUs on
macOS.
It is also worth noting there
are some great improvements in
the instruments, to make it much
easier to view other instruments
alongside Metal System Trace in
a more integrated fashion.
Our next profiling tool is the
GPU Shader Profiler.
The tool for probing shader
performance.

It is integrated into the frame
debugger, and lets you view
shader time on a per draw call
and per pipeline basis.
And if you're on iOS or tvOS, it
also lets you view it on a per
line basis.
Well, our first new tool this
year is designed to work hand in
hand with the GPU Shader
Profiler.
We call that Metal Pipeline
Statistics.
Metal Pipeline Statistics gives
you a direct line to the GPU
compiler to find out about the
quality of the machine code the
compiler is generating from your
shader.
It gives you a rich set of
statistics with things such as
instruction count, instruction
mix, by which I mean the
relative ratio of operations
such as ALU or memory or control
flow, and on GPUs where it is
relevant, it will also show you
register usage and occupancy.
For GPUs such as that, these
measures are crucial in
understanding what is the
limitations on how many shaders
can be scheduled simultaneously,
by which shader instances can be
scheduled simultaneously.
But even better are the new
compiler remarks.
With this, the GPU compiler will
give you direct actual guidance
on the performance of your
shader, and things you can do to
avoid performance hits, from
things such as slow math usage,
register spills, and stack
usage.

It's like having a GPU compiler
engineer built into every Xcode.
For each remark, it will explain
what it means, what you can do
to reduce it, and give you a
link to where you need to go to
fix it.
Well, to demo this new feature,
I'd like to invite my colleague
Jose to the stage, to give you a
tour of Metal Pipeline
Statistics.

Hello everyone.
My name is Jose Enrique
[inaudible].
I am going to present you a new
feature of our GPU friendly
debugger that will have you
produce good quality.
As you can see, we are replaying
a capture of Metal [inaudible]
demo for iOS.
The first thing I'm going to do,
I'm going to change my debug
navigator view from view frame
by call to view frame by
performance.
What this gives, what this view
gives is all these pipelines you
capture, sorted by time.
Remember, in Metal, a shader is
always linked to a pipeline,
therefore, this is a list of all
initiator combinations that are
available in [inaudible]
capture.
I am going to go to a
[inaudible] view of the most
expensive pipeline, to see if we
can improve the shaders.
As you can see, this view has
three sections.
On top, we have remarks.
Remarks are a unique approach to
the compiler shader quality.
As the report, final compiler
co-generation issues.

Remember, the GPU will execute
your shader millions of times
per frame, therefore, the
[inaudible] you can get, the
better the performance it will
have.

Remarks also are sorted by
relevance, and if expanded, they
offer you reason on why we're
reporting it, and our
recommendation on how to prevent
it.

Below remarks, we have an
overview for each shader, where
you can see how the compiler
final assemblies compose
[inaudible] of instruction
ratios.

And finally, we have a list of
all the recalls using this
pipeline.

This will become very handy
whenever we are iterating over
our shaders.

So let me showcase an example of
workflow profiler statistics.
We go to our top remark,
Register Spill, we can see the
compiler is reporting a big
spill, 1,040 bytes.

Spills will cause the GPU to
access memory, which can stall
your shade execution.

Knowing that a compiler is
spilling and fixing it can
drastically improve your shader
performance, yet finding where
and why the compiler is spilling
tends to be a time consuming
manual [inaudible].
But notice the second and fourth
remarks.

Dynamic Stack Store, and Dynamic
Stack Load.
If expanded, they offer a reason
an expensive stack load is
emitted to a dynamic offset in a
local array.

As well as a recommendation, to
reduce the stack access,
eliminate dynamic access to
local arrays.
This is basically saying that we
have an array variable in our
shader code that is storing the
stack and where accessing it
using some other index variable.

This is a very common pattern
when providing for the CPU, but
GPUs are different, they suffer
when we rely on stack usage.
But note the languages below the
recommendation.

It has an exact line number.
This means that we option click
to it, we'll jump directly to
the [inaudible] line where the
compiler is actually loading
data from the stack array.

We just found the compiler
spill.
Also these align very well with
our shared performing data,
informing us of the high cost of
this line, which now we know
exactly why.
The shader is doing two passes.
The first pass, doing some light
computation.
And a second pass, doing the
light accumulation.

This speaks to the issue by
working with the compiler but
from a GPU perspective.

The first thing I'm going to do,
I'm going to remove the stack
array, I am going to remove it
in relation, and then I'm going
to compute the light
accommodation directly into the
first loop.
Then I am going to remove the
second loop, and just not do
that anymore.
Now, I click my update shader
button, and wait for the
results.
What this is going to do, it's
going to have the compiler
perform a whole loop
optimization and reuse the same
reducer over and over, instead
of relying on the stack.
Once the results are back, we
can see the instruction ratio
between the previous and current
[inaudible] has been reduced, as
well as the impact of this
change on every single draw call
used in the pipeline, giving us
also whole space performance
improvements.
And with this, I conclude my
example of [inaudible]
statistics.
I hand it back to my colleague,
Seth.

Thank you, Jose.
And finally, onto our last new
tool today.
GPU Counter Profiling.
As you know, GPU architecture is
complex, with a pipeline
consisting of multiple
programmable and fixed function
blocks, bottlenecks can occur at
any point within the pipeline,
and often, at multiple
simultaneous points.
Your job as Metal programmers is
to minimize fixed function
bottlenecks while efficiently
harnessing programmable blocks.
Well, to do that, our new GPU
Counter Profiling is the tool.
Instead of going directly into
the GPU Frame Debugger, it gives
you detailed GPU hardware
performance statistics on a per
draw call, on macOS and per
encoder on iOS and tvOS basis.
And instead of giving you an
arcane list of counters that
change for every GPU and are
hard to understand, and often
don't tell you what you need to
know, we've defined a high level
set of characters that mean the
same across each GPU.

So you don't have a per GPU
learning curve.
So here is the Counter
Profiling.
On the left, we have a graph
view, showing you detailed GPU
counter graphs, and on the
right, the detail window.
Let's talk about those in order.

In the graph view, we show you
counters across your frame,
where the x axis represents draw
calls, or encoders across time.
At the top, we show you GPU
time.

As that is inherent to all GPU
counter profiling.
And then below that, a set of
top level counters that
correspond to the stages in the
GPU pipeline, along with some
other top level counters that
correspond to shared execution
units, such as the shader core
and test units.
For each group, you can drill
down to more detailed counters,
exploring a lot more data for
each stage, and this is great
for work flow where you identify
it as your first cut, where you
think the performance issue is,
and then drill down to see why
it's going on.
In the detail view, we'll show
you all the same counters from
the counter graph view, but
displayed in full detail
numerically.

And to give it context, we will
show it alongside the median,
max, and total values for the
frame as well.
Now, both the graph view and the
detail views support the same
full, rich filtering options
that we support elsewhere in the
frame debugger, so if you want
to just view certain pixel stats
and certain memory stats at the
same time, you can put together
the search term, and view
everything you want alongside
each other.

But I'll highlight the GPU
counter profiling is our advance
to bottleneck analysis.

With this, we take all the
counters that have been pulled
for each draw call, or each
encoder, and perform rich
analysis on it, both on a cross
platform basis, and on a per GPU
specific basis to identify
potential bottlenecks at each
call.

Alongside this, we will give you
lots of data about, okay, what
is going on here?
What could cause it?
And then intuitive workflow to
navigate direct to the affected
area.
Now, both the, all the
bottlenecks and all the counters
will have rich detailed
documentation within Xcode docs,
explaining what each counter
means in detail, why it might be
particularly high or
particularly low, and what you
can do about it.
To give a demo of this great new
GPU Counter Profiling feature,
I'd like to invite my colleague,
Jose, back to the stage to give
you a demo of it in action.

Thank you, Seth.
And hello again everyone.

This time, I will demonstrate
GPU counters, a tool that will
help you analyze your GPU
performance.
First, I'm going to replay the
same demo that [inaudible] was
on, but this time, we'll focus
from a performance perspective.
The first thing to note is, the
new GPU gauge, just under the
FPS gauge.
By clicking on it, we'll
navigate to our GP counter view.
As you can see, there is a
wealth of data here.

Available for the first time.
With this view, you can now
deliver file any GPU performance
issue that you have in any of
your capture frames.
So let me demonstrate how to
find performance issues.
First, let's focus on the graph
view.

'
As you can see, there is a main
spike in GPU time at the very
beginning of a capture.
The first thing you want to do
is to zoom in to see a single
recall, there are more
offenders.
In order to do that, I can
simply pinch and zoom, just like
that.
Any default system behaviors
will work just as you expect
them to.
Now I will see that there is a
main spike.
You can click on this draw call
to highlight this impact across
all the pipeline, and hovering
over each row will give us
detailed information on how
relevant that stage is for this
particular draw call.
In this case, Vertex Omission,
Vertex Shader, and Primitives
did not seem to have relevant
impact.

On the contrary, Fragment Shader
and Pixels [inaudible] seemed to
be quite high.

Let's focus on the Fragment
Shader first.
If we expand this group we now
get access to a massive amount
of counter data that gives us
detailed information what is
going on with the Shader stage.
The last thing that this
counted, we can quickly see that
the stall time is unusually
high, over 76%.
This means that most of the time
we are spending on the Fragment
Shader is actually waiting for
some memory or data to be
available.
This is caused because you are
fetching from a buffer or from a
texture, but texture captures
should be here in this latency,
so let's go down to our Texture
Unit, to see what is the cache
rate.
And we can immediately see that
the textures cache rate is also
unusually high, almost at 60%.
This means that more than half
of the texture samples we are
doing are coming from video
memory and not from the texture
caches.
Now that we have a better
understanding of the issue at
hand, let's focus on the
assistant editor.
As you can see, the assistant
editors offer the same graph
inform-- counter information as
the graph view was offering, but
this time, displayed as a table
view.
But more important, look at the
top.
This is our bottleneck access
tool.

It will point out two relevant
issues that we consider when we
analyze all the counters within
the selected draw call, and
point out any relevant issues
that we cconsider important for
you to pay attention to.
In this case, highlighting the
same as we just found manually
by checking the graph, that the
texture cache miss rate is high.
When expanded, it also offers
recommendations on what to
check.
In this case, check if sampled
textures have [inaudible], and a
quick navigational name to
relevant views for this issue.

For example, boundary sources,
where we can immediately see
what's the issue at hand, we're
fetching a 4K by 4K RGBA32
Floating Point Texture with
[inaudible] in both our vertex
and our Fragment Shader.
This is a 256-megabyte texture
that is fetched all over the
pipeline.
No wonder we are trashing our
cache.

Just think for a moment what we
just did.
This was an incredibly detailed
view of how GPU internals work.
You finally have the data to
prove what [inaudible] telling
you, that fetching from the
textures is expensive, but now
you know exactly why.

Accessing this texture was a
star on the Fragment Shader,
because it had to fetch some
data from [inaudible] memory
that was not available in the
caches.

This level of detail is
typically not seen outside
consult tools.

Solving this issue now is a
matter of balancing performance,
quality, and correctness, but
you have demonstrated how you
can use GPU counters [inaudible]
the GPU Frame Debugger, to help
you investigate, analyze, and
verify any capture information
-- any performance information
that you have in your captures.
And now, I hand it back to my
colleague, Seth.

Thank you, Jose.
So that is GPU Counter
Profiling.
Like all the new features we
talked about today, it's the
ultimate joy in Xcode Beta 9,
and it's available for all Metal
capable GPUs.

You will find that more recent
GPUs have more counters
available due to the more modern
nature of the GPU, but there's
still a rich and very usable set
available for all GPUs.

However, we would love to hear
feedback from you if you feel
there's particular counters that
were unexposed that would be
particularly valuable, you know,
please drop by the lab, or
[inaudible] and we'll be happy
to investigate.
So what have we talked about
today?
We've talked about some great
enhancements to the Metal Frame
Debugger, with support for pixel
inspection, inspecting Vertex
Shader outputs, rich filtering,
better capture support, better
capture performance, and Xcode
Metal Quick Looks.

We've talked about support for
debugging and profiling VR
applications in Metal Tray
Debugger, and Metal System
Trays.
We've talked about Metal
Pipeline Statistics, giving you
a direct line to the GPU
compiler for performance
information.
And we've introduced GPU Counter
Profiling, giving you unheralded
access to GPU Performance
Counter Data in Metal.
For more information, check out
the website.
Code is 607.
I did want to call out a couple
of other sessions.
If you didn't catch either the
Introducing Metal 2, or VR With
Metal 2 sessions earlier on this
week, it's highly worth going
and checking out the video, even
if you did see them, it's still
worth checking out the videos.
And coming up later this
afternoon, there is a great
session on using Metal 2 for
Compute, in Grand Ballroom A, at
10 past 4.
And that's it.
Thanks for coming.

Have a great remainder of your
WWDC 17 and enjoy the bash.
Thank you.

Looking for something specific? Enter a topic above and jump straight to the good stuff.