A '''Compute Shader''' is a shader stage that is used entirely for computation.

+

{{infobox feature

+

| core = 4.3

+

| core_extension = {{extref|compute_shader}}

+

}}

−

These are special in that all other shaders have specific input and output information flow. Vertex shaders get their inputs from vertex attributes; geometry shaders get their input from vertex shaders and provide output to the rasterizer and/or transform feedback. And so on. These shaders can access [[Texture]]s, [[Buffer Object]]s, and so forth via various means, but they also have these special purpose inputs and outputs.

+

A '''Compute Shader''' is a [[Shader Stage]] that is used entirely for computing arbitrary information. While it can do rendering, it is generally used for tasks not ''directly'' related to drawing triangles and pixels.

−

Compute shaders do not. They have a very limited set of built-in inputs, which only define "where" in the computation this particular invocation of the shader is executing. They have no defined outputs. Thus, if they are to do something, they must do so through mechanisms like writing to [[Image Load Store|images]], employing [[Shader Storage Buffer Object]]s, and the like.

+

== Overview ==

−

{{stub}}

+

Compute shaders operate differently from other shader stages. All of the other shader stages have a well-defined set of input values, some built-in and some user-defined. They have a well-defined set of output values, some built-in and some user-defined. The frequency at which a shader stage executes is specified by the nature of that stage; vertex shaders execute once per input vertex, for example (though some executions can be skipped via caching). Fragment shader execution is defined by the fragments generated from the rasterization process.

+

+

Compute shaders work very differently. The "space" that a compute shader operates on is largely abstract; it is up to each compute shader to decide what the space means. The number of compute shader executions is defined by the function used to execute the compute operation. Most important of all, compute shaders have no user-defined inputs and no outputs at all. The built-in inputs only define where in the "space" of execution a particular compute shader invocation is.

+

+

Therefore, if a compute shader wants to take some values as input, it is up to the shader itself to fetch that data, via [[GLSL Sampler|texture access]], [[Image Load Store|arbitrary image load]], [[Shader Storage Buffer Object|shader storage blocks]], or other forms of interface. Similarly, if a compute shader is to actually compute anything, it must explicitly write to an image or shader storage block.

+

+

=== Compute space ===

+

+

The space that compute shaders operate within is abstract. There is the concept of a ''work group''; this is the smallest amount of compute operations that the user can execute. Or to put it another way, the user can execute some number of work groups.

+

+

The number of work groups that a compute operation is executed with is defined by the user when they invoke the compute operation. The space of these groups is three dimensional, so it has a number of "X", "Y", and "Z" groups. Any of these can be 1, so you can perform a two-dimensional or one-dimensional compute operation instead of a 3D one. This is useful for processing image data or linear arrays of a particle system or whatever.

+

+

When the system actually computes the work groups, it can do so in any order. So if it is given a work group set of (3, 1, 2), it could execute group (0, 0, 0) first, then skip to group (1, 0, 1), then jump to (2, 0, 0), etc. So your compute shader should not rely on the order in which individual groups are processed.

+

+

Do not think that a single work group is the same thing as a single compute shader invocation; there's a reason why it is called a "group". Within a single work group, there may be many compute shader invocations. How many is defined by the ''compute shader itself'', not by the call that executes it. This is known as the ''local size'' of the work group.

+

+

Every compute shader has a three-dimensional local size (again, sizes can be 1 to allow 2D or 1D local processing). This defines the number of invocations of the shader that will take place within each work group.

+

+

Therefore, if the local size of a compute shader is (128, 1, 1), and you execute it with a work group count of (16, 8, 64), then you will get 1,048,576 separate shader invocations. Each invocation will have a set of inputs that ''uniquely'' identifies that specific invocation.

+

+

This distinction is useful for doing various forms of image compression or decompression; the local size would be the size of a block of image data (8x8, for example), while the group count will be the image size divided by the block size. Each block is processed as a single work group.

+

+

The individual invocations within a work group will be executed "in parallel". The main purpose of the distinction between work group count and local size is that the different compute shader invocations ''within'' a work group can inter-communicate through a set of {{code|shared}} variables. Invocations between work groups can theoretically inter-communicate, but only through some form of global memory and this requires difficult synchronization.

+

+

== Dispatch ==

+

+

Compute shaders are not part of the regular [[Rendering Pipeline Overview|rendering pipeline]]. So the usual [[Vertex Rendering]] functions do not work on them.

+

+

A [[GLSL Object|program object]] can have a compute shader in it. The compute shader linked with other [[Shader Stage]]s (whether in a single program object or in a program pipeline) is effectively inert to rendering functions.

+

+

There are two functions to initiate compute operations. They will use whichever compute shader is currently active (via {{apifunc|glBindProgramPipeline}} or {{apifunc|glUseProgram}}, following the usual rules for determining the active program for a stage).

The {{param|num_groups_*}} parameters define the work group count, in three dimensions. These numbers cannot be zero. There are [[#Limitations|limitations]] on the number of work groups that can be dispatched.

+

+

It is possible to execute dispatch operations where the work group count comes from information stored in a [[Buffer Object]]. This is similar to [[Vertex_Rendering#Indirect_rendering|indirect rendering for vertex data]]:

The {{param|indirect}} parameter is the byte-offset to the buffer currently bound to the {{enum|GL_DISPATCH_INDIRECT_BUFFER​}} target. Note that the same limitations on work group counts ([[#Limitations|see below]]) still apply; however, indirect dispatch bypasses OpenGL's usual error checking. As such, attempting to dispatch with out-of-bounds work group sizes can cause a crash or even a GPU hard-lock.

+

+

== Inputs ==

+

+

Compute shaders cannot have any user-defined input variables. It only provides the following built-in input variables:

+

+

<source lang="glsl">

+

in uvec3 gl_NumWorkGroups;

+

in uvec3 gl_WorkGroupID;

+

in uvec3 gl_LocalInvocationID;

+

in uvec3 gl_GlobalInvocationID;

+

in uint gl_LocalInvocationIndex;

+

</source>

+

+

; {{code|gl_NumWorkGroups}}

+

: This variable contains the number of work groups passed to the dispatch function.

+

; {{code|gl_WorkGroupID}}

+

: This is the current work group for this shader invocation. Each of the XYZ components will be on the half-open range [0, gl_NumWorkGroups.XYZ).

+

; {{code|gl_LocalInvocationID}}

+

: This is the current invocation of the shader ''within'' the work group. Each of the XYZ components will be on the half-open range [0, gl_WorkGroupSize.XYZ).

+

; {{code|gl_GlobalInvocationID}}

+

: This value uniquely identifies this particular invocation of the compute shader among ''all'' invocations of this compute dispatch call. It's a short-hand for the math computation:

+

<source lang="glsl">

+

gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID;

+

</source>

+

; {{code|gl_LocalInvocationIndex}}

+

: This is a 1D version of {{code|gl_LocalInvocationID}}. It identifies this invocation ''within'' the work group. It is short-hand for this math computation:

+

<source lang="glsl">

+

gl_LocalInvocationIndex =

+

gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +

+

gl_LocalInvocationID.y * gl_WorkGroupSize.x +

+

gl_LocalInvocationID.x;

+

</source>

+

+

=== Local size ===

+

+

The local size of a compute shader is defined within the shader, using a special layout input declaration:

By default, the local sizes are 1, so if you only want a 1D or 2D work group space, you can specify just the {{param|X}} or the {{param|X}} and {{param|Y}} components. They must be integral constant expressions of value greater than 0. Their values must abide by the [[#Limitations|limitations imposed below]]; if they do not, a compiler or linker error occurs.

+

+

The local size is available to the shader as a compile-time constant variable, so you don't need to define it yourself:

+

+

<source lang="glsl">

+

const uvec3 gl_WorkGroupSize;

+

</source>

+

+

== Shared variables ==

+

+

Global variables in compute shaders can be declared with the {{code|shared}} storage qualifier. The value of such variables are shared between all invocations within a work group. You cannot declare any [[GLSL Type#Opaque types|opaque types]] as shared, but aggregates (arrays and structs) are fine.

+

+

At the beginning of a work group, these values are uninitialized. Also, the variable declaration cannot have initializers, so this is illegal:

+

+

<source lang="glsl">

+

shared uint foo = 0; //No initializers for shared variables.

+

</source>

+

+

=== Shared memory coherency ===

+

{{main|Memory Model#Incoherent memory access}}

+

+

Shared variable access uses the rules for incoherent memory access. This means that the user must perform certain synchronization in order to ensure that shared variables are visible.

+

+

Shared variables are all implicitly declared {{code|coherent}}, so you don't need to (and can't use) that qualifier. However, you still need to provide an appropriate memory barrier.

+

+

The [[Memory Model#Ensuring visibility|usual set of memory barriers]] is available to compute shaders, but they also have access to {{code|memoryBarrierShared()}}; this barrier is specifically for shared variable ordering. {{code|groupMemoryBarrier()}} acts like {{code|memoryBarrier()}}, ordering memory writes for all kinds of variables, but it only orders read/writes for the current work group.

+

+

While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are executing in lock-step. If you need to ensure that an invocation has written to some variable so that you can read it, you need to synchronize ''execution'' with the invocations, not just issue a memory barrier (you still need the memory barrier though).

+

+

To synchronize reads and writes between invocations within a work group, you must employ the {{code|barrier()}} function. This forces an explicit synchronization between all invocations in the work group. Execution within the work group will not proceed until all other invocations have reach this barrier. Once past the {{code|barrier()}}, all shared variables previously written across all invocations in the group will be visible.

+

+

There are limitations on how you can call {{code|barrier()}}. However, compute shaders are not as limited as [[Tessellation Control Shader]]s in their use of this function. {{code|barrier()}} can be called from flow-control, but it can only be called from ''uniform'' flow control. All expressions that lead to the evaluation of a {{code|barrier()}} must be [[Dynamically Uniform Expression|dynamically uniform]].

+

+

In short, if you execute the same compute shader, no matter how different the data they fetch is, every execution must hit the ''exact'' same set of {{code|barrier()}} calls in the exact same order. Otherwise badness happens.

+

+

=== Atomic operations ===

+

{{main|Shader Storage Buffer Object#Atomic operations}}

+

+

A number of atomic operations can be performed on shared variables of integral type (and vectors/arrays/structs of them). These functions are shared with [[Shader Storage Buffer Object]] atomics.

+

+

{{:Atomic Variable Operations}}

+

+

== Limitations ==

+

+

The number of work groups that can be dispatched in a single dispatch call is defined by {{enum|GL_MAX_COMPUTE_WORK_GROUP_COUNT}}. This must be queried with {{apifunc|glGet|Integeri_v}}, with the index being on the closed range [0, 2], representing the X, Y and Z components of the maximum work group count. Attempting to call {{apifunc|glDispatchCompute}} with values that exceed this range is an error. Attempting to call {{apifunc|glDispatchComputeIndirect}} is much worse; it may result in program termination or other badness.

+

+

Note that the ''minimum'' these values must be is 65535 in all three axes. So you've probably got a lot of room to work with.

+

+

There are limits on the local size as well; indeed, there are two sets of limitations. There is a general limitation on the local size dimensions, queried with {{enum|GL_MAX_COMPUTE_WORK_GROUP_SIZE}} in the same way as above. Note that the minimum requirements here are much smaller: 1024 for X and Y, and a mere 64 for Z.

+

+

There is another limitation: the total number of invocations within a work group. That is, the product of the X, Y and Z components of the local size must be less than {{enum|GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS}} (a single value, queried with {{apifunc|glGet|Integerv}}.

+

+

There is also a limit on the total storage size for all shared variables in a compute shader. This is {{enum|GL_MAX_COMPUTE_SHARED_MEMORY_SIZE}}, which is in bytes. The OpenGL-required minimum is 32KB. OpenGL does not specify the exact mapping between GL types and shared variable storage, though you could use the std140 layout rules and UBO/SSBO sizes as a general guideline.

[[Category:OpenGL Shading Language]]

[[Category:OpenGL Shading Language]]

+

[[Category:Shaders]]

Revision as of 22:02, 3 January 2013

A Compute Shader is a Shader Stage that is used entirely for computing arbitrary information. While it can do rendering, it is generally used for tasks not directly related to drawing triangles and pixels.

Contents

Overview

Compute shaders operate differently from other shader stages. All of the other shader stages have a well-defined set of input values, some built-in and some user-defined. They have a well-defined set of output values, some built-in and some user-defined. The frequency at which a shader stage executes is specified by the nature of that stage; vertex shaders execute once per input vertex, for example (though some executions can be skipped via caching). Fragment shader execution is defined by the fragments generated from the rasterization process.

Compute shaders work very differently. The "space" that a compute shader operates on is largely abstract; it is up to each compute shader to decide what the space means. The number of compute shader executions is defined by the function used to execute the compute operation. Most important of all, compute shaders have no user-defined inputs and no outputs at all. The built-in inputs only define where in the "space" of execution a particular compute shader invocation is.

Therefore, if a compute shader wants to take some values as input, it is up to the shader itself to fetch that data, via texture access, arbitrary image load, shader storage blocks, or other forms of interface. Similarly, if a compute shader is to actually compute anything, it must explicitly write to an image or shader storage block.

Compute space

The space that compute shaders operate within is abstract. There is the concept of a work group; this is the smallest amount of compute operations that the user can execute. Or to put it another way, the user can execute some number of work groups.

The number of work groups that a compute operation is executed with is defined by the user when they invoke the compute operation. The space of these groups is three dimensional, so it has a number of "X", "Y", and "Z" groups. Any of these can be 1, so you can perform a two-dimensional or one-dimensional compute operation instead of a 3D one. This is useful for processing image data or linear arrays of a particle system or whatever.

When the system actually computes the work groups, it can do so in any order. So if it is given a work group set of (3, 1, 2), it could execute group (0, 0, 0) first, then skip to group (1, 0, 1), then jump to (2, 0, 0), etc. So your compute shader should not rely on the order in which individual groups are processed.

Do not think that a single work group is the same thing as a single compute shader invocation; there's a reason why it is called a "group". Within a single work group, there may be many compute shader invocations. How many is defined by the compute shader itself, not by the call that executes it. This is known as the local size of the work group.

Every compute shader has a three-dimensional local size (again, sizes can be 1 to allow 2D or 1D local processing). This defines the number of invocations of the shader that will take place within each work group.

Therefore, if the local size of a compute shader is (128, 1, 1), and you execute it with a work group count of (16, 8, 64), then you will get 1,048,576 separate shader invocations. Each invocation will have a set of inputs that uniquely identifies that specific invocation.

This distinction is useful for doing various forms of image compression or decompression; the local size would be the size of a block of image data (8x8, for example), while the group count will be the image size divided by the block size. Each block is processed as a single work group.

The individual invocations within a work group will be executed "in parallel". The main purpose of the distinction between work group count and local size is that the different compute shader invocations within a work group can inter-communicate through a set of shared​ variables. Invocations between work groups can theoretically inter-communicate, but only through some form of global memory and this requires difficult synchronization.

Dispatch

A program object can have a compute shader in it. The compute shader linked with other Shader Stages (whether in a single program object or in a program pipeline) is effectively inert to rendering functions.

There are two functions to initiate compute operations. They will use whichever compute shader is currently active (via glBindProgramPipeline​ or glUseProgram​, following the usual rules for determining the active program for a stage).

The indirect​ parameter is the byte-offset to the buffer currently bound to the GL_DISPATCH_INDIRECT_BUFFER​ target. Note that the same limitations on work group counts (see below) still apply; however, indirect dispatch bypasses OpenGL's usual error checking. As such, attempting to dispatch with out-of-bounds work group sizes can cause a crash or even a GPU hard-lock.

Inputs

Compute shaders cannot have any user-defined input variables. It only provides the following built-in input variables:

Local size

The local size of a compute shader is defined within the shader, using a special layout input declaration:

layout(local_size_x = X​, local_size_y = Y​, local_size_z = Z​) in;

By default, the local sizes are 1, so if you only want a 1D or 2D work group space, you can specify just the X​ or the X​ and Y​ components. They must be integral constant expressions of value greater than 0. Their values must abide by the limitations imposed below; if they do not, a compiler or linker error occurs.

The local size is available to the shader as a compile-time constant variable, so you don't need to define it yourself:

constuvec3gl_WorkGroupSize;

Shared variables

Global variables in compute shaders can be declared with the shared​ storage qualifier. The value of such variables are shared between all invocations within a work group. You cannot declare any opaque types as shared, but aggregates (arrays and structs) are fine.

At the beginning of a work group, these values are uninitialized. Also, the variable declaration cannot have initializers, so this is illegal:

Shared memory coherency

Shared variable access uses the rules for incoherent memory access. This means that the user must perform certain synchronization in order to ensure that shared variables are visible.

Shared variables are all implicitly declared coherent​, so you don't need to (and can't use) that qualifier. However, you still need to provide an appropriate memory barrier.

The usual set of memory barriers is available to compute shaders, but they also have access to memoryBarrierShared()​; this barrier is specifically for shared variable ordering. groupMemoryBarrier()​ acts like memoryBarrier()​, ordering memory writes for all kinds of variables, but it only orders read/writes for the current work group.

While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are executing in lock-step. If you need to ensure that an invocation has written to some variable so that you can read it, you need to synchronize execution with the invocations, not just issue a memory barrier (you still need the memory barrier though).

To synchronize reads and writes between invocations within a work group, you must employ the barrier()​ function. This forces an explicit synchronization between all invocations in the work group. Execution within the work group will not proceed until all other invocations have reach this barrier. Once past the barrier()​, all shared variables previously written across all invocations in the group will be visible.

There are limitations on how you can call barrier()​. However, compute shaders are not as limited as Tessellation Control Shaders in their use of this function. barrier()​ can be called from flow-control, but it can only be called from uniform flow control. All expressions that lead to the evaluation of a barrier()​ must be dynamically uniform.

In short, if you execute the same compute shader, no matter how different the data they fetch is, every execution must hit the exact same set of barrier()​ calls in the exact same order. Otherwise badness happens.

Atomic operations

A number of atomic operations can be performed on shared variables of integral type (and vectors/arrays/structs of them). These functions are shared with Shader Storage Buffer Object atomics.

All of the atomic functions return the original value. The term "nint" can be int​ or uint​.

nint atomicAdd(inout nint mem​, nint data​)

Adds data​ to mem​.

nint atomicMin(inout nint mem​, nint data​)

The mem​'s value is no lower than data​.

nint atomicMax(inout nint mem​, nint data​)

The mem​'s value is no greater than data​.

nint atomicAnd (inout nint mem​, nint data​)

mem​ becomes the bitwise-and between mem​ and data​.

nint atomicOr(inout nint mem​, nint data​)

mem​ becomes the bitwise-or between mem​ and data​.

nint atomicXor(inout nint mem​, nint data​)

mem​ becomes the bitwise-xor between mem​ and data​.

nint atomicExchange(inout nint mem​, nint data​)

Sets mem​'s value to data​.

nint atomicCompSwap(inout nint mem​, nint compare​, nint data​)

If the current value of mem​ is equal to compare​, then mem​ is set to data​. Otherwise it is left unchanged.

Limitations

The number of work groups that can be dispatched in a single dispatch call is defined by GL_MAX_COMPUTE_WORK_GROUP_COUNT. This must be queried with glGetIntegeri_v​, with the index being on the closed range [0, 2], representing the X, Y and Z components of the maximum work group count. Attempting to call glDispatchCompute​ with values that exceed this range is an error. Attempting to call glDispatchComputeIndirect​ is much worse; it may result in program termination or other badness.

Note that the minimum these values must be is 65535 in all three axes. So you've probably got a lot of room to work with.

There are limits on the local size as well; indeed, there are two sets of limitations. There is a general limitation on the local size dimensions, queried with GL_MAX_COMPUTE_WORK_GROUP_SIZE in the same way as above. Note that the minimum requirements here are much smaller: 1024 for X and Y, and a mere 64 for Z.

There is another limitation: the total number of invocations within a work group. That is, the product of the X, Y and Z components of the local size must be less than GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS (a single value, queried with glGetIntegerv​.

There is also a limit on the total storage size for all shared variables in a compute shader. This is GL_MAX_COMPUTE_SHARED_MEMORY_SIZE, which is in bytes. The OpenGL-required minimum is 32KB. OpenGL does not specify the exact mapping between GL types and shared variable storage, though you could use the std140 layout rules and UBO/SSBO sizes as a general guideline.