Similar presentations

2Agenda Vertex Shaders Pixel Shaders vs_2_0 and extended bits vs_3_0Vertex Shaders with DirectX HLSL compilerPixel Shadersps_2_0 and extended bitsps_3_0HLSL Compiler will be improved by GDC and MS will be talking about it. For example, you will be able to get at vs_2_0 control flow through HLSL in new rev.

3Legacy 1.x Shaders No discussion of legacy shaders todayThere is plenty of material on these models from Microsoft and IHVs

4What’s with all the asm?We’ll show a lot of asm in this section to introduce the virtual machineLater sections will focus on HLSL since that is what we expect most developers will prefer to useFamiliarity with the asm is useful, howeverAlso, the currently-available HLSL compiler does not support flow-control, though that is coming in a future revision. Microsoft will discuss more details of this in their event tomorrow.

9Changes to the VS numbers256 instructions of stored program (was 128)256 constants (was minimum of 96)Address register is now vector (was scalar)New registers64 iteration control registers (as 16 vectors)1 scalar loop register (only readable within the loop)16 1-bit Boolean registersMax number of instructions executed per shader is now tentatively 1024 (max was always 128 before jumps)

10What does this mean for your shaders?Some examples:Paletted skinning can now cover ~50 xformsMostly that will mean one draw call for a whole animated characterAn instancing shader can handle ~200 instances in a single draw callEnabling and disabling lights doesn’t have to mean multiple shaders, just change a loop registerSo this is big for ‘ease of use’

11And some smaller detailsLoading a0 now rounds-to-nearest rather than truncatingNot hard to cope with but could surprise you...Loading i0 to i63 rounds-to-nearest too (which is consistent)No jumping into or out of loops or subroutines

13Setting Vertex Shader RegistersBecause of the new types of constant registers in DirectX 9, there are new APIs for setting them from the app:SetVertexShaderConstantF()SetVertexShaderConstantI()SetVertexShaderConstantB()

21HLSL and ConditionalsInitial revision of HLSL compiler which shipped with the DirectX 9.0 SDK would handle conditionals via executing all codepaths and lerping to select outputsFuture SDK releases will provide compiler which can compile to asm control flow instructionsMicrosoft will discuss this in more detail in their event tomorrow

26Extended shaders What are the extended shader models?SimExtended shadersWhat are the extended shader models?A shader is considered 2_x level if it requires at least one capability beyond stock ps.2.0 or vs.2.0Use the ‘vs_2_x’ or ‘ps_2_x’ label in your assembly or as your compilation target

27D3DVS20CAPS_PREDICATIONSimCaps for vs_2_xNew D3DVSHADERCAPS2_0 structureD3DCAPS9.VS20CapsCineFX supportCapsD3DVS20CAPS_PREDICATIONNumTemps16StaticFlowControlDepth4DynamicFlowControlDepth24These are the vertex shader caps, which are accessed through the D3DVSHADER_CAPS2_0 structure. The first one is DynamicFlowControlDepth, which basically controls whether or not you can dynamically branch inside a shader based on a value that’s computed within the shader. An architecture that exposes 0 for DynamicFlowControlDepth (which is the minimum for vs.2.0 support) will only be able to branch based on constants sent in from the application…essentially, the registers that control branching and looping become read-only within the shader itself. Values greater than 0 designate the number of nested levels of control flow you can have within the shader…think of it as the number of nested if-statements allowed in C code. The CineFX architecture supports 24, which is the maximum. This can be useful for many effects, you could imagine breaking out of a light loop when a certain condition is met (for example, the lighting components saturate to some value), or perhaps stop computing skinning if all weights sum to 1.0.The next cap is NumTemps, which covers the number of temporary registers available for use within the vertex shader; the CineFX architecture supports 16.The final cap covers predication, which is the ability to execute and write the result of an instruction based on a predicate…a similar idea to the CMOV instruction in modern x86 CPUs. This is fully supported by CineFX as well.

28Extended Vertex Shader CapsSimExtended Vertex Shader CapsD3DVS20CAPS_PREDICATIONPredication is a method of conditionally executing code on a per-component basisFor instance, you could test if a value was > 0.0f before performing a RSQFaster than executing a branch for short code sequences

29Vertex Shader Predication – HLSLSimVertex Shader Predication – HLSLThis example adds in specular when the self-shadowing term is greater than zero.if ( dot(light_vector, vertex_normal) > 0.0f ){total_light += light_color;}If targeting a vs_2_x profile, the compiler should use predication for a short conditional block

33Nested Static Flow ControlSimNested Static Flow ControlStatic flow control is a standard part of vs_2_0CALL / CALLNZ-RET, LOOP-ENDLOOPMost useful for looping over light countsAs long as they aren’t shadowed

34Dynamic Flow Control Branching information is derived per vertexSimDynamic Flow ControlBranching information is derived per vertexCan be based on arbitrary calculation, not just constantsBest used to skip a large number of instructionsSome architectures may pay a perf penalty when vertices take disparate branches

36SimDynamic Flow ControlVery useful to improve batching for matrix palette skinningDo a dynamic loop over the # of non-zero bones in the vertexSort per-vertex indices & weights by priorityIf the a weight is zero, break out of skinning loopCan do automatic shader LODDistance to light or viewer large enoughOr when fully fogged, etc.

41Argument Swizzles .r, .rrrr, .xxxx or .x .g, .gggg, .yyyy or .y.b, .bbbb, .zzzz or .z.a, .aaaa, .wwww or .w.xyzw or .rgba (No swizzle) or nothing.yzxw or .gbra (can be used to perform a cross product operation in 2 clocks).zxyw or .brga (can be used to perform a cross product operation in 2 clocks).wzyx or .abgr (can be used to reverse the order of any number of components)

45GRADIENTINSTRUCTIONS NODEPENDENTREADLIMIT NOTEXINSTRUCTIONLIMITCaps for Pixel Shader 2.xD3DCAPS9CineFX supportMaxPShaderInstructionsExecuted1024PS20Caps.NumInstructionSlots512PS20Caps.NumTemps28PS20Caps.StaticFlowControlDepth0?PS20Caps.DynamicFlowControlDepthPS20Caps.CapsARBITRARYSWIZZLEGRADIENTINSTRUCTIONSPREDICATIONNODEPENDENTREADLIMITNOTEXINSTRUCTIONLIMITTwo of the caps are DynamicFlowControlDepth and StaticFlowControlDepth….these are basically the same as the vertex shader equivalents, and define the ability to do branching. Branching facilities are more limited at the pixel level than they were at the vertex level.The NumTemps value defines the number of temporary registers…in the CineFX architecture this is 32 floating point registers. NumInstructionSlots defines the number of instructions that can be executed in the pixel shader…basic ps.2.0 only requires that 96 instructions, 32 texture + 64 math, be available…but CineFX allows a total of 1024 instructions to be executed.The next 5 caps cover a number of different features that hardware may implement that goes above and beyond the ps.2.0 spec. First, the PREDICATION cap is the same as the vertex shader equivalent…it defines whether or not predicated instructions are possible. The ARBITRARYSWIZZLE cap defines whether or not general input swizzling is allowed in ps.2.0 instructions. By default, not all swizzles are supported…but architectures that set this cap bit allow you to do arbitrary swizzling, just like in the vertex shader. The next cap, GRADIENTINSTRUCTIONS, defines whether or not the partial derivative instructions dsx and dsy are available for use, and whether or not you can use the texldd (load with partial derivatives) instruction. These instructions, you’ll recall, allow you to take the partial derivatives in screen space for an arbitrary value in the pixel shader. The final two caps, NODEPENDENTREADLIMIT and NOTEXINSTRUCTIONLIMIT, allow two of the limitations in ps.2.0 to be relaxed. By default, ps.2.0 only allows you to chain together four dependent reads and mandates that there is a separate limit on texture instructions as compared to math instructions (which is 32 vs. 64 in ps.2.0). Architectures that set these two caps, such as CineFX, remove these two limits.

46Pixel Shader 2.x Static flow controlIF-ELSE-ENDIF, CALL / CALLNZ-RET, REP-ENDREPAs in vertex shader (less LOOP-ENDLOOP)Gradient instructions and texture fetchDSX, DSY, TEXLDDPredicationAs in vertex shader 2.xArbitrary swizzlingUse them to pack registersDsx and dsy are significant because they allow you to take the derivative of an arbitrary value in the screenspace x and y directions, which is a very powerful operation.

48Static flow control One way of controlling number of lightsUsually 4 active lights is enough on any one triangleBenefits of single-pass lightingGreater speed than multi-pass if vertex boundAllows more complex vertex shadersBetter precisionfp32-bit shader precision vs. 8-bit FB blenderMore flexible combination of lightsCombine lights in ways FB blender doesn’t allowShadows are trickyCan use pre-baked occlusion information, either per-vertex or with textures ( similar to lightmaps )Using the stencil buffer for shadow volumes will not work because the stencil buffer information is not available in the pixel shader, and you need this to decide which pixels are lit by which light. However, you could use the same stencil trick rendering to a texture, and then input it to the pixel shader.

49Single Pass Lighting?Sometimes it does make sense to collapse multiple vertex lights in one passHence the fixed-function pipelineThis works because the fixed function pipeline doesn’t handle shadowsWith vertex shaders, one can do per-vertex shadowingPaletted Lighting & Shadowing

51Single-Pass Lighting ?Detailed per-pixel lighting can typically be performed on DirectX8 cards in 1-3 passes per lightOne pass for shadow creationOne pass for attenuation & shadow testingOne pass for lightingDirectX9 cards can do all the math for a light in one passLonger shaders mean attenuation, shadow testing & lighting can be performed in one passMore lighting per pass means fewer, larger batchesDoes it make sense to all lighting in one pass?

52Single Pass Lighting?Shadow Volume Approaches require each shadowed light to be handled separatelyBecause they either use the stencil or dest alpha to store occlusion infoWhich are single resourcesShadow Maps allow multiple lights to be performed in one passOnce the shadow maps are created, each light can perform shadow test & lighting calcs in one pass

53Single Pass Lighting?Putting multiple lights in one pass leads to some issuesThere is still a limit to the #of lights per pass that can be handledDue to 16 samplers or 8 interpolatorsOr overly long shadersWhat to do when light count exceeded?Fall back to multi-passDrop or merge ‘unimportant’ lightsCareful of popping and aliasing

54Single Pass Lighting?It makes sense to collapse a single light into a single render pass if possibleLess vertex workLess draw callsLess bandwidthNot really an issue if shader boundPerf scales linearly with # of lights

55Single Pass Lighting?It doesn’t necessarily make sense to try to fit multiple shadowed lights in one passShadow volumesYou can’t really do this anywayShadow mapsStill a hard limit on the # that can be handled per passMultiple code pathsNon-linear perf falloff with more lightsAs # of light 0/1/2/3 / material combinations go up, batch size goes down anywayProbably not worth the hassle factor

60Lighting Render Loop On DirectX8 HW On DirectX9 HWPasses are as listedOn DirectX9 HWPasses L + 1 and L + 2 can be combinedFog Pass can be collapsed into Pass 0Due to not needing dest alpha for attenuation & maskPerform Lighting passes with black fogPass L + 2 might have better lightingDue to per-pixel calculation of H or R and real exponents for specularVarious more complex lighting models possible

61Lighting Render Loop SummaryGood solution for D3D8 / D3D9 scalabilityFor shadowed scenes, use single light at a timeEasier than packing mulitple lights into one passScales linearlyDirectX9 lets you do all lighting in one passBut for shadows and similarity to DirectX8 path, use one pass per light

62Greater number of instructionsLonger shader programs make multi-sample AA really “free”Sweet spot for low-end DirectX9 cards is w/ 4x AA640x480x4x is way faster than 1280x960x1x on a pixel shader-bound appLong shaders more practical with HLSLOne interesting observation is that longer pixel shader programs essentially make multi-sample anti-aliasing free. It used to be impossible to write long shader programs since there were so many limits on number of instructions but with these limitations lifted it’s quite easy to get into a situation where you’re bound by shading speed. And since multi-sample AA only runs the pixel shader once for each block of multi-samples, it incurs no cost if you’re bound by shading.

65Caveats for Conditional InstructionsConditional instructions can easily cause visual discontinuitiesUseful for shaders that sometimes need a sharp edgeOr effects where the edge is faded out anywayLike light attenuationA filtered texture fetch into a sharp gradient texture can give smoother results

66Gradient InstructionsUseful for shader anti-aliasingDSX, DSY compute derivative of a vector wrt screen space positionUse derivatives with TEXLDD...••Then there’s this texldd instruction, which is basically a flexible generalization of the texldb instruction. Instead of specifying a hardcoded bias, you can actually specify the partial derivatives of the texture coordinates in the x and y screenspace directions directly and the hardware will then use these partials to determine LOD. This allows you to perform more general tweaking of LOD, and even allows you to force the use of anisotropic filtering.Note that edge pixels will calculate the derivative outside of the triangle.•

68Shader Texture LODIf you do a texture fetch in a pixel shader, LOD is calculated automaticallyBased on a 2x2 pixel quadSo, you only need the gradient instructions if you want a non-default LODOr if you want to band-limit your shading function to avoid aliasingSwitch shading models or drop high-frequency terms

72Why use fp16 precision? SPEEDOn some HW, using fewer registers leads to faster performancefp16 takes half the register space of fp32, so can be 2x fasterThat said, the first rule of optimization is : DON’TIf your shaders are fast enough at full precision, greatBut make sure you test on low-end DirectX9 cards, tooOtherwise, here is how to optimize your shaders for precision

73How to use fp16 In HLSL or Cg, use the half keyword instead of floatBecause the spec requires high-precision when mixing high & low precision, you may have to use extra casts or temporaries at timesAssuming s0 maps to a 32bit fp texture :Original :float3 a = tex2d( s0, tex0 )Optimized :half3 a = tex2d( s0, (half3)tex0 )

75Precision PitfallsAn IEEE-like fp16 is 1 sign bit, 5 bits of exponent and 10 bits of mantissaThink of mantissa as tick marks on a rulerThe Exponent is the length of ruler+/-1024 ticks, no matter what, so .1% precision across whatever range you haveInchesFeet

76Precision PitfallsIf you are lighting in world space, your distances are typically far away from the origin, and most of the 16 bits are allocated to the left of the decimal point to maximize range, at the expense of precision.World Space Vectorsps.2.0sub r0, t0, t1 // << precision loss heredp3 r0, r0, r0 // this will band heavilymov r0, 1- r0 // ( 1 – d^2 ) attenuationWorld Space Origin

77Precision PitfallsMost precision problems we’ve seen at fp16 are due to developers doing world-space computations per-pixelClassic example is per-pixel attenuationEasy solution, change to light space in the vertex shader instead and pass down the result

78Precision PitfallsThe bad news is that it’s a bit inconvenient not to be able to write the entire shader at the fragment levelAnd get great speedThe good news is that you really didn’t want that anywayExcept for prototyping

79Fully Fragment Shading?Doing the entire shading equation at the maximum computation frequency doesn’t make senseSome of it is constantLight colors, Material colorsMuch of it is linearPositionsThe HW can interpolate linear components for freeWhy recompute something linear per-pixel?Only if you run out of interpolators

81Precision Pitfalls Avoid precision issues with Normalization CubemapsUse a normalization cubemap to normalize vectors derived from positionTexture coordinates are fp24+Unless marked _ppThe resulting value from the cubemap is low-precision but derived at high precision

82Texcoords and Precision10 bits of mantissa is enough to exactly address a 1024-wide textureWith no filtering10 bits can do a 512-wide texture with 2 filtering levelsSo, large textures may require high precision to sample with good filtering

88vs_3_0 Outputs 12 generic output (on) registersMust declare their semantics up-front like the input registersCan be used for any interpolated quantity (plus point size)There must be one output with the positiont semantic

89vs_3_0 Output example vs_3_0dcl_color4 o3.x // color4 is a semantic namedcl_texcoord3 o3.yz // Different semantics can be packed into one registerdcl_fog o3.wdcl_tangent o4.xyzdcl_positiont o7.xyzw // positiont must be declared to some unique register// in a vertex shader, with all 4 componentsdcl_psize o // Pointsize cannot have a mask

92Conclusion Vertex Shaders Pixel Shaders vs_2_0 and extended bitsVertex Shaders with DirectX HLSL compilerPixel Shadersps_2_0 and extended bitsps_3_0HLSL Compiler will be improved by GDC and MS will be talking about it. For example, you will be able to get at vs_2_0 control flow through HLSL in new rev.

93We will start back up again at 2pmLunch BreakWe will start back up again at 2pm