Adventures in Deferred Rendering: Tiled Deferred Shading and Compute Shaders (2024)

Adventures in Deferred Rendering: Tiled Deferred Shading and Compute Shaders (1)

In basic deferred shading, we evaluate each light separately and add the results to the output buffer, and then use that as the starting point from which we begin the post-processing.

Shading a single quad is oftentimes much more scalable than iterating through each geometry to apply lighting information. However, there is one bottleneck that needs to be addressed: when we’re dealing with thousands of lights, rendering becomes expensive because we need to process all lights for each pixel, and each evaluation requires either a separate shader invocation or an additional loop iteration inside a fragment shader.

Enter tiled shading, which first appeared in the game Uncharted: Drake’s Fortune. The basic idea of tiled shading is to map light sources to tiles of pixels, thereby capping the number of lights that need to be evaluated at each pixel.

There are several principles to keep in mind regarding tiled shading. First, we are storing the lighting data on a per tile basis instead of by pixel. This is somewhat conservative because a given light source may not affect every single pixel in the work group (i.e., tile). However, this simple heuristic is sufficient enough to greatly optimize the rendering.

Second, it’s important to pay attention to the asynchronous determination of light sources inside of a compute shader. We’ll want to determine the list of light sources relevant to the compute work group and drive shader evaluation for each pixel once we finalize that list of relevant lights.

Compute Shaders

Compute shaders operate on a per work group basis, which lends itself nicely to a tile-based abstraction. Concretely, any compute shader invocation can understand things in terms of the space in which it operates, which is defined by the number or position of its work group, the position of the specific shader invocation (i.e., pixel) within that work group, and the position of that invocation in the compute dispatch call, which we can define to mean the pixel’s position on a texture.

There are several mechanisms that lend this concept of a “work group” to our solution. Each work group can possess a set of shared variables that can be operated on by a number of atomic operations. Additionally, when we talk about the existence of global memory that is filled with our lighting data via shader storage buffer objects (unordered access views in DirectX), it becomes clear that compute shaders exist to minimize the latency of memory access on data relevant to the work group’s context.

Physical distances matter, and one of the challenges of hardware design is to empower impressive processing throughput via intelligent placement of processing and memory units so that there is a sensible distance between a given execution context and some memory allocation on the hierarchy. The smaller the distance, the faster a thread or context can access the memory, and the faster the overall execution (provided that the other processing units are also optimally placed and in reasonable numbers).

The name of the game is simple: Have the execution context operate with respect to the more favorable shared memory as much as possible and only dive intermittently into global memory when we need to do something important, such as image storing the result of a computation. Concretely, compute shaders operate on different ranges of the dataset that we explicitly define and that we can visualize as different parts or tiles of an image (with some caveats, of course, because another important aspect of writing compute shaders is defining a sensible distribution of work across a given dataset). The ranges specify a thread group that operates using the same shared memory, in addition to the globally accessible memory that’s coherent to all the work groups.

Adventures in Deferred Rendering: Tiled Deferred Shading and Compute Shaders (2)

Essentially, each work group processes its own, say pixel, range, and at the end of a dispatch call, we’ll have essentially combined the results from each work group to give us a complete image.

The important thing to keep in mind is the relationship between compute work group sizing from a programmer’s perspective and the actual hardware manifestation in terms of how streaming multiprocessors (SMs) consist of some 2 to the power number of cores. In analyzing the figure above, it would make sense that we would constrain the work group size to something that’s distributed well over the GPU architecture (i.e., in 16×16 dimensions) because shared memory is really only coherent across a multiprocessor (e.g., consisting of 64 INT units and 64 FP units in the case of Turing).

That said, the GPU scheduler normally does a good job of mapping the work groups to the SMs. Programmers should focus on minimizing memory accesses and conditional branches. Though, it never hurts to emulate the underlying hardware in the most faithful manner possible.

The fact that compute shaders really lend themselves to sections of rendered quads and, therefore, post-processing really got me experimenting with transitioning some of my fragment shader-based implementations to more compute shader-oriented workflows. I was able to redo my variance shadow mapper’s Gaussian filtering to utilize compute shaders. The setup looks like this:

blurShader.use();blurShader.setIVec2(G_DIR, glm::ivec2(1, 0));glBindImageTexture(0, shadowMaps[0], 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);glBindImageTexture(1, shadowMaps[1], 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);glDispatchCompute(256, 256, 1);blurShader.setIVec2(G_DIR, glm::ivec2(0, 1));glBindImageTexture(0, shadowMaps[1], 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);glBindImageTexture(1, shadowMaps[0], 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);glDispatchCompute(256, 256, 1);

Here’s the corresponding shader:

#version 460 corelayout (local_size_x = 16, local_size_y = 16) in;const float WEIGHTS[9] = float[] (0.0162162162f, 0.0540540541f, 0.1216216216f, 0.1945945946f, 0.2270270270f, 0.1945945946f, 0.1216216216f, 0.0540540541f, 0.0162162162f);const int NUM_THREADS = 256;const int NUM_TAPS = 8; // # taps minus the sample in the middleconst int RADIUS = 3;layout (binding = 0, rgba32f) uniform readonly image2D gInput;layout (binding = 1, rgba32f) uniform writeonly image2D gOutput;uniform ivec2 gDir;shared vec4 sTexels[NUM_THREADS + NUM_TAPS * RADIUS];// 9-tap Gaussian filtervoid main(){ uint texelIndex = gl_LocalInvocationIndex; ivec2 tc = ivec2(gl_GlobalInvocationID.xy); // The leftmost or bottommost texel at which we begin to sample ivec2 start = tc - NUM_TAPS / 2 * RADIUS * gDir; sTexels[texelIndex] = imageLoad(gInput, start); // Since we want the last threads in the workgroup to have coherent sample data // (i.e. its range of texels to sample from fully accounted for), we'll get // the first 8 * RADIUS (i.e. twice the number of samples extending from each // side of any given texel) of our threads to deal with those trailing edge cases if (texelIndex < NUM_TAPS * RADIUS) { sTexels[texelIndex + NUM_THREADS] = imageLoad(gInput, start + NUM_THREADS * gDir); } // Fundamentally, we are extracting all the relevant texel data for our workgroup, // and then getting each texel to operate with respect to the data in a highly // parallelized fashion barrier(); vec4 sum = vec4(0.0f); for (int i = 0; i <= NUM_TAPS; ++i) { sum += sTexels[texelIndex + i * RADIUS] * WEIGHTS[i]; } imageStore(gOutput, tc, vec4(sum.rgb, 1.0f));}

I won’t go into the details of the setup or shader, but I will instead defer explanation of the major ideas to the next section.

Tiled Deferred Shading

Now that we’ve established some understanding of compute shaders, all that remains is piecing together the various mechanisms that lend themselves to a coherent interaction between compute dispatch calls and the graphics pipeline.

First, we allocate space on the GPU for our lighting data through a shader storage buffer. The shader storage buffer object (SSBO) differs from uniforms in various ways, but most notably, it can store a whole lot more data than what we normally update constant buffers with. SSBOs are ideal for things such as…thousands of light sources and their data.

Whereas vertex buffers (i.e., GL_ARRAY_BUFFER) are more about vertex data and the arrangement of specific binding points to that data in the vertex shader, shader storage buffers make no assumptions about the semantics of your data. You have to lend coherence to an explicitly allocated chunk on the GPU, like so:

void copyLightDataToGPU(){ glGenBuffers(1, &lightsBuffer); glBindBuffer(GL_SHADER_STORAGE_BUFFER, lightsBuffer); glBufferData(GL_SHADER_STORAGE_BUFFER, pointLights.size() * sizeof(PointLight), nullptr, GL_STATIC_DRAW); glBindBuffer(GL_ARRAY_BUFFER, lightsBuffer); PointLight* ptr = reinterpret_cast<PointLight*>(glMapBufferRange(GL_ARRAY_BUFFER, 0, pointLights.size() * sizeof(PointLight), GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT)); for (int i = 0; i < pointLights.size(); ++i) { ptr[i]._color = pointLights[i]->_color; ptr[i]._radius = pointLights[i]->_radius; ptr[i]._position = pointLights[i]->_position; } glUnmapBuffer(GL_ARRAY_BUFFER);}

For reference, I set up my PointLight structure to use perfect memory alignments:

struct PointLight{ glm::vec3 _color; float _radius; glm::vec3 _position; unsigned int : 32;};

Again, for the sake of memory access performance, it’s never a bad idea to allocate your memory in a way that lends itself to optimal memory access patterns.

Next, let’s assume that we’ve fully executed our G-buffer pass and have generated a bunch of textures that we’d like to access in our compute shader. We bind the first level of each texture unit as an image (which is a distinct concept from a texture), like so:

cullLightsShader.use();cullLightsShader.setMat4(phoenix::G_PROJECTION_MATRIX, utils->_projection);cullLightsShader.setMat4(phoenix::G_VIEW_MATRIX, utils->_view);cullLightsShader.setMat4("gInverseVP", glm::inverse(utils->_projection * utils->_view));cullLightsShader.setVec3(phoenix::G_VIEW_POS, camera->_position);glBindImageTexture(0, gBuffer->_FBO, 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);glBindImageTexture(1, albedoSpecularMap, 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);glBindImageTexture(2, output, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, lightsBuffer);glDispatchCompute(160, 90, 1);

Note that we also needed to specify an explicit binding point for our previously generated lighting data that’s been placed on the GPU. Once that’s taken care of, we can set up an arrangement of 160×90 work groups that map cleanly to the resolutions of our G-buffer textures, as well as my screen’s resolution. That’s because given a 16×16 sized thread group or work group or tile, 160 work groups in the x-direction result in a processing of 2,560 pixels, and 90 work groups in the y-direction translate to 1,440 pixels.

Over on the GPU side and looking at things on a per shader invocation (i.e., pixel) basis, we see some familiar corresponding variables:

#version 460 corelayout (local_size_x = 16, local_size_y = 16) in;const int NUM_LIGHTS = 4096;const int MAX_LIGHTS_PER_TILE = 160;const int WORK_GROUP_SIZE = 16;const float SPECULAR_FACTOR = 16.0f;const float SCREEN_WIDTH = 2560.0f;const float SCREEN_HEIGHT = 1440.0f;struct PointLight{ vec3 _color; float _radius; vec3 _position;};layout (binding = 0, rgba32f) uniform readonly image2D gNormalMap;layout (binding = 1, rgba32f) uniform readonly image2D gAlbedoSpecularMap;layout (binding = 2, rgba32f) uniform writeonly image2D gOutput;layout (std430, binding = 3) buffer LightsBuffer{ PointLight gPointLights[];};uniform mat4 gProjectionMatrix;uniform mat4 gViewMatrix;uniform mat4 gInverseVP;uniform vec3 gViewPos;shared uint sVisibleLightIndices[NUM_LIGHTS];shared int sNumVisibleLights = 0;

On the second line of our shader is a specification on the size of our work group, which consists of 16×16 or 256 threads. Across the entire space of dispatched work groups, this is going to map to a full quad of resolution 2560×1440.

std430 defines a specific memory layout for our lights buffer. You’ll also notice a couple of shared variables that are coherent only in the space of the current work group. Thus, no other work group has any idea of the exact light sources that we will have culled for our given work group because work groups can’t communicate with one another. The idea is to constrain each context to a specific, predefined tile so that the dispatch can process tiles in a highly parallelized fashion (which themselves will have threads execute in a highly parallelized fashion).

Inside the main function, we have quite a few interesting points to discuss:

void main(){ vec2 center = vec2(SCREEN_WIDTH, SCREEN_HEIGHT) / float(2 * WORK_GROUP_SIZE); // Location of the middle work group vec2 offset = center - vec2(gl_WorkGroupID.xy); // Extract the viewing frustum planes (normals) // https://gamedev.stackexchange.com/questions/156743/finding-the-normals-of-the-planes-of-a-view-frustum vec4 column0 = vec4(-gProjectionMatrix[0][0] * center.x, gProjectionMatrix[0][1], offset.x, gProjectionMatrix[0][3]); // (-2 * n' / (r - l) * 80, 0, offset.x, 0) vec4 column1 = vec4(gProjectionMatrix[1][0], -gProjectionMatrix[1][1] * center.y, offset.y, gProjectionMatrix[1][3]); // (0, -2 * n' / (t - b) * 45, offset.y, 0) vec4 column3 = vec4(gProjectionMatrix[3][0], gProjectionMatrix[3][1], -1.0f, gProjectionMatrix[3][3]); // (0, 0, -1, 0) vec4 frustumPlanes[4]; // Left frustumPlanes[0] = column3 + column0; // Right frustumPlanes[1] = column3 - column0; // Top frustumPlanes[2] = column3 - column1; // Bottom frustumPlanes[3] = column3 + column1; for (int i = 0; i < 4; ++i) { frustumPlanes[i] /= length(frustumPlanes[i].xyz); } int numThreads = WORK_GROUP_SIZE * WORK_GROUP_SIZE; // 16x16x1 sized thread group; 16x16 tile int numPasses = (NUM_LIGHTS + numThreads - 1) / numThreads; // (4096 + 16^2 - 1) / 16^2 = 16 for (int i = 0; i < numPasses; ++i) { // [0-15] * 16^2 + gl_LocalInvocationIndex // Each thread/shader processes using the same relative offset (that is unique to it inside the work group) in the tile, // so that all of the threads in the work group collectively process all of the lights for their relevance to the tile uint lightIndex = min(i * numThreads + gl_LocalInvocationIndex, NUM_LIGHTS - 1); PointLight pointLight = gPointLights[lightIndex]; if (sNumVisibleLights < MAX_LIGHTS_PER_TILE) { bool inFrustum = true; for (int j = 3; j >= 0 && inFrustum; --j) { float distance = dot(frustumPlanes[j], gViewMatrix * vec4(pointLight._position, 1.0f)); // Distance of the point from the plane // https://gamedev.stackexchange.com/questions/79172/checking-if-a-vector-is-contained-inside-a-viewing-frustum inFrustum = -pointLight._radius <= distance; } if (inFrustum) { int mem = atomicAdd(sNumVisibleLights, 1); sVisibleLightIndices[mem] = lightIndex; } } } barrier(); ivec2 texelSpaceTexCoords = ivec2(gl_GlobalInvocationID.xy); vec2 texCoords = vec2(texelSpaceTexCoords.x / SCREEN_WIDTH, texelSpaceTexCoords.y / SCREEN_HEIGHT); vec4 N = imageLoad(gNormalMap, texelSpaceTexCoords); vec3 P = calcWorldPos(texCoords, N.w); vec3 V = normalize(gViewPos - P); vec4 albedoSpecular = imageLoad(gAlbedoSpecularMap, texelSpaceTexCoords); vec4 fragColor = vec4(0.1f * albedoSpecular.rgb, 1.0f); for (int i = 0; i < sNumVisibleLights; i++) { fragColor += calcLighting(gPointLights[sVisibleLightIndices[i]], P, N.xyz, V, albedoSpecular); } imageStore(gOutput, texelSpaceTexCoords, fragColor);}

The first thing that occurs is a fast extraction of the viewing frustum planes, or rather the normals of the view frustum planes that define the boundaries of our given tile. I won’t go too deep into the mathematics behind the method, but you can take a look at the link posted in the comments, as well as resources such as Mathematics for 3D Game Programming and Computer Graphics just to get a sense of what’s going on.

Once we have the normals, the next thing that we’ll want to do is essentially distribute the work group threads in a manner that most efficiently determines the light sources with respect to our tile. You are free to use any number of heuristics to achieve that end, but I decided to map each thread to its respective offset in the work group. Then, for every multiple of the number of threads that we have in our work group, we process the light source at the index that’s at the same offset with respect to the multiple number. A somewhat brief subsequent calculation after that determines whether or not the light source intersects the inside of the frustum associated with our current tile, and if so, we add its index onto our per tile list of lights.

Once all the lights have been processed and we have a complete list of lights that are relevant to the pixels on our tile, we can then calculate the amount of lighting on our pixel (i.e., with respect to our shader invocation). We do that by iterating through the list of lights, which are represented as indices into the large buffer of lights that we had to process in a parallelized way earlier. The important thing to note here is that we have to wait until all threads finish processing the lights buffer. Otherwise, we risk using incomplete data in our calculations due to the non-deterministic nature of our parallel execution. We also have to image load/store our texture data in terms of texel space coordinates (i.e., 2560×1440) instead of the usual normalized device coordinates that we frequently use for fragment shaders (to be fair, we still do use NDC to extract the world position from our stored depth information).

That just about does it for this technique! Generally speaking, the main principles behind driving tiled shading with compute shaders are setting up a somewhat evenly distributed workload across the threads in the work group and defining the peculiar interactions with the shared memory to reduce the hits on the L2 cache.

There are several reasons to apply light culling and utilize lists of lights for our pixel evaluations, but perhaps the most important is that each fragment in a work group evaluates the same list, ensuring some coherency across the GPU warp, which has performance ramifications. Other reasons include the potential reusability of the light lists, which might be useful for subsequent shading passes, and the ability to lower framebuffer precision due to the setup not having to keep track of the results from a large number of lights over numerous iterations.

Potential Optimizations

It’s always worth considering improvements to the culling routine, especially because our naive sphere/frustum test lends itself to false positives in certain situations. This is because our frustum emanates from a screen-space tile, so it’s often long, thin, and asymmetrical. To address the problem, we could add a sphere/box test after the initial test against the frustum planes, which would still be vulnerable to false positives but to a much lesser extent.

We could also consider clustered shading, which divides the view frustum into a set of three-dimensional cells or clusters. Whereas tiled shading subdivides a screen space quad, clustered shading also divides by z-depth slices. The difference between this approach and specific tiled shading algorithms that consider the minimum and maximum z extents is that clustered shading does not need rendered geometry to execute the culling because the z-depth subdivision is performed on the entire view frustum.

There are definitely a lot of options to move this implementation forward (perhaps even to forward+)!

Resources

Mathematics for 3D Game Programming and Computer Graphics
Real-Time Rendering
Andrew Lauritzen’s Deferred Rendering for Current and Future Rendering Pipelines
Johan Andersson on DirectX 11 Rendering in Battlefield 3
NVIDIA Turing GPU Architecture Whitepaper

Adventures in Deferred Rendering: Tiled Deferred Shading and Compute Shaders (2024)

References