Blog – RasterGrid | Software Consultancy

Vulkan memory barriers and image layouts explained

Daniel Rákos — Tue, 03 Mar 2026 16:05:29 +0000

There are numerous misconceptions out there about Vulkan memory barriers and image layout transitions to the extent that even the most seasoned developers get certain details wrong about them. This article aims to clear up these misconceptions by delving into the details of the motivation and behavior behind these operations while also covering when and how to use them in practice.

Since the advent of the new generation of low-level / explicit graphics APIs such as Vulkan, developers are struggling with comprehending and correctly using pipeline barriers and it is often cited as the most difficult element of these new APIs. At the time of writing, these APIs have already been with us for a decade, yet there are still many aspects of them that are misinterpreted by developers. Misusing them can result in rendering corruption, in performance issues, or both, and the difficulty in getting them right is that the symptoms of incorrect use may vary across hardware, even between GPU generations or models of the same vendor.

Over the years, these APIs also evolved a bit: some graphics APIs took ideas from others to leverage performance optimization opportunities unavailable before, specifications provided more accurate and concise descriptions, new versions of the APIs attempted to make them simpler and easier to understand. There are even initiatives that seek to sacrifice certain aspects of the control available to the developers on the altar of simplicity, such as the VK_KHR_unified_image_layouts extension that aims to (more or less) “deprecate” the need for image layout transitions.

These developments also reflect the fact that hardware implementations have converged and evolved since the conception of these new generation of graphics APIs, but, as we will see as we explore the problem space in this article, the fundamental principles and motivation behind these seemingly complex API concepts remain to have relevance today and are unlikely to ever go away completely.

Throughout the article, we will use AMD’s original GCN architecture as an example, as it is a well known and thoroughly documented GPU architecture that also happens to be able to take advantage of pretty much all aspects of the pipeline barrier design of Vulkan. While nowadays these GPUs can be thought of as archaic, the conceptual model still applies today to a wide range of hardware, even outside the world of GPUs.

Having made significant contributions to the original design of the Vulkan pipeline barrier APIs, hopefully, the article also provides additional insight into the motivation of certain design choices.

What is a pipeline barrier?

When we talk about pipeline barriers we mean the operations issued by the command buffer commands such as vkCmdPipelineBarrier[2], and their split variants comprising of pairs of vkCmdSetEvent[2] and vkCmdWaitEvents[2] calls. These operations enable expressing dependencies between subsequently issued action commands such as draws, compute dispatches, ray tracing, video coding operations, etc.

Illustration of the behavior of regular (left) and split (right) barriers.

The need for the application developer to have to specify such dependencies is more or less a new requirement of the new generation of graphics APIs, but the need for them from a hardware perspective has been there much earlier. Due to the quasi-synchronous execution model of legacy APIs, such as OpenGL, such dependency information could not be specified, therefore it was the driver’s responsibility to automatically insert them behind the scenes. This is referred to as “implicit synchronization”, although in some cases even legacy APIs employed explicit synchronization operations in newer API features (see glMemoryBarrier as an example).

The problem with implicit synchronization is that the driver does not always have all the contextual knowledge to be able to identify the dependencies or minimize the necessary synchronization cost. This often results in the driver having to make conservative decisions about when and what type of pipeline barrier to insert and that typically results in suboptimal performance. This performance cost can range from less than 1% to over 50%, depending on the application, hardware, and driver. Furthermore, in an attempt to optimize these cases, drivers of legacy APIs usually employ heuristics that often lead to unexpected performance cliffs. Explicit synchronization therefore provides additional optimization tools to the developer and delivers more predictable performance characteristics at the cost of some application complexity.

It is also worth noting the effects of modern “bindless” and GPU-driven techniques from the perspective of synchronization. Implicit synchronization was only an option in legacy APIs because the APIs had exact, trackable points where a resource was bound to a specific use, therefore the driver could reason about when two uses of the same resource may need synchronization. With the advent of “bindless” resources and GPU-driven workload submission, the driver doesn’t even stand a chance and may have to resort to even more conservative synchronization choices with even higher unnecessary performance overhead.

A pipeline barrier can be thought of as the combination of the following:

Execution control – defines the set of previously issued work that needs to be waited on to complete before subsequently issued work can start
Cache control – provides a way to flush and/or invalidate caches in order to make sure that subsequently issued work has a coherent view of shared data
Image layout transitions – we will discuss these later in detail, in short, they allow transforming the internal representation of images (textures) for specific use

These are more or less orthogonal aspects of a pipeline barrier and, depending on the hardware implementation, it may be possible to issue them completely independently, but bundling them into a single operation enables better portability across hardware architectures, as hardware often performs them in tandem (particularly execution and cache control) while also enabling driver optimization opportunities (by coalescing operations).

Since VK_KHR_synchronization2 and Vulkan 1.3, there is no longer such a clear separation between execution control and cache control, as the new APIs express execution dependencies as part of the memory barriers, but we keep our discussion focused on the way the pipeline barrier design was originally conceived and discuss them separately. While the new, combined view of the two aspects of pipeline barriers makes it somewhat easier to reason about formally, the separate view of the two better showcase what actually happens on the underlying hardware.

Execution control

While the focus of this article is the topic of memory barriers and the related image layout transitions, it is worth having a few words about execution control, i.e. the so-called “happens-before” and “happens-after” relationships.

Execution control is the main moving force of pipeline barriers. As the purpose of pipeline barriers is to define a dependency between two sets of work items, this aspect plays a key role by providing information about those.

In Vulkan, execution control is specified in the form of two sets of pipeline stage masks. The execution dependency will cause the hardware to wait for all previous workloads to complete all their work in the specified set of source pipeline stages before the processing of subsequent workloads starts in the destination pipeline stages. Vulkan defines a fine grained set of pipeline stage flags enabling the application to specify the exact point within the graphics (or other type of) pipeline that has to be waited on and where the wait has to happen at the latest.

Comparison between a “full” pipeline barrier (top) and a “partial” pipeline barrier (bottom).
The “full” pipeline barrier synchronizes all previous pipeline stages with all subsequent ones.
The “partial” pipeline barrier only synchronizes fragment shader stages.
Note how the latter allows for overlapping execution of shader stage workloads from different draw calls therefore resulting in a shorter overall execution time.

In practice, hardware does not necessarily have the same level of granularity. For example, some hardware may not be able to wait only on the completion of fragment shaders without also having to wait on the completion of color attachment outputs. Also, on the waiting side, some hardware may not be able to start the vertex processing and rasterization of subsequent work even if the wait is supposed to only block fragment shaders. Nonetheless, the application should specify fine grained information whenever possible as even if those do not provide any performance benefits on one GPU they may provide a significant performance boost on another. The benefits can grow as hardware and drivers evolve.

In case of the original GCN architecture, as an example, waiting was only possible inside the command processor, therefore typically (with some exceptions) all pipeline stages of all subsequent commands were blocked by the pipeline barrier, no matter the specified set of destination pipeline stage flags, but work completion signaling had some level of granularity, such as waiting on all vertex processing work and waiting on fragment shader completion. Newer hardware, however, supports finer granularity for execution control.

While this may sound overly verbose, this granularity for execution control enables significant optimization opportunities compared to legacy APIs where the implementation often had to resort to fully draining the pipeline before proceeding with subsequent work, resulting in the so-called “pipeline bubbles”. In fact, there are cases where even this fine granularity is not sufficient, for example, waiting on or before transfer operations may still result in unexpected dependencies as some transfer operations may not be implemented using dedicated DMA operations but through issuing driver-internal graphics or compute pipelines. This could have been solved by the API exposing what method of execution a particular transfer operation uses, but that would further increase the complexity of the API that many developers already deem too complex.

Cache control

We already discussed in great detail in our earlier article (Understanding GPU caches) that GPUs usually employ non-coherent cache hierarchies to boost performance compared to CPUs where the coherent cache hierarchies, while providing a convenient programming model, incur inherent die area and performance costs despite having relatively small number of threads stomping on shared data compared to GPUs. Therefore, execution control alone is rarely sufficient and cache control operations are necessary in almost all cases in order to ensure a coherent view of data in memory across subsequently issued operations.

In a nutshell, while GPUs usually feature at least one device-wide cache that is shared by, and therefore provides coherent access for, all or at least most processing units of the GPU, most processing units (shader cores, ROPs, etc.) have their own private, non-coherent caches that may need to be flushed and/or invalidated in order to share a consistent view of the underlying data across those units. These are the operations formally referred to making the results of memory writes “available” and “visible”, respectively.

We generally distinguish between three types of cache coherency related data hazards:

Read-after-write (RAW) – when writes of previous operations may not have completed before subsequent operations attempting to read the result of those, potentially causing subsequent operations to read stale data and misbehave
Write-after-read (WAR) – when writes of subsequent operations may overwrite data that is yet to be be read by previous operations, therefore potentially causing previous operations to misbehave
Write-after-write (WAW) – when writes of previous operations may not have completed before writes of subsequent operations, potentially producing unexpected results in the end

It is worth noting that, depending on the application, memory access hazards do not necessarily cause problems. For example, if the relative order of multiple writes does not matter, a WAW hazard could be ignored, and if it’s okay for reads to read stale (not the most up-to-date) data, then RAW hazards could be ignored. However, these are rare, niche cases, and typically applications expect consistent behavior and coherent accesses.

Cache control is expressed in Vulkan pipeline barriers in the form of memory barriers specifying two sets of access masks. These specify the source access types performed by previous operations that need to be synchronized with the destination access types performed by subsequent operations. In the second version of the synchronization APIs introduced by the VK_KHR_synchronization2 extension and promoted to core in Vulkan 1.3, the access masks are always relative to the pipeline stages specified as part of the execution control, therefore more tightly coupling the two. This means that you must always include the pipeline stage flags corresponding to any specified access flag.

The effect of specifying a set of source and destination access masks in a memory barrier therefore can be described as follows:

the source access mask is mapped to a set of corresponding cache paths (paths within the cache hierarchy serving the corresponding type or types of access)
the destination access mask is also mapped to a set of corresponding cache paths
the two sets of cache paths overlap up to some point (up to the most local cache that is shared and therefore coherent across the paths) or otherwise they are completely disjoint, therefore the only coherent view is through memory – this will be the target
all cache paths corresponding to the source access mask are flushed all the way up to the target cache level or memory
all cache paths corresponding to the destination access mask are invalidated all the way up to the target cache level or memory

Behavior of render-to-texture cache^[1] synchronization on GCN Gen1-4 (top) and Gen5 (bottom).
[1] Memory barrier synchronizing color attachment output to shader read.
These diagrams illustrate how the same memory barrier can result in significantly different behavior and performance characteristics on different GPUs.
Note: RDNA is also similar to GCN Gen5 barring the extra cache levels. We included this example to showcase the shift in behavior across GPU generations.

Of course, this description may or may not fully reflect what actually happens on any specific hardware, as implementations may implement all sorts of cache coherency protocols. Furthermore, complete invalidation may be avoidable if the hardware can selectively invalidate only the cache lines that contain data from memory addresses actually updated by the flush. In other words, the effects of a full flush and invalidation can be achieved through applying appropriate coherency protocols directly between the two disjoint cache paths. Although, as with many other details, GPU architectures vary and will use a solution that they deem the best for the purpose, potentially using different policies for different types of caches within the hierarchy, and these details do not have any visible effect on the application other than somewhat varying performance characteristics observed on different implementations, or different types of corruptions when using incorrect pipeline barriers. In the end, everything is a die area, power/energy usage, and performance trade-off.

Another thing to note is that even pipeline barriers can only synchronize access across these non-coherent cache paths at the granularity of entire operations such as two draw calls. Within a single draw, compute dispatch, or other operation, there may still be data hazards. There are various tools to avoid those too in specific cases, such as the OpMemoryBarrier SPIR-V instruction, the Coherent SPIR-V decoration (pre-Vulkan Memory Model), and the Make*Available/Make*Visible SPIR-V operand bits (post-Vulkan Memory Model). While those features are orthogonal to pipeline barriers, the latter ones deserve an example as it has important interactions with pipeline barriers.

It is quite common in modern workloads to have a set of back-to-back compute dispatches that share data, e.g. earlier compute dispatches writing outputs that are consumed by subsequent ones. A natural way to synchronize such compute dispatches would be with a pipeline barrier that specifies COMPUTE_SHADER_BIT in both the source and destination pipeline stage masks, and SHADER_WRITE_BIT and SHADER_READ_BIT in the source and destination access masks, respectively. However, if the shared resource is marked with the Coherent SPIR-V decoration, then the memory operations themselves will, by definition, be coherent (i.e. they bypass the shader core’s private cache). Similar behavior can be achieved when using the Vulkan Memory Model and the Make*Available/Make*Visible SPIR-V operand bits used with the Device scope.

In this case the source and destination access masks can be left blank in the pipeline barrier and proper synchronization is still ensured while still achieving similar performance, as the shader core private caches rarely show traditional cache reuse behavior due to their small size (as explained in our article about Understanding GPU caches). This approach can provide great optimization opportunities when combined with split barriers comprising of pairs of vkCmdSetEvent[2] and vkCmdWaitEvents[2] calls, as when there is sufficient workload submitted between the dependent compute shaders there may not be a need for any wait to happen (the previous compute shader may already be done) and the entire (execution-control-only) pipeline barrier may turn into a zero-cost operation.

Comparison of using a split barrier with cache control to synchronize dependent compute shader writes with subsequent reads (top) and using coherent accesses with an execution-control-only barrier (bottom).
Note how this enables execution to overlap between independent workloads and therefore reduce total execution time.

While driver implementations could implicitly detect in the example above that all the previously executed compute shaders used the Coherent SPIR-V decoration or the Make*Available/Make*Visible SPIR-V operand bits, therefore the SHADER_WRITE_BIT and SHADER_READ_BIT cache control flags could be ignored, there is no guarantee on whether actual driver implementations will do so. Even if they do, they can only make decisions based on static code analysis that inherently limits what the driver may be able to assume.

This use case alone is a great example of why execution control and cache control are separate and why having explicit control over each can be beneficial, but there are other, less common cases where one could take advantage of controlling these two aspects of synchronization separately, particularly, performing execution control without cache control. On a related note, unlike the original synchronization APIs in Vulkan, the variants introduced by VK_KHR_synchronization2 no longer support the other direction: performing cache control without execution control, though this does not limit any practical use cases.

One more interesting aspect of VK_KHR_synchronization2 is that the split version of pipeline barriers now includes the memory dependencies as part of the information passed to vkCmdSetEvent2. In order to understand why the original design did not include that information in the first part of the split barrier, it is worth looking into how they were implemented on the GCN architecture originally (and likely on many other GPU architectures). There, execution control was as simple as signaling the completion of earlier work by asking the hardware to write a value to a memory address (the VkEvent object) once the work is complete that the GPU’s command processor can then wait on later (as discussed earlier, the original GCN could only wait at the “top of the pipe”). The cache control operation itself (both cache flushing and invalidation) was a completely orthogonal operation issued by the command processor afterwards. Therefore, specifying the memory dependencies as part of the signal operation (vkCmdSetEvent) was unnecessary. However, on an implementation that can combine execution control and cache control operations or one that can issue cache flushes separately from cache invalidations, having the information up-front, at the time of signaling the event, enables issuing any necessary cache flushes earlier or otherwise optimizing the process. Therefore the new model isn’t just better from the perspective of the formal model, but can actually provide tangible benefits on a sufficiently sophisticated hardware implementation.

Buffer and image memory barriers

Memory barriers in Vulkan come in three forms:

Global memory barriers – describe memory barriers that apply to all memory locations
Buffer memory barriers – describe memory barriers that apply only to memory locations corresponding to the specified buffer range
Image memory barriers – describe memory barriers that apply only to memory locations corresponding to the specified image subresource range

There are various reasons for the existence of these three separate categories. For example, buffer and image memory barriers can also include queue family ownership transfers and image memory barriers can also include image layout transitions. These will be discussed later in detail. For now we will only focus on the cache control aspect of these.

As the brief descriptions above suggest, the key difference between the three is the range of memory locations the memory barrier applies to. This enables exposing the ability to perform partial/selective cache flush/invalidation operations on hardware implementations that support them. Unlike a full cache flush/invalidation operation, such cache coherency operations can selectively flush/invalidate only a subset of the cache lines in the affected caches that fall into a specific address range while leaving all other cache lines untouched. In our case, these address ranges are provided indirectly through the specified buffer ranges and/or image subresource ranges.

This can have a positive performance effect for accesses of other resources that the memory barriers do not affect as their data is not evicted from the lower-level caches unnecessarily. Unfortunately, not all hardware supports this, and even those that do don’t necessarily have this particular optimization opportunity leveraged in their driver stack. Furthermore, the performance benefits of such optimizations are usually only visible when the cache control operation results in having to flush/invalidate larger caches (like the L2 cache flush/invalidation that was often needed on the original GCN architecture to synchronize framebuffer operations with texturing) as the most local caches are rarely sufficiently large to still have relevant data to be reused from the cache at any given time.

Nonetheless, even though it is an opportunistic optimization that may or may not show any benefits in practice on any given workload, application developers should prefer to use buffer and/or image memory barriers when they can potentially benefit from partial cache flush/invalidation support. That latter point, however, is important…

It is quite common that Vulkan applications use buffer and image memory barriers in all cases, often submitting pipeline barriers with a long list of buffer and image memory barriers. While that may sound like a good idea, in practice it is a pessimization for the following reasons:

If the list of buffer and image memory barriers is long, it likely covers all (or at least most) memory locations currently in the cache(s), nullifying the benefits of partial cache flush/invalidation operations and potentially even resulting in higher overhead if the driver actually submits those as separate partial cache flush/invalidation operations to the hardware
Even if the driver detects “abuse” and reduces all those partial cache flush/invalidation requests to a single full cache flush/invalidation request, there is a CPU overhead to producing and parsing those long lists of memory barriers

Therefore, we strongly recommend applications to use global memory barriers by default and only use buffer and image memory barriers in one of the following cases:

If a queue family ownership transfer needs to be performed on the resource
If an image layout transition needs to be performed on the subresources of the image
If only a single or only a few resources need to be synchronized and there is a chance to benefit from the possibility of using partial cache flush/invalidation operations

Following this advice not only can result in better performance overall, but can also eliminate complex and unnecessary tracking in the application code. As we will see also in other cases, it’s a good rule of thumb that if the application does extensive tracking to provide fine grained information to the API, then it probably uses the wrong approach. Sure, in many cases some tracking might be justified to handle non-local (in terms of source code) command dependencies when recording to a command buffer, but even those typically can be limited to accumulating pipeline stage flags and access flags to aggregate information about dependency on earlier workloads and memory operations.

Pipeline barrier dependency flags

Another noteworthy aspect of pipeline barriers is that execution and memory dependencies resolved by the barrier can be restricted or widened to apply to specific scopes by setting the appropriate dependency flags. For example, VK_DEPENDENCY_DEVICE_GROUP_BIT can be used to express that the barrier should be applied across all devices in the device group (for multi-GPU applications).

The most interesting and commonly used dependency flag remains the VK_DEPENDENCY_BY_REGION_BIT that exists since Vulkan 1.0. As discussed in detail in our popular article GPU architectures explained, unlike immediate-mode rendering (IMR) GPUs, tile-based rendering (TBR) GPUs do not implement the graphics pipeline in the traditional order it is typically described in. Instead, they first perform the so called “binning” phase that encompasses most of the geometry processing stages and then performs the fragment processing stages for each tile, one after another (of course, with appropriate parallelism as the number of shader cores in the GPU allows).

Simplified illustration of the tile-based rendering pipeline.

The VK_DEPENDENCY_BY_REGION_BIT provides the unique ability to leverage that by minimizing the scope and therefore the performance impact of pipeline barriers in specific scenarios. In particular, one common example is a fragment-shader-to-fragment-shader dependency, or any other dependency where only fragment processing stages are involved, and the dependency itself is local in framebuffer space (e.g. because fragment operations in subsequent draws only depend on the results of fragment operations in previous draws on a pixel-local basis). In such cases VK_DEPENDENCY_BY_REGION_BIT avoids having to wait for the completion of all previous per-tile fragment processing steps.

Comparison of the behavior of a fragment-to-fragment pipeline barrier on a tile-based rendering GPU without (top) and with (bottom) VK_DEPENDENCY_BY_REGION_BIT.

While it can already be seen that this could allow the same workload to be completed faster if there are sufficient processing units available on the GPU, it provides an even more important benefit. TBR GPUs operate on a per-tile basis to optimize framebuffer accesses. This is achieved by keeping framebuffer data in fast on-chip tile memory for the duration of the processing of a tile. However, getting that data on chip (tile load) and then writing back the final results (tile store), when needed, is still very expensive, therefore TBR GPUs can benefit significantly whenever more workload can be crammed in the same per-tile pass, which also explains why Vulkan originally introduced the concept of render passes.

In case of our specific example of a fragment-to-fragment dependency, it is easy to demonstrate how VK_DEPENDENCY_BY_REGION_BIT can avoid redundant tile load/store operations even on a small TBR GPU with a single shader core.

Comparison of the behavior of a fragment-to-fragment pipeline barrier on a single-core tile-based rendering GPU without (left) and with (right) VK_DEPENDENCY_BY_REGION_BIT.

In summary, VK_DEPENDENCY_BY_REGION_BIT allows inserting local fragment-to-fragment pipeline barriers while still allowing TBR GPUs to merge the draws into a single hardware render pass.

Image layouts

We arrived at the most important topic that motivated the creation of this article. Image layouts are one of the most misunderstood features of Vulkan among developers. Developer frustration stemming from this misunderstanding culminated in the attempt to effectively “deprecate” image layouts through the VK_KHR_unified_image_layouts extension, at least to the extent that simple rendering applications can pretty much avoid having to worry about image layouts and image layout transitions in most cases. Our goal is to better explain the rationale behind Vulkan image layouts, why they remain relevant, particularly for advanced use cases crossing device IP block boundaries, and how to use them effectively.

The number one misconception about image layouts is that the application has to track them, therefore it comes with inherent and unavoidable complexity and run-time cost. We will see that, except in a few specific scenarios, applications never have to track image layouts. But let’s step back a bit first and discuss what image layouts are and what they represent…

Image layouts were introduced in Vulkan to enable the application to instruct the driver the type of usage to optimize the physical representation of an image subresource for. This typically translates to the driver deciding what types of image compression schemes it should enable or whether some sort of decompression or compression/resummarization operation it needs to perform when transitioning the image layout of image subresources. We will discuss some types of compression schemes where image layouts have an effect, but it is worth noting that image layouts are not necessarily always about compression, there are other ways image layout transitions may transpose the physical in-memory representation of an image subresource or any metadata associated with it.

Possible values for the API image layout include the following:

VK_IMAGE_LAYOUT_UNDEFINED – indicates that the image subresource is not in any specific image layout (yet) therefore the driver should make no assumptions about its current contents or physical representation (as we will see, this image layout is particularly interesting and enables certain optimization opportunities)
VK_IMAGE_LAYOUT_GENERAL – indicates that the image subresource is in an effective hardware image layout that allows using the image in any fashion as otherwise allowed by the operation itself and the image creation parameters
VK_IMAGE_LAYOUT__OPTIMAL – image layouts with the OPTIMAL suffix indicate that the image subresource is in an effective hardware image layout optimized for a specific usage (e.g. READ_ONLY_OPTIMAL) and cannot be used in any other fashion while being in this layout
VK_IMAGE_LAYOUT_ – image layouts specifying a particular use without the OPTIMAL suffix indicate that the image subresource is in an effective hardware image layout compatible with that specific use (e.g. PRESENT_SRC) and is the naming convention used by image layouts that are specific to a particular use that is not covered by the VK_IMAGE_LAYOUT_GENERAL layout

Over time, all sorts of use case specific and special purpose image layouts have been added to the API whenever a new type of image usage came along that could potentially benefit from optimized or otherwise specific in-memory representation.

Note that we intentionally use the term “image subresource” here, as it’s not really the image, as a whole, that has a corresponding current image layout, but individual mip levels and layers of an image, i.e. its subresources. Therefore at any given time every single mip level and every single layer within that mip level can have its own effective hardware image layout. The addition of the “effective” qualifier and using the term “hardware image layout” is also intentional, as there may not be a one-to-one correspondence between VkImageLayout values and corresponding physical representations, i.e. effective hardware image layouts. In practice, different GPUs, even ones from the same vendor, but from different hardware generations, may end up mapping the same VkImageLayout value to different effective hardware image layouts, and the set of VkImageLayout values mapping to the same effective hardware image layout may also vary across implementations. As an example, VK_IMAGE_LAYOUT_GENERAL and VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL both map to the same fully uncompressed effective hardware image layout on AMD’s first GCN architecture for all non-multisampled color images, but the same is not true for later GCN generations, or for depth and multisampled images.

Example showing an image with 3 mip levels and 4 array layers with each image subresource being in a specific image layout.
Note: while such variation of image layouts across individual image subresources is rare, it demonstrates that image layout state can be changed for each image subresource individually.

The mapping between API image layouts and effective hardware image layouts therefore is subject not just to the specific GPU architecture, but also to the creation parameters of the image. The available set of effective hardware image layouts, for example, can vary based on the used image tiling (linear or optimal), whether the image is multisampled, what usage flags were specified, etc.

As a trivial example, if an image is created with only the VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT usage flag specified, then the driver has the freedom to map both VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL and VK_IMAGE_LAYOUT_GENERAL to the same effective hardware image layout optimized for color attachment use, as the image cannot be used for everything else. However, the same may not be true if VK_IMAGE_CREATE_ALIAS_BIT is also specified when creating the image.

The VkSharingMode can also affect the mapping, as it controls whether the driver can assume exclusive access to the image from queues of the queue family currently owning them, or whether it may be simultaneously accessed by queues from different queue families with potentially different hardware capabilities, as discussed later in this article. Hardware capabilities are the key here, as different queue families, or even different hardware units serving the functionalities provided by a given queue family, such as the texturing units and ROPs of the GPU may not have the same capabilities when it comes to interpreting the data in memory.

Everything sounds simple if one thinks about images as simple 1/2/3-dimensional arrays of individual raw texel values, but the physical reality is often far from the good old pitch-linear image representation. But even the most basic feature of GPUs called texture tiling/swizzling breaks that, whereas the actual texels are not stored in pitch-linear fashion but rather often assembled into recursive hierarchies of tiles arranged physically in memory following the Morton order or other similar order optimized for spatial locality. It’s easy to see that even choosing the optimal image tiling mode for a given GPU and image may come with compromises, as one layout may be optimal for texturing but not necessarily ideal for color attachment use, and it’s also possible for some image tiling modes to be entirely incompatible with one or more hardware blocks within the GPU, therefore the driver often has to make compromises in the physical representation already due to the set of image usages that the application requested.

But it’s not just the order of the individual texels within an image subresource that can vary. Images often have additional meta-surface planes that store information about the compression state or other metadata about the image subresources. These meta-surface planes are typically reduced resolution images whose pixels contain information about the state of the corresponding block/tile (e.g. 4×4, 8×8, or 16×16 pixels) within the actual image subresource. Often, as indicated by such meta-surface planes, the main image plane does not even contain raw texel values but rather something else (such as compressed data) that can only be interpreted together with the corresponding meta-surface plane entries.

In an ideal world, all hardware blocks on the device would be able to interpret and update all such data, both in the main image plane and in any meta-surface planes. However, in practice that is not always feasible due to added die area or performance cost, and there can also be other architectural limitations such as those imposed by cache hierarchy structure and even cache line sizes.

On many recent GPUs, sticking with traditional graphics and compute only workloads may allow using hardware units that share the “same language” therefore can all interpret and update the same data formats (i.e. what VK_KHR_unified_image_layouts promises) without any (or at least not too many) compromises. The same may not be true everywhere, especially when you add other hardware components to the picture such as DMA engines, video codec engines (both present on contemporary graphics cards), or external device interop. The fact that VK_KHR_unified_image_layouts still kept the VK_IMAGE_LAYOUT_PRESENT_SRC_KHR image layout shows that we can have non-orthogonality just trying to “speak” with the display engine.

So why bother with such meta-surface planes at all? Well, performance. The compression schemes and other metadata information that these meta-surface planes enable can often provide double-digit percentage performance improvements on real-world applications, so obviously GPU architectures take advantage of them to the extent possible. Then why not support them everywhere? Sometimes it would simply not be feasible from an architecture or die area perspective, as eluded to earlier. Sometimes it just doesn’t make sense as the compression scheme or metadata is only relevant to specific operations. Sometimes, like in the case of interop with an external device (such as a camera, capture card, etc.) coming from a different vendor, it’s just not possible at all.

Image layout transitions, therefore, enable performing (typically in-place) transformation of the data in the main plane and the meta-surface planes of the image (such as decompression) to make the image subresource’s in-memory representation compatible with a different use.

Common image compression schemes

Fast clears

The history of “fast clears” goes a long way back, so many are probably familiar with the concept that framebuffer attachment clears are fast and preferred to be done, even if the entire framebuffer is intended to be overwritten, as they can also speed up subsequent rendering. But how exactly is that achieved?

The simplest solution is to use a small meta-surface plane that has exactly one bit for each block/tile of the actual image data that marks whether the block is “fast cleared” (1) or not (0). This enables clearing a framebuffer attachment by simply setting all bits of this meta-surface plane, without even having to touch the main image plane, therefore saving a significant amount of memory bandwidth. Furthermore, when rendering to the attachment, the ROP can just look at the meta-surface plane and avoid having to load any image blocks into the cache if they are in the “fast cleared” state, providing additional memory bandwidth saving even during rendering.

All of this requires though that the ROP is able to load and interpret the meta-surface plane and it knows the clear color/value. That may be too much complexity to add to the texture units, may involve additional indirections, and increase cache storage requirements. Therefore, many GPUs did not even attempt to support accessing such “fast cleared” images. Instead, the appropriate image layout transition (e.g. from VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL) “decompresses” (eliminates) the fast clear by actually writing out the clear color/value to the texel blocks/tiles of the main image plane that are still marked as “fast cleared”. This may still be much faster than not using fast clear at all, as the performance benefits are still there during the rendering to the image as an attachment, and at the point of the image layout transition there may not even be any more texel blocks/tiles in the “fast cleared” state.

Illustration showing a rendering of the Stanford Dragon into a fast cleared color attachment.
The example shows a 64×64 8-bit RGBA color attachment image using a fast clear meta-surface with 1 bit per 8×8 block/tile.
Note that the main image plane (top left, total size: 16 kilobytes) may still have uninitialized data in the 8×8 blocks/tiles that were not rendered to after the fast clear, as indicated by the 1 values in the fast clear meta-surface plane (bottom left, total size: 8 bytes), but combined with the known fast clear value (transparent black in the example) the effective content of the image (right) looks as expected.
The diagram also demonstrates how an image layout transition can decompress fast clears by combining the two planes (left) and the fast clear color to produce the uncompressed image (right).

This technique still exists in various forms on contemporary GPUs, although they are, to some extent, subsumed by more advanced color compression schemes (such as delta color compression) or more specific ones (such as hierarchical-depth, depth plane compression, or specialized multisampled image compression schemes).

Delta color compression

Delta color compression (DCC) is a technique typically used for non-multisampled color images, as depth and multisampled images often use more specialized schemes. It works by applying lossless compression to the texels within a block/tile. While the compressed data is stored within the main image plane, this compression scheme (like most) typically still uses a separate meta-surface plane to store information about the compression state of each block/tile.

On AMD GPUs, it first appeared with the GCN Gen3 architecture, but went through a set of iterations over subsequent GPU architecture generations. While reading DCC compressed images from shaders was supported from the beginning, without the need to decompress (i.e. doing an image layout transition from VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL was effectively a no-op), the same was not true for shader writes, so transitioning to the VK_IMAGE_LAYOUT_GENERAL layout still required decompression if the image was created with VK_IMAGE_USAGE_STORAGE_BIT, for example.

Current image layout	Image usage flags specified at create time	GCN Gen3	RDNA
GENERAL	COLOR_ATTACHMENT_BIT \| SAMPLED_BIT
GENERAL	COLOR_ATTACHMENT_BIT \| SAMPLED_BIT \| STORAGE_BIT
COLOR_ATTACHMENT_OPTIMAL	COLOR_ATTACHMENT_BIT \| SAMPLED_BIT \| STORAGE_BIT
SHADER_READ_ONLY_OPTIMAL	COLOR_ATTACHMENT_BIT \| SAMPLED_BIT \| STORAGE_BIT

Examples of image layouts, image usage flags, and corresponding DCC behavior on the GCN Gen3 and RDNA GPU architectures.
GCN Gen3 does not support shader writes to DCC compressed surfaces, therefore it cannot keep the image compressed in the VK_IMAGE_LAYOUT_GENERAL layout if the image was created with VK_IMAGE_USAGE_STORAGE_BIT.
Note that the examples show best case scenarios as actual driver behavior may be different due to interactions with other image creation parameters such as sharing mode, image create flags, etc.

It is also worth looking into what happens when transitioning from an API image layout that maps to an effective hardware image layout not supporting DCC (or, for that matter, any compression scheme) to an API image layout that maps to an effective hardware image layout that does. A natural thought process would be that such an image layout transition would compress the image subregion(s). In practice, that is rarely the case and is more conventional for such image layout transitions to be no-ops, as compression will be added back to the data anyway (e.g. by rendering new content to the color attachment). Performing a compression pass over the image subregion to potentially save some memory read bandwidth on the first read of each block/tile would likely be a net loss anyway.

As time passed and new hardware generations came, DCC comes with fewer and fewer compromises and limitations. Nowadays, even shader writes are supported to DCC compressed images, effectively making most image layouts, even VK_IMAGE_LAYOUT_GENERAL, to be able to retain full DCC, without the need for the image layout transitions to result in a decompression. At least, in theory. In practice, DCC and similar techniques come in many shapes and forms across different GPU vendors and architecture generations, each with their own quirks and compromises, so your mileage may vary.

Hierarchical depth buffer

Depth images can also benefit from plain-old “fast clear” compression, using DCC, or other lossless compression schemes tailored specifically for depth images, but the most interesting depth compression technique is hierarchical depth (or Hi-Z). This is a technique that is almost as old as fast clears, yet it is still very much prevalent. The core idea is that we use a separate meta-surface plane to store for each texel block/tile the minimum and/or maximum depth value within that texel block/tile. This is done by the ROP during the depth write process, therefore it comes with minimal added complexity or memory bandwidth requirements.

Having information about the per block/tile minimum and/or maximum depth values allows the ROP to trivially reject entire blocks/tiles during depth testing if the current depth comparison mode and the incoming triangle’s plane equation indicates so. E.g. if the depth comparison mode is “less than”, but the incoming triangle’s plane equation indicates that for a given block/tile the triangle’s pixels will all have greater depth values than the current per block/tile maximum stored in the Hi-Z meta-surface plane, then no pixels of the triangle with pass the depth test so the entire block/tile of pixels can be dropped without having to perform any per-pixel operation.

Terrain rendered into a 128×96 16-bit depth buffer (left, total size: 24 kilobytes) with a Hi-Z meta-surface storing 16-bit minimum depth values for each 8×8 block/tile (right, total size 384 bytes).

The interesting thing about Hi-Z is that, while it can be considered a compression scheme, it doesn’t actually require decompression as the main image plane always contains the full, uncompressed representation of the depth image, unless other compression schemes are also applied. Therefore, an image layout transition from VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL can be a no-op, although, in practice, it will rarely be a no-op, as usually other depth compression schemes are used in conjunction.

However, what happens when an image layout transition is performed e.g. from VK_IMAGE_LAYOUT_GENERAL to VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL? In this case it is possible that some other, non-attachment use of the depth image updated some of the depth values, but not the values in the Hi-Z meta-surface plane (as Hi-Z is not relevant anywhere else but during depth testing, it’s unlikely that shader or DMA writes, for example, will also update the Hi-Z values). This is a problem, because the stale values in the Hi-Z meta-surface plane could cause misbehaviors during depth testing.

While it’s also a possibility to disable Hi-Z altogether if the depth image may be written through other means, what usually happens is that the layout transition back to VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL “resummarizes” the Hi-Z meta-surface plane’s content, effectively bringing it up-to-date with the main image plane data.

In summary, it can be observed that the effect of an image layout may be any of the following, depending on the image itself, the old and new image layouts specified, and the target GPU architecture:

No-op, if the effective hardware image layout corresponding to the old and the new API image layouts match on the specific GPU and driver
No-op, if the new effective hardware image layout does not support the compression scheme and the old one does, but the data in the main image plane itself is not compressed (e.g. Hi-Z)
Decompression, if the new effective hardware image layout does not support the compression scheme but the old one does, and the data in the main image plane itself is compressed (e.g. DCC)
No-op, if the new effective hardware image layout supports the compression scheme and the old one doesn’t, and the opposite transition would result in a decompression (e.g. DCC)
Resummarization, if the new effective hardware image layout supports the compression scheme and the old one doesn’t, and the opposite transition would result in a no-op (e.g. Hi-Z)

Nonetheless, this only considers dual-state compression schemes (the surface is either compressed or decompressed), whereas some compression schemes can have multiple states/levels (e.g. some multisampled image compression schemes are tri-state), and multiple compression schemes may be used in conjunction. So, in practice, image layout transitions can result in combinations of any of the operations listed above.

When you consider all of that, and that the set of effective hardware image layouts and the types of accesses they support can vary across GPUs, even across ones from the same vendor, it is easy to see how much of the actual complexity and implementation divergence is abstracted away by image layouts such that the application does not have to worry about these differences, while also having explicit control over when and what type of image layout transitions should be performed.

Undefined is good

VK_IMAGE_LAYOUT_UNDEFINED deserves its own section, because it enables so many optimization opportunities that even VK_KHR_unified_image_layouts did not attempt to eliminate. This image layout effectively says “I don’t know what layout the image subresource is in but I also don’t care”. This opens up the door for driver implementations to transition the layout of the image subresource without any care for what the main image plane or meta-surface planes may contain.

For example, transitioning from VK_IMAGE_LAYOUT_UNDEFINED to VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL may be implemented by simply performing a fast clear that effectively moves the image subresource into a well-defined physical layout at a fraction of the memory bandwidth cost of having to touch the main image plane. The same is true in case of transitioning a depth/stencil image to VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL. In this case, having specified an old image layout of VK_IMAGE_LAYOUT_UNDEFINED, the driver also doesn’t have to worry about resummarizing the Hi-Z meta-surface, avoiding further performance overhead.

The VK_IMAGE_LAYOUT_UNDEFINED can also be beneficial for TBR GPUs, or other architectures utilizing various types of on-chip storage, as it is sort of the analog of the VK_ATTACHMENT_LOAD_OP_DONT_CARE value used to indicate that a TBR GPUs does not have to load the current data of an attachment image, as the current (old) content in it will not be used. In fact, it might even have a positive effect on the cost of cache control operations in certain circumstances on a sufficiently sophisticated GPU and driver combination.

Tracking image layouts

Before going into why Vulkan applications don’t necessarily have to track the image layouts of individual image subresources, that is often cited as the most annoying and complex problem the existence of image layouts in Vulkan imposes, let’s first ask the more obvious question…

Why do Vulkan image layouts even exist when earlier APIs like OpenGL did not have them, even though the underlying GPUs still had the same problem with compression schemes to deal with?

The short answer is fairly simple: it was all hidden and managed by the drivers. Drivers effectively had to do that dreading image layout tracking and had to automatically insert the appropriate image layout transitions (decompression, resummarization, etc.) when necessary. This was not trivial to do, more so because the driver itself had no contextual information about the application as its writer did, because it only saw the incoming raw API call stream.

This didn’t only involve a lot of complex driver logic that everybody had to pay the run-time cost of, but also came with lost optimization opportunities because the driver could not always guess what would be the best thing to do with limited information. For example, one could think that when a texture is bound in OpenGL, the driver could simply perform the equivalent of a Vulkan image layout transition on the texture to a shader readable effective hardware image layout. But in practice most applications bound textures and other objects only to then bind a different one later before doing any draw calls. This could not only result in unnecessary (and quite costly) decompression operations, but the effects of an unnecessary decompression could also impose a significant performance overhead if subsequent operations could have taken advantage of the memory bandwidth savings of the compression.

Therefore most drivers actually delayed the layout transition up until the last moment, e.g. when a draw was actually issued that is now known to need the image to be decompressed. While also contributing significantly to the so-called “draw-time overhead” the old graphics APIs were famous for, delaying these operations up until the last moment also limited concurrency potential, effectively resulting in longer overall run time.

But even if that would be a cost that developers would be comfortable to suffer if they wouldn’t have to deal with image layouts, the reality is that with today’s GPU-driven and “bindless” workloads it is not even possible for the driver to figure out which resources are going to be used by a draw or dispatch. Therefore image layouts in the API are not only an opportunity to better control the compression schemes and schedule image layout transitions, but a necessity. That is, if we accept our assertion that there will always be hardware units out there that cannot understand all the compression schemes used across the entire GPU or the entire system.

Some Vulkan developers will not be able to avoid tracking image layouts. For example, developers working on drivers emulating legacy APIs on top of Vulkan, such as Zink or Angle will inevitably have to deal with this, at least to some extent, as it is their duty to add back the “driver magic” of a legacy API driver on top of Vulkan, with all the internal complexity and performance cost that comes with it. But if we exclude these types of middleware, there is rarely ever a good reason for a Vulkan application to track the image layout of image subresources, hence the premise that Vulkan image layouts are “the problem” is moot. Let’s see why…

Not tracking image layouts

Tracking the current image layout of subresources is really unnecessary in usual application code, except maybe in very niche situations. An application uses images for specific purposes, therefore it has high-level knowledge about what image layout its subresources should be in at any given time, not through tracking, but inherent from their purpose and expected usage pattern.

In particular, in most cases there is a clear “default image layout” that the application can assume the image subresources are in. Having that knowledge, the application can simply transition the layout of specific image subresources to other image layouts than the default one temporarily, as needed, without any tracking. We will go through a couple of examples to show how this can work in practice.

Plain-old textures

When an image is used for plain-old texturing, it is reasonable to assume everywhere in the application code that the current image layout of all its subresources is VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL. Sure, that won’t be the case initially, as all images are created with the VK_IMAGE_LAYOUT_UNDEFINED layout (we will not consider the VK_IMAGE_LAYOUT_PREINITIALIZED, as that is an uncommon case, but the same applies there). Before being able to texture from the image, the application will anyway have to upload the texture contents to it, which typically will involves e.g. an image layout transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, a copy to it, then a transition to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL. Once there, there really isn’t a need to perform further image layout transitions on the image, therefore it’s a reasonable expectation to assume this default image layout anywhere in the code.

The situation is somewhat more complicated if per-subresource texture streaming is used, or if full-on virtual texturing is used through sparse images. However, even in that case, the application only needs to track whether the specific subresource and/or virtual texture tile is loaded or not, which it has to do anyway, but the image layouts can still be assumed without tracking: if the subresource is not loaded yet (or has been evicted), then its image layout is assumed to be VK_IMAGE_LAYOUT_UNDEFINED. When it is loaded, the same process happens as in case of regular texture uploading. If already loaded, the layout can be assumed to be VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.

Framebuffer attachments

For simple color attachments used during rendering, once again, a default image layout can be assumed: VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL. If its content does not need to be preserved (quite typical), you’d anyway start your rendering with an image layout transition from VK_IMAGE_LAYOUT_UNDEFINED to VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL (no matter whether you use dynamic rendering or render passes), but even if you need to preserve the content, you can assume it was left off in the VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL image layout. No image layout tracking is necessary. The same applies to depth/stencil attachments, except that the default image layout would be VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL.

Things get somewhat more complicated when post-processing is involved and you need to temporarily use the framebuffer attachments as textures (or input attachments) in subsequent rendering passes, but tracking should not be necessary here either. Simply transition the attachment image subresources to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL for the duration of the post-processing passes needing the image as input. Whether you have to transition the image layout back also depends on whether the attachment contents have to be preserved across frames. In most cases it won’t be necessary as the next frame will anyway start with an image layout transition from VK_IMAGE_LAYOUT_UNDEFINED. The application should already have knowledge about which post-processing steps need any attachment as input, and any other policy on where and how long the attachment contents can be accessed.

Sure, things may be less trivial in some cases, such as when a depth/stencil attachment is used as an input texture for post-processing while also being used for depth testing, and therefore one has to use the more exotic image layouts such as VK_IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL, but the principle remains valid: the default image layout remains the attachment optimal one, except for the duration of operations intending to use the attachment image otherwise.

Shadow maps, reflection maps, etc.

These types of images are not that different from plain-old textures either. By default, the application will assume it can use them when applying shadows/reflections/etc. therefore it’s reasonable to assume them to always be in the VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL default image layout. This only needs to change for the duration of the render-to-texture pass initializing or updating them, which is a trivial temporary and local transition to one of the attachment optimal layouts for the duration of rendering to them.

These types of use cases also highlight the real power of Vulkan image layouts. Thanks to the explicit control, the application can choose which image subresources (individual mip levels and layers) will be updated and only transition those. In traditional APIs, drivers would often have to do a lot of guess work in this process that usually manifested in unexpected performance pitfalls.

Bonus use case

We can see that some developers will still have a skeptical view of the examples we shown here due to their simplicity, so let’s look at a final, more complicated use case: depth buffer used for hierarchical-Z map based occlusion culling while also performing additional rendering with the depth attachment, both read-only and read-write.

In this case we, once again, work with an image used as depth attachment, but one that has mip levels. The base level is generally used as a depth attachment, therefore its default image layout remains to be VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL, but for all other mip levels that are normally used only in the occlusion culling passes as input will have a default image layout of VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.

After the initial render pass, we have to re-build the mip chain. This can be done by transitioning the base level temporarily to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL, and mip level 1 to VK_IMAGE_LAYOUT_GENERAL, in order to update the latter. If there are further rendering passes that want to use the depth attachment for depth testing, then VK_IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL can be used for the base level instead, and such rendering passes can even be performed concurrently with the downsampling.

If subsequent rendering passes need to use the depth attachment with depth writes enabled, then obviously that can only happen before or after the downsampling into mip level 1, as otherwise one might encounter data race issues. But such rendering passes can still execute concurrently with the updates of subsequent mip levels of the hierarchical-Z map.

After mip level 1 is populated, it can be transitioned back to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL, and all subsequent mip levels can be updated by following the steps below for all subsequent mip levels:

Transition the mip level from VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL to VK_IMAGE_LAYOUT_GENERAL
Update the mip level
Transition the mip level from VK_IMAGE_LAYOUT_GENERAL back to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL

Of course, there are further optimization opportunities for such mip chain update processes which fall outside the scope of this article, but the bottom line is that all of this can be done without ever needing to track image layouts anywhere in the application: all current image layouts are simply assumed and inherent from the high-level application algorithm itself.

Schematic flow of the hierarchical depth map use case described in the section.

It might be a great exercise for the reader to implement the above with Vulkan image layouts, per above, and compare its performance with that of an implementation using a legacy API or an implementation relying on the promises of VK_KHR_unified_image_layouts and using VK_IMAGE_LAYOUT_GENERAL everywhere. If that is still not convincing enough then not sure what would be.

Sharing modes and queue family ownership transfers

We left sharing modes to the end because they have an overarching effect on everything we discussed so far. Vulkan provides two sharing modes that the application can choose from when creating image or buffer resources:

VK_SHARING_MODE_EXCLUSIVE – ranges of buffers or subresources of images created with this sharing mode can only be used by queues of the same queue family at any given time, and queue family ownership transfer barriers need to be used to transfer exclusive ownership of a buffer range or image subresource from one queue family to another
VK_SHARING_MODE_CONCURRENT – resources created with this sharing mode can be used by queues of any queue family that the resource was created with, requiring no queue family ownership transfers

After a quick look at these modes, developers may ask themselves why would they ever not use the concurrent sharing mode, considering that it sounds simpler. The answer, as usual, is performance. In order to understand the rationale and behavior, it has to be noted that this is yet another acknowledgement of the fact that queues from different queue families can have different hardware capabilities when it comes to being able to interpret compressed image data, from the perspective of cache hierarchies, etc.

In practice, this means that whenever VK_SHARING_MODE_CONCURRENT is used to create a buffer or image, the driver has to assume that any queue from any of the specified queue families may access the resource at any given time. This often results in the driver having to make more conservative compression scheme and cache flushing/invalidation choices, potentially incurring significant performance overhead for all accesses to the resource.

Using VK_SHARING_MODE_CONCURRENT may not have much of an impact if the only queue families the resource is shared across are the compute and graphics queue families often found on devices. These usually have the same (or very similar) capabilities, barring the lack of graphics-related functionality available on compute-only queues. However, the same may not be true when a transfer-only queue family is included that may be implemented using some sort of DMA engine. Such DMA engines often do not understand any of the graphics-specific image compression schemes and they may not use the same cache hierarchy, if any at all. The same is often true for other non-graphics/compute queue families such as the video decode and encode queue families.

For example, we saw earlier that, subject to the used image usage flags and GPU generation, AMD’s DCC image compression scheme can be in effect even if an image subresource is in the VK_IMAGE_LAYOUT_GENERAL layout. However, that may not be the case if the image was created with VK_SHARING_MODE_CONCURRENT and the specified queue families include non-graphics/compute queue families. The situation may even be worse for more graphics-centric compression schemes such as fast clears or Hi-Z, as those are typically not relevant or usable even on a compute-only queue.

This also reveals that even for a given GPU the mapping between API image layouts and effective hardware image layouts is subject to the sharing mode the image was created with, the set of queue families the image is shared across when VK_SHARING_MODE_CONCURRENT is used, or the current queue family having ownership of the image subresource when VK_SHARING_MODE_EXCLUSIVE is used. This is why image layouts do not even attempt to refer to specific effective hardware image layouts.

Therefore, it may not come as a surprise that we advise developers to always use VK_SHARING_MODE_EXCLUSIVE if they seek ultimate performance, although in very specific scenarios even VK_SHARING_MODE_CONCURRENT may come without comprises. The only thing one has to worry about when using VK_SHARING_MODE_EXCLUSIVE is to make sure that the right queue family ownership transfers are performed at the boundaries of sharing resources between queue families.

Of course, it’s possible that on some implementations, with some combination of image creation parameters and some specific image layouts, using VK_SHARING_MODE_CONCURRENT may not have performance consequences. Although even in such cases it should be fine to use VK_SHARING_MODE_EXCLUSIVE, as the only downside is the minimal CPU cost incurred by having to issue queue family ownership transfers which, in such cases, would anyway result in no-op. Therefore the only good reason to use VK_SHARING_MODE_CONCURRENT is when the application actually needs concurrent access to the same buffer range or image subresource in queues created from different queue families (concurrent access from queues of the same queue family are allowed by VK_SHARING_MODE_EXCLUSIVE).

A queue family ownership transfer can be performed as part of a pipeline barrier by specifying different values in the srcQueueFamilyIndex and dstQueueFamilyIndex members of buffer and/or image memory barrier structures. The former specifies the index of the queue family to release ownership from, while the latter specifies the index of the queue family to acquire ownership for.

Interestingly, pipeline barriers containing queue family ownership transfers need to be issued both on the queues of both the source and destination queue family. These correspond to the “release” and “acquire” parts of the ownership transfer operation. While the ownership transfer is issued on both sides, typically at least one of them ends up being a no-op. Why?

Let’s take the example of a DCC compressed image subresource. If ownership is transferred from e.g. the graphics queue family to a queue family that does not “understand” DCC, the “release” part of the ownership transfer operation will result in a DCC decompress operation to ensure that the target queue family will be able to read the data in the image subresource and the “acquire” part will be a no-op. It’s easy to see that the necessary decompression wouldn’t be possible to be performed on the “acquire” part of the pipeline barrier pair, as the destination queue family is not able to interpret DCC compressed data, let alone being able to decompress it.

Illustration of a queue family ownership transfer from a “more capable” queue family to a “less capable” one.
In this example, an image with DCC is handed off to a queue family that is not able to interpret DCC data, therefore the “release” side of the pipeline barrier pair performing the queue family ownership transfer is responsible to perform a DCC decompress.

Another example that is worth looking at is a depth image subresource with a Hi-Z meta-surface plane. Imagine the depth image subresource was updated on the transfer-only queue and now ownership has to be transferred to the graphics queue to use the image as a depth attachment. As we discussed earlier, this may require resummarizing the data in the Hi-Z meta-surface plane, as the main image plane data may no longer be in sync with the Hi-Z data. In this case this most likely can only be performed on the “acquire” side of the pipeline barrier pair, as the transfer-only queue is unlikely to be able to perform such a Hi-Z resummarization pass, while the “release” side will end up being a no-op.

Illustration of a queue family ownership transfer from a “less capable” queue family to a “more capable” one.
In this example, a depth image updated by a transfer-only queue is handed off to the graphics queue family that expects to use Hi-Z data, therefore the “acquire” side of the pipeline barrier pair performing the queue family ownership transfer is responsible to perform a Hi-Z resummarization.

This is why queue family ownership transfers always come in pairs, as it enables the driver to perform the necessary data transformations on the queue family that can actually perform them, without the application developer having to be aware of which queue family is the suitable one.

Of course, while it is relatively rare, it’s also possible for the two queue families to both support some sort of compression or optimized layout but distinct ones. In such cases it is possible for both the “release” and “acquire” sides of the queue family ownership transfer barrier to have to perform appropriate transformations.

These examples reveal one more thing: queue family ownership transfers may result in an effective hardware image layout transition even if the API image layout remains unchanged. In particular, in the earlier DCC example, the queue family ownership transfer will result in a DCC decompression (an effective hardware image layout transition from “DCC-compressed” to “DCC-uncompressed”) even if both the old and new image layouts were VK_IMAGE_LAYOUT_GENERAL.

Beyond basic sharing mode related use cases, queue family ownership transfers play an important role in advanced applications that share resources across APIs, processes, and devices. In these cases the srcQueueFamilyIndex of the “acquire” side and the dstQueueFamilyIndex of the “release” side of the pipeline barrier pair performing the queue family ownership transfer take one of the following values:

VK_QUEUE_FAMILY_EXTERNAL – refers to a queue that is external to the current Vulkan instance (it’s a queue of another Vulkan or non-Vulkan API queue in the same or a different process) but one that refers to a queue of the same physical device or device group
VK_QUEUE_FAMILY_FOREIGN_EXT – refers to a queue belonging to another device, let that be another GPU or some other device, such as a capture card, camera, network adapter, or anything else

Conclusion

Throughout the article, we went through the key aspects of Vulkan pipeline barriers, provided numerous examples showcasing specific benefits of the Vulkan synchronization model, and attempted to disprove some of the common misconceptions about it. Admittedly, despite the extent of the article, we still couldn’t cover everything in the level of detail they deserve.

Among those misconceptions we tackled, we believe that we also managed to clarify why some of the bold claims implied by the introduction of VK_KHR_unified_image_layouts should be taken with a grain of salt. After all, that extension is a wishful promise that using VK_IMAGE_LAYOUT_GENERAL everywhere (barring the exceptions) will not come with performance compromises. While GPUs have evolved over time, image layouts remain relevant when you look at Vulkan not just as an API for graphics workloads, but an API that exposes functionalities beyond graphics, including access to other hardware components often present on graphics cards, such as video codec engines and DMA engines, and an API that can interact with other external hardware and software stacks.

Our hope is that this article not only managed to explain the underlying behavior of pipeline barriers, but also revealed the rationale behind all of its apparent complexity, enabling the reader to create a better mental model of their function. We believe that this mental model will enable developers to implement more efficient yet simpler Vulkan applications by understanding how they can leverage the high level knowledge they have about their application use cases to issue the appropriate pipeline barriers, and image layout transitions included in them, without adding much, if any, complexity to, or even eliminating existing complexity from their existing code base.

The Vulkan SC Emulation driver stack

Daniel Rákos — Wed, 12 Feb 2025 13:38:24 +0000

One of the most exciting projects we had the chance to work on in 2024 was the creation of the Vulkan SC Emulation driver stack. This software stack enables all developers to prototype and develop Vulkan SC applications on consumer PC hardware running Windows or Linux without needing specialized hardware and software components that are typically required to target safety-critical systems.

The Vulkan SC 1.0 API specification has been out there since 2022 and compliant implementations are available from CoreAVI and NVIDIA on various platforms expected to be used in safety-critical environments, however, access for the general public to hardware and software platforms having native Vulkan SC support remains limited. This is why, in 2023, we proposed the Vulkan SC Working Group the idea of creating a Vulkan SC Emulation layer that enables running Vulkan SC applications on top of regular Vulkan implementations in order to significantly improve the accessibility of Vulkan SC as an API to developers around the globe.

Admittedly, NVIDIA did provide a public Vulkan SC implementation in their JetPack SDK available for Jetson-based devices and they also extended this in 2024 to cover desktop systems by including Vulkan SC implementations in all their desktop drivers. That, however, still only covers support for a single vendor’s hardware, so the Vulkan SC Emulation driver stack remained an important next step in the evolution of the Vulkan SC Ecosystem toolchain because it widens the coverage to all PCs (or any other compatible computer) running Windows or Linux. Furthermore, the Vulkan SC Emulation driver stack enables a whole set of additional possibilities to further aid the development of Vulkan SC applications, as covered later in this article.

Nomenclature

You can see the Vulkan SC Emulation driver stack being referred to as the Vulkan SC Emulation layer here and there, but it is worth clarifying that it is not an actual layer in the traditional Vulkan (SC) sense. Rather it is a standalone ICD (installable client driver) similar to actual native Vulkan and Vulkan SC drivers, more analogous to e.g. the Mesa Zink driver that enables the emulation of OpenGL on top of Vulkan.

It is also important to make it clear that the Vulkan SC Emulation driver stack is not an officially conformant and safety certifiable Vulkan SC solution on its own, irrespective of being able to pass the Vulkan SC conformance test. Even if itself would be, which was never a goal, it still relies on an underlying Vulkan implementation that itself is not designed for safety-critical use, by definition. Rather, the Vulkan SC Emulation driver stack is a developer tool that aims to enable developers to prototype and develop Vulkan SC applications on consumer hardware and therefore eliminate the high barrier of entry for developers interested in learning and using the Vulkan SC API.

As any other implementation of the Vulkan SC API, the driver stack comprises of two components:

An installable client driver – the Vulkan SC Emulation ICD (vksconvk.dll / libvksconvk.so)
An offline pipeline cache compiler – the Vulkan SC Emulation PCC (pcconvk)

For those interested, the origin of the binary names is that they provide access to the Vulkan SC API (vksc) and a corresponding Pipeline Cache Compiler (pcc) on top of the Vulkan API (vk).

How it works

The Vulkan SC Emulation ICD implements all the logic necessary to execute Vulkan SC applications on top of any compatible Vulkan implementation available on the system. First, this needs the system to have a GPU and corresponding Vulkan driver that fulfills the minimum requirements to emulate Vulkan SC on them. Vulkan SC 1.0 is based on Vulkan 1.2 and requires support for the Vulkan Memory Model, so any Vulkan physical device available on the system that meets these requirements will be exposed as Vulkan SC physical devices by the Emulation ICD. As otherwise it is just another Vulkan SC ICD with a set of physical devices, applications are able to gain access to all emulated devices as well as native Vulkan SC devices (such as the ones exposed by the NVIDIA desktop Vulkan SC driver) through the Vulkan SC loader.

Vulkan SC Loader and Vulkan SC Emulation ICD behavior in the presence of multiple Vulkan ICDs and other “real” Vulkan SC ICDs.

Of course, the secret sauce is the efficient and transparent translation of the incoming Vulkan SC API command stream to a corresponding set of Vulkan API commands. Fortunately, the Vulkan SC API is very close to the Vulkan API, as one of the key design goals of Vulkan SC was to be able to reuse most existing software, source code, and developer mindshare with minimal deviations required to meet safety-critical requirements, so the Emulation ICD does not need to jump through flaming hoops to make that translation happen.

Beyond just doing the bare minimum needed for correct API translation, however, the Emulation ICD provides additional Vulkan SC specific and general functionalities that further aid developers in creating robust Vulkan SC applications, some of which go beyond what one may be able to achieve even when using a native Vulkan SC driver implementation. We’ll talk about these later in this article.

The emulation pipeline cache compiler

As any other Vulkan SC implementation, the Vulkan SC Emulation driver stack comes with an “offline” pipeline cache compiler. The role of the pipeline cache compiler in Vulkan SC is to offload pipeline cache compilation (and inherently shader compilation) into a preprocessing stage and ship Vulkan SC applications with pre-built pipeline cache binaries in order to eliminate the need for online pipeline and shader compilation, which is a key aspect of the special requirements of safety-critical systems to ensure predictability (although it may often be preferred even in case of regular Vulkan applications). This happens by issuing the pipeline cache compiler to produce the binaries from a set of shader files and pipeline JSON files, as described in detail in the Khronos Vulkan SC Overview and our last article about the Vulkan SC Ecosystem components.

Vulkan SC uses offline pipeline compilation to produce pipeline cache binaries from a set of input JSON pipeline descriptions and shader SPIR-V binaries.

The Vulkan SC Emulation PCC “cheats” a bit here in order to keep things simple and flexible…

An actual production pipeline cache compiler produces the final shader and pipeline binaries (built for a specific target hardware) as output in order to completely eliminate the need for online shader compilation. The Emulation PCC, however, currently only packages the pipeline and shader SPIR-V binaries into the pipeline cache binaries, practically only containing the “debug information”. This means that online pipeline compilation will still happen when running a Vulkan SC application against the Emulation ICD, but exactly reproducing how a Vulkan SC application would behave in a real safety-critical environment in terms of performance predictability is not entirely achievable when using consumer hardware and software anyway. There is room for more options on this front, as it is certainly possible to add support to the Emulation PCC to build and emit Vulkan pipeline cache binaries targeting specific devices exposed by the Emulation ICD, although even that may not be able to provide 100% guarantees on no online pipeline compilation due to the inherent behavior of the underlying Vulkan API used for the emulation.

Barring the drawbacks above, using this simple model for the output of the Emulation PCC has its own set of benefits. The most important one is that the produced pipeline cache binary is “portable”, i.e. it can be used with any physical device exposed by the Emulation ICD, even on a different system, because the produced binaries do not contain any hardware-specific or Vulkan implementation-specific data.

Advanced emulation features

The Emulation ICD comes with additional features that aim to help with specific aspects of Vulkan SC application development. In this section we will go through the most important ones.

Command pool memory tracking

The Emulation ICD tracks command pool memory consumption similarly to a native Vulkan SC driver and therefore reacts to the parameters specified through VkCommandPoolMemoryReservationCreateInfo. The actual command pool memory usage of individual command buffers and commands is emulated (unsurprisingly) based on calculations customizable in the source code generators of the ICD, as no such information can be extracted from the underlying Vulkan implementation.

Still, this can come useful to test the handling and resilience of Vulkan SC applications to command pool memory exhaustion that even triggers the corresponding faults like a native Vulkan SC driver would. This leads us to the next interesting feature…

Fault handling

The Emulation ICD is able to trigger Vulkan SC faults, an area that is typically difficult to test on native Vulkan SC implementations yet it is something that Vulkan SC applications are expected to gracefully handle, when possible. The Emulation ICD can and will trigger faults in various situations some of which can be used by Vulkan SC application developers to deterministically test fault handling behavior, e.g.:

VK_FAULT_TYPE_COMMAND_BUFFER_FULL faults are reported when the application reserved command pool memory is exhausted
VK_FAULT_TYPE_PHYSICAL_DEVICE faults are reported in case any applicable command detects a VK_ERROR_DEVICE_LOST error
In some cases the ICD detects invalid API usage conditions and will report VK_FAULT_TYPE_INVALID_API_USAGE faults, although the ICD remains largely reliant on valid API usage, per the usual Vulkan (SC) expectations, and the Vulkan SC Validation Layers should be used to test for valid API usage in the application

We particularly recommend developers to test their Vulkan SC applications early with command pool memory consumption related faults, because it enables basic test coverage for both fault handling and resiliency to command pool memory exhaustion.

Display emulation

At the time of writing, this is the most recent feature addition to the Vulkan SC Emulation driver stack. One particular difficulty when dealing with graphical Vulkan SC applications is that the window / display system support is limited and may even be esoteric on some safety-critical platforms and Vulkan SC does not support the traditional desktop window systems (the corresponding Vulkan extensions are not available in Vulkan SC). Our best bet for on-screen rendering is therefore the VK_KHR_display extension that enables direct-to-display access in both Vulkan and Vulkan SC. While support for this extension is available on some existing driver implementations, they often require additional system configuration and operations in order to perform direct-to-display presentation on systems with a running window system / compositor.

Our goal was to maximize portability, flexibility, and ease of use, so the Emulation ICD exposes support for VK_KHR_display by emulating displays as plain Win32/X11 windows. This only requires the underlying Vulkan implementation to support the VK_KHR_win32_surface (on Windows) / VK_KHR_xcb_surface (on Linux X11 or Wayland through XWayland) extension.

The display emulation support included in the Vulkan SC Emulation driver stack is also highly configurable. The number of emulated displays (the default is 4 displays) can be changed using the VKSC_EMULATION_DISPLAYS environment variable and even the names, geometry, etc. of the emulated displays can be customized by providing a simple JSON configuration file describing them through the VKSC_EMULATION_DISPLAY_CONFIG environment variable. A sample JSON display configuration file is available in the repository.

Of course, even with all these capabilities, the display emulation support has its limitations that may not enable Vulkan SC application developers to completely implement their display / window system interactions using the Emulation driver stack, as the target safety-critical platforms may have different capabilities / requirements and may not even expose display support through the VK_KHR_display extension but rather provide other extensions and APIs for the purpose. Nonetheless, having display emulation included in the stack should still enable Vulkan SC application developers to get pictures on screen and to cover most of their window system integration implementation on readily available consumer systems.

Test drive

The Vulkan SC Emulation driver stack is open-source and available on GitHub. The project can be easily built using CMake, as described in the build instructions. While CMake can be used to perform system-wide installs on Linux, it is more typical to use a custom install prefix when building from source, e.g.:

# Configure
cmake -S  -B  -D CMAKE_INSTALL_PREFIX= -D CMAKE_BUILD_TYPE=Release -D UPDATE_DEPS=ON

# Build & Install
cmake --build  --config Release --target install

Once built, in order to enable the Vulkan SC Loader to pick up the Vulkan SC Emulation driver stack, the driver location has to be included in the library loading path of the platform (PATH environment variable on Windows, LD_LIBRARY_PATH on Linux) and the VK_DRIVER_FILES or VK_ADD_DRIVER_FILES loader environment variable has to be set to point to the location of the driver’s JSON manifest file (this should be familiar to people messing around with custom Vulkan / Vulkan SC driver implementations):

# Windows
set PATH=\bin;%PATH%
set VK_DRIVER_FILES=\bin\vksconvk.json

# Linux
export LD_LIBRARY_PATH=/lib
export VK_DRIVER_FILES=/share/vulkansc/icd.d/vksconvk.json

This configuration process and additional instructions related to the use of the Vulkan SC Emulation ICD can be found in the documentation.

The Vulkan SC version of the famous Cube Demo is now also available in the VulkanSC-Tools repository on GitHub. This builds out of the box with a portable pipeline cache binary embedded in the binary, compiled using the Emulation PCC, so executing the resulting vksccube program with the Vulkan SC Emulation driver stack configured should present you a familiar sight with a Vulkan SC twist to it:

Vulkan SC Cube Demo running on the Linux version of the Vulkan SC Emulation driver stack.
Note: emulated display presented as an X11 window in a Wayland session.

Summary

The release of the Vulkan SC Emulation driver stack at the end of 2024 marked another milestone in the evolution of the Vulkan SC Ecosystem toolchain. The most important thing it brings to the table is that Vulkan SC application development is now accessible to developers on a wide range of consumer software and hardware environments. With the recent addition of display emulation, it is now also possible to get pretty pictures on screen without the need for special equipment or system configuration.

This stack can be of great value even for developers already using a vendor-specific Vulkan SC implementation and developer toolchain, as the nature and customizability of an emulation environment enables additional ways to exercise the API that can aid both in the development and the testing of Vulkan SC applications.

Another major benefit of the Vulkan SC Emulation driver stack is that, thanks to clever code generation, it can make new extensions promoted from Vulkan to Vulkan SC immediately available on compatible Vulkan implementations, therefore streamlining early adoption.

You can find the relevant components and related documentation in the following repositories:

VulkanSC-Emulation – home of the Vulkan SC Emulation ICD (installable client driver) and PCC (pipeline cache compiler)
VulkanSC-Tools – now also including the Vulkan SC Cube Demo

For further reading and materials, please check out our article Behind the scenes: the Vulkan SC Ecosystem, and the new official homepage of Vulkan SC.

Behind the scenes: the Vulkan SC Ecosystem

Daniel Rákos — Mon, 08 Jul 2024 15:14:57 +0000

It’s been over a year since we started working on the Vulkan SC Ecosystem. Now that the component stack has reached a high level of maturity, it seemed appropriate to write an article about the secret sauce behind the Vulkan SC Ecosystem components that enabled us to leverage the industry-proven Vulkan Ecosystem components to provide corresponding developer tooling for the safety-critical variant of the API.

Vulkan SC was released by the Khronos Group in 2022 as the first of the new generation of explicit APIs to target safety-critical systems. The Vulkan SC 1.0 specification is based on the Vulkan 1.2 API and aims to enable safety-critical application developers access to and detailed control of the graphics and compute capabilities of modern GPUs. In order to accomplish that, Vulkan SC removes functionality from Vulkan 1.2 that is not applicable, not relevant, or otherwise not essential for safety-critical markets, and tweaks the APIs to achieve even more deterministic and robust behavior to meet safety certification standards.

The Vulkan SC Ecosystem components, such as the ICD Loader and Validation Layers, are not safety certified software components themselves, rather, they are developer tools intended to be used by application developers writing safety-critical applications using the Vulkan SC API. Building on the success of the corresponding ecosystem components available for the Vulkan API, the goal for the Vulkan SC Ecosystem is to leverage the tremendous engineering effort that went (and still goes) into those in order to create a comparably comprehensive suite of developer tools for the safety-critical variant of the API, amended with additional features specific to Vulkan SC. Reaching that goal, however, came with its own set of challenges…

The Challenges

While the Vulkan SC API is derived from the Vulkan API, it has some significant differences in order to fulfill the special requirements of safety-critical applications including, but not limited to, the following:

Device child objects are allocated from a static pool reserved at device creation time
Shader compilation is handled offline using a separate vendor-specific pipeline cache compiler (PCC)
Command pool memory is reserved up-front
Fault callbacks are added to the API to notify the application about critical faults
Features that are not needed (e.g. shader modules) and that are not a good fit for a safety-critical environment (e.g. sparse resources) were removed

Vulkan SC uses offline pipeline compilation and static memory allocation to avoid runtime memory allocation whenever possible.

Overall, it can be said that Vulkan SC is neither a subset, nor a superset of Vulkan, but rather an API that has a large overlap with Vulkan, and many of the API differences have a significant impact on how applications and developer tools are expected to behave. Still, due to the large overlap, it is of the utmost importance from a feasibility point of view for the Vulkan SC Ecosystem components to reuse as much of the Vulkan Ecosystem efforts as possible.

Another key part of the challenge, beyond the API and behavior differences, is that the Vulkan and Vulkan SC APIs are developed to maintain as much alignment as possible but do need to diverge to address market-specific requirements, and therefore the Vulkan SC Ecosystem components also need to exist as separate variants of the corresponding Vulkan Ecosystem components with little to no impact on the latter, while also retaining the ability to leverage ongoing general improvements made on the Vulkan side. The first prototypes of the Vulkan SC Ecosystem components, built before we joined the project, therefore were created by forking the corresponding Vulkan Ecosystem components and patching them with Vulkan SC specific changes (mostly through appropriate #ifdef VULKANSC preprocessor magic).

Unsurprisingly, as it usually happens with permanent forks, this approach turned out to be difficult to maintain, particularly for components enjoying a fast evolution such as the Validation Layers, due to the sheer number of places the upstream code needed to be patched and the unmanageable number of merge conflicts those cause during downstreaming.

The goal for us was clear: we need to find a maintainable architecture that enables us to apply Vulkan SC related modifications to the upstream components without any intrusive changes to the latter. The tricky part of achieving that is that we had to find a way to significantly modify behavior while modifying as little of the code we inherit from the Vulkan Ecosystem components in order to avoid merge conflict churn. All of this while also reusing as much of the upstream code as possible. The great thing about working in a very restrictive environment, where your hands are pretty much tied behind your back, is that it stimulates creativity.

Getting Things to Build

In order to build a new, maintainable solution for the problem at hand, we decided to start fresh and work out the details from first principles. Most of the #ifdef modifications in the original prototype implementation were needed because the Vulkan API, and thus its headers, contain some differences in the set of function and type definitions compared to Vulkan SC, primarily due to the different set of extensions supported by the two APIs. Therefore being able to compile a common code base against the latter immediately required eliminating all code that depended on Vulkan definitions that do not exist in Vulkan SC using the traditional #ifdef approach.

These headers are generated from an XML registry which is logically separate for the two APIs, but since the Vulkan SC specification regularly downstreams Vulkan specification changes, and upstreams its own definitions, they now do coexist in a single XML registry file. The definitions in this common XML registry include appropriate annotations when the effective registries of the two APIs differ (e.g. for Vulkan-only definitions, Vulkan SC-only definitions, or common definitions that are used differently in the two APIs). The header generation tooling uses these annotations to decide which definitions to include (and how) in the generated headers based on whether the target API variant is Vulkan or Vulkan SC.

To solve the compilation problem against different variants of the headers, we added a new capability to the header generation tooling that enables generating combined headers for Vulkan SC that also contain Vulkan definitions that are otherwise not available for use in Vulkan SC applications. These combined headers are not official, their sole purpose is to be used by the ecosystem components, so they are automatically generated as part of the build process.

Illustration of different headers that can be generated from the vk.xml.

Using these combined headers allowed us to avoid making intrusive changes to the ecosystem component code we inherited from upstream and eliminated the need for most of the #ifdefs that existed in the original prototypes, leaving us with fairly minor and manageable deviations from upstream for most components, with the notable exception of the Validation Layers where this was only the first step to reach our goal.

Generated Code

The ecosystem components contain a fair amount of code that is automatically generated using python scripts based on the XML registry, such as:

Boilerplate code for dispatching and, in general, dealing with API entry points
Reflection utilities to enable pretty printing the names and details of API constructs
Utilities dealing with extensions, features, structure chains, etc.

Some components go even further. In particular, as the XML registry also contains metadata related to the expected use of various API constructs, a good chunk of actual validation code of the Validation Layers (for example, the validation of so-called implicit valid usage clauses) is also generated using such scripts.

For the most part, these just work, regardless of the target API variant, as they use the common XML registry tooling to extract the definitions corresponding to the API variant in question. However, some of the fundamental differences between the Vulkan and Vulkan SC APIs also require alternative code generation for the two based on conditions that are not expressed in the XML registry. In order to deal with these differences, the code generation python scripts have been extended with additional hooks and tools that enable modifying the code generation process depending on the target API variant. Also, as a side effect of using the combined headers, some generated code similarly needs to work based on the combined set of definitions of the two APIs (e.g. reflection utilities).

Adding Vulkan SC Validation

Building the Validation Layers for Vulkan SC is one thing, being able to maintain a fork of the Vulkan Validation Layers with Vulkan SC specific validation code added to it is another. The main challenge is to be able to add validation code specific to Vulkan SC or modify the behavior of validation code inherited from the upstream Vulkan Validation Layers without continuous maintenance burden caused by conflicting changes made in the upstream and downstream repositories.

From maintenance point of view, the key thing to get right is how the core validation (CoreChecks class) and the state tracking (ValidationStateTracker class) code is organized relative to the baseline Vulkan Validation Layers, as that is where the majority of change conflicts could happen. The Vulkan Validation Layers use an architecture based on class inheritance, where all stateful validation layers consist of state tracking and a derived validation class specific to the particular use case. We extended this by introducing additional derived classes with Vulkan SC specific state tracking and validation, as depicted below:

Class inheritance hierarchy of the Vulkan SC Validation Layers.

NOTE: The class hierarchy depicted above has been changed in the Vulkan Validation Layers since the original implementation and will continue to evolve, but the principle remains.

With this architecture, all Vulkan SC specific validation and state tracking can be maintained completely separately (in their own classes and files), which avoids the painful troubles that come with maintaining such a downstream repository.

In this architecture the validation layers will still perform all Vulkan-specific validation checks and some of them may not apply to Vulkan SC. Aside from the negligible performance cost, this does not cause any practical issues in most cases, as all those checks are typically behind corresponding API, extension, or feature checks that would simply not pass on Vulkan SC. However, there are a few exceptions where Vulkan SC deviates in some validation rules compared to Vulkan in subtle ways. These corner cases are handled by explicitly checking the Vulkan SC specific rules in the Vulkan SC specific validation code and using the built-in VUID filtering tools of the Validation Layers to ignore any validation errors that do not apply in Vulkan SC.

API Versions, Extensions, and Features

The Validation Layers contain a lot of code that checks the used API version, and the enabled extensions and features. The API version checks are already an issue from the perspective of Vulkan SC compatibility, as the uint32_t value used to represent an API version is encoded differently in Vulkan and Vulkan SC as the latter uses a non-zero API variant ID at bit offset 29. Furthermore, Vulkan SC 1.0 is more or less equivalent to Vulkan 1.2, barring the API variant specific differences. Modifying every piece of code in the Validation Layers to check for Vulkan SC API versions instead of Vulkan API versions is neither a small task nor is it going to be maintainable due to constant merge conflicts and the introduction of new version checks.

Instead, our solution to the problem was to replace the existing raw storage of the uint32_t values representing API versions with an APIVersion class that is customized for Vulkan SC to automatically understand and handle the mapping between Vulkan and Vulkan SC API versions in its comparison operators, making the existing checks of the API version against specific Vulkan API versions work as expected even when run in a Vulkan SC environment.

Similarly, handling of extensions and features promoted to a core version, including implicit promotions resulting from Vulkan SC 1.0 being based on Vulkan 1.2, are all handled transparently, without any modifications to the upstream code, thanks to some clever code generation and infrastructure.

Validation Layer Tests

Even if we solve all problems on the implementation side, we cannot be sure about the correct behavior of the Validation Layers without proper test coverage. The Vulkan Validation Layers have thousands of test cases that we need to leverage in addition to adding our own Vulkan SC specific test cases. The former is a must have, otherwise we would have to redo years of test coverage effort already developed upstream and continue to do so as the API development moves forward. The original prototype of the Vulkan SC Ecosystem did not have a solution for this, so we had to come up with one.

The tricky part is that even though the test suite would build just fine using the combined headers, executing them against a Vulkan SC implementation would not work due to the API differences. While some of these differences, like the Vulkan SC requirement to provide object reservation information at device creation time, are fairly trivial to handle at the test framework level, there are many other API and behavior differences that would normally require non-trivial and intrusive changes to the test cases in order to make them compatible with a Vulkan SC implementation. Just thinking about the fact that Vulkan SC requires the use of an offline pipeline cache compiler and the built pipeline cache data being specified at device creation time shows that we have fundamental issues that we have to solve or work around before even thinking about being able to run the validation layer tests against a Vulkan SC implementation.

We could have chosen to handle this in a similar fashion to the Vulkan SC CTS, i.e. by running the tests in a two pass process where first we capture pipeline creation parameters (including the SPIR-V shader modules), build a pipeline cache binary from those using the pipeline cache compiler, and then re-run the tests, but that approach has its own set of problems.

The more interesting question to ask ourselves first is whether it is necessary at all to run the validation layer tests against an actual Vulkan SC driver. If we think about it, the Validation Layers mostly test negative cases, i.e. when the behavior of the API is undefined because we violated some API usage rules, and even the positive tests only care about not generating validation errors when we shouldn’t. After all, we test the Validation Layers, not the Vulkan SC implementation, so it shouldn’t make a difference whether we run that against a real driver or some placeholder one. In fact, this is the approach the upstream Vulkan Validation Layers use to run the tests in Github Actions, as no real GPU hardware is available there. Instead, the tests run against the Mock ICD, which is a placeholder driver that (for the most part) does not do anything.

There is one particular exception when having a real driver in the stack is relevant: GPU-assisted validation (or GPU-AV). This new type of validation is an additional component available in the Vulkan Validation Layers that uses shader instrumentation, by patching the incoming SPIR-V shader modules, to perform fine-grained device-side validation, but it is not directly applicable to Vulkan SC, as Vulkan SC uses offline shader compilation.

Based on this, the choice was clear for Vulkan SC: we should just rely on Mock ICD based testing for the validation layer tests, because it should be sufficient to achieve full coverage. Furthermore, as we did not have to run the tests against real Vulkan SC implementations, we could also get away without any real pipeline cache data, so we could avoid the whole offline pipeline cache compilation problem.

Still, even if we deal with all of the basic API differences at the test framework level (such as additional input structures for certain APIs, differences in API version handling, promoted and available extensions and features), a lot of the test cases written for the Vulkan Validation Layers would still not run fine against the Vulkan SC Validation Layers. For example:

The test case may not even be applicable to Vulkan SC, as it tests a Vulkan-specific validation rule that simply does not exist in Vulkan SC (e.g. unsupported feature or pre-Vulkan 1.2 rule)
The test case may not apply to Vulkan SC because the respective validation rule has been changed in Vulkan SC compared to Vulkan
The test case may rely on some functionality that is removed in Vulkan SC (e.g. being able to free/destroy certain object types that are not destructible in a safety-critical environment)
The test case may depend on shader source code (SPIR-V) availability
The test case may be relevant, but triggers an undesired Vulkan SC-specific explicit or implicit valid usage clause by chance

Simple test case filtering would have worked for most cases, but we clearly saw the need for something more powerful that enables us to mark and potentially patch individual test cases. In order to achieve that, we introduced a new (manually triggered) test case converter tool that “transpiles” the upstream test cases written for the Vulkan Validation Layers to a patched version that, among other things, allows the following control over individual test cases:

Disable test cases that are not applicable to Vulkan SC
Disable test cases that are replaced with Vulkan SC equivalents
Disable test cases on specific platforms (such as QNX) due to compatibility issues
Mark test cases that depend on SPIR-V shader module data availability
Add custom object reservation to test cases that may use unusually large number of objects/resources
Add additional feature (or other) dependencies to test cases that have additional prerequisites in Vulkan SC

The test conversion tool is used to filter and patch upstream test cases. These test cases are augmented with Vulkan SC specific test cases

This test converter tool (which is nothing but a fairly straightforward python script, see vksc_convert_tests.py), together with some test framework customizations, enabled us to reuse nearly all of the upstream test cases in one form or another (although, unsurprisingly, a large number of test cases remain skipped on Vulkan SC due to unsupported extension and/or feature dependencies).

This, however, isn’t the end of the story of the validation layer tests for Vulkan SC, as Vulkan SC introduced some physical device properties that have fundamental effects on the overall behavior of the API and therefore on its validation rules. That means it’s not enough to make sure that we add additional tests for all additional Vulkan SC specific validation rules, we need a runtime environment where we can test against “devices” with or without support for specific physical device capabilities/attributes.

As a result, we needed a way to simulate different device properties in order to achieve full coverage. Fortunately, this is not an entirely new problem. Vulkan already has a tool that allows achieving exactly that: the Vulkan Profiles Layer. This layer can take a JSON profile file describing a set of device capabilities and report those to an application instead of the actual device capabilities. We created an analogous component for Vulkan SC called the Vulkan SC Device Simulation Layer which we use to test the validation layer tests with different device capability sets. This approach has also been adopted upstream in the meantime so the Vulkan Validation Layers are also tested now using the Vulkan Profiles Layer with different device profiles.

Illustration of the software stack used for Vulkan SC Validation Layer testing.

It is important to note that there is nothing specific to our testing needs in the Vulkan SC Device Simulation Layer. That means application developers can use it in a similar fashion to test their Vulkan SC applications against different device capabilities as they would use the Vulkan Profiles Layer in case of Vulkan applications.

SPIR-V Dependent Validation

We’ve already alluded to the fact that, some validation rules depend on the availability of SPIR-V shader module data (just think about all the rules that cross-validate shader code against API state). Support for these SPIR-V dependent validation rules is one of the more recent additions to the Vulkan SC Validation Layers.

Of course, the key to that is that we actually need to have SPIR-V shader module data available in the Validation Layers in order to be able to validate the corresponding rules. This is not the case by default, as the pipeline cache data produced by the offline pipeline cache compilers usually contain only the final vendor-specific ISA (and any other implementation-specific state related to the pipelines). However, these offline pipeline cache compilers typically also allow embedding debug information into the generated pipeline cache data. This debug information includes the JSON description of the pipeline, as well as the SPIR-V binaries of the individual shader stages.

Illustration of the typical layout of Vulkan SC 1.0 pipeline cache data.

As the Vulkan SC pipeline caches have a well-defined internal representation (aside from the implementation-specific data), as depicted above, the Vulkan SC Validation Layers are able to parse these, when available, and use them during validation, enabling the same level of SPIR-V dependent validation as in Vulkan. Just make sure you configure your offline pipeline cache compiler to emit the debug information.

With this feature included, the testing vector also had to change as we needed to make sure that the Vulkan SC Validation Layers operate correctly both with and without SPIR-V debug information. That was fairly straightforward to accomplish thanks to the test converter tool which we could use to mark test cases that depend on SPIR-V debug information to make sure they are only executed when SPIR-V debug information is actually available.

There is one more trick we applied in the test framework to be able to run the upstream SPIR-V dependent test cases against the Vulkan SC Validation Layers. The upstream test cases were written for Vulkan, therefore they use shader modules and on-demand pipeline creation. In order to translate these into Vulkan SC API calls, we had to create internal container objects for the shader modules, and create pipeline caches built from those, on demand.

Shader module emulation used to run upstream SPIR-V dependent validation layer test cases.

As shown in the diagram above, this required us to ignore valid usage clauses disallowing the creation of pipeline caches with pipeline cache data that was not specified at the time the device was created. However, we can safely do that here, because we are only testing with the Mock ICD, and the tests are not supposed to work on a real Vulkan SC driver implementation. After all, we are testing the Vulkan SC Validation Layers here, not drivers, and this trick allowed us to further increase our test coverage without having to replicate hundreds of existing upstream test cases downstream.

Summary

The Vulkan SC Ecosystem has come a long way since the first prototypes released in 2022. The Vulkan SC Validation Layers, in particular, transformed from a limited functionality prototype into a fully featured and thoroughly tested component that is ready for prime time use. Most importantly, we now have a set of efficiently maintainable ecosystem components that can be extended with additional Vulkan SC specific capabilities while also leveraging all the upstream efforts going into the Vulkan Ecosystem.

You can find the individual ecosystem components in the following repositories:

VulkanSC-Headers – official API headers and combined header generation tooling
VulkanSC-Utility-Libraries – utilities used by the ecosystem components that are also available for application use
VulkanSC-Loader – ICD loader
VulkanSC-Tools – vulkanscinfo command line tool, the mock ICD, and the device simulation layer
VulkanSC-ValidationLayers – validation layers

We are very proud that the Vulkan SC Working Group trusted us with taking on this project and giving us the opportunity to rebuild the Vulkan SC Ecosystem on a new, solid foundation, and we’re looking forward to sharing more news about the ongoing evolution of the Vulkan SC Ecosystem in the future.

SIMD in the GPU world

Daniel Rákos — Fri, 04 Feb 2022 13:55:08 +0000

Today’s high computational throughput probably would not be attainable without the application of the SIMD paradigm in modern processors in increasingly clever ways. It’s no coincidence that GPUs also gain most of their performance, die area, and efficiency benefits thanks to this instruction issue scheme. In this article we will explore a couple of examples of how GPUs may take advantage of SIMD and the implications of those on the programming model.

Before proceeding, it’s worth noting that we will not discuss processor hardware design, thus we will not dwelve into details of individual components within a processor core, superscalar processor architecture, issue ports, instruction-level parallelism, register files, bank conflicts, etc. Our focus will be on aspects of the various uses of the SIMD paradigm that have a direct effect on the way developers should write efficient code for such processors, and will only touch marginally on subjects beyond that. That is not to say those hardware details and many other nuances of a specific target architecture have no significant impact on the way code should be written for such devices in order to achieve optimal performance, however, such a discussion is well beyond the scope of this article.

What is SIMD?

The term comes from Flynn’s classification of computer architectures. SIMD stands for single instruction, multiple data, as opposed to SISD, i.e. single instruction, single data corresponding to the traditional von Neumann architecture. It is a parallel processing technique exploiting data-level parallelism by performing a single operation across multiple data elements simultaneously.

Illustration of a 4-lane SIMD block.

Looking at it from a different perspective, SIMD enables reusing a single instruction scheduler across multiple processing units. That allows processor designers to save significant die area and hence achieve greater computational throughput with the same number of transistors compared to traditional scalar processing cores having a one-to-one mapping between instruction schedulers and processing units.

The SIMD model is not unique to massively parallel processors like GPUs, in fact CPUs have a long history of employing SIMD instruction sets like MMX, SSE, NEON, and AVX that can be used in addition to the traditional scalar operations provided by the CPU. While our focus will be on GPUs, we will also see a couple of examples of those.

Vector SIMD

Traditionally, 3D graphics workloads were all about vector operations and to some extent they still are:

Rendering 3D scenes require certain linear transformations of geometric attributes like position, normal, and texture coordinates which involves vector-matrix multiplications which themselves comprise of multiple vector-vector operations (dot products) often performed on 4-component vectors representing homogeneous coordinates
Determining the color of individual vertices and/or pixels usually involves complex lighting calculations which themselves usually comprise of 3- or 4-component vector operations where the vectors represent colors (in RGB or RGBA format) or directions like the surface normal, incoming light direction, reflection direction, etc.

It is thus no surprise that GPUs used SIMD units since the early days to implement vector instructions. It is also not a coincidence that the first programmable shaders used assembly-like shading languages providing instructions operating on 4-component vectors.

The atomic unit of data in this model is a 4-component vector with floating point components. Assuming standard IEEE 754 32-bit floating point values, we get vector registers with a total width of 128 bits. This form of SIMD operating on registers with multiple components is hence often also referred to as packed-SIMD, or SWAR (SIMD within a register).

There are two instructions (or family of instructions) that are worth calling out in particular.

The first is MAD (multiply-add) or MAC (multiply-accumulate) which is available on practically all GPUs as a single instruction, as graphics and multimedia workloads are full of scale-and-bias operations. This means that on traditional 4-component vector based GPUs it takes only a single instruction to calculate 4 floating point multiplications and 4 additions, and floating point MAD/MAC is still often used as the unit for measuring the instruction throughput of GPUs.

The second is the various flavours of dot product instructions (e.g. DP4 or DP3) that calculate the scalar (or dot) product of two vectors. These themselves, more or less, comprise of MAD/MAC operations, hence they are similarly “cheap” operations to perform on a vector SIMD processor. As most of the transformation and lighting calculations directly or indirectly comprise of dot products, vector SIMD processors greatly benefit of single-instruction dot products both from throughput and latency perspective.

Illustration of a possible implementation of 4-component multiply-add (MAD, left) and dot product (DP4, right) pipes.
As processing time is dominated by the multiplications, both operations can be completed with comparable latency. Also note that the scalar result of the DP4 instruction is usually replicated across the channels of the destination vector register, by default (unless requested otherwise as we’ll see later).

CPU SIMD instruction sets also use packed-SIMD technology. As an example, on x86 the SSE instruction set also enables performing operations across multiple data elements in a single instruction by interpreting the XMM registers as packed vectors with multiple components.

When it comes to vector SIMD processors, it’s worth noting two key techniques popularized by them:

Component swizzling – the ability to redirect individual components of source operands to individual processing units, and similarly redirect individual output components to destination components
Component masking – the ability to discard individual output components (or, analogously, disable individual processing units)

These enabled expressing more complex variations of the same operation by reducing the number of components to process, replicating an input or output across components, etc. Implementations typically also support special constant swizzles where the corresponding component of the operand is replaced with one of the commonly used constant values like 0.0, 1.0, 2.0, and 0.5 (potentially even more). Making all (or at least most) instructions accept custom swizzling and masking can significantly reduce the number of instructions for a given workload as it eliminates the need for most move instructions.

Illustration of swizzling (both) and masking (right) in vector instructions.
Note that in the example on the right the 3rd (Z) component of the output is masked but in effect it’s the 4th channel in the SIMD that is unused. The latter is really arbitrary and in fact different SIMD instruction sets use different ways to express masking (as we will see later).

While traditional vector-based GPUs are less prevalent nowadays, packed-SIMD technology is still in use in other forms, as we will see later.

From Vector to Scalar

As GPU workloads evolved, more and more scalar operations creeped their way in the shaders making it increasingly more difficult to reach the theoretical computational throughput of traditional vector-based GPUs. As these processors were vector-oriented by design, performing scalar operations usually meant the execution of a vector instruction with all but one component masked out.

While sometimes it’s possible to combine multiple scalar operations into a single vector instruction, e.g. four independent scalar additions can be trivially merged into a single vector addition and thus utilizing all processing units, it is usually difficult to find enough independent scalar operations of the same kind. Nonetheless, when targeting vector-based GPUs or other packed-SIMD instruction sets, generally it’s highly advised to try to vectorize calculations as the application developer can often do a better job at that than even an optimizing compiler.

Some GPU architectures thus moved from a traditional vector-based architecture to a VLIW one. VLIW stands for very long instruction word, and processors using such an instruction set utilize complex instructions which comprise of multiple operations that are executed in parallel.

Some VLIW based GPUs used a 3+1 structure where a single instruction encoded one operation to perform on the first three components of the vector register, and another to perform on the fourth component, acknowledging the fact that RGBA values often required separate operations to be performed on the RGB color channels compared to the alpha channel, and that for many calculations 3-component vector operations were sufficient (color-only, direction vector, or affine transformation operations) leaving the fourth processing unit available to execute e.g. a completely independent scalar operation.

Illustration of a hypothetical 3+1 VLIW processing core (left) and a sample instruction (right) combining a 3-component vector and a scalar operation.

Transcendental operations (e.g. trigonometric and logarithmic operations) and other non-trivial operations (division, square root, etc.) typically only used with scalar operands were often implemented only on the fourth processing unit, often called the transcendental unit, aligning the processor design better to the expected workload while saving precious die area.

Over time more and more complex VLIW based GPUs appeared with various widths and ever more flexible ways to specify multiple operations within a single instruction. In their most sophisticated incarnations the VLIW instruction sets allow scheduling practically any operation separately for each component.

VLIW based GPUs, hence, have an edge over traditional vector-based ones in that almost any set of operations can be merged into a single VLIW instruction covering the entire width of the processing block, as the operation itself can vary per component (or groups of components) in each instruction, not just the data.

However, those operations generally still have to be independent, i.e. no source of either operation may depend on the result of another within a single instruction, hence despite the best optimization efforts from the application developer and the compiler, it may still result in multiple processing units idling from time to time over the course of a shader invocation’s execution due to the data dependencies.

In addition, unless the particular instruction set supports addressing different registers as source or destination operands across the different operations within a VLIW instruction, additional move operations may be necessary to comply with the operand reference limitations, just like in case of traditional vector-based GPUs.

Nonetheless, one appeal of such architectures is that they can sort of operate in a mixed mode where vector and scalar operations can be both expressed in a single instruction thus even inter-component vector math like dot products and cross products may be performed using a single instruction (although the actual time an instruction completes may vary on the operations used).

Still, the heterogeneous instruction set of such processors means that the instruction decoder and scheduler is likely to be similarly complicated thus limiting the die area benefits of using a single instruction scheduler across multiple processing units.

One way to alleviate this complexity is to use a simple scalar instruction set instead which is what AMD did, for example, with the introduction of the GCN instruction set architecture, that is likely the most well-known GPU ISA in the developer community to date. Of course, this also comes with some sacrifices, as in a completely scalar instruction set even a basic dot product requires multiple MAD/MAC instructions (although, once again, we ignore important details here, like how long each instruction actually takes to complete).

Throughout this paradigm shift it became gradually more important for application developers to use scalar operations in their shaders whenever possible and only keep vector math where that’s the natural granularity of computation.

But does all this mean we’re done with SIMD? Of course not, in fact we are just getting started…

SIMT

Vector processing is just one way to leverage the benefits of the SIMD paradigm. Another common way to utilize SIMD instructions, as it’s often done even on the CPU, is to perform array processing (contrarily to vector processing), as demonstrated in the example below:

// SISD code to perform element-wise multiplication of two arrays
void array_mul_sisd(float* C, float* A, float* B, size_t size)
{
    for (size_t i = 0; i < size; ++i)
        C[i] = A[i] * B[i];
}

// Same algorithm using 128-bit (4-wide) SIMD array processing (x86 SSE)
// (for simplicity, we assume the alignment and size of the arrays is appropriate)
#define FLT4(X) *((__m128*)(&(X)))
void array_mul_simd4(float* C, float* A, float* B, size_t size)
{
    for (size_t i = 0; i < size; i += 4)
        FLT4(C[i]) = _mm_mul_ps(FLT4(A[i]), FLT4(B[i]));
}

As it can be seen above, even though the individual work that needs to be performed on the array elements is scalar by nature, SIMD instructions can be used to process multiple array elements in parallel. This subtype of the SIMD paradigm is often called SIMT, i.e. single instruction, multiple threads. It is a misnomer, to some extent, as the “threads” we talk about here are not the independently schedulable threads of execution that we all know, but rather the threads we know from NVIDIA’s CUDA API, i.e. the individual lanes of a wave. But let’s not get ahead of ourselves…

This is really just the other side of the same coin, as we can call the above example as vectorization as well, if we really want to. However, when this vectorization isn’t explicit, but rather an artifact of the programming model then the distinction between array vs vector processing becomes clear-cut.

So far we only talked about leveraging internal parallelism within a single shader invocation, utilizing the fact that many shader computations operate on vectors of various widths and even scalar operations are ofttimes independent from each other. However, the massive parallelism of GPU workloads actually stems from having to execute the same shader code across a large number of data elements (vertices, primitives, fragments, etc.).

Thus, ignoring control flow for now, which is anyway something that wasn’t available on early programmable GPUs, it is trivial to process multiple shader invocations in parallel by scaling up the width of the SIMD unit. Of course, there are practical and technological limits to how wide it’s possible or worth to make a SIMD unit, but in theory it could go as wide as the entire processor. This allows reusing a single instruction scheduler across even more processing units than in a basic vector or VLIW processor.

Illustration of a hypothetical GPU with a 4-component vector-based vertex processor and a 3+1 VLIW based fragment processor.

Going back to our GPUs with scalar instruction sets, now it’s trivial to see that the scalar nature of the instructions themselves does not prevent us from utilizing SIMD technology, as just as their vector-based or VLIW based counterparts, the instruction stream can be issued across multiple shader invocations simultaneously, or, loosely speaking, executed in lock-step.

In practice, modern GPUs usually comprise of multiple sets of such SIMD processing blocks hierarchically aggregated into clusters sharing different types of caches and auxiliary hardware blocks performing fixed-function operations of the graphics pipeline.

In this model shader invocations that are scheduled simultaneously across the processing units of one of more SIMD blocks form a subgroup often also called a wave, wavefront, or warp, while the individual shader invocations within those are referred to as the lanes or threads of the wave.

Taking AMD’s GCN architecture as an example, while the instructions are scalar from the perspective of a single shader invocation, the instruction actually refers to these as vector instructions, as practically they perform operations of entire waves as wide vector operations where each component belongs to a particular shader invocation. Thus the scalar nature of the instruction set should not be confused with the scalar unit available on GCN GPUs (or in some recent NVIDIA GPUs) which actually behaves more as a SISD execution unit shared across the entire wave.

SIMD Control Flow

Over time GPUs gained more and more sophisticated support for shader control flow. However, in case of a SIMD unit that may process multiple shader invocations (or threads, if we must) it is less intuitive how control flow can be implemented. This is where another SIMD paradigm comes handy that is called an associative processor in Flynn’s taxonomy.

The technique expands on the idea we already covered to some extent in our discussion about vector-based GPUs whereas individual components of the vector operation could be masked out. There is no reason why we couldn’t apply the same principle for SIMD units that process multiple shader invocations in parallel.

Early incarnations of GPU control flow support did not have true branching support in the processor, more specifically, there were no jump instructions or anything similar available. Thus loops of any sort would get unrolled by the compiler and the shader authors had to be wary of the instruction limit of the target GPU as at this point instructions were not streamed from memory but were stored in a limited size on-chip buffer. Support for conditionals, however, arrived fairly early in the form of predicated/conditional instructions.

In a naive implementation this means that in case of an if-else block the GPU would execute both branches and then a conditional instruction (e.g. some form of CMOV) would select the results of the appropriate branch based on the value of the condition. This technique enables to continue utilizing SIMD technology to execute multiple shader invocations in parallel while still allowing for the individual invocations to virtually take different branches across the control flow.

// GLSL-style high-level pseudo-code
...
if (!inShadow) {
    light = max(0.0, dot(L, N));
    color *= light;
}
...

// Corresponding hypothetical instructions
...
DP3 R1.x, L.xyz, N.xyz;
MAX R1.x, R1.x, R0.0;
MUL R1.xyz, COL.xyz, R1.xxx;
CMOV COL.xyz, COL.xyz, R1.xyz, SHD.x;
...

Obviously, this comes at a hefty cost as all branches within the shader actually need to be executed for all shader invocations, and it’s the origin of the old advice of avoiding conditionals in shaders whenever possible that far outlived the actual GPUs without true branching capabilities. Nonetheless, even in these times, the cost of control flow was still acceptable when the computation in the actual branches was fairly limited, as in the example above.

One drawback of using the naive approach above is that we are not only paying the performance cost of both branches, but also their power cost, as all shader invocations across the SIMD unit perform both sets of calculations even though each will only use the results of one of them in the end. Thus, in practice, GPUs support predicating of pretty much every instruction through some special register(s) similar to the mask registers used by CPU instruction sets like AVX-512. Hence even the early assembly-like shading languages used such an approach and thus allowed to at least save the power cost for shader invocations not taking a particular branch.

Illustration of a hypothetical condition (predicate) mask based 16-wide SIMT GPU’s active lanes over the course of running 16 shader invocations that take different paths through the branches.
Lanes in green are active while grayed out lanes are masked out by the corresponding conditions.
Note that this is an implementation using a stack to handle nested conditionals. Stackless implementations, flattening nested conditionals is often possible with other trade-offs. More commonly, as the nesting depth is typically known for shaders, a fixed number of backup registers can be used as the stack.
Without support for branching instructions it is not possible to skip over a branch even if none of the shader invocations in the wave would take it, as it can be seen in the case of the branch on cnd2 above.

Newer GPUs then introduced actual branching instructions (jump-like or structured) that work in a similar fashion to their SISD versions. However, we must not forget that GPUs schedule and execute entire waves of shader invocations in lock-step, thus skipping over code using branching instructions is only possible if all shader invocations within the wave take the same path.

When that’s not the case we are talking about divergent waves and in such cases GPUs continue to operate like their predecessors by executing both branches of an if-else block, or worse, in case of loops it means that each shader invocation within the wave will take as many iterations as the one that takes the most. Hence, even though control flow is inexpensive on today’s GPUs, dynamically uniform control flow (as opposed to divergent) is still strongly preferred to avoid having to pay the cost of executing the instructions of multiple, shader-invocation-wise, mutually exclusive branches.

Illustration of a hypothetical branching capable 16-wide SIMT GPU’s active lanes over the course of running 16 shader invocations.
Lanes in green are active while grayed out lanes are masked out by the corresponding conditions.
Note the following:
cnd1 is a compile-time known uniform expression, i.e. it’s known from the shader code that the condition will not vary across lanes of a wave, hence only branching instructions are needed.
cnd2 is a dynamically uniform expression, i.e. it happens to be that all lanes of the wave evaluated it to the same value, hence the wave could skip the untaken branch. However, condition mask code was still necessary to be added by the compiler as dynamic uniformity is not known at compile-time.
cnd3 is a divergent expression, hence the wave will execute both sides of the branch with the appropriate predication.
If the shading language syntax allows expressing it, generation of branching code for known divergent branches, or predication code for known dynamically uniform branches can be avoided.

Cross-Lane Operations

Analogously to how the idea of component masking expands to the SIMT model in the form of instruction predication, component swizzling also has its corresponding counterpart in the form of cross-lane operations. This time instead of swizzling the components of a vector when using them as instruction operands on a vector-based GPU, data is swizzled across the shader invocations within a wave.

This technique is beneath one of the hottest shading language features in the last couple of years as it enables significantly higher performance data sharing across shader invocations within a subgroup compared to the wider (workgroup) scope but slower data exchange through shared memory, as cross-lane operations allow shader invocations to directly reference data in the registers of other shader invocations within the wave. Implementing this seems fairly trivial, considering that the registers of all shader invocations executing on a particular SIMD unit are located in the same register file.

There’s More!

Processors often employ another method to increase instruction-level parallelism without actually increasing the width of the underlying SIMD block. In this scheme the instruction scheduler issues each instruction multiple times but for different sets of shader invocations and it’s often referred to as temporal SIMT, or, when using wide SIMD blocks, spatio-temporal SIMT, as instruction issue is spread both in the spatial domain (over individual lanes of a SIMD block) and the temoral domain due to the multi-cycle reissue.

Loosely speaking, this is similar to string operations using the REP prefix on an x86 CPU, although that is far from an accurate analogy. In practice, temporal SIMT on GPUs is often a bit more rigid than that, as generally instructions are reissued a hardwired number of times. As an example, AMD’s GCN architecture issues an instruction across a complete 64-wide wave of shader invocations to a 16-wide SIMD block over 4 cycles, 16 lanes each cycle. This instruction issue technique enables the possibility to hide the latency of instruction decoding and/or execution and thus provide greater instruction-level parallelism.

Note, however, that this technique should not be confused with the simultaneous multithreading (SMT) technology often used by GPUs whereas the processor schedules instructions of other waves while a wave waits for a long-latency operation like a memory read.

Temporal SIMT can be also implemented in its pure form, i.e. without an actual wide execution unit. In this case each shader invocation within a wave is issued in separate cycles which may even enable the scheduler to skip issuing the instructions of inactive shader invocations (due to predication). Such an approach could potentially eliminate the cost of divergent shader invocations, but only up to a certain extent, as it also limits the chance of hiding operation execution latency, hence also limiting effective instruction-level parallelism, and the overall time to complete the execution of a wave would increase due to removing one dimension of parallelism.

Yet another technique is to share a single instruction scheduler across multiple SIMD blocks. While this may not be self-evident, there’s a difference between issuing instructions to multiple SIMD blocks compared to a single, wider SIMD block. For example, separate SIMD blocks have separate register files, hence simple cross-lane operations cannot be used to share data across shader invocations running on separate SIMD blocks, even if they are both fed by the same instruction scheduler.

Combining temporal SIMT with a single instruction scheduler feeding multiple SIMD blocks allows a single scheduler to handle an even larger number of instructions executing in parallel. However, sometimes this may also imply that the issue granularity may be higher than the size of a single wave. On AMD’s GCN architecture, for example, this is not the case, as a single instruction of a 64-wide wave takes 4 cycles to start on a single 16-wide SIMD block and, while there are four SIMD blocks per scheduler, the scheduler can actually issue an instruction from a separate wave each cycle, hence able to send a new instruction to all four SIMD blocks across those 4 cycles until it has to go back to the first one.

Illustration of how a spatio-temporal SIMT GPU issues instructions from 32-wide waves using a single scheduler to four 8-wide SIMD execution units.
The dark colored blocks indicate the first instruction issue instances (covering the first 8 shader invocations within the wave), while the light colored blocks indicate the reissue of the same instruction for the subsequent groups of shader invocations within the same wave.

While it may have seemed that some of the techniques presented that take advantage of the SIMD paradigm are mutually exclusive, all of them can be combined. As an example, there’s no reason why a vector-based or VLIW GPU could not take advantage of the SIMT model, in fact they do, as going e.g. 4-wide with SIMD wouldn’t be sufficient to achieve the scales of computational throughput that we have seen on GPUs over the last couple of decades.

Another example of a multi-paradigm use of SIMD processing can be noted in certain SIMT based GPUs that also support multiple operand precisions (e.g. both 16-bit and 32-bit floating point operands) as this may mean that even a GPU that otherwise uses a scalar instruction set may implement lower-precision operations following the packed-SIMD paradigm, or use wider vector widths for lower-precision operations, as the register width is typically fixed by the architecture.

Illustration of mixing scalar and vector (packed-SIMD) pipes on a GPU with 32-bit registers.

Conclusion

As we saw, GPUs can leverage the benefits of the SIMD paradigm in many interesting ways. All of those techniques, however, have various levels of effects on the way code should be written for them in order to maximize performance. Hence it’s important for developers to familiarize themselves with them. Fortunately, there is a high degree of commonality across the way how the various types of beasts prefer to be fed, so all is not lost. Nonetheless, there is always some extra performance to be found when targeting a particular hardware architecture.

It is also worth noting that, to some extent, the instruction set can be fairly orthogonal to the actual way the processor schedules individual operations for execution, let alone the actual execution itself. We presented a very simplistic view of how hardware may execute operations in a SIMD fashion. Modern superscalar processors with deep instruction pipelines are far more complex than that.

Also, the instruction set and the way how platform-independent shader code is mapped to it is generally hidden behind the compiler infrastructure provided by the hardware vendors. Hence sometimes one may come to incorrect conclusions about how a particular target hardware should be coded for simply based on some limited high-level information about the architecture.

Considering the shader programming model offered by the various graphics and compute APIs, it seems that at least the SIMT execution model is common across all GPU architectures on the market today (unsurprisingly), while other design choices like using a vector-based, VLIW, or scalar instruction set (or a combination of those) varies more across individual implementations.

In fact, it may not be uncommon for modern GPUs to be able to switch between various issue models (e.g. vector vs scalar, or different wave widths as we see on some recent architectures). It is thus possible that only the level of flexibility and granularity at which these issue models can be switched is what defines each unique architecture.

If GPU architectures would go in such a multi-paradigm direction then that would likely be good news for application developers, as it may allow reaching closer to optimal performance and efficiency for a wider range of algorithms.

For further reading and architecture examples, please check out the links below:

Multisampling primer

Daniel Rákos — Tue, 19 Oct 2021 17:17:29 +0000

Multisampling is a well-understood technique used in computer graphics that enables applications to efficiently reduce geometry aliasing, yet not everybody is familiar with the entire toolset offered by modern GPU hardware to control multisampling behavior. In this article we present the behavior of basic multisampling and explore a set of controls that enable us to tune performance/quality trade-offs and open doors for more advanced rendering techniques.

While it’s quite common nowadays for renderers to use other, lower cost and often less intrusive techniques to reduce screen-space aliasing like temporal anti-aliasing or various morphological techniques, multisampling remains prevalent as a standard go-to technique to perform anti-aliasing, is often used in conjunction with the aforementioned alternatives, and sometimes (ab)used to aid more complex rendering algorithms like checkerboard rendering to achieve better perceived resolution, quality, and/or performance. Hence familiarity with the behavior and configuration parameters available in modern multisampling hardware implementations comes handy in a wide set of rendering problems.

Naive Supersampling

In general, any sort of screen-space aliasing is the result of the very nature of evaluating rendering functions at discrete units (typically per-pixel, at pixel centers) hence producing noticeable under-sampling induced artifacts.

Supersampling improves this by increasing the sampling resolution of the entire rendering pipeline. The simplest way to implement supersampling is to render at a higher resolution than the final target (e.g. twice the width and height to achieve 4x supersampling) and then downsampling the produced image to the target resolution, usually with a simple box filter. This downsampling process is called a resolve operation.

Illustration of single-sampled rendering (left) and 4x supersampled rendering with resolve (right).

From a terminology perspective this leads us to the notion of sample whereas we refer to the collection of values in the supersampled image contributing to a particular value in the final target image as pixels, and we call the individual values within those collections as the samples of the pixel.

Obviously, supersampling doesn’t completely eliminate aliasing, really nothing can in a discrete processing pipeline, however it can significantly reduce its effect. The quality improvement is roughly proportional to the number of samples evaluated per pixel, i.e. the ratio of the resolution at which the rendering pipeline was evaluated and the resolution at which the results are displayed. The main problem with naive supersampling is that the performance and memory usage cost is also increased by the same ratio.

Basic Multisampling

Multisampling is essentially a performance optimization to supersampling where certain operations within the graphics pipeline that don’t have significant (or sometimes any) effect on aliasing are allowed to operate at a reduced rate. This enables multisampling to achieve better overall performance-to-quality ratio than naive supersampling.

In order to understand where we can reduce processing rate, let’s look at the relevant graphics pipeline stages and resources:

Graphics pipeline stages relevant in the context of multisampling.
Note: fragments may correspond to whole pixels or to individual samples within a pixel.

Clearly, rasterization, depth/stencil testing, color buffer operations, and generally all fixed-function per-fragment operations, have to happen at full rate in order to actually produce an effectively higher resolution image with multiple samples per pixel that can be later resolved to the display resolution. This also implies that our depth/stencil buffer and color buffer(s) also need to be able to store such a higher resolution image, or, in practice, multiple samples per pixel. We call these multisampled images (or textures) and their effective resolution is specified through a width, height, and number of samples (with the addition of a depth and/or layer count parameter for 3D/layered textures, if applicable).

We can note here though that evaluating the fragment shader for each sample of a pixel separately doesn’t seem to provide much benefit as it’s a reasonable assumption that evaluating color values for each sample separately and then averaging those when resolving the multisampled image(s) should result in the same color value that we get if we only evaluate a single color value per pixel. Obviously, this ignores certain details that make this assumption incorrect in general, as we will see later, often there’s little to no perceived difference between the two.

Multisampling takes advantage of this by executing only a single fragment shader invocation for each pixel a primitive overlaps with, no matter how many samples within that pixel the primitive actually covers. As fragment shading is typically the most expensive stage in the graphics pipeline back-end, this can save significant processing time compared to supersampling. In cases where the additional bandwidth cost of multisampled color and depth/stencil buffers, and the additional load on the fixed-function stages responsible for rasterization, depth/stencil testing and color buffer operations don’t saturate the corresponding hardware resources, basic multisampling induces a relatively small overhead compared to single-sampled rendering, unlike supersampling.

Integrating basic multisampling into a renderer is quite straightforward as it requires little to no modifications aside from the additional resolve operation needed before presenting the final rendering. However, in practice, things get complicated when considering certain special cases discussed later, and as modern rendering workloads often involve passes rendering to intermediate framebuffers used for post-processing, shadow mapping, and other purposes, developers have to choose carefully which passes should they apply multisampling to.

Coverage

This is a new term that needs to be introduced in the context of multisampling. The coverage of a pixel with respect to a particular incoming primitive provides information about which samples within a pixel actually overlap the primitive in question. This coverage information is generally represented as a binary number called the coverage mask where the ith digit (bit) of the number contains 1 if the ith sample of the pixel is inside (is covered by) the primitive, and 0 otherwise. When a sample’s corresponding coverage mask bit is set, it’s often referred to as the sample being lit.

The coverage mask needs to be available across the post-rasterization graphics pipeline stages, as it affects their operation:

The original coverage mask determined by rasterization (often called pre-depth coverage mask) needs to be available to depth/stencil testing to know which samples need to be tested
Depth/stencil testing will set each bit of the coverage mask whose corresponding samples failed the test to 0, producing the post-depth coverage mask
Color buffer operations then use this post-depth coverage mask to limit writing the output color(s) of the fragment shader only for the samples whose corresponding bits are set in the mask

The pre-depth or post-depth coverage information can also be made available to the fragment shader as input, enabling more advanced rendering techniques, and the fragment shader may even output its own coverage mask which is then combined with the incoming coverage mask by subsequent pipeline stages using a bitwise AND operator (i.e. fragment shaders can’t add new covered samples, but can discard them), although there are some API extensions out there that also allow growing the set of covered samples using the fragment shader.

Examples of the use of the coverage mask as fragment shader input and output.

This notion of coverage will come handy when discussing more advanced multisampling features.

Additional Benefits

The processing scheme established by basic multisampling also allows for a set of techniques that provide performance and/or quality benefits that couldn’t be applied in a naive supersampling environment. In this section we will mention a few of those.

Bandwidth Optimizations

One thing to note about the effects of basic multisampling is that often the majority of samples within a pixel of the rendered image have the same color, as pixels covered entirely by the front-most primitive will get each of their samples’ color values coming from the same single fragment shader invocation.

Some GPUs take advantage of this through a novel scheme where color values of such pixels are only written to memory once, instead of being replicated for each sample of the pixel. This is achieved by decoupling the actual color value’s storage from the sample index it belongs to by introducing additional metadata that maps sample indices to storage locations. In such schemes, individual color values at separate storage locations within a pixel are often referred to as color fragments.

Illustration of separate color coverage and color fragment storage.
Note: illustration-only as support, representation, behavior, and the flexibility of mapping between samples and color fragments depends on the hardware implementation.

This provides us with a lossless color image compression scheme that is specifically tailored to reduce the bandwidth requirements of multisampled rendering, further reducing the performance gap compared to conventional single-sampled rendering.

For depth/stencil data the usual compression schemes work just as well in case of multisampling with a few caveats, as explained later.

Some GPUs go even further. In particular, as we’ve seen in our previous article, TBR GPUs can avoid storing multisampled data in off-chip memory altogether by performing multisampled rendering entirely on-chip and performing the resolve as part of the tile store operation responsible for committing on-chip result to RAM. This approach completely eliminates the additional external bandwidth costs of multisampling.

Sample Locations

Both single-sampled rendering and naive supersampling (implemented by doing single-sampled rendering at a higher resolution) evaluate per-fragment operations across a regular grid of sample points dictated by the center location of each pixel. Multisampling introduces additional flexibility here by not requiring the locations of samples within a pixel to form a uniform axis-aligned grid. In fact, hardware implementations historically used special empirically established sample location patterns that increase the perceived quality of anti-aliasing by breaking the uniform grid and resulting aliasing patterns.

Standard sample locations used in multisampling for (left-to-right) 1, 2, 4, 8, and 16 samples, respectively.

Nowadays, GPUs allow the application developer to specify the locations of samples within every 2×2 groups of pixels (quad) enabling further customization of multisampling behavior. This also opens the door for more advanced tricks where the sample locations are altered across subsequent frames and can be used as the basis for rendering techniques like checkerboard rendering or hybrid anti-aliasing algorithms combining multisampling and temporal anti-aliasing.

Comparison of 4x MSAA sample patterns: standard constant pattern across quad (left) and an example of custom varied pattern across quad (right).

One thing to note here is that the location of the samples within a pixel has an effect on some of the compression schemes typically used for depth buffers. This means that multisampled depth buffer data can only be interpreted with respect to a particular set of sample locations. It’s thus no coincidence that graphics APIs like Vulkan rely on the specification of the sample locations used when performing any operations on a multisampled depth buffer that may require decompressing it, or otherwise interpreting its compressed contents.

Complications

As briefly mentioned earlier, even though basic multisampling can be easily deployed into existing rendering code, there are a whole set of cases that require special handling in order to produce the expected rendering results. Handling these cases appropriately can be fairly intrusive because it often involves the need for specialized shaders for the multisampling case. In this section we will take a look at a few of these special cases that all renderers employing multisampling should be aware of.

Texture Sampling

We talked about the shortcut multisampling takes compared to supersampling, whereas the fragment shader is evaluated only once, even if multiple samples within the pixel are covered by a primitive. Although it may not be immediately obvious, this shortcut can have significant effects on the final rendering when we look at the texture samples used by the fragment shader.

First, basic multisampling will sample textures only once per pixel and only at the center of the pixel (typically, at least), as texture samples are read by the fragment shader which itself is executed only once per pixel in the basic case. This already can have subtle side-effects due to the limited precision available in the various hardware components involved.

However, a more critical situation arises when the pixel center isn’t actually covered by the primitive for which fragment shading executes. This results in sampling the input textures at a location that is outside of the primitive, and can produce rendering artifacts where texture data belonging to another primitive can “leak” into the current one.

Sampling at pixel center (left) versus centroid sampling (right).

In order to avoid this, the coordinates at which textures are sampled may need to be altered to correspond to a screen-space location that lies inside the actual primitive’s footprint. Shading languages provide appropriate syntax to mark varyings (fragment shader inputs) to use centroid-based interpolation which guarantees that the corresponding values are interpolated at a location within the pixel that is actually covered by the primitive. Centroid-sampling is not without caveats though, as derivatives, and consequently the sampled LOD, may be affected by the unaligned sub-pixel sampling locations across the 2×2 pixel area (quad) used to calculate them.

Alpha Testing

Alpha testing has long ceased to exist as a dedicated fixed-function pipeline stage in GPUs (in most of them, at least) yet the term alpha testing stuck. What we really talk about here is the ability to cut out parts of a primitive based on some per-fragment data typically coming from a texture (often stored in the alpha channel). This is done by the fragment shader discarding certain sets of fragments.

As in the basic case of multisampling the fragment shader is only executed once per pixel, it follows that such cut-outs also happen per-pixel with a fragment shader designed for single-sampled rendering, hence the cut-out edges remain aliased, despite using multisampling.

Such shaders usually have to be altered to produce the expected results when used in conjunction with multisampling. There are multiple possible solutions to the problem with various levels of performance, quality, and intrusiveness.

The simplest solution is to just render such alpha-tested geometry separately, using supersampling. This guarantees correct rendering, however, it also eliminates the performance benefits of multisampling.

Another solution could be to manually loop over individual samples in the fragment shader (through the use of the input coverage mask), evaluate the condition of discarding for each, and adjust the output coverage mask according to the outcome of the test. This way we practically discard individual samples using the output coverage mask.

Clearly, this achieves identical results to supersampling, but in the process we also gave up most of the performance benefits of multisampling by having to execute at least part of the fragment shader per sample. This may or may not be any faster than supersampling, as on one hand we still kept the rest of the fragment shader code to execute per pixel instead of per sample, but on the other hand we made an otherwise parallelizable workload serial through the in-shader loop and, in general, made our fragment shader significantly more expensive.

A third option is to use a feature called alpha-to-coverage. This feature instructs the GPU to use the alpha channel of the first color output of the fragment shader to derive a corresponding coverage mask in some implementation dependent way where an alpha value of 0.0 results in a coverage mask with all bits set to 0, an alpha value of 1.0 results in a coverage mask with all bits set to 1, and for alpha values between those the produced coverage mask has a roughly proportiate number of bits set, potentially also applying some screen-space dithering. The derived coverage mask is then used as if it was the output coverage mask of the fragment shader. Using this feature enables a fairly non-intrusive way to keep the performance benefits of multisampling while still producing adequate anti-aliasing quality for alpha-tested geometry.

Advanced Multisampling

So far we only talked about the basic case of multisampling, but modern GPUs provide a wide set of parameters to control the behavior of it. Unfortunately, not all graphics APIs expose all of these. Truth be told, some of them are not universally or identically supported across all GPU vendors, while others are entirely vendor-specific, and at least many of them are actually exposed through cross-vendor and/or vendor-specific API extensions.

In order to have a structured look at the various configuration options, we have to introduce some new terminology…

In contrast to the single global sample count that is used in traditional supersampling and multisampling, we can define multiple different kinds of sample counts and collections of them used at different stages of the graphics pipeline:

Rasterized samples (RSS) – the sample count at which rasterization and most per-fragment operations take place
Depth/stencil samples (DSS) – the sample count at which depth/stencil testing takes place (the number of samples per pixel in the depth buffer)
Shaded samples (SHS) – the number of fragment shader invocations per pixel
Color coverage samples (CCS) – the sample count at which color buffer operations take place (the number of coverage samples per pixel in the color buffers)
Color storage samples (CSS) – the number of unique color values per pixel the color buffers can store (the number of color storage samples per pixel in the color buffers)

The above terminology is sort of a combination of the terminology used across various GPU vendors. NVIDIA uses the term color samples to denote the actual samples with unique storage locations in the color buffers, and uses the term coverage samples for the samples which at least maintain color coverage information, while AMD sometimes uses the color samples term for the latter, and color storage samples or color fragments for the former. Little literature distinguishes between rasterized samples and (color) coverage samples, but for the purposes of the contents of this article it seemed appropriate.

The way various sets of samples are mapped to one another seem to vary across implementations to some extent where some may use fixed mapping while others may allow more dynamic mapping between certain types of sample counts. We will note some of these nuances when presenting the corresponding sample counts in detail.

Each sample count is typically a power-of-two value and, with a few exceptions, the following inequalities hold for them (usual implementation supported values are enumerated on the right):

RSS >= DSS >= CSS >= SHS         RSS, CCS ∈ { 1, 2, 4, 8, 16 }
RSS >= CCS >= CSS >= SHS        DSS, CSS ∈ { 1, 2, 4, 8 }

These sets of inequations tell a story about what happens with samples across the pipeline stages and are behind the special multisampled anti-aliasing schemes like coverage sampled anti-aliasing (CSAA), enhanced quality anti-aliasing (EQAA), and other, more complex ones.

In general, the sample count only reduces throughout the pipe, although the shaded sample count is sort of special because it doesn’t actually change the number of samples in the pipe, rather it determines the set(s) of samples that are shaded with a single fragment shader invocation. Also, the shaded sample count typically cannot be larger than the storage sample count, as otherwise we couldn’t even store the unique colors corresponding to a single primitive (more on storage vs coverage samples later). Although, there are certain exceptions when special coverage reduction steps are introduced which can convert relative coverage to a modulation factor, practically assigning fractional opacity values to color outputs in proportion to the number of covered samples. In these cases the shaded sample count related inequations don’t necessarily have to hold.

For reference, the following table summarizes some common multisampling and supersampling modes, and corresponding sample count values:

Mode	RSS	DSS	SHS	CCS	CSS
2x SSAA	2	2	2	2	2
4x SSAA	4	4	4	4	4
8x SSAA	8	8	8	8	8
2x MSAA	2	2	1	2	2
4x MSAA	4	4	1	4	4
8x MSAA	8	8	1	8	8
8x CSAA	8	4	1	8	4
8xQ CSAA	8	8	1	8	8
16x CSAA	16	4	1	16	4
16xQ CSAA	16	8	1	16	8
2f4x EQAA	4	2	1	4	2
4f8x EQAA	8	4	1	8	4
4f16x EQAA	16	4	1	16	4
8f16x EQAA	16	8	1	16	8

Example sample count values used in various supersampling (SSAA), multisampling (MSAA), coverage sampling (CSAA) and enhanced quality (EQAA) anti-aliasing schemes.

These assume default, uncostumized behavior, that you can also typically achieve through simple control panel settings in compatible applications. As we will see, a whole set of more complex modes are achievable through explicit use of the advanced multisampling features available on modern GPU hardware.

Sample Shading Rate

So far we only gave some vague definitions of the various types of sample counts used throughout the graphics pipeline. In particular, the shaded sample count value used across the examples shown until now have all been 1, except for supersampling where it matched the other sample counts. This is actually not unexpected, considering that the key difference between supersampling and multisampling that we’ve discussed so far has been the granularity at which samples are fed to fragment shading: supersampling shades each sample individually, while basic multisampling only runs a single fragment shader invocation for all samples within a pixel.

Modern multisampling implementations actually enable the application to control the shaded sample count, although not entirely directly. Instead, graphics APIs provide a way to specify a minimum rate at which samples are fragment shaded.

When the rate set to 0.0 then regular multisampling is used where all samples of a pixel are processed by a single fragment shader invocation, while if the rate is set to 1.0 then each sample is shaded by separate fragment shader invocations, practically achieving supersampling.

This already enables mixing basic multisampled and supersampled rendering operations using the same set of multisampled framebuffer attachments, thus enabling renderers to take advantage of the performance benefits of multisampling for regular geometry while still allowing the use of supersampling in cases when basic multisampling would produce inadequate results (e.g. alpha-tested geometry).

The graphics APIs are intentionally vague about the mapping of minimum shading rate values (let’s call it MSR) and corresponding shaded sample counts, as some implementations only support two modes (multisampling vs supersampling, also known as per-sample shading), however, many GPUs offer additional support for values in between. More formally the relation between the configured minimum shading rate and the shaded sample count is as follows:

SHS >= CCS * MSR

The way coverage samples are assigned corresponding fragment shader invocations may be implementation-specific though, as it may either be a fixed mapping based on the sample indices, or could be a more dynamic scheme.

Supporting different shaded sample counts does come with additional complexity as handling them may require specialized fragment shaders for various cases:

Fragment shaders designed to do per-sample shading (i.e. supersampling), and
Fragment shaders designed to shade groups of samples (in extreme cases even separate ones for the SHS=1 and specific SHS>1 cases)

Although it still remains manageable as typically only certain cases like alpha-tested geometry need special handling.

Overrasterization

Let’s take a step back and now look at the various sample counts in the order of the pipeline stages. First we look at a feature sometimes referred to as overrasterization, whereas the rasterization sample count is larger than the other sample counts (RSS > DSS/CSS).

These modes are often used with a single-sampled color buffer, and without a depth/stencil buffer, compositing related use cases being the most common, although in certain cases and mode combinations it can be useful even with a depth/stencil buffer.

By rasterizing primitives at a higher sample count, our fragment shader is able to receive more fine-grained coverage information which then can be used, for example, to calculate blending factors to composite geometry onto an existing image, effectively producing anti-aliased image even if the target color buffer is single-sampled.

Naturally, this only produces the expected results if objects are rendered in back-to-front order (typical compositing scenario), as we don’t maintain coverage information for all samples (neither from depth/stencil perspective due to no, or fewer sample depth/stencil buffer, nor from color perspective as we don’t store either color values or color coverage at the full rate).

Mixed Samples Rendering

Rasterizing a higher number of samples than the number of samples in the depth/stencil buffer, if one is present, introduces some complications. How are we supposed to perform depth/stencil testing on these? Should we group samples together in pairs/quadruplets/etc. and associate them with corresponding depth/stencil buffer samples?

While that can be easily done, the stored values in the depth/stencil buffer were evaluated at specific sample locations, so using those for depth/stencil testing samples at different locations within the pixel can inherently result in subtle rendering artifacts.

Some GPU implementations employ special depth compression schemes that store plane equations or gradients instead of raw depth values, when possible, and these may be suitable for evaluating depth values even at sample locations that weren’t used to construct the plane equations or gradients in response to earlier depth writes. However, in the general case this cannot be relied upon for various reasons:

Some hardware implementations may not support such compression schemes at all
The depth buffer may have been decompressed for various reasons
The compression scheme may not be applicable in case of many small triangles or around triangle edges

Thus in order to guarantee sample accurate depth/stencil testing results the depth/stencil sample count has to match the rasterization sample count, even if it’s otherwise larger than the color sample counts (RSS = DSS > CCS/CSS). These modes are sometimes referred to as mixed samples modes.

This guarantee comes at a cost, though, as increasing the depth/stencil count proportionally increases the memory consumption of depth/stencil buffers, as they now have to be able to hold more unique values per pixel. Consequently, the memory bandwidth requirements scale as well.

Still, compared to supersampling, or even basic multisampling where a growth in depth/stencil sample count needs the color sample count to grow as well, mixed samples multisampling modes can provide performance and memory usage benefits when there’s no need to maintain color data at the same granularity.

Separate Color Coverage

We touched a few times already in various contexts upon the topic of the separation of color coverage data from actual color values, or, using different terminology, color samples from color fragments. It’s time to explain the distinction between the two concepts in more detail.

The basic idea behind separate color coverage is that in a basic multisampling environment all samples within a pixel covered by the incoming primitive are assigned a single color value due to fragment shading happening per-pixel. This means that most pixels will only ever need to store a single value, except at the edge of primitives or at the intersection of primitives.

Taking advantage of this, we can often maintain color coverage information for each sample separately, while getting away with color value storage only for a fewer number of samples (CCS > CSS). This reduces the memory consumption of color buffers, as well as the memory bandwidth requirements of accessing them.

Obviously, this comes with a caveat as in cases where there would be a need to store a larger number of unique color values for a given pixel than the color storage sample count, we don’t have any choice but to discard one of the color values.

Illustration of lost color values due to limited color storage samples (CCS=4, CSS=2).

The replacement policy used in these cases may vary across hardware implementations, partly depending on how flexible the mapping supported by the hardware is between color coverage samples and color fragments.

Nonetheless, despite its limitations, keeping the color coverage sample count at full rate while using a reduced number of color storage samples usually results in comparable overall anti-aliasing quality to regular multisampling, while saving precious memory storage and bandwidth, hence the corresponding CSAA and EQAA modes remain popular choices for a wide range of applications.

Bonus: Variable Rate Shading

While technically not a multisampling feature, we need to also mention variable rate shading here.

This feature allows controlling the rate at which fragment shading happens on a per-draw, per-tile, per-viewport, and/or per-primitive basis. Even though the key differentiator of this feature is that it enables shading multiple pixels by a single fragment shader invocation, it also allows controlling the sample shading rate in multisampling use cases.

As such, it provides a way to control the shaded sample count at a granularity that isn’t achievable with traditional sample shading rate parameters. This additional flexibility makes it possible to mix alpha-tested geometry with fully opaque geometry in the same draw call where alpha-tested primitives can use supersampling while the rest of the primitives can still take advantage of the performance benefits of basic multisampling.

Conclusion

In this article we took a look at how increasing spatial sampling frequency at various points in the graphics pipeline enables us to greatly reduce screen-space aliasing and inherently also certain temporal aliasing artifacts (e.g. “travelling” pixels). Supersampling is the most straighforward way to achieve this by uniformly increasing sampling frequency over the entire back-end of the graphics pipeline. We’ve also seen how multisampling attempts to reduce the overhead of supersampling while maintaining most of its quality benefits.

Throughout the article we also presented numerous multisampling features present on modern GPUs that further close the gap between the quality of multisampling and supersampling, including mechanisms that allow us to selectively apply supersampling to certain parts of a scene.

While subject to hardware support, the various advanced multisampling modes can even be combined with one another, enabling fine-grained control over memory/performance/quality trade-offs.

Each individual multisampling feature deserves a much lengthier, in-depth, and more formal description, so this article remains a mere introduction to the topic. Graphics API standard specifications like OpenGL and Vulkan, as well as related extension specifications are likely the best reference out there for rigorous definitions of the behavior of these features. Please, consider the following list of references for further reading:

OpenGL 4.6 Core Profile Specification
- GL_ARB_multisample (basic multisampling, alpha-to-coverage)
- GL_ARB_sample_shading (sample shading rate)
- GL_ARB_sample_locations (custom sample locations)
- GL_ARB_post_depth_coverage (support for post-depth coverage mask in the fragment shader)
- GL_EXT_raster_multisample (overrasterization)
- GL_AMD_framebuffer_multisample_advanced (separate color coverage and mixed samples rendering)
- GL_NV_framebuffer_multisample_coverage (separate color coverage)
- GL_NV_fragment_coverage_to_color (support to output final coverage mask to a color buffer)
- GL_NV_framebuffer_mixed_samples (mixed samples rendering)
- GL_NV_sample_mask_override_coverage (support to add new lit samples in the fragment shader)
- GL_NV_alpha_to_coverage_dither_control (alpha-to-coverage dithering)
- GL_NV_shading_rate_image (per-tile/viewport variable rate shading)
- GL_NV_primitive_shading_rate (per-primitive variable rate shading)
Vulkan 1.2 Specification with all published extensions
- VK_KHR_fragment_shading_rate (per-tile/primitive variable rate shading)
- VK_EXT_sample_locations (custom sample locations)
- VK_EXT_post_depth_coverage (support for post-depth coverage mask in the fragment shader)
- VK_EXT_fragment_density_map (per-tile variable rate shading)
- VK_EXT_fragment_density_map2 (per-tile variable rate shading)
- VK_AMD_mixed_attachment_samples (mixed samples rendering)
- VK_AMD_shader_fragment_mask (direct access to color coverage information)
- VK_NV_coverage_reduction_mode (control coverage reduction behavior)
- VK_NV_fragment_coverage_to_color (support to output final coverage mask to a color buffer)
- VK_NV_framebuffer_mixed_samples (mixed samples rendering)
- VK_NV_sample_mask_override_coverage (support to add new lit samples in the fragment shader)
- VK_NV_shading_rate_image (per-tile/viewport variable rate shading)

GPU architecture types explained

Daniel Rákos — Mon, 19 Jul 2021 17:09:29 +0000

The behavior of the graphics pipeline is practically standard across platforms and APIs, yet GPU vendors come up with unique solutions to accelerate it, the two major architecture types being tile-based and immediate-mode rendering GPUs. In this article we explore how they work, present their strengths/weaknesses, and discuss some of the implications the underlying GPU architecture may have on the efficiency of certain rendering algorithms.

Let’s start with the basics first by taking a look at the way how the two major GPU architecture types implement the graphics pipeline stages.

Immediate-Mode Rendering

We can really call this GPU architecture the “traditional” one, as immediate-mode rendering (IMR) GPUs implement the logical graphics pipeline, as described by the various graphics APIs, quite literally:

Simplified illustration of the immediate-mode rendering pipeline.

Incoming draws trigger the generation of geometry workload with a corresponding set of vertices to be processed (with appropriate primitive connectivity information according to the primitive type)
Vertices/primitives are then fed to the various geometry processing stages (vertex shading and any other additional processing stages like tessellation or geometry shading, if enabled, or mesh shading in latest achitectures)
The resulting primitives are then culled (and potentially clipped), transformed to framebuffer space, and sent to the rasterizer
Fragments generated by the rasterizer then go through the various per-fragment operations and fragment processing (potentially discarded by the fragment shader)
Finally the remaining fragment’s color values get written to the corresponding color attachments (potentially in non-trivial ways in case of multisampling, as an example)

The important takeaway is that entire draw commands are processed to completion on the GPU in a single pass and all resources are accessed through traditional (cache assisted) memory transactions.

One prerequisite of this is the existence of some sort of buffer between the front-end stages (ending with primitive assembly) and back-end stages (starting with rasterization) to be able to handle the non-uniform ratio between geometry and fragment workload that is inherent from the fact that a single incoming primitive may cover an arbitrarily large area in framebuffer space. This remains an active area of research for immediate-mode rendering architectures despite the fact that actual programmable shader processing happens across a group of unified shader cores in modern GPUs, enabling almost any form of distribution of geometry and fragment processing workloads across them.

Tile-Based Rendering

As the name suggests, tile-based rendering (TBR) GPUs execute the graphics pipeline on a per-tile basis. What this means is that the framebuffer space (also referred to as the render area) is split into equisized rectangular areas called tiles, and rasterization, as well as all other back-end stages, are executed separately for each individual tile after the front-end stages completed.

Simplified illustration of the tile-based rendering pipeline.

The traditional primitive assembly stage is replaced with a primitive binning stage whereas culled primitives are accumulated into one or more bins, depending on which tile within the render area do they overlap with. After the front-end stages complete, a separate per-tile pipeline is launched that:

Starts with the tile load operation responsible to load framebuffer attachment data corresponding to the tile into dedicated on-chip storage
Then primitives in the bin corresponding to the tile are rasterized and go through the usual per-fragment operation and fragment processing stages, but when those need to read/write framebuffer data they access the on-chip storage instead of the in-memory attachment resources
Finally the tile store operation is responsible to store back any modified framebuffer attachment data from the on-chip storage to memory

It can be seen that the key difference between a TBR and an IMR GPU is the way they communicate primitive data between the front-end and back-end stages and the way they access framebuffer data. These dissimilarities have profound implications though, as explained later.

Algorithm Comparison

The pseudo codes below present the algorithmic differences between the IMR and TBR architectures:

# IMMEDIATE-MODE RENDERING ALGORITHM


# Single phase:

for framebuffer/subpass in pass:

  for draw in subpass:

    for primitive in draw:

      process(primitive/vertices)

      rasterize(primitive)

      for fragment in primitive:

        process(fragment)

# TILE-BASED RENDERING ALGORITHM

# Phase 1:
for framebuffer/subpass in pass:
  for draw in subpass:
    for primitive in draw:
      process(primitive/vertices)
      bin(primitive)

# Phase 2:
for tile in renderarea:
  for framebuffer/subpass in pass:
    load(framebuffer tile)
    for primitive in bin:
      rasterize(primitive)
      for fragment in primitive:
        process(fragment)
    store(framebuffer tile)

For those familiar with compute shaders or other GPU compute APIs, a reasonable analogy to the behavior of TBR GPUs would be to imagine implementing the two phases of the TBR algorithm with two compute shader dispatches.

The first compute shader is ran across the domain of primitives within the whole render pass, each invocation processing a primitive and its vertices, and then appending it to the end of the per-tile buffers for each tile the primitive overlaps.

The second compute shader is ran across the domain of pixels within the render area, each workgroup covering a single tile. Invocations of the workgroup go through the list of framebuffers/subpasses and perform the following operations:

Load the pixel corresponding to the invocation from each framebuffer attachment into shared memory
Go through the primitives from the bin of the current subpass and determine whether the pixel corresponding to the invocation is inside the primitive, if yes, perform the appropriate per-fragment operations and fragment processing, utilizing the framebuffer data loaded in shared memory when needed
Store the pixel data in shared memory corresponding to the invocation back into the corresponding framebuffer attachment

Of course, this is an oversimplified example on how compute shaders could mimic the behavior of TBR GPUs, and, while reasonably efficient, such a software based approach would still be significantly slower than hardware implementations, as the latter has dedicated hardware for specific fixed-function parts of the process (e.g. rasterization). Nonetheless, the analogy is surprisingly accurate if we consider that the on-chip tile memory used by TBR GPUs is fair to be expected to be backed by the same physical on-chip memory that is used to implement compute shared memory.

One question that always pops up in the context of tile-based rendering is that “okay, okay, tile-based, but what tile sizes are used?”, and while there’s usually no uniform answer even in case of a single vendor, sometimes even for a single ASIC, the general figure is that TBR GPUs typically use very small tiles, e.g. 16×16 or 32×32. Implementations may even choose the tile size dynamically, depending on the pixel size of the framebuffer attachments used within a subpass, or across a series of subpasses. This is not surprising if we examine that the tile size and the total per-pixel size of all framebuffer attachments that need to be accessed by a given subpass (or preserved for later subpasses) is what determines the total on-chip tile memory required.

So Which Architecture Is Better?

It’s quite easy to find articles and discussions on the internet making bold claims about one architecture being better than the other, or being the future. In reality, the question itself is moot, as both architectures make specific trade-offs to increase the performance and/or efficiency of certain operations at the expense of others.

If we just look at the core architectures themselves across a wide set of content types then both architectures look pretty efficient and scalable, yet it’s well known that typically desktop/discrete GPUs use IMR, while mobile/embedded GPUs use TBR.

In order to better understand the trade-offs made by the two approaches we first need to look at the way how the two architectures access/exchange data throughout the pipeline.

Key data paths of immediate-mode GPUs.

In case of IMR GPUs all application provided data is accessed through different types of caches (for more details read our earlier article on caches). Note that framebuffer attachments are accessed through the RB cache which consists of a set of color- and depth/stencil caches private to each ROP/RB (raster operation unit or render backend) of the GPU.

The entire pipeline is ran in a single phase, as discussed earlier, that is enabled by the on-chip primitive buffer that allows the fixed-function primitive assembly stage to push primitive data to the rasterizer.

Key data paths of tile-based GPUs.

In contrast, TBR GPUs write the primitive data off-chip into per-tile primitive bins which then are consumed by the subsequent per-tile operations issued in the second phase of the pipeline.

However, as the back-end stages of the TBR GPU operate on a per-tile basis, all framebuffer data, including color, depth, and stencil data, is loaded and remains resident in the on-chip tile memory until all primitives overlapping the tile are completely processed, thus all fragment processing operations, including the fragment shader and the fixed-function per-fragment operations, can read and write these data without ever going off-chip.

Shader reads/writes go through a standard GPU shader data cache hierarchy on both IMR and TBR GPUs.

As the amount of memory traffic is often the main bottleneck of GPU workloads both from performance and energy efficiency perspective, it follows that the better suited GPU architecture for any given workload is the one that requires less external memory bandwidth. More specifically, how the additional memory traffic on TBR GPUs incurred by writing the primitive data to off-chip per-tile primitive bins compare to the recurring memory traffic to/from the framebuffer attachments on IMR GPUs, and how both compare overall to the rest of the memory traffic generated by index and vertex buffer loads, texture reads, or other shader read/write operations.

Taking the simplest practical example, we’re talking about 10 bytes per vertex (16-bit integer framebuffer-space X, Y, and Z, plus 2x 16-bit floating point texture coordinates), at best 10 bytes per triangle (if all triangle vertices are shared with neighboring ones), while at the framebuffer side we’re talking about 6 bytes per pixel (32-bit LDR or 10-bit HDR color buffer, and 16-bit depth buffer), and these per-vertex and per-pixel sizes need to be taken into account, together with the primitive to pixel ratio, in order to estimate which GPU architecture type could be more efficient for a specific workload.

However, the equation is not that simple, as even TBR GPUs typically need to load and/or store each touched tile once (load can be avoided if the framebuffer contents are discarded or cleared before rendering while stores may be avoidable in a few specific scenarios as well), thus TBR GPUs need some level of overdraw to actually be able to amortize the cost of tile load/store. In addition, the information about which primitive overlaps with which tile needs additional storage and thus bandwidth in the per-tile primitive bin.

On the IMR side, it’s also important to consider the efficiency of the RB cache, because as long as multiple subsequent primitives get rasterized on a given ROP unit in the same framebuffer area then the RB cache can similarly amortize the cost of framebuffer reads/writes as the on-chip tile memory on TBR GPUs. In fact, the larger the GPU, and thus the higher the number of ROP units, it’s more likely for framebuffer accesses hitting in the RB cache.

Furthermore, modern TBR GPU implementations use aggressive lossless compression schemes for both the per-tile primitive bin storage and the off-chip storage of framebuffer attachments, that further skews the naive figures. Though that’s nothing new to IMR GPUs either, as modern implementations also employ lossless framebuffer attachment compression algorithms to save bandwidth.

Another technique available to TBR GPUs to reduce the bandwidth requirements of the per-tile primitive bins, often referred to as vertex shader splitting or deferred attribute shading, takes advantage of the fact that the binning stage only depends on the position of vertices. In such a scheme the geometry processing done in the first phase of the tile-based rendering only executes the part of the geometry processing stages (e.g. vertex shader) that are needed to determine the framebuffer-space position of the vertices (and possibly clip distances or other clipping/culling related data), as that’s the only information needed to decide which per-tile bin the primitive needs to be emitted to. In this case only index data (with or without position data) needs to be written to the per-tile primitive bins, the rest of the workload determining per-vertex and per-primitive outputs is instead performed in the second phase.

In the simplest case this may simply mean that the vertex shader work is moved into the fragment shader (similar to how the visibility buffer algorithm works), but more sophisticated implementations may employ a separate geometry processing stage (or stages) executed on the primitives in the per-tile bin. While this approach may potentially save a significant amount of external memory bandwidth in case of shaders using many interpolants, it usually results in redundant processing of the same vertices across the individual tiles, and the vertex buffer access pattern may be less cache-friendly due to gaps between the data of the subset of primitives overlapping the tile in question.

In order to allow better vertex buffer access performance on such hardware implementations it is advised to keep position-affecting vertex attributes in separate buffers from the rest, even if otherwise using an interleaved scheme within the vertex buffers. In fact, it’s generally good to follow this practice, as geometry-only passes like a depth pre-pass and shadow map generation passes can also benefit from it.

The possibility of using late attribute shading is not exclusive to TBR GPUs though as IMR GPUs may similarly benefit (or not). Reducing the amount of per-primitive data transferred from the front-end stages to the back-end stages, and delaying geometry processing workload after culling can be a performance win on any architecture. However, once again, whether the benefits of this approach outweight its drawbacks is highly dependent on the particular workload at hand.

Nonetheless, despite many parameters being at play, it’s a good rule of thumb that the higher the geometry complexity the more likely IMR GPUs will outperform their tile-based counterparts. Thus it comes as no surprise that despite mobile devices being able to drive rendering at very high resolutions with pretty pixel effects, geometric detail is usually significantly lower compared to similar renderings encountered on desktop, and features like tessellation that significantly increase the number of primitives to be rasterized continue to remain practical primarily only on desktop systems.

It’s also worth noting here that if the external memory traffic is dominated by fragment shader memory accesses, e.g. due to complex materials needing many high-resolution textures as input that is quite common in modern workloads, the differences between the two architectures diminish, even if we acknowledge that TBR GPUs may experience better spacial locality of memory accesses and thus may employ more sophisticated optimizations in order to accelerate these accesses, thanks to the stricter processing order inherent from tile-based rasterization.

Use Cases

Instead of trying answer the impossible and declare a winner here, let’s take a look at a couple of use cases and see how the two architectures tackle these problems, which can both give an idea of where each one shines and can lead to a better understanding of their behavior and rationale behind their design.

Hidden Surface Removal

During the rendering of a typical 3D scene it’s pretty much inevitable to have certain level of overdraw, and spending time on processing fragments/pixels of primitives that will later be covered by subsequent primitives in front of those can have a great impact on overall performance and efficiency.

IMR GPUs tackle this problem by relying on depth test to reject hidden pixels as early as possible. This enables even the most naive implementation to avoid having to generate framebuffer read/write traffic, but modern GPUs go well beyond that.

First, in most cases depth testing (and depth writes as well) can execute as part of the early per-fragment operations thus fragment shading can be avoided all together for hidden surfaces. A few exceptions to these are fragment shaders modifying the fragment depth value or discarding the fragment, which both delay depth writes to happen in the late per-fragment operations (as the depth value to write may change or shouldn’t be done in the first place), and if the fragment shader modifies the fragment depth in a way that could cause false negatives during early depth testing (i.e. fragment would fail depth testing before the fragment shader but would pass it after it) then the entire depth test needs to be delayed and fragment shader cost cannot be avoided even for fragments that end up being part of hidden surface in the end.

Simplified illustration of early/late depth testing on IMR GPUs.

It’s worth noting though that even fragment shaders modifying the fragment depth may benefit from early depth testing if the modification doesn’t affect the depth test results, although this requires the application developers to explicitly declare their intent on how they plan to alter the fragment depth in the shader (see conservative depth).

In addition to early testing, IMR GPUs use Hi-Z (hierarchical depth) buffers and special depth-specific compression schemes that allow rejecting large groups of fragments (entire “tiles”) immediately, before even doing fine-grained rasterization. Discussing the details of this, however, is beyond the scope of this article.

Regardless of the individual optimizations in the hardware, taking advantage of this type of hidden surface removal always relies on the presence of a depth buffer and it’s best taken advantage of if the geometry is submitted to the GPU in front-to-back order. While the latter is fairly straightforward to achieve at the granularity of individual draw calls (although in case of overlapping geometry some overdraw may happen anyway), it usually cannot be ensured for primitives within a single draw call, in fact early depth test efficiency will often depend on the orientation of the geometry with respect to the view point.

Switching to TBR GPUs, avoiding framebuffer read/write traffic is less of an issue due to the data being in the fast on-chip tile memory, but avoiding unnecessary fragment shader invocations is just as important (if not more important) due to the external memory traffic they generate to read and interpolate incoming varyings/attributes (texture coordinates, normals, etc.) and fetch texture data.

Early depth testing and related optimizations found on IMR GPUs can be similarly employed by TBR GPUs, although some of the more sophisticated techniques may make less sense on them (at least in their traditional form). Nonetheless, they have other tools at their disposal for effective hidden surface removal due to all the geometry affecting the pixels of the tile in question being available at the time rasterization starts.

Before focusing on the back-end stages of the pipeline, it already can be noted that the primitive binning stage may be able to employ a similar technique to Hi-Z to avoid having to emit primitives to a tile’s bin if a previously emitted primitive is guaranteed to occlude it due to depth testing later on. Such an approach wouldn’t just avoid having to perform unnecessary fragment shading on the primitive, but can also save memory bandwidth consumed by the per-tile primitive bin writes/reads.

As part of primitive binning or during rasterization, TBR GPUs may also be able to partially or fully sort primitives front-to-back potentially achieving better hidden surface removal than their immediate-mode counterparts. Some TBR GPUs go even further than that and perform perfect per-pixel hidden surface removal and thus can guarantee that every single pixel gets shaded exactly once throughout the whole subpass. This can be achieved even in the absence of a depth buffer which comes handy in other use cases as we’ll see later.

Of course, these techniques may fall short in the presence of fragment shader discard, fragment shader depth modification, blending, and other scenarios that don’t satisfy the prerequisites of applying the corresponding optimitization, as we’ve seen in case of IMR GPUs.

2D Compositing

While we all appreciate our GPUs primarily for the beautiful 3D images they are able to produce, we have to admit that the most common workload we give to them in our daily use of computers is 2D composition, let that be the desktop UI composition itself or the compositing done to present web content in a browser.

In case of IMR GPUs there aren’t a whole lot of features helping 2D compositing use cases in particular. While depth testing and front-to-back rendering can be used for 2D compositing as well, it requires opaque-only geometry, and the often cheap fragment processing used by compositors rarely justify using a depth buffer. Thus basic graphics based 2D compositing is usually simply done using back-to-front rendering and using blending where applicable with appropriate hidden surface removal happening in software, using scissors and discard rectangles, potentially using the stencil buffer for complex cases, or with combinations of those.

In contrast, it seems like 2D compositing cases play to the streights of TBR GPUs due to the low geometry-to-pixel ratio and guaranteed on-chip blending. In fact this partly explains why tile-based architectures are favored in low-power and energy-constrained environments.

Overdraw debug overlay enabled on an Android device.

Talking about blending, IMR GPUs use dedicated fixed-function hardware for the purpose. Beyond being responsible for applying the appropriate blend composition itself, this component also has to make sure that pixels are blended on top of each other according to the incoming primitive order. This isn’t trivial when you can have shaders processing the fragments of different primitives that may overlap in framebuffer space scattered across the shader cores.

TBR GPUs don’t really have this problem, as they process all primitives covering a particular tile at the “same place”, seemingly in order (disregarding any potential reordering done for optimization purposes as explained earlier). This also enables typical TBR GPUs to allow fragment shaders to directly read or write their corresponding framebuffer pixels, opening the door for programmable blending and other compositing tricks unavailable to their IMR counterparts. In fact it’s not uncommon for the standard graphics API blending to be implemented on TBR GPUs through the driver patching the application provided fragment shader with blending code.

To be fair, similar benefits can be achieved using a compute shader on an IMR GPU mimicking tile-based composition, although while the on-chip tile memory contents are preserved when switching shaders on a TBR GPU the same isn’t true about data in shared memory across compute shader dispatches.

Loosely speaking, post-processing effects applied on top of a 3D scene also fall into this workload category, it’s thus no surprise that modern desktop applications usually use compute shaders to implement them.

Multisampling

Multisampled rendering enables faster antialiasing with similar quality to naive supersampling by only performing rasterization and certain per-fragment operations at a higher rate while continuing to do fragment shading only once per fragment/pixel. There are certain controls to alter that, potentially going all the way to full supersampling (shading each sample within a pixel individually), but the highest performance gains over supersampling are achieved in the basic case.

The behavior of multisampled rasterization is pretty much standardized, yet hardware implementation is quite different on IMR and TBR GPUs.

Multisampled framebuffer attachments have to be able to store a unique value for each sample in the edge case, however, it’s quite common for all samples to have the same color value (e.g. all opaque pixels lying entirely inside the primitive). IMR GPUs usually take advantage of this and instead of storing the same color value for each covered sample individually, they only store it once and add special metadata to the framebuffer attachment that tells which sample within the pixel has this particular stored value. This can reduce the consumed memory bandwidth considerably.

Although similar compression techniques can also be employed on TBR GPUs to use less bandwidth when loading/storing framebuffer attachment tiles, such optimizations are less interesting while the data is in the on-chip tile memory. TBR implementations instead go one step further and try to avoid storing multisampled images in memory all together when possible.

Multisampled rasterization on tile-based GPU architectures.
Note that multisampled data only exists on-chip.

In most cases image data lives in a multisampled format only temporarily, as after the frame is rendered the multisampled framebuffer attachments get resolved into single-sampled counterparts that are directly presentable to the screen. In the ideal case this can be leveraged by TBR GPUs to the extent that multisampled data only ever has to live in on-chip tile memory. In such a setup the multisampled framebuffer attachments never get loaded from or stored to memory, and thus don’t even consume external memory (see transient attachments), they only exist for the duration of the multisampled rendering and once the tile is completely rendered then the tile store operation will simply resolves the multisampled data on-chip into a single-sampled image in memory. This can save tremendous amount of memory bandwidth but only if the multisampled data doesn’t need to be preserved across render passes/frames.

Deferred rendering

There’s a long-standing myth (that luckily slowly disappears) that deferred rendering techniques are not suitable for TBR GPUs. The origins of this myth is probably poor performance observed on naive ports of deferred rendering applications to these platforms.

Ironically, the opposite is true if both the geometry pass and the deferred pass is part of the same render pass, as TBR GPUs don’t need to actually consume any external memory bandwidth for writing out and then reading (typically multiple times) the G-Buffer contents, that is usually the main bottleneck of such renderers. Instead, for each tile, the geometry pass will write G-Buffer data into the on-chip tile memory that is then later consumed by the deferred pass(es) and may never actually need to leave the GPU die. Actual mileage may vary though, as we will soon see.

Deferred rendering on tile-based GPU architectures.
Note that G-Buffer data only exists on-chip.

Shadow and Reflection Maps

There are many situations when the outputs of a rendering pass need to be mapped one way or another to some geometry, i.e. framebuffer attachments are reused as textures. Shadow map or reflection map rendering are perfect examples.

While IMR GPUs don’t particularly “like” these types of workloads, especially if the results of the rendering are immediately needed in the next subpass, and is often the cause of pipeline bubbles, they handle them just as well as any other type of workload. In fact, shadow map rendering often doesn’t require any fragment processing (except for partially transparent primitives, etc.), thus populating the depth attachment can be performed extremely fast by the fixed-function ROP units.

The same isn’t true for TBR GPUs, as even depth-only rendering requires the geometry data to be written out to memory and consumed by the rasterizer. However, the larger issue is the fact that the produced framebuffer data needs to be flushed from the on-chip tile memory, storing the data to memory, before the render pass that would like to map that image onto some geometry can start, practically eliminating many benefits of tile-based rendering.

Besides that, generating a shadow or reflection map requires pushing the scene geometry (even if just partially) down the graphics pipeline once more for each map, further stressing the main bottleneck of tile-based architectures. This is why we usually see less effects depending on these types of render-to-texture scenarios in applications primarily targeting TBR GPUs.

Other Considerations

Except the last use case presented, the examples above mostly showcased benefits of tile-based architectures over IMRs, the only drawback seeming to be the additional cost of having to write processed geometry back to memory before rasterization is started. However, this seemingly innocent trade-off has a lot of implications.

First, we have to note that besides the obvious memory bandwidth cost of going off-chip with geometry data that usually makes highly detailed geometry (even if tessellated within the graphics pipeline) impractical on TBR GPUs, there is also the question of how much memory this transient buffer needs.

Practically TBR GPU drivers need to determine how much geometry storage will be necessary during the entire render pass. This is fairly trivial for simple cases when the final post-transform primitive count matches the incoming count, or is less than that (e.g. due to culling), it simply needs book-keeping primitive/vertex counts of draw calls and delaying their submissions until data is collected from the entire render pass and appropriate primitive bin storage is reserved. However, that’s not always an option as in-pipeline geometry amplification (e.g. in case of tessellation or geometry shading) or indirect draws may break the correlation between input and output primitive count in ways that cannot be predicted.

In case the determined primitive bin storage size exceeds practical limits, or cannot be estimated, driver implementations will be forced to split the rendering workload into multiple render passes, potentially eliminating the advantages of tile-based rendering. This can happen even in Vulkan, despite having explicit render pass constructs at the API level, as a single render pass object can be turned into multiple render passes internally by the driver, if necessary, let alone traditional APIs where render pass boundaries are determined entirely by the driver often involving complex guess-work leading to unpredictable performance cliffs from the developer’s perspective.

Besides the primitive bin storage limitations, there are many other circumstances where render passes need to be split. As an example, a split may be necessary due to too many or too big (in terms of bytes-per-pixel) framebuffer attachments being used, or needed to be preserved in a subpass. Even if such cases aren’t handled by doing a full render pass split, they may result in needing to perform in-render-pass partial tile stores and reloads.

Other examples include workloads that contain feedback loops like transform feedback, occlusion queries (even indirect draws and render-to-texture scenarios can be included this category), especially when they contain back-end to front-end dependencies. Such feedback loops can have significant overheads on IMR GPUs as well, but not nearly as devastating as on their TBR counterparts.

Thus, while TBR architectures offer many potential benefits, they are more content sensitive and require careful attention from software developers to hit their sweet spot, while IMR architectures typically “work just fine” (though it should go without saying that efficient API usage makes a huge difference on both architectures, nonetheless).

Hybrid Architectures

Not all GPUs fall strictly in the IMR or TBR category, in fact many architectures employ some hybrid approach. The simplest hybrid architectures are GPUs that are capable of operating both in IMR and TBR mode thus allowing the underlying driver to use the best mode for the particular workload at hand.

Another class of hybrid architecture is one that is often referred to as tile-based immediate-mode rendering. As dissected in this article, this hybrid architecture is used since NVIDIA’s Maxwell GPUs. Does that mean that this architecture is like a TBR one, or that it shares all benefits of both worlds? Well, not really…

What the article and the video fails to show is what happens when you increase the primitive count. Guillemot’s test application doesn’t support large primitive counts, but the effect is already visible if we crank up both the primitive and attribute count. After a certain threshold it can be noted that not all primitives are rasterized within a tile before the GPU starts rasterizing the next tile, thus we’re clearly not talking about a traditional TBR architecture.

To understand what is happening here, we have to look at how rasterization workload is typically distributed across the multiple ROP units present on discrete GPUs. Loosely speaking, IMR GPUs also sort of work in a per-tile fashion in the sense that workload is usually sent to specific ROP units based on their framebuffer-space location to maximize RB cache hit rates. If the ROP unit count is sufficiently large for no two tile being served by the same unit then practically we can achieve similar framebuffer access performance like TBR GPUs, although obviously that’s rarely ever the case. That means the ROP units need to “context-switch” between tiles through the course of rendering the frame, resulting in decreased RB cache hit rates.

Thus what likely happens here isn’t tile-based rendering in its “standard” form, but rather on-chip buffering of incoming primitives and intelligently dispatching them to individual ROP units in an order that depletes all buffered primitives corresponding to a specific tile handled by a particular ROP unit before sending other buffered primitives that would require the ROP unit to switch to another tile.

While this approach can definitely achieve similar benefits to TBR GPUs in very low primitive count use cases like 2D composition, post-processing, decals, and certain particle effects, and should provide performance boost in general, it doesn’t offer the same whole-render-pass level benefits of TBR architectures, like the possibility of pixel-perfect hidden surface removal, on-chip multisampling, or on-chip deferred rendering, as the tile-based rasterization only happens across a small window of primitives (depending on how many of them fit in the on-chip primitive buffer) while TBRs use large off-chip buffers for the same purpose.

Conclusion

We presented the two most prevalent GPU architecture type, the immediate-mode rendering (IMR) architecture using a traditional implementation of the rendering pipeline, and the tile-based rendering (TBR) architecture that takes a different approach to achieve the same goals.

Along the discussion we also explored a set of use cases that highlight key strengths and weaknesses of each.

While both architectures have their merits and each clearly has an advantage over the other in specific scenarios, there is no free lunch in the context of GPU architectures, just as nowhere else in the world of algorithm optimization, as both architectures make specific compromises to sacrifice performance in one part of the pipeline in order to gain performance elsewhere.

Thus it is doubtful whether the development of all GPUs will converge towards either, instead, each architecture will continue to be used in systems that get the most benefits from the specific trade-offs characterizing it.

It is not unlikely though that we will see more hybrid architectures on systems where the additional hardware cost is justified by being able to handle a wider range of content efficiently, and that graphics APIs will continue to introduce explicit mechanisms, like Vulkan’s render pass API, to enable more direct and effective use of quirkier GPU features like the two-phase per-tile rendering, as these reduce the reliance on driver guess-work and thus lead to more predictable rendering performance.

Video compression basics

Daniel Rákos — Sun, 09 May 2021 16:38:39 +0000

The Khronos Group recently released a set of provisitional extensions adding video encoding and decoding capabilities to the Vulkan API, collectively referred to as Vulkan Video. This thus seemed like the perfect opportunity to provide an introduction to video compression from the perspective of a graphics programmer, and discuss why having integrated support for video encoding and decoding as part of the Vulkan API is an important step forward for the industry.

DISCLAIMER: This article aims to provide only a rudimental presentation of the basic concepts one needs to be familiar with when working with video codec APIs, targeting developers coming from the graphics rendering world without prior experience with these, and as such some of the simplistic descriptions may cause experts to raise their eyebrows, rightfully so. Nonetheless, the article hopefully covers the foundations in sufficient depth for readers to be able to get started with video encoding and decoding.

Support for compressed video capture and playback became such a basic feature of any interactive computer system that nowadays one hardly finds one that has no support for it. It’s also clear that such systems need dedicated acceleration hardware to be able meet the performance requirements imposed by the ever increasing resolution and frame rate needs. Thus contemporary dedicated GPUs, many CPUs, and most mobile and embedded SoCs come with such dedicated video codec hardware.

Throughout the last couple of decades a large number of video and media codec APIs emerged (then most disappeared) that aimed to provide access to these acceleration engines at different abstraction levels. However, most of these APIs were or still are either hardware vendor or platform dependent, and a few that came with wider portability promises couldn’t really live up to those due to lack of adoption by the industry players. While Vulkan Video is yet to prove that it could overcome this “curse” haunting these APIs, the momentum of the Vulkan API in general should certainly help in this respect.

Arguably, another aspect of many of the existing video codec API standards and libraries that concern developers targeting high performance interoperability between video encode/decode and rendering, is that many lack copy-less sharing of frame data and/or efficient synchronization at sharing boundaries, as they were designed around a similar CPU-centric model as legacy graphics APIs did, which often results in additional transfer or synchronization round-trips between the GPU and the CPU. The few APIs that do provide efficient sharing are often limited to particular hardware vendors, platforms, or to specific use cases e.g. efficient presentation to screen. Taking advantage of the explicit memory management and synchronization infrastructure offered by the core Vulkan API, Vulkan Video should be able to overcome these obstacles.

In order to accentuate the importance of efficient video and rendering interop, it’s worth mentioning a couple of typical use cases:

Video texturing – i.e. being able to use compressed videos as a streaming source for texture data mapped onto 3D geometry (quite common in video games, but desktop compositors can similarly benefit, and use cases generally have high efficiency requirements)
Streaming renderers – i.e. being able to efficiently stream the output of rendering applications to other applications or remote devices (for interactive use cases latency is of key importance)
Video post-processing – e.g. being able to apply GPU based filters on a series of compressed video frames
Video periferal support – e.g. using IP camera input

The anatomy of compressed video

Generally speaking, a compressed video stream consists of a bitstream that contains a (typically lossy) compressed representation of each image frame of the video sequence with some additional codec specific metadata (e.g. parameter sets) that is necessary for a decoder to be able interpret the compressed data and display the decompressed images.

In the simplest case each frame of the video is compressed separately, e.g. one can create a compressed video stream out of a series of JPEG compressed images. Such a compressed frame of a complete image is called an I-frame (intra-coded frame), sometimes referred to as a keyframe, although the application of the latter term is rather context specific.

NOTE: While the terms "frame" and "picture" are often used interchangeably, video compression algorithms usually work in terms of pictures which can represent either a frame or a field. Fields are used in case of interlaced video where pairs of subsequent fields comprise a full frame, one holding the odd-numbered scanlines and the other holding the even-numbered scanlines. We will use the term "frame" throughout the article because it is the more common terminology used in literature, but usually we actually mean a "picture".

Of course, the quality/bitrate metrics of modern video codecs wouldn’t be achievable if all images would be individually compressed into I-frames, as those only take advantage of spacial coherency of pixel data. Instead, temporal coherency is leveraged by usually only storing compressed differences between separate image frames of the video.

A P-frame (predicted frame), sometimes referred to as a delta-frame or inter-frame, encodes only changes compared to an earlier frame. The frame that is used to calculate the delta-frame from is called the reference frame (or reference picture). While commonly this difference is calculated with respect to the immediately preceding frame, P-frames can refer to any earlier frame in the general case. In fact some codecs may also allow using multiple reference frames together with the image frame to construct a P-frame.

Even higher compression rates can be achieved when compressing an image frame as a B-frame (bidirectional predicted frame) as this frame type allows a delta-frame to refer not only to earlier frames in the sequence, but also later ones. Once again, typically B-frames are encoded using the immediately preceding and the subsequent frames as references, but this isn’t always the case.

It’s also worth noting that the reference frames used by a P- or B-frame themselves may be delta/predicted frames, they don’t have to be I-frames.

Subsequent video frames (left-to-right) with I-, P- and B-frames.
Note how the B-frame can take advantage of image information from both the preceding and subsequent frames used as reference frames.

The order in which I-, P-, and B-frames follow each other in the video sequence is called the GOP (group of pictures), and some decoder/encoder implementations may require the GOP to have a specific size (number of pictures/frames) or structure (order in which different picture/frame types follow each other). The GOP structure is often described by two numbers, telling the distance between subsequent P and I frames, respectively, e.g. a GOP with M=3, N=9 means that its structure is IBBPBBPBBI, where each letter refers to a corresponding frame type in the sequence.

Illustration of a group of pictures in a video bitstream following an IBBPBBI pattern (M=3, N=6).
NOTE: actual physical order of frames in the bitstream is usually different, as discussed later.

NOTE: Some codecs also support the compression scheme to be controlled at a finer granularity. These are typically rectanglar sub-regions splitting the picture horizontally or vertically. E.g. H.264/AVC calls these slices, and a single frame may contain different types of slices (I-slices, P-slices, B-slices).

The basic processing unit of video codecs varies, however, it’s quite common to use some sort of block-based coding. E.g. H.264/AVC uses 4×4, 8×8, or 16×16 pixel sized macroblocks (MB) and H.265/HEVC uses 16×16 or 64×64 pixel sized coding tree units (CTU). The encoding process usually uses techniques like motion estimation to select picture segments to produce delta information from and the image data (or image difference data) is compressed using discrete cosine transform, Huffman coding, run-length encoding, or other algorithms specific to the particular codec, including appropriate quantization at certain levels of the process to achieve the desired quality-to-size ratio.

Inputs and outputs

The actual uncompressed frame format used as the source for video encoding and the destination of video decoding varies between codecs and implementations, but typically they use images in a YCbCr color space, i.e. a format with a luminance (luma) and two chrominance (chroma) channels.

NOTE: Going forward we will use the terms YCbCr and YUV interchangeably. Strictly speaking, this is inaccurate as YUV refers to an analog encoding scheme, nonetheless it's quite common for literature to refer to YCbCr formats as YUV formats, the U and V channels being equivalent to the Cb (blue chrominance) and Cr (red chrominance) channels, respectively.

Often chrominance is stored at a lower resolution. This practice is called chroma subsampling, and it means that chrominance samples may be shared among several luminance samples. The particular chroma subsampling scheme used is commonly expressed as a three-part ratio J:a:b, where:

J is the horizontal sampling reference (i.e. the horizontal region width in pixels in terms of which the chroma to luma ratio is expressed) and usually has the value 4
a is the number of chrominance samples, including both Cr and Cb samples, across the horizontal sampling reference (i.e. per rows of J pixels)
b is the row stride in terms of chrominance samples (i.e. the index of the first chrominance sample pair in the second row of J pixels), which is zero for formats with only a single row of chrominance samples

Some YUV formats include an alpha channel as well in which case a fourth value is included in the ratio, although the alpha sampling rate always matches the luma sampling rate (J) in these cases.

Demonstration of common chroma subsampling schemes.

Another thing to keep in mind when working with YUV formats using chroma subsampling is that the mapping between luma and chroma values, i.e. the points at which luma and chroma samples were “measured” on the original full resolution image can be ambiguous without further contextual information. In order words, the image-space coordinates of luma samples and corresponding chroma samples may not coincede. Thus the interpretation of chroma subsampled YUV formats require knowing the relative location of luma and chroma samples.

Examples of commonly used chroma cositing schemes.
NOTE: the coordinate space convention used has a lower-left origin.

From the perspective of in-memory representation, we distinguish between packed and planar YUV formats. Packed formats interleave luminance and chrominance samples, while planar formats use separate planes for luminance and chrominance, sometimes even separate planes for red and blue chrominance. Some literature calls the latter as fully planar format compared to semi-planar formats that include both Cr and Cb in a single plane. In addition, certain YUV format definitions also require specific placement of the individual planes of planar YUV formats in memory.

Examples of 4×4 pixel images with packed (top row), fully planar (middle row), and semi-planar (bottom row) YUV formats and different chroma subsampling schemes.

GPUs usually come with support for accessing YUV formatted images/textures, however, the level and type of support can vary.

Some GPUs and corresponding driver implementations limit support to being able to access the raw YUV format itself (i.e. access the luminance and chrominance values), allowing creating textures with YUV formats, and being able to read from, write to and sample from (with filtering) these images/textures.

In case of packed 4:4:4 formats this works similarly to RGB/RGBA formats, except that the semantics of the individual channels are different. For some subsampled packed formats (e.g. UYVY) a single texel value can actually include more than one luma samples each in a different channel.

Planar YUV formatted textures work similarly to planar depth/stencil textures whereas the actual texture consists of multiple planes, sometimes having different resolutions as well due to chroma subsampling, and shader access (or other type of GPU access) is allowed to them on a per-plane basis.

Other GPU and/or driver implementations, instead or as an alternative, allow addressing and interpreting YUV formatted textures in other ways. E.g. some implementations may allow sampling from planar YUV formatted textures through a single sampler resource variable in the shader, and some may even support a fixed set of YCbCr-to-RGB conversion modes meaning that the texel values get converted between the corresponding color spaces, thus sampling instructions producing RGB result values. This may be made possible either through dedicated hardware support for multi-plane access and color-space conversion, or through software emulation which enable the driver to mimic hardware support by manually converting YUV texture sample operations into per-plane native access operations complemented with in-shader filtering and color space conversion.

Vulkan exposes support for these functionalities through the VK_KHR_sampler_ycbcr_conversion extension that is part of the core API in Vulkan 1.1. The name itself is a bit misleading when one considers that native hardware support varies across implementations, however, Vulkan provides the necessary tools to perform either raw per-plane access (through the creation of per-plane image views), or multi-plane access with or without color space conversion (through the use of samplers with YCbCr conversion enabled), although support for each is not uniform across driver implementations for the reasons mentioned above.

Vulkan format	Known YUV format names	Bits / channel	Chroma subsampling	Layout type	Bits / pixel
R8G8B8A8_UNORM	AYUV^[1]	8	4:4:4	Packed	32
G8_B8_R8_3PLANE_444_UNORM	YV42^[2]/YV24^[1,2]	8	4:4:4	Planar	24
B8G8R8G8_422_UNORM G8B8G8R8_422_UNORM	UYVY/VYUY^[1] YUY2/YVYU^[1]	8 8	4:2:2	Packed	16^[3]
G8_B8R8_2PLANE_422_UNORM	NV16^[2]/NV61^[1,2]	8	4:2:2	Semi-planar	16
G8_B8_R8_3PLANE_422_UNORM	YV61^[2]/YV16^[1,2]	8	4:2:2	Planar	16
G8_B8R8_2PLANE_420_UNORM	NV12^[2]/NV21^[1,2]	8	4:2:0	Semi-planar	12
G8_B8_R8_3PLANE_420_UNORM	YV21^[2]/YV12^[1,2]	8	4:2:0	Planar	12
G10X6_B10X6_R10X6_3PLANE_444_UNORM_3PACK16 G12X4_B12X4_R12X4_3PLANE_444_UNORM_3PACK16 G16_B16_R16_3PLANE_444_UNORM	?^[1] ?^[1] ?^[1]	10 12 16	4:4:4	Planar	48
R10X6G10X6B10X6A10X6_UNORM_4PACK16 R12X4G12X4B12X4A12X4_UNORM_4PACK16 R16G16B16A16_UNORM	? ? ?	10 12 16	4:4:4	Packed	64
B10X6G10X6R10X6G10X6_422_UNORM_4PACK16 G10X6B10X6G10X6R10X6_422_UNORM_4PACK16 B12X4G12X4R12X4G12X4_422_UNORM_4PACK16 G12X4B12X4G12X4R12X4_422_UNORM_4PACK16 B16G16R16G16_422_UNORM G16B16G16R16_422_UNORM	UYVP/VYUP^[1] YUYP/YVYP^[1] ? ? ? ?	10 10 12 12 16 16	4:2:2	Packed	32^[3]
G10X6_B10X6R10X6_2PLANE_422_UNORM_3PACK16 G12X4_B12X4R12X4_2PLANE_422_UNORM_3PACK16 G16_B16R16_2PLANE_422_UNORM	? ? ?	10 12 16	4:2:2	Semi-planar	32
G10X6_B10X6_R10X6_3PLANE_422_UNORM_3PACK16 G12X4_B12X4_R12X4_3PLANE_422_UNORM_3PACK16 G16_B16_R16_3PLANE_422_UNORM	? ? ?	10 12 16	4:2:2	Planar	32
G10X6_B10X6R10X6_2PLANE_420_UNORM_3PACK16 G12X4_B12X4R12X4_2PLANE_420_UNORM_3PACK16 G16_B16R16_2PLANE_420_UNORM	? ? ?	10 12 16	4:2:0	Semi-planar	24
G10X6_B10X6_R10X6_3PLANE_420_UNORM_3PACK16 G12X4_B12X4_R12X4_3PLANE_420_UNORM_3PACK16 G16_B16_R16_3PLANE_420_UNORM	? ? ?	10 12 16	4:2:0	Planar	24

List of available Vulkan YUV texture formats.
[1] Requires appropriate channel swizzle
[2] Depends on whether the implementation uses a contiguous representation
[3] Contains two pixels per texel, i.e. two luma samples per texel

Due to the limited set of color space conversion presets supported by the hardware/driver/API, it’s quite likely that any Vulkan based library or application that needs to handle arbitrary YCbCr data will need to use raw per-plane YUV texture access in the general case, as that allows supporting arbitrary types of YCbCr and RGB color spaces. Nonetheless, developers should consider adding fast-paths to their products using the built-in YCbCr conversion samplers where support is available, especially in power or performance constrained systems like mobile devices.

Encoding and decoding process

The toolset of a complete video codec is usually extremely wide, both in terms of the set of supported input/output frame formats, the method used to encode/decode those, etc. As such, usually video codec standards define multiple profiles that mandate certain capabilities, targeting specific classes of applications. In addition, video codec standards usually also define a set of levels that define corresponding decoder performance requirements (e.g. sample rate or bit rate). Video decoder implementations need to support the specific profile a compressed video was encoded with in order to be able to decode it, and need to support a specific level to be able to decode the video in real-time.

The encoder side of things is less constrained and uniform. While the specific codec, profile, and level limit the set of compression and encoding schemes available, individual encoder implementations, let those be software or hardware encoders, may use different techniques and algorithms to produce a conforming bitstream. In particular, fitting within a particular bit rate budget either contrained by the target codec profile level or by the storage/transmission limits is typically controlled using vendor or implementation specific configuration commands and parameters. These mechanisms are collectively referred to as rate control and have a direct effect on the achieved compression rate, visual quality, and performance on a given encoder implementation.

Both the encoding and decoding process employs a special resource pool called the decoded picture buffer (DPB). This pool stores the reference pictures (usually frames) used during the encoding and decoding of P-frames and B-frames. As discussed earlier, depending on the codec and implementation, individual P-frames and B-frames may refer to more than one reference frame, and they may each use different frames as their reference, hence in the general case the DPB needs to be able to hold more than just a single reference picture. The codecs (and individual profiles within it) generally define an upper bound on the DPB capacity.

A frame/picture is added to the DPB if it planned to be used as a reference frame of a subsequently encoded or decoded frame, and is later removed from the DPB when no further frames need to refer to it anymore. Thus the DPB is a container for transient image data needed during the encoding and decoding process.

Encoder and decoder operation with respect to DPB management.

Adding reference frames to the DPB is trivial during decoding: when decoding a frame that is marked as a reference frame the decoder simply adds the output of the frame decoding to the DPB, either by reference or by copy. During encoding, however, we cannot simply use the original incoming image frame data as the reference frame, at least not in case of lossy compression schemes (which is usually the case with video codecs). If we would do so, then we would construct the delta-frame based on the original image frame which is not known to the decoder, thus decompressing the delta-frame would have different (incorrect) results.

Instead, when encoding a frame that is marked as a reference frame the encoder will reconstruct the frame from the output of the frame encoding and add this reconstructed frame to the DPB. This is also the reason behind why the DPB is often also referred to as the reconstructed picture buffer in the context of video encoding.

It’s worth noting here though that this frame reconstruction process does not necessarily require decompressing the bitstream resulting from the frame encoding, as usually encoders can simply generate the reconstructed frame from the original input image frame and internal data collected during the encoding process. Nonetheless, due to the reconstructed frame being generally different compared to the input image frame, adding a reference frame to the DPB during encoding requires a copy.

As the reference frames need to be available at the time a P-frame or B-frame referring to them is encoded or decoded, frames used as reference frames always need to be encoded/decoded before any dependent frames. This is a given in case of a simple encoder or decoder that simply processes subsequent frames in the video sequence one after another, however, due to the nature of B-frames which may use “future” frames as references, such a scheme cannot work in the general case.

The concequence of that is that we need to distinguish between coding order, i.e. the order in which we encode/decode individual frames/pictures/slices to/from the bitstream, and display order, i.e. the order in which the frames are expected to be presented during video playback. In the simple cases the two orders match, but when using B-frames the two diverge.

Encoder and decoder operation in the presence of B-frames.
Note how frames used as backward (future) references are encoded and decoded before the dependent B-frames.

As the presence of B-frames may have an effect on the latency between decoding and presentation of video frames, often codec profiles also define restrictions on the use of them, and decoder implementations have to be designed with this potential latency in mind.

Quite often events like data corruption or transmission errors may result in the inability to process certain parts of the incoming bitstream which may leave the DPB in an inconsistent state. This can cause decompression failures/artifacts in case of P- and B-frames that refer to reference frames that were “lost” due to such events. Eventually the effect of such failure events go away once no new incoming frame arrives that refer to these lost reference frames and the DPB’s reference frame slots get replaced with new references.

Some codecs also define a special I-frame type called an IDR-frame (instantaneous decoder refresh frame) that, in addition to being intra-coded, also clears the contents of the DPB. That means all subsequent frames/pictures/slices are guaranteed to not refer to any frame decoded prior to the IDR-frame. This feature also makes seeking video streams easier and more responsive as the player can know for sure that it can safely start processing a stream at IDR-frame boundaries without caring about earlier bitstream data.

The new kid on the block

When it comes to hardware acceleration, most implementations focus strictly on image data processing with limited or no handling of metadata (codec parameter sets, etc.), as the actual image data compression/decompression is the most computationally intensive part of the process. Accordingly, producing a complete conformant bitstream out of the encoded frames, packaging them in any particular storage format, and, similarly, consuming those typically necessitates additional code around the use of the hardware accelerator or the use of 3rd party libraries designed for this purpose.

Vulkan Video exposes video encoding and decoding functionality through a series of extensions:

VK_KHR_video_queue provides the common infrastructure for video codec support
VK_KHR_video_encode_queue provides the encode specific APIs
VK_KHR_video_decode_queue provides the decode specific APIs
Additional encode and decode extensions provide the codec specific parts

These extensions provide direct access to the actually hardware accelerated parts of the video encode/decode process, offering explicit control over the following:

Issuing individual encode/decode operations on pictures with specific picture/slice types
Controlling the contents of the DPB, i.e. marking a picture as reference and specifying the references used by predicted pictures
Managing the rate control parameters used during video encoding
Video session objects providing a context for the often stateful encoding/decoding process, that enables handling multiple video streams in parallel or in an interleaved fashion

The programming model itself should be sufficiently familiar to developers accustomed to the rest of the Vulkan API. Video codec functionality is exposed through a set of new queue families with video encode and/or decode capabilities. Individual picture encode and decode operations are recorded in coding order into command buffers enqueueable to a queue of the appropriate queue family, with memory backing, resource management (e.g. image layout transitions), and synchronization being handled with the usual Vulkan mechanisms.

The provisional extension specifications still have some rough edges and certain under-specified behaviors, but considering their status and the fact that this is a major feature from a quite different domain than what graphics API design usually concerns with, it’s big news to have them available for public evaluation.

For those interested, NVIDIA already offers beta drivers with support for the provisional Vulkan Video extensions, and they even published an open-source video decode sample that, while still being largely work-in-progress as of the time of writing this article, should provide sufficient material for developers to get started with the API.

Custom memory allocators

Daniel Rákos — Wed, 03 Mar 2021 19:03:27 +0000

Most systems programmers are not new to creating and applying custom memory allocators in performance or memory constrained projects. However, the benefits of purpose-built memory allocators are often underestimated or overlooked by many programmers. This article aims to provide an overview of the motivation and advantages of deploying custom memory allocation schemes and presents a few common allocation strategies.

Just as with any other custom component that ought to replace a readily available general component, implementing custom memory allocators is often viewed as a symptom of NIH syndrome. That may be true in some cases, but such skepticism commonly stems from the false premise that the goal of writing a custom memory allocator is to create an objectively better general-purpose allocator. While that’s an admirable goal, especially for academia, the real goals of using custom memory allocators in a software project are usually far more modest yet better motivated, as we will see when we start exploring the objectives. But first, let’s start with clarifying what we really talk about here.

Nomenclature

When we talk about custom memory allocators in the context of application development what we really mean is a component that enables partitioning an operating system or runtime environment provided memory allocation into multiple smaller pieces, sometimes referring to such components as sub-allocators. This distinction between allocators and sub-allocators is irrelevant in practice, as technically even the operating system kernel memory allocator is actually a sub-allocator as it sub-allocates the physical memory and/or virtual address space available on the system.

What this means from a practical perspective is that a memory allocation request made by the application software to the runtime is actually served by a hierarchy of memory allocators and adding a custom memory allocator to the software means inserting yet another into this hierarchy, as depicted below:

Illustration of a typical memory allocator hierarchy in a system.
User code may use runtime provided heap allocator, OS kernel provided allocator, or custom memory allocators built on top of either of these.

Each allocator in the hierarchy is responsible to sub-partition a partition reserved by an allocator below it, thus in worst case if there’s no more room for a new allocation in any of the already reserved partitions at any level of the hierarchy, each allocator needs to request a new partition from the allocator below it, potentially all the way down to the page allocators provided by the operating system.

In addition, most modern allocators provided by the runtime and operating system themselves are usually comprised of multiple different allocation algorithms operating in tandem to be able to serve different types of allocations (think of e.g. different allocation sizes) in an efficient manner.

It is also worth noting at this point that the use of custom allocators isn’t limited to sub-partitioning system memory allocations, one can use the same components to sub-partition other types of memories, persistent storage, or really any continuous addressable range.

Generally speaking, an allocator usually provides the following set of operations:

Allocate – reserves a sub-partition of the address range
Deallocate – frees a previously reserved sub-partition of the address range
Resize (optional) – grows or shrinks a previously reserved sub-partition of the address range

The last operation is not common as in many cases it is difficult or impossible to implement it (at least efficiently), but we mention it here as some use cases may rely on support for it as we will see later.

Objectives

The most common goals of implementing custom memory allocators are the following:

Improved performance of the operations provided by the allocator
Improved memory efficiency by reducing internal and/or external fragmentation
Improved performance through better spatial locality of related data and thus more efficient use of processor caches

However, recall that we’re not trying to beat the runtime or kernel provided allocators at their own game. Rather, it’s about knowing more about the allocation pattern of our software and thus being able to create a purpose-built allocator that may have functional limitations, may have worse performance or memory efficiency in the general case, but has better characteristics in the specific usage pattern our software employs.

General-purpose allocators provided by the runtime have to fulfill all sorts of requirements, including but not limited to the following:

They need to be able to allocate sub-partitions of arbitrary size, ranging from a single byte up to gigabytes or more
They need to be able to bookkeep arbitrarily large number of sub-partitions
They need to be able to serve allocation and deallocation requests in virtually any sequence
They need to be able to simultaneously serve requests from different threads
They need to be memory efficient, thus they often cannot pre-allocate large partitions to speed up subsequent sub-partition allocations

This is why they work reasonably well in all scenarios, but complying to all of these requirements comes at a cost.

Contrarily, in case of memory allocations needed by a specific software component we may have additional contextual information that can enable us to do better, e.g.:

We may know that we only need sub-partitions of a specific size or a specific set of sizes
We may know that there is an upper bound on the number of sub-partitions to ever exist at a given time
We may know that allocations and deallocations follow a specific pattern/sequence
We may know that we will only allocate sub-partitions on a specific thread or sets of threads
We may know the upper bound on the memory needed by all sub-partitions and we’re happy to reserve all of it up-front

Implementing custom allocators thus allows us paying only for what we actually need/use, and this is why a custom memory allocator can often effortlessly beat even the best general-purpose allocators in the context of a particular piece of code.

Now that we have the right premise, let’s see a couple of common memory allocation strategies and techniques that aim to increase allocation/deallocation performance.

LIFO allocator

Yes, we’re talking about a stack. While almost all programming languages provide a built-in stack to store local variables, function/method parameters, etc., they usually limit such allocations to objects whose size is known at compile-time. One notable exception is the C99 standard that supports variable-length arrays that can serve as backing store for dynamically sized objects on the stack. There are also other mechanisms provided by runtimes to dynamically allocate memory on the stack (e.g. alloca and _malloca) but usually they have their own set of quirks, thus usually it’s better to implement a LIFO allocator with a backing store allocated from the heap.

It’s no surprise that a LIFO allocator comes handy in similar situations as a stack, i.e. when an allocation matches some scope within the call hierarchy, and is often used to bypass the programming limitations to allocate memory for dynamically sized scope-local objects.

Implementing such an allocator is extremely simple: only a backing memory range and a stack pointer is necessary. Allocating and deallocating thus is constant time and trivial, as it only involves incrementing and decrementing the stack pointer, respectively.

It is easy to note that the basic implementation of this allocator requires all the memory that is available for sub-partitioning up-front and remains allocated for the entire lifetime of the allocator, and that the total size of the sub-partitions reserved at any given time cannot surpass this capacity. This, however, does not have to be the case. There are two common strategies to relax these restrictions, if needed. The first one is to follow the same solution that the built-in stack uses.

Nowadays application developers rarely have to mess around with the stack size configured for the application or its threads, but back in the days before virtual memory was a thing one had to pay attention to it, as making it too big consumed too much physical memory while making it small left the door open for stack overflow errors. While memory is still something that we don’t want to overcommit on, default stack sizes are usually extremely large yet they only consume as many physical memory pages as needed, thanks to runtimes only allocating virtual address space for the stack and back them with physical pages on demand. The same strategy can be employed by a custom LIFO allocator.

This still requires knowing the absolute upper limit for the size of the partition though. One could think that a runtime/kernel allocator supporting resize operations would be able to solve the problem of needing to increase the size of the partition, but that’s not the case in practice. While resize operations may as well be implemented with the same mechanism, i.e. allocate and commit further memory at the end of the partition, there’s no guarantee that the address space is actually available at the time of the resize, as other partitions may be using it. In such cases resize operations typically reallocate the virtual address range corresponding to the partition at a different location with sufficient contiguous address space, however, that’s usually unacceptable from the perspective of a custom allocator as it would invalidate all previous sub-partition addresses, unless the old address space is preserved as well which isn’t usually the case and can easily proliferate into significant virtual address space waste after multiple resize operations.

Another approach is to simply maintain a chain of partitions in the LIFO allocator and allocate a new partition every time a new allocation request arrives that doesn’t fit in the existing one. This can be thought of as a two-level stack (stack of stacks), where the first-level stack holds the backing partitions and the second-level stacks hold the sub-partitions. One drawback of this approach is the possibility of internal fragmentation, thus the size of the individual allocations made by the allocator should be selected appropriately to minimize that.

LIFO allocators: basic (left), virtual memory managed (middle), two-level (right).

Obviously, there are additional non-trivial costs involved in these cases to commit physical pages to the pre-reserved virtual address range or to allocate additional partitions in a two-level stack, so not all allocations will get away with the trivial constant time cost of updating the stack pointer. Thus it’s always a balance between memory usage, performance, and flexibility and the appropriate variant should be chosen based on the needs of the application code in question.

Same applies if we want to decommit pages or deallocate partitions if the stack shinks sufficiently and we want to free up memory, and is generally advised to not do so unless necessary, as it can cause continuous grow/shrink costs if the stack pointer keeps moving around the boundary. Therefore it’s better to always keep around some amount of additional free space (committed pages or additional partitions) to avoid such pathological cases.

FIFO allocator

This type of allocator comes handy when partitions are allocated in the same order as they are deallocated. This enables treating the backing partition as a queue where an allocation request is treated as data being pushed at the front and deallocation is treated as data being pulled at the back. Thus all the tracking that is necessary is a head and tail pointer, therefore allocation and deallocation is constant time and as trivial as incrementing the head and tail pointers, respectively.

FIFO allocators are very useful for allocating transient objects, a common example being software components operating as non-branching pipelines where the sub-partitions provide the storage for the input and output data of the pipeline stages.

FIFO allocator used to allocate inputs and outputs for the stages of a processing pipeline.

If the maximum required memory isn’t known up-front then the same approaches presented for the LIFO allocator can be used to reserve additional pages or partitions for more memory, though the same caveats apply as well.

In addition, in case of FIFO allocators there’s sort of a third option as the back of the partition becomes progressively free as sub-partitions are deallocated, thus allocation can be performed in a round-robin fashion by treating the parition as a circular buffer. In its simplest form this may cause fragmentation at the end of the partition if the next allocation no longer fits there thus has to be reserved at the beginning. This, however, can be avoided by allocating twice the amount of virtual address space for the partition and mapping the physical memory to both halves of this address space. In such a setup allocating a continuous sub-partition is always possible as we bridge the gap between the end and the beginning of the partition’s physical memory.

By using the mechanism above the partition sub-allocated by the FIFO allocator needs to be only as big as the maximum memory needed by any single stage of the pipeline.

FIFO allocators can also be used to serve branching pipelines by using a separate FIFO allocator for each new branch forked. If we would like to avoid that, the FIFO allocator can be extended to support allocations and deallocations to be issued in arbitrary sequence, and thus enable it to serve parallel branches simultaneously, however, that has the following implications:

Deallocation of a sub-partition that isn’t the backmost live allocation will be delayed until it is, thus deallocation may not happen immediately
Additional tracking is necessary to accomplish that which comes at a performance cost

Again, it’s always a trade-off between memory usage, performance, and flexibility.

Size constrained allocators

As you probably already see the pattern, the performance benefits of custom allocators usually comes from taking advantage of certain constraints we impose on the flexibility of the allocator compared to a general-purpose one. The examples presented so far limited the order in which allocation and deallocation requests can follow each other. This time we look at custom allocators that limit the size of individual sub-partitions instead.

It’s quite common for certain software modules to frequently allocate/deallocate instances of the same type of object that have a specific size. Such an allocator is typically referred to as a pool allocator or memory pool.

It is surprisingly simple to construct a custom allocator that is able to serve allocation/deallocation requests of such objects in constant time by pre-allocating a memory partition suitable to hold a certain number of such objects simply by maintaining a singly linked list of available entries that we pop from on allocation and push to on deallocation.

Such a custom allocator can be trivially extended to support arbitrary number of such objects by having a two-level singly linked list where the first-level list contains the non-full partitions and the second-level list contains the individual slots available. Of course, the usual caveat applies that if we want to support the number of partitions to dynamically grow or shrink then we still have to issue partition allocation/deallocation requests to the underlying memory allocator, but most requests can be still served in constant time.

Dynamic capacity pool allocator with two-level singly linked list tracking.

One thing to keep in mind when implementing a pool allocator is that it is generally better to keep the linked list control structure separate from the actual object slots to keep the slots uniformly aligned and contiguous in memory for better processor cache utilization.

That also brings us to the next thing that may come up in the use case scenarios of object pools whereas often user code needs to allocate multiple slots at once and preferably they should be contiguous in memory, in fact sometimes that’s a must have. Supporting such additional requirements complicates the allocation scheme and generally allocations can no longer be served in constant time as partically we’re pretty much back to the problem of a general-purpose allocator just one that has a specific granularity. Although, as in such cases you’re actually allocating memory for multiple objects at once, the additional cost of going with a more sophisticated allocation scheme is technically amortized across the number of objects instantiated.

It’s also worth to take the opportunity here to talk about multi-threading. In case of pool allocators implemented with such singly linked lists it’s possible to extend it to support allocation/deallocation requests coming from multiple threads by using lock-free linked lists instead, and usually there are lock-free thread-safe solutions for most allocation schemes using atomic operations. Still, the usual principle of paying for only what you really need applies here as well, thus such solutions should only be employed when concurrent allocation/deallocation is actually a requirement.

Another typical option to support serving multiple threads is to use separate allocators per thread though obviously that’s a trade for improved performance at the cost of additional memory usage due to each thread’s allocator now maintaining its own partition or set of partitions it sub-allocates from.

The linked list based approach isn’t the only common implementation for memory pools. Often a simple bit-array based tracking works just as well, sometimes even better. While the time complexity of allocations is linear in the number of slots in this case, crunching through a contiguous range of memory is extremely efficient on modern processors due to the optimal use of the caches, and with a single 64-bit read we can easily check the availability of 64 slots at once. This is ofttimes the preferred implementation approach when support for reserving multiple slots in contiguous memory is preferred or required, as finding contiguous slots and reserving them can be quickly done using bit masks.

It is rarely the case that a single pool allocator supporting objects of a single size is enough. In this case one can use multiple pool allocators for each object size in need, although that also comes at the cost of additional memory usage, just like using separate allocators per thread.

Instead, it’s possible to create a pool allocator that allows serving requests with more than just one size from the same partition(s). This clearly must have a performance cost, although, with a well constructed solution the cost can be limited to the logarithm of the number of different sizes needed to be supported. This leads us to another common allocation strategy that is also used by operating system kernels for large-scale sub-allocation: the buddy allocator.

The buddy memory allocation algorithm is an allocation scheme where usually a power-of-two sized partition is successively split into halves to try to give a best fit. The control structure is pratically a binary tree where each subsequent level contains nodes representing the first and second half of the memory region of their parents. In the standard implementation allocations and deallocations are served in logarithmic time, where the upper bound is the binary logarithm of the depth of the tree, i.e. the number of times the partition can be successively split into two.

Buddy allocator maintaining three 32 KB and one 64 KB allocations (allocated leaf nodes) in a 256 KB partition.

Strictly speaking, buddy allocators can only allocate/deallocate chunks in power-of-two multiples of the minimum block size, however, it can also be used as a general-purpose allocator of arbitrarily sized chunks with the caveat that in such uses this allocation scheme exhibits significant internal fragmentation.

Implementations usually use an array representation of the binary tree for efficiency which can be statically allocated knowing the partition size and the minimum allocation size that we plan to support. Similarly to earlier allocators, dynamic capacity can be supported by maintaining a linked list of such trees, though the use of nested trees isn’t uncommon either.

Count constrained allocators

We’ve already seen in all of our previously presented allocation algorithms that knowing the total size of allocations enables room for certain optimizations like allocating the backing partition up-front, and that knowing the size or set of sizes of allocations allows us to employ simpler and more efficient allocation schemes than what a general-purpose allocator has to be prepared for.

Another constraint on the usage pattern that can enable writing more efficient memory allocators is when we know the total number of allocations that can exist at any given time, as this grants us the ability to lay out our control structures in a fixed-size container, possibly in ways that are more processor cache friendly.

We’ve seen a few examples of this already, as the single partition variants of the pool allocator and the buddy allocator are similarly count constrained allocators while also being bound by size constraints. But here we will take a look at an example that focuses strictly on constraining the maximum number of allocations.

As an example, if we would like to support N allocation blocks in a partition where the free and allocated blocks are maintained using linked lists, then we know that there will be at most a total of 2*N+1 number of nodes in these lists at any given time (maximum N allocated blocks with free holes between them, and two more free blocks at the beginning and end of the partition), and the available node entry indices (nodes not being linked to any of the bookkeeping lists) can be maintained on a simple stack implemented using a fixed-size array.

This, contrarily to traditional linked list based general-purpose allocators, provides the benefit that there isn’t a need to add headers/footers to the allocated sub-partitions thus accessing allocations may allow for better processor cache utilization and potentially lower fragmentation in case allocations have to conform to specific alignment requirements that would otherwise cause the blocks to straddle due to the headers/footers, and, as a bonus, they are less prone to control structure corruption in case of buffer overflows/underflows.

Comparison of general-purpose header/footer based linked list allocator (top) and a count constrained linked list allocator example (bottom).
The count constrained variant uses separate control structure (block node array and unused node stack).
Notice how headers/footers can cause alignment imposed straddle (unavoidable free block between the two aligned allocations), and that the presence of headers/footers may result in sub-optimal use of cache lines for small allocations.

In case of a simple first-fit allocation policy, this allocation scheme allows for the typical allocation time complexity to be constant time with a linear worst case, and constant time deallocation, just like its unconstrained variant. However, due to the control structure being separate, the linear worst case is usually still faster in absolute terms due to better chance of cache hits while traversing the list of free blocks.

Just like unconstrained linked list based allocators, their count constrained variant can also employ binning free blocks based on their size to support best-fit or approximate best-fit allocation policies, but the advantage is, once again, that the corresponding control structures have a strict upper bound on their size, so we can statically allocate them and lay them out in a cache-friendly fashion, which fully generic solutions wouldn’t be able to do.

Still, before jumping to early conclusions that this is objectively a better allocator, don’t forget that we made compromises to achieve these benefits. First, the number of allocations is limited. Second, because there’s no mapping from the allocation’s address to the corresponding control node, we have to keep around the block node index along with the allocation’s address in order to be able to free a block in constant time. In fact this is typically the case with count constrained custom allocators, and the only reason we didn’t need to do so in case of pool allocators and buddy allocators is that those use fixed block sizes, thus the block index can be directly derived from the allocation’s address.

Conclusion

Implementing custom memory allocators is no silver bullet, but if applied in the right scenarios it can be a great item on the tool belt of any software engineer. This article mainly focused on improving performance through the use of custom memory allocators, but some of the presented ideas can also be used to reduce memory usage.

We’ve seen that the benefits of using custom memory allocators are stemming from the implementation freedom enabled by constraining the problem space at hand compared to the general-purpose allocators provided by runtimes and operating systems that are mostly one-size-fits-all solutions (though usually using multiple different specific algorithms internally). As it can be universally acknowledged about programming, everything is a trade-off between run-time performance, memory usage, and flexibility, and usually the means through which custom memory allocators achieve their goals is by sacrificing the latter.

We’ve presented some of the most commonly used custom memory allocators, and a few ways to customize them. It is important to recognize, however, that this article is in no way tried to provide a complete list of common custom allocators (if such a thing is even possible), or to suggest that any of the presented custom allocation strategies would improve the performance of a given piece of software.

As always, optimizations should be applied where there’s value in doing so. Once it is known with certainty that memory allocation/deallocation is a performance bottleneck, and the code (or design) is sufficiently analyzed to understand the usage pattern of dynamic allocations and their rationale, only then it is time to consider using custom memory allocators.

Even then, it’s quite possible that a simple “cookie-cutter” custom allocator, as some of the ones presented here, is not the right or the best solution for the problem at hand. Often the best approach is to derive a set of constraints from the analysis of the dynamic allocation usage pattern of the code in question, and construct a tailor-made allocator for it. That’s where, hopefully, the ideas and examples showcased in this article should be able to help.

Finally, we must remember that the best kind of dynamic allocation is no dynamic allocation at all, so if the storage for some data can be statically allocated without unreasonable memory usage or flexibility consequences then that’s probably the best thing to do.