Introducing the r600/NIR back-end

Introducing the r600/NIR back-end

Gert Wollny
July 07, 2022

Share this post:

Reading time: 10 minutes

Even though the hardware served by the r600 driver is ageing, it is still in wide use, and high-end cards from that generation will still deliver good performance for mid-range gaming. When the drivers were originally implemented, TGSI was the dominating intermediate representation (IR) used by the shader compilers in Mesa. Several years back, NIR (new intermediate representation) was introduced, which has since been adopted by most drivers in Mesa. Among other things, NIR allows adding hardware specific opcodes that make it easy to transform the shader code to something that can easily be translated into hardware specific assembly. (To learn more about the features of NIR, take a look at Faith Ekstrand's excellent blog post.)

With that in mind, and the general sentiment that I should learn something about NIR, I got the idea to implement a NIR back-end for the r600 hardware while I was at XDC 2018. At that time, the driver created non-optimized assembly from the TGSI, which was then optimized by SB, an optimizer that was added in 2013 to the r600 driver. This optimizer has quite a few quirks; it does not work for compute or tessellation shaders, or shaders that use images or atomic operations. On top of that, it has some bugs that are difficult to fix because the code base is not well documented and difficult to understand.

The first NIR implementation

When I started this project, I did not have any idea about how NIR was actually implemented or to be used. My only experience with compilers was the implementation of an improved register allocation pass for TGSI, so obviously I would make a lot of errors.

As someone who likes test-driven development, my approach for bringing up the back-end, that is, to get working basic vertex and fragment shaders, was to implement a function that makes it possible to create a NIR shader from its printout. Then I would write a test with the expected assembly, and implement the code to actually create that assembly. (The code to create a NIR shader from a printout can be found in a development branch however, the way NIR is printed has changes since, so that some re-designing would be needed to make it useful again.)

Thanks to the working TGSI back-end, test expectations were easy to obtain. So I happily coded away to first get the shaders to draw a simple triangle, add texturing, matrix operations, and so on.

Once the basic shaders were running, and glxgears did what it supposed to do, it was easy to move forward: Run a set of piglit tests and for those that crash or fail, see what the TGSI created assembly does, and fill in the gaps.

With that it was simple to get vertex and fragment shaders working. The most challenging part was not to get the assembly right, but to get the shader info in sync with what was created by the TGSI back-end, because that's what the state code expects.

Up to this point the difference between the assembler output of the TGSI code path and the output of the NIR code path was not substantial. Granted, NIR was way better optimized from the start, so the assembly created from the IR was usually better, but up to this point SB, would level the play field (apart from the bugs).

By the end of 2019 - when support for fragment and vertex shaders had been implemented - the back-end was upstreamed, and development continued there.

Transforming NIR

To support r600 properly, a few instructions, like nir_op_cube_r600 had already been added, and while these made some things easier, they did not really contribute to more optimized code. Only with the use of local data storage (LDS) in tessellation shaders did NIR really begin to shine: With the TGSI code path, for every instruction that accesses LDS the memory address is evaluated from scratch, resulting in quite some code duplication. Because LDS handling was only added after SB had landed, support for optimizing the generated code was initially not available, and because LDS reads actually require two dependent instructions (a fetch to a queue and a read from that queue), implementing this support is not trivial. Dave Airlie merged some code to SB to do this optimization, but it is still disabled.

With NIR things became simple: just add a back-end-specific intrinsics for accessing LDS and lower the shared memory access with all the address calculations to use these intrinsics that can directly be translated to r600 assembly. Then, let the NIR passes take care of removing the code duplication and optimizing the address calculation. With that, TessMark performance (with factor 32) improved from 32 FPS to 52 FPS, and a few rendering bugs were fixed too.

TGSI	NIR

A somewhat dead end

From this point on implementing further functionality was, again straightforward. At the beginning of 2021 the NIR back-end had been brought to parity with the TGSI back-end for Evergreen class hardware, and soft-fp64 had been tied in, so that support for OpenGL 4.5 could be advertised. By mid 2021 Cayman class hardware was also supported, although without the hardware fp64 support.

However, since I had jumped into the project without much knowledge about how to write a compiler, my initial design of the intermediate representation used in the back-end did not really plan for optimization or scheduling. Hence, the limitations that were true for the TGSI back-end in that regard were still true.

In addition, NIR itself is a constantly moving target. For instance, initially it was not possible to consume the lowered IO in the r600 back-end, because some information about semantics was lost. Later, when this data was added to the IO intrinsics, I changed the code to lower IO, because it makes things a lot easier, but this left a fair amount of dead code lying around. In addition, the better I understood NIR the more code became obsolete, but was still somewhat used and difficult to rip out. Hence, I decided that the back-end should be rewritten, taking into account the lessons learned, and this time some optimization and better scheduling would be built in.

The re-write

Because the functionality was already there, rewriting the back-end was quite easy, mostly copying and pasting the existing code and adjusting the interfaces. The new back-end implements some copy-propagation, still a bit conservative, though, and a pre-scheduler. The final code-arrangement is still made by the old assembler code. Still, it barely changes the pre-scheduled code - it mostly takes care of emitting additional instructions for indirect addressing, and it validates the created assembly.

Thanks to the work done by Emma Anholt, the glsl-to-tgsi code path has been replaced by glsl-to-nir and nir-to-tgsi. With that, the TGSI the driver sees is already a lot better optimized than before, but a few problems still remain: The per-LDS address calculation is still done, instruction groups are not filled if a TGSI instruction does not use all four slots, and if the shader does not allow for SB to be used, then this is the code that is executed by the hardware.

With that in mind, adding a native NIR back-end still has its virtues.

Current state

As of now, the NIR back-end supports Evergreen and Northern Island-based hardware. It is, again, on par with the TGSI back-end; a few piglit regressions remain though. For some test results I ran piglit on Cayman PRO (Radeon HD 6950) and Cedar (Radeon HD 5000). Because the GPU soft-reset sometimes crashes the graphics hardware in a way that makes a reboot necessary, I excluded a number of tests from the piglit runs.

Cayman

On Cayman piglit was run like:

 ./piglit run gpu -x conditional \
                 -x glx \
                 -x tex3d-maxsize \
                 -x atomicity \
                 -x ssbo-atomiccompswap-int\
                 -x image_load_store \
                 -x gs-max-output \
                 -x spec@arb_compute_shader@execution@min-dvec4-double-large-group-size \
                 -j1 --dmesg -v --timeout 100

The NIR code provides quite a number of fixes and it was possible to enable a few more features so that the driver now advertises OpenGL 4.5.

	TGSI	NIR
pass:	42052	42400
fail:	590	468
crash:	190	0
skip:	2307	2295
changes:	0	370
fixes:	0	319
regressions:	0	7
total:	45175	45200

Cedar

On Cedar, piglit was run similarly to Cayman. Since the TGSI back-end doesn't support fp64 here, piglit was once run on NIR skipping the fp64 tests to directly compare to TGSI, and once including the fp64 tests:

SKIP_FP64=-x dmat -x fp64 -x double -x dvec
./piglit run gpu -x conditional \
                 -x glx \
		 -x tex3d-maxsize \
		 -x atomicity \
	         $SKIP_FP64 \
		 -x ssbo-atomiccompswap-int -j1 --dmesg -v --timeout 100

Here we see a similar picture as with Cayman, the number of fixes out-weigh the number of the few regressions, and many tests were enabled because OpenGL 4.5 can be exposed with the NIR back-end.

	TGSI	NIR	NIR (fp64 included)
pass:	33629	36988	42951
fail:	612	555	1103
crash:	1	2	1
skip:	4858	1582	2322
changes:	0	3382	9900
fixes:	0	65	64
regressions:	0	8	10
total:	39135	39163	46416

Performance on Cayman

Performance-wise the NIR back-end is mostly a win. A number of benchmarks were run by using the Phoronix test suite, comparing TGSI and NIR both with SB disabled and enabled.

Benchmark	TGSI	NIR	TGSI + SB	NIR + SB
OpenArena 0.8.8	108	108	114	114
Unigine Heaven	13.3	19.0	14.9	19.9
Unigine Sanctuary	79.7	110	114	126
Unigine Tropics	79.4	96.1	96.0	100
Unigine Valley	30.1	37.6	37.3	38.6
GLmark2 2021.08.30	2450	2484	2555	2561
Furmark	1600	1726	1750	1792
Tessmark	412	535	405	535
Xonotic 0.8.2	105	39	129	128

As can be seen in the table above, all but two test cases give a performance improvement, i.e. NIR standalone performs better than TGSI standalone, and NIR+SB performs better than TGSI+SB. In addition, even though SB is usually capable of improving the code produced by the NIR back-end, the performance win is generally smaller then when optimizing the code that was created by the TGSI back-end. There are two exception though: For OpenArena no performance improvements can be seen, and Xonotic sees a significant performance regression with the NIR back-end as compared to TGSI. The poor performance achieved here can mostly be attributed to lost opportunities for copy propagation and vectorizing gradient evaluations. SB can level the playing field, but since it also doesn't vectorize the gradient evaluation, a slight performance regression remains.

The detailed results can be found on openbenchmarking.org

Where to go from here?

A number of improvements can still be applied to the NIR back-end:

The (pre-)scheduler should be changed to emit the instructions related to indirect addressing. So far, a whole instruction group is wasted for address loading, and a better scheduler could make use of the empty slots.
The register allocation should make use of the clause-local registers. On one hand, this would reduce register pressure and on the other hand, it would make shader execution faster.
The scheduler should be run a few times with different parameters, to be able to pick the best result and to work around failed register allocation.
Copy-propagation could be more aggressive in some cases; specifically, registers that need to be grouped together with the same register ID are pinned to the channel, but many instructions that require these register groups support swizzling. Since the x and y channels are used very often in texture instructions, pinning the registers to the channel increases register pressure; applying some swizzling could be used to relax this.
Currently, a forward scheduling algorithm is used. This makes is easy to fill the ALU slots but has the disadvantage that instructions that just load a constant for later use may be scheduled early, increasing the register pressure. Some heuristic is used to reduce this effect, but it may hinder a more optimal scheduling.
Enabling tessellation in Tomb Raider 2013 may lead to a GPU hang; this should be fixed too.
Support R600 and R700 based graphics cards: currently some instructions are emitted that are not supported on these older graphics cards.

Finally, for NIR to become the default back-end, all piglit regressions and the big performance regression with Xonotic must be fixed.

The new NIR code is available with the merge request. If you want to help and test this code, the back-end is enabled with R600_DEBUG=nir. SB is enabled by default, but you can use R600_DEBUG=nir,nosb to run NIR with disabled SB. Play your favorite games with the back-end enabled and report bugs at https://gitlab.freedesktop.org/mesa/mesa/-/issues. If you are a developer and have the hardware, just pick a task from the list above and start fixing.

How to write a Vulkan driver in 2022

Deep dive into OpenGL over DirectX layering

Bridging the synchronization gap on Linux

How to write a Vulkan driver in 2022

Deep dive into OpenGL over DirectX layering

Bridging the synchronization gap on Linux

How to write a Vulkan driver in 2022

Comments (0)

Add a Comment

Search the newsroom

Latest Blog Posts

Evolving hardware, evolving demo: Collabora's Embedded World Board Farm

24/04/2025

Collabora's Board Farm demo, showcasing our recent hardware enablement and continuous integration efforts, has undergone serious development…

Implementing Bluetooth on embedded Linux: Open source BlueZ vs proprietary stacks

27/02/2025

If you are considering deploying BlueZ on your embedded Linux device, the benefits in terms of flexibility, community support, and long-term…

The state of GFX virtualization using virglrenderer

15/01/2025

With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기