What Changed When REX Was Re-Evaluated Against LLVM 22

Posted on May 1, 2026 (Updated on May 4, 2026)

The LLVM 22 reevaluation was intentionally narrow: keep the current generated REX benchmark tree fixed, rebuild native LLVM and REX against the same source-built LLVM 22 offloading stack, correct the timing harness where it used the wrong benchmark signal, and check whether any result flipped. Native LLVM 22 built and ran all 9 benchmarks, REX outputs still matched native outputs on all 9, and no benchmark became a confirmed native LLVM GPU-offloading win. The margins moved: bfs and hotspot tightened, nn moved from an LLVM 21 near-tie into a clearer REX win, and the srad rows remained close. The main conclusion is not that LLVM 22 was irrelevant. The conclusion is that the REX performance fixes survived the toolchain migration without requiring benchmark-specific repairs.

The previous post closed the LLVM 21 performance story. That mattered, but it was not the end of the engineering question.

The next question was more uncomfortable:

1
2
If the OpenMP offloading runtime changes under us,
do the REX fixes still hold?

That is the question this post answers.

This is not another optimization post. No new benchmark-specific rewrite was added for LLVM 22. The point was to test whether the current REX design remained valid when the underlying OpenMP offloading stack moved from LLVM 21 to a source-built LLVM 22 toolchain.

That distinction matters. If the result had changed, there would have been three possible explanations:

1
2
3
the compiler-generated REX code was relying on an LLVM 21 accident;
the benchmark harness was measuring a different quantity than before;
or LLVM 22 changed native and REX costs in a way the old table hid.

The reevaluation found a little of the second case, some of the third case, and no confirmed instance of the first case.

A diagram showing the LLVM 22 reevaluation boundary: keep generated REX source fixed, swap the offloading toolchain, rebuild both variants, then compare correctness and timing. — Figure 1. The LLVM 22 pass was a controlled toolchain swap, not another compiler rewrite.

What Was Held Fixed

The most important methodological choice was to reuse the current regenerated benchmark tree under:

1
<path-to-rex-benchmark-tree>/rex

That made the experiment narrower. We were not asking:

1
2
What happens if REX changes code generation and LLVM changes runtime behavior
at the same time?

We were asking:

1
2
What happens if the accepted REX-generated host, device, and helper files are
rebuilt against LLVM 22 instead of LLVM 21?

Native LLVM 22 binaries were rebuilt with the source-built clang 22.1.2 from:

1
<path-to-llvm-install>

using the same OpenMP GPU-offloading shape as the LLVM 21 comparison:

1
2
3
clang -O3 -g -fopenmp \
  -fopenmp-targets=nvptx64-nvidia-cuda \
  --offload-arch=sm_52 ...

The REX binaries were rebuilt against the same LLVM 22 libomp and libomptarget by pointing the benchmark builds at the same install tree:

1
LLVM_OPENMP_INSTALL=<path-to-llvm-install> make

The full-suite artifact was saved under:

1
<artifact-directory>/summary_corrected.json

Focused reruns for the benchmarks whose margins moved were saved under:

1
<artifact-directory>/focus_reruns.json

This setup is the reason the result is meaningful. REX did not get to regenerate a special LLVM 22 version of the benchmarks. Native LLVM did not get a different input program. Both variants moved to the same offloading runtime layer, and then the current generated artifacts had to stand on their own.

The First Pass Found Harness Bugs

The first useful result was not a performance number. It was that the harness still had benchmark-specific timing assumptions that were too easy to misuse.

Two mistakes showed up immediately.

The first was pathfinder. The benchmark prints a line named timer, which looks like a natural timing source. But the source tells a different story:

1
2
3
#define pin_stats_reset() startCycle()
#define pin_stats_pause(cycles) stopCycle(cycles)
#define pin_stats_dump(cycles) printf("timer: %Lu\n", cycles)

That value is a cycle count, not elapsed seconds. Treating it like microseconds creates a fake precision story. For the LLVM 22 report, pathfinder stayed on the same wall-clock proxy used in the fair LLVM 21 table.

The second mistake was b+tree. External wall-clock timing makes the run look like a much larger program because it includes file input, tree construction, command parsing, and other host work. That is not the offloading comparison point used in the fair LLVM 21 report. The correct benchmark signal remains the kernel section’s own Total time: line:

1
2
3
4
5
Time spent in different stages of CPU/MCPU KERNEL:
 0.000000000000 s,  0.000000000000 % : MCPU: SET DEVICE
 0.010974000208 s, 100.000000000000 % : CPU/MCPU: KERNEL
Total time:
0.010974000208 s

Both corrections were applied before accepting the final LLVM 22 numbers.

This is a small but important lesson from the whole journey: performance comparisons are only as good as the timing source. A compiler change can be correct and fast, and a report can still be wrong if it changes the metric halfway through the suite.

Figure 2. The LLVM 22 pass first fixed measurement consistency before interpreting margin changes.

The LLVM 22 Scoreboard

After the timing corrections, the high-level result was clean:

1
2
3
native LLVM 22 built and ran: 9 of 9
native LLVM 22 output matched REX output: 9 of 9
confirmed native LLVM GPU-offloading wins: 0

The base table is:

Benchmark	Timing Source	Samples	LLVM 22 Mean	REX Mean	Result
`b+tree`	kernel `Total time`	10	`0.013950 s`	`0.009950 s`	REX by `28.7%`
`bfs`	benchmark compute	5	`0.026081 s`	`0.023674 s`	REX by `9.2%`
`gaussian`	benchmark total incl. transfers	3	`0.253465 s`	`0.131447 s`	REX by `48.1%`
`heartwall`	wall-clock proxy	1	`72.420105 s`	`49.394615 s`	REX by `31.8%`
`hotspot`	wall-clock proxy	3	`1.470017 s`	`1.425833 s`	REX by `3.0%`
`nn`	benchmark total	5	`0.302465 s`	`0.291088 s`	REX by `3.8%`
`pathfinder`	wall-clock proxy	3	`9.588426 s`	`9.399827 s`	REX by `2.0%`
`srad_v1`	benchmark compute stage	3	`1.027374 s`	`1.027008 s`	effective tie
`srad_v2`	wall-clock proxy	3	`1.042898 s`	`1.027593 s`	REX by `1.5%`

A scoreboard diagram showing LLVM 22 results: all nine benchmarks build and match, no confirmed native LLVM wins, and REX wins or ties every row. — Figure 3. LLVM 22 moved margins, but it did not flip the suite into a native LLVM win.

The most important part of this table is not that every row has a REX-friendly base-table direction. Some rows are too close to treat as meaningful wins. srad_v1 is an effective tie, and srad_v2, hotspot, and pathfinder still use wall-clock proxy timing rather than a benchmark-owned GPU-total line.

The important part is narrower and stronger:

1
2
LLVM 22 did not create a correctness regression,
and LLVM 22 did not expose a confirmed native LLVM GPU-offloading win.

That was the regression question we needed answered.

What Moved From LLVM 21

The LLVM 22 run did change the margins. That is expected. A new compiler and runtime stack can change host launch overhead, runtime initialization behavior, device image handling, optimization choices, and noise characteristics.

The useful question is whether those movements point to a REX flaw.

For b+tree, the answer is no. Native LLVM moved from about 0.014066 s under the fair LLVM 21 run to 0.013950 s under LLVM 22. REX stayed essentially flat at about 0.009950 s. The REX lead moved from 29.3% to 28.7%.

That is the pattern we wanted to see. The recovered read-only load path and direct search kernel still explain the result. LLVM 22 did not erase the benefit of restoring __ldg-style read-only behavior in the REX-generated irregular tree traversal.

For bfs, the margin tightened. The fair LLVM 21 table had REX ahead by 13.7%; the LLVM 22 base table had REX ahead by 9.2%. Native improved slightly, while the REX base-suite sample regressed modestly. The 10-run focus rerun still favored REX by 10.2%.

That is a real movement, but not a winner flip. It should be treated as a margin to keep watching, not as evidence that the direct-kernel design failed under LLVM 22.

For gaussian, both variants slowed by about 5%, and the relative result barely moved: 48.4% under LLVM 21 versus 48.1% under LLVM 22. The 5-run focus rerun still put the REX lead near 48.9%.

That is the cleanest shared-toolchain-shift pattern in the suite. If both variants move together and the relative gap stays stable, the likely explanation is a common runtime or compiler cost, not a REX-specific regression.

For heartwall, the direction stayed the same. Native improved slightly, REX slowed slightly, and the REX lead moved from 33.0% to 31.8%. Because the long full-suite capture still uses one sample, the exact percentage is less important than the direction. LLVM 22 did not change the story: the direct-kernel path still avoids enough generic OpenMP device scaffolding to matter.

For hotspot, the base-table REX lead narrowed from 4.9% to 3.0%. The 10-run focus rerun still favored REX by 7.1%, with much higher native variance. This is one of the rows where a single base table can be misleading. LLVM 22 changed the visible noise profile more than it changed the underlying conclusion.

For nn, LLVM 22 made the result clearer in REX’s favor. Under LLVM 21, fair clause preservation had reduced nn to a near tie. Under LLVM 22, native slowed by about 2.1%, while REX improved by about 0.4%, so the base suite moved to a 3.8% REX win. The 10-run focus rerun was noisier because the machine was under heavier load, but it widened in the same direction.

This was the only benchmark where LLVM 22 materially changed the competitive interpretation. It did not reveal a REX regression. It moved the former near tie toward REX.

For pathfinder, both variants slowed by about 2.6% to 2.8%, and the REX lead moved from 1.7% to 2.0%. The focus rerun was noisier, especially on native LLVM, but the direction stayed the same. The bigger lesson is still the timing-source correction: future automation must not treat the printed cycle count as seconds.

For srad_v1, LLVM 22 collapsed a tiny LLVM 21 native edge into an effective tie with a trivial REX edge in the base table. The 10-run rerun favored REX by 0.68%, still too close to overstate. The correct classification is effective tie.

For srad_v2, both variants improved by about 8.7%, and the relative gap stayed nearly unchanged: about 1.4% under LLVM 21 and 1.5% under LLVM 22. That means LLVM 22 did not reopen the lifecycle or timing ambiguity that earlier posts separated from real GPU-total behavior.

The Validation Footnote Still Matters

One issue remained, but it was not a performance regression in the measured binaries.

The reduced -DOUTPUT validation helpers for hotspot still required temporary patched source copies for both native LLVM and REX. The inactive output path in the current benchmark source needs a local declaration for loop variable i when that path is forced on for reduced-output checking.

That is easy to misunderstand. Earlier work fixed the REX unparser path that could drop inactive #ifdef OUTPUT bodies. This LLVM 22 note is different. Here, the output body is present, but the preserved inactive code still needs a local declaration to compile when the validation mode turns it on.

The main benchmark binaries were not changed by that helper patch. The purpose was correctness validation, not performance measurement.

That distinction is worth keeping because validation-only patches can contaminate benchmark stories if they are not labeled clearly. In this case:

1
2
3
main performance binaries: unchanged
reduced OUTPUT validation helpers: temporarily patched for both variants
native versus REX output comparison: still matched

What LLVM 22 Taught Us

The useful conclusion from the LLVM 22 reevaluation is not “LLVM 22 changed nothing.” It changed enough to matter at the margins. bfs tightened. hotspot became visibly noisy in the base table. nn moved from a near tie to a clearer REX win. srad_v1 stayed close enough that calling it a win would be less honest than calling it a tie.

The useful conclusion is this:

1
The REX fixes were not LLVM 21 accidents.

The direct __tgt_target_kernel launch path still worked. The generated CUBIN registration path still worked. The modern launch argument ABI still worked. The fair launch-geometry policy still held. The b+tree read-only load recovery still paid off. The process-lifetime cleanup did not become invalid under the newer runtime.

LLVM 22 also confirmed something about methodology. The suite result was only trustworthy because the comparison was strict about what changed. If we had regenerated the REX tree, changed helper files, changed runtime libraries, and changed timing extraction all in one step, any result would have been ambiguous. Instead, the experiment made one primary change:

1
LLVM 21 offloading stack -> source-built LLVM 22 offloading stack

Everything else was either held fixed or explicitly corrected as a measurement bug.

That is the engineering rule this series should end on. REX can beat native LLVM in important cases, but the claim only remains credible when the measurement discipline is as careful as the compiler work.

Where This Leaves REX

The LLVM 22 reevaluation leaves the REX GPU offloading path in a good but not finished state.

The good part is concrete:

1
2
3
4
5
all 9 native LLVM 22 binaries built and ran;
all 9 current native and REX outputs matched;
no benchmark became a confirmed native LLVM win;
the biggest LLVM 21 REX wins survived the toolchain change;
the former `nn` near tie moved toward REX under LLVM 22.

The unfinished part is also concrete:

1
2
3
4
close rows still need repeated sampling and GPU-total checks where possible;
wall-clock proxy rows still need careful interpretation;
the reduced-output validation helpers should become less ad hoc;
REX should keep improving generic analysis robustness instead of adding benchmark-name policy.

That is the right ending for the performance arc. The goal was never to manufacture a perfect scoreboard. The goal was to make each row explainable, fair, and reproducible enough that a regression would have somewhere specific to land.

After LLVM 22, the result still holds: REX’s direct-kernel path, source-informed launch lowering, read-only load recovery, and runtime lifecycle cleanup remain competitive across the suite. More importantly, the comparison now survives a major offloading-toolchain migration without needing a new round of benchmark-specific fixes.