What Changed When REX Was Re-Evaluated Against LLVM 22
bfs and hotspot tightened, nn moved from an LLVM 21 near-tie into a clearer REX win, and the srad rows remained close. The main conclusion is not that LLVM 22 was irrelevant. The conclusion is that the REX performance fixes survived the toolchain migration without requiring benchmark-specific repairs.The previous post closed the LLVM 21 performance story. That mattered, but it was not the end of the engineering question.
The next question was more uncomfortable:
| |
That is the question this post answers.
This is not another optimization post. No new benchmark-specific rewrite was added for LLVM 22. The point was to test whether the current REX design remained valid when the underlying OpenMP offloading stack moved from LLVM 21 to a source-built LLVM 22 toolchain.
That distinction matters. If the result had changed, there would have been three possible explanations:
| |
The reevaluation found a little of the second case, some of the third case, and no confirmed instance of the first case.
Figure 1. The LLVM 22 pass was a controlled toolchain swap, not another compiler rewrite.
What Was Held Fixed
The most important methodological choice was to reuse the current regenerated benchmark tree under:
| |
That made the experiment narrower. We were not asking:
| |
We were asking:
| |
Native LLVM 22 binaries were rebuilt with the source-built clang 22.1.2 from:
| |
using the same OpenMP GPU-offloading shape as the LLVM 21 comparison:
| |
The REX binaries were rebuilt against the same LLVM 22 libomp and libomptarget by pointing the benchmark builds at the same install tree:
| |
The full-suite artifact was saved under:
| |
Focused reruns for the benchmarks whose margins moved were saved under:
| |
This setup is the reason the result is meaningful. REX did not get to regenerate a special LLVM 22 version of the benchmarks. Native LLVM did not get a different input program. Both variants moved to the same offloading runtime layer, and then the current generated artifacts had to stand on their own.
The First Pass Found Harness Bugs
The first useful result was not a performance number. It was that the harness still had benchmark-specific timing assumptions that were too easy to misuse.
Two mistakes showed up immediately.
The first was pathfinder. The benchmark prints a line named timer, which looks like a natural timing source. But the source tells a different story:
| |
That value is a cycle count, not elapsed seconds. Treating it like microseconds creates a fake precision story. For the LLVM 22 report, pathfinder stayed on the same wall-clock proxy used in the fair LLVM 21 table.
The second mistake was b+tree. External wall-clock timing makes the run look like a much larger program because it includes file input, tree construction, command parsing, and other host work. That is not the offloading comparison point used in the fair LLVM 21 report. The correct benchmark signal remains the kernel section’s own Total time: line:
| |
Both corrections were applied before accepting the final LLVM 22 numbers.
This is a small but important lesson from the whole journey: performance comparisons are only as good as the timing source. A compiler change can be correct and fast, and a report can still be wrong if it changes the metric halfway through the suite.
Figure 2. The LLVM 22 pass first fixed measurement consistency before interpreting margin changes.
The LLVM 22 Scoreboard
After the timing corrections, the high-level result was clean:
| |
The base table is:
| Benchmark | Timing Source | Samples | LLVM 22 Mean | REX Mean | Result |
|---|---|---|---|---|---|
b+tree | kernel Total time | 10 | 0.013950 s | 0.009950 s | REX by 28.7% |
bfs | benchmark compute | 5 | 0.026081 s | 0.023674 s | REX by 9.2% |
gaussian | benchmark total incl. transfers | 3 | 0.253465 s | 0.131447 s | REX by 48.1% |
heartwall | wall-clock proxy | 1 | 72.420105 s | 49.394615 s | REX by 31.8% |
hotspot | wall-clock proxy | 3 | 1.470017 s | 1.425833 s | REX by 3.0% |
nn | benchmark total | 5 | 0.302465 s | 0.291088 s | REX by 3.8% |
pathfinder | wall-clock proxy | 3 | 9.588426 s | 9.399827 s | REX by 2.0% |
srad_v1 | benchmark compute stage | 3 | 1.027374 s | 1.027008 s | effective tie |
srad_v2 | wall-clock proxy | 3 | 1.042898 s | 1.027593 s | REX by 1.5% |
Figure 3. LLVM 22 moved margins, but it did not flip the suite into a native LLVM win.
The most important part of this table is not that every row has a REX-friendly base-table direction. Some rows are too close to treat as meaningful wins. srad_v1 is an effective tie, and srad_v2, hotspot, and pathfinder still use wall-clock proxy timing rather than a benchmark-owned GPU-total line.
The important part is narrower and stronger:
| |
That was the regression question we needed answered.
What Moved From LLVM 21
The LLVM 22 run did change the margins. That is expected. A new compiler and runtime stack can change host launch overhead, runtime initialization behavior, device image handling, optimization choices, and noise characteristics.
The useful question is whether those movements point to a REX flaw.
For b+tree, the answer is no. Native LLVM moved from about 0.014066 s under the fair LLVM 21 run to 0.013950 s under LLVM 22. REX stayed essentially flat at about 0.009950 s. The REX lead moved from 29.3% to 28.7%.
That is the pattern we wanted to see. The recovered read-only load path and direct search kernel still explain the result. LLVM 22 did not erase the benefit of restoring __ldg-style read-only behavior in the REX-generated irregular tree traversal.
For bfs, the margin tightened. The fair LLVM 21 table had REX ahead by 13.7%; the LLVM 22 base table had REX ahead by 9.2%. Native improved slightly, while the REX base-suite sample regressed modestly. The 10-run focus rerun still favored REX by 10.2%.
That is a real movement, but not a winner flip. It should be treated as a margin to keep watching, not as evidence that the direct-kernel design failed under LLVM 22.
For gaussian, both variants slowed by about 5%, and the relative result barely moved: 48.4% under LLVM 21 versus 48.1% under LLVM 22. The 5-run focus rerun still put the REX lead near 48.9%.
That is the cleanest shared-toolchain-shift pattern in the suite. If both variants move together and the relative gap stays stable, the likely explanation is a common runtime or compiler cost, not a REX-specific regression.
For heartwall, the direction stayed the same. Native improved slightly, REX slowed slightly, and the REX lead moved from 33.0% to 31.8%. Because the long full-suite capture still uses one sample, the exact percentage is less important than the direction. LLVM 22 did not change the story: the direct-kernel path still avoids enough generic OpenMP device scaffolding to matter.
For hotspot, the base-table REX lead narrowed from 4.9% to 3.0%. The 10-run focus rerun still favored REX by 7.1%, with much higher native variance. This is one of the rows where a single base table can be misleading. LLVM 22 changed the visible noise profile more than it changed the underlying conclusion.
For nn, LLVM 22 made the result clearer in REX’s favor. Under LLVM 21, fair clause preservation had reduced nn to a near tie. Under LLVM 22, native slowed by about 2.1%, while REX improved by about 0.4%, so the base suite moved to a 3.8% REX win. The 10-run focus rerun was noisier because the machine was under heavier load, but it widened in the same direction.
This was the only benchmark where LLVM 22 materially changed the competitive interpretation. It did not reveal a REX regression. It moved the former near tie toward REX.
For pathfinder, both variants slowed by about 2.6% to 2.8%, and the REX lead moved from 1.7% to 2.0%. The focus rerun was noisier, especially on native LLVM, but the direction stayed the same. The bigger lesson is still the timing-source correction: future automation must not treat the printed cycle count as seconds.
For srad_v1, LLVM 22 collapsed a tiny LLVM 21 native edge into an effective tie with a trivial REX edge in the base table. The 10-run rerun favored REX by 0.68%, still too close to overstate. The correct classification is effective tie.
For srad_v2, both variants improved by about 8.7%, and the relative gap stayed nearly unchanged: about 1.4% under LLVM 21 and 1.5% under LLVM 22. That means LLVM 22 did not reopen the lifecycle or timing ambiguity that earlier posts separated from real GPU-total behavior.
The Validation Footnote Still Matters
One issue remained, but it was not a performance regression in the measured binaries.
The reduced -DOUTPUT validation helpers for hotspot still required temporary patched source copies for both native LLVM and REX. The inactive output path in the current benchmark source needs a local declaration for loop variable i when that path is forced on for reduced-output checking.
That is easy to misunderstand. Earlier work fixed the REX unparser path that could drop inactive #ifdef OUTPUT bodies. This LLVM 22 note is different. Here, the output body is present, but the preserved inactive code still needs a local declaration to compile when the validation mode turns it on.
The main benchmark binaries were not changed by that helper patch. The purpose was correctness validation, not performance measurement.
That distinction is worth keeping because validation-only patches can contaminate benchmark stories if they are not labeled clearly. In this case:
| |
What LLVM 22 Taught Us
The useful conclusion from the LLVM 22 reevaluation is not “LLVM 22 changed nothing.” It changed enough to matter at the margins. bfs tightened. hotspot became visibly noisy in the base table. nn moved from a near tie to a clearer REX win. srad_v1 stayed close enough that calling it a win would be less honest than calling it a tie.
The useful conclusion is this:
| |
The direct __tgt_target_kernel launch path still worked. The generated CUBIN registration path still worked. The modern launch argument ABI still worked. The fair launch-geometry policy still held. The b+tree read-only load recovery still paid off. The process-lifetime cleanup did not become invalid under the newer runtime.
LLVM 22 also confirmed something about methodology. The suite result was only trustworthy because the comparison was strict about what changed. If we had regenerated the REX tree, changed helper files, changed runtime libraries, and changed timing extraction all in one step, any result would have been ambiguous. Instead, the experiment made one primary change:
| |
Everything else was either held fixed or explicitly corrected as a measurement bug.
That is the engineering rule this series should end on. REX can beat native LLVM in important cases, but the claim only remains credible when the measurement discipline is as careful as the compiler work.
Where This Leaves REX
The LLVM 22 reevaluation leaves the REX GPU offloading path in a good but not finished state.
The good part is concrete:
| |
The unfinished part is also concrete:
| |
That is the right ending for the performance arc. The goal was never to manufacture a perfect scoreboard. The goal was to make each row explainable, fair, and reproducible enough that a regression would have somewhere specific to land.
After LLVM 22, the result still holds: REX’s direct-kernel path, source-informed launch lowering, read-only load recovery, and runtime lifecycle cleanup remain competitive across the suite. More importantly, the comparison now survives a major offloading-toolchain migration without needing a new round of benchmark-specific fixes.