How REX Finished The LLVM 21 GPU Benchmark Suite

Posted on Apr 30, 2026 (Updated on May 4, 2026)

After the LLVM 21 runtime migration, launch-fairness cleanup, direct __tgt_target_kernel lowering, read-only load recovery, GPU-total timing correction, and process-lifecycle fix, the fair suite no longer had a confirmed native LLVM GPU-offloading win. REX had clear or stable wins on b+tree, bfs, gaussian, heartwall, hotspot, and pathfinder, and effective ties on nn, srad_v1, and srad_v2. The wins came from three compiler mechanisms: direct CUDA-style kernels, source-informed launch shaping only when user intent allowed it, and selective read-only load recovery; the final classifications also depended on methodology and runtime fixes for GPU-total timing and process lifetime. LLVM still remained stronger in generality, semantic uniformity, and robustness against REX-specific source-to-source information loss.

The previous post removed the last confusing process-lifetime tax from the LLVM 21 comparison. GPU-total profiling had already shown that the suspicious pathfinder and srad rows were not device-side native LLVM wins. The process-exit teardown investigation then explained why some short wall-clock runs still looked worse for REX even when kernel and transfer totals were clean.

At that point, the LLVM 21 performance work finally had a coherent scoreboard.

This post is the wrap-up for that state. It is not another chronological debugging log. The chronological work is covered by the previous posts. This one answers a different question:

1
2
3
4
For each benchmark, why does REX win or tie now,
what exactly does REX do better,
what does native LLVM still do better,
and why does that not change the final result?

The comparison here is the fair LLVM 21 state after the direct-kernel migration, launch-fairness cleanup, b+tree __ldg recovery, GPU-total timing correction, and process-exit teardown fix.

The fairness rules stayed the same through the end:

1
2
3
4
preserve valid explicit user launch clauses;
shape launch defaults only when REX owns the choice;
compare outputs before trusting timing;
use GPU-total profiling when wall-clock proxy timing is ambiguous.

With those rules, the final LLVM 21 suite had:

1
2
3
4
5
6
7
8
clear or stable REX wins:
  b+tree, bfs, gaussian, heartwall, hotspot, pathfinder

effective ties:
  nn, srad_v1, srad_v2

confirmed native LLVM GPU-offloading wins:
  none

A scoreboard diagram showing six REX wins, three effective ties, and no confirmed native LLVM wins in the fair LLVM 21 suite. — Figure 1. The final LLVM 21 fair suite is not nine unconditional wins. It is six wins, three honest ties, and no confirmed native LLVM GPU-offloading win.

The Three REX Mechanisms

Most of the final table is explained by three compiler mechanisms.

First, REX now lowers suitable target loops into direct CUDA-style kernels:

long tid = blockIdx.x * blockDim.x + threadIdx.x;
long stride = gridDim.x * blockDim.x;

for (long i = tid; i < tripcount; i += stride) {
  body(i);
}

That matters because many native LLVM kernels still carry a more generic OpenMP device execution frame: target initialization, team state, worksharing control flow, and barriers around the actual computation. That machinery is semantically useful, but it is not free.

Second, REX shapes launch defaults from recovered source facts when it is allowed to do so:

1
2
3
4
if the user explicitly requested the thread or team shape:
    preserve it unless invalid
else:
    use tripcount and loop shape to avoid oversized launches

This is why the suite is fair. REX is not allowed to win nn by rewriting an explicit user request, but it can win heartwall or pathfinder when the source leaves a launch dimension open and the tripcount is clear.

Third, REX now preserves enough read-only provenance to recover selective __ldg(...) loads in irregular kernels:

const int key = __ldg(&_dev_keys[bid]);
const long next = __ldg(&node->indices[t]);

That is the difference between a broad cache flag and a compiler optimization. REX only emits the read-only path when it can trace the load back to read-only device storage.

A diagram showing direct kernels, fair launch shaping, and selective read-only load recovery as the three mechanisms behind the REX wins. — Figure 2. The final REX wins are explainable by generic compiler mechanisms, not benchmark-name conditionals.

b+tree: REX Wins Inside The Kernel Body

Final fair status: REX wins.

The important detail in b+tree is that the explicit launch clauses remain explicit. REX is not winning by shrinking num_threads(1024) or taking back num_teams(256). That would be an unfair comparison against native LLVM, which preserves the same source intent.

The win comes from the kernel body.

b+tree is an irregular read-only tree walk. The hot path repeatedly loads keys, indices, and records. Earlier REX output had already moved to the direct kernel path, but it still lost const provenance through its own source-to-source normalization. That made the hottest loads ordinary global loads. Once REX fixed mapped-array read-only analysis and local const-shadow recognition, the generated CUDA recovered the profitable __ldg(...) path.

What REX does better:

1
2
3
direct search kernel;
preserved const provenance;
selective read-only loads for irregular tree traversal.

What LLVM still does better:

1
2
it is not exposed to this REX-specific provenance-loss failure mode;
its integrated pipeline needs fewer source-to-source recovery passes.

LLVM’s robustness was real. It is why native LLVM stayed ahead until the REX provenance bug was fixed. But once REX recovered the read-only path, the direct kernel became the better vehicle for this workload.

bfs: REX Wins Without Depending On Launch Shrinking

Final fair status: REX wins.

bfs is a useful sanity check because it is not primarily a launch-shaping story. The tested graph has enough work that trimming launch dimensions is not the main effect. REX still wins.

That means the win is mostly device-body structure. The REX kernels go directly into the graph traversal work over masks, costs, and nodes. Native LLVM still carries the generic OpenMP device runtime frame. For a benchmark with enough parallel work already available, removing unnecessary device-side control structure matters.

What REX does better:

1
2
3
simpler graph kernels;
less device-side OpenMP worksharing scaffolding;
direct work over the graph arrays.

What LLVM still does better:

1
2
it has a more uniform path for arbitrary OpenMP target shapes;
it is less dependent on REX proving that the loop fits the direct fast path.

Those LLVM advantages are compiler-engineering strengths, but this benchmark stays inside the simple loop shapes REX handles well. The generic runtime structure does not buy enough to offset its cost.

gaussian: REX Wins Because Repetition Amplifies Everything

Final fair status: REX wins decisively.

gaussian is the benchmark where multiple REX mechanisms stack. The elimination loop launches kernels repeatedly, and the live iteration space shrinks as t advances. A static wide launch becomes increasingly oversized later in the run.

REX recovers the live tripcount for the direct path and shapes the launch for the current iteration. It also emits direct device kernels rather than OpenMP runtime-mediated kernels. Both savings repeat many times.

What REX does better:

1
2
3
recomputes launch shape for shrinking live work;
uses direct kernels for repeated short launches;
avoids repeated device runtime scaffolding.

What LLVM still does better:

1
2
it keeps a more general device execution model;
it is more tolerant of source shapes that REX cannot canonicalize.

That generality is useful, but gaussian is canonical enough for REX to analyze. Once the ABI was repaired and the hidden launch-environment slot matched the modern runtime contract, the direct path became a large win.

heartwall: REX Wins By Preserving One User Choice And Shaping Another

Final fair status: REX wins decisively.

heartwall is the clean example of fair launch shaping. The tested source fixes the thread count at 4, and REX preserves that. The number of teams is not fixed the same way for the tested configuration, and the live work is only about 51 points.

So REX keeps the explicit thread choice and reduces the unfixed block count to match the real tripcount. Native LLVM’s more conservative path keeps a much larger launch shape and pays the generic device OpenMP cost.

What REX does better:

1
2
3
honors explicit thread count;
shapes the unfixed team dimension;
uses direct per-point kernels over repeated frames.

What LLVM still does better:

1
2
it is more input-agnostic;
it does not depend on front-end tripcount recovery to choose a safe launch.

LLVM’s conservatism would be useful for very different input sizes. On this measured input, it is mostly oversubscription. REX wins because it changes only the launch dimension that the user left available.

hotspot: REX Wins Even Without Tripcount Shaping

Final fair status: REX wins.

hotspot is important because it proves direct-kernel lowering has value by itself. The hot kernels are not primarily winning through tripcount-derived launch changes. REX still wins because the generated device bodies are simpler.

Native LLVM still carries target init, team-state setup, and synchronization/control structure around the stencil-like body. REX emits direct kernels that enter the computation with less runtime ceremony.

What REX does better:

1
2
3
plain lowered stencil kernels;
less device-side runtime scaffolding;
no need for launch shaping to show a win.

What LLVM still does better:

1
2
it preserves a more general OpenMP execution model;
it needs fewer REX-specific source analyses to stay semantically correct.

That is a real LLVM strength. It just does not offset the direct-device-body advantage on this workload.

pathfinder: REX Wins, But Only After Measuring The Right Thing

Final fair status: REX wins narrowly.

pathfinder looked ambiguous for a long time because wall-clock timing was noisy and the benchmark includes significant non-device work. GPU-total profiling changed the interpretation: the REX offloaded work was already better even when the whole-process table looked like a tie.

The live domain is close to one useful block, not a full 256-block default grid. REX can shape the unfixed launch to the actual work and run a direct grid-stride kernel. Native LLVM keeps the conservative generic path.

What REX does better:

1
2
3
launches the live domain instead of a large default grid;
uses direct kernel execution;
wins on GPU-total profiling.

What LLVM still does better:

1
2
its conservative launch behavior is less tied to one input size;
its end-to-end wall-clock behavior can look more stable when host work dominates.

That is why pathfinder needed the methodology post before this wrap-up. The fair conclusion depends on measuring the GPU work, not just the full process.

nn: The Correct Result Is A Tie

Final fair status: effective tie.

nn started as the most severe REX regression. The original gap was mostly not arithmetic: offload initialization was misplaced, the old XOMP scheduler path was too heavy, and launch/runtime details were not aligned with the modern OpenMP path.

Those regressions are gone. But the benchmark also contains the fairness trap: the source explicitly requests a very wide launch for a tiny live work window. REX can no longer shrink that explicit choice just to win.

What REX does better:

1
2
3
4
direct distance kernel;
fixed offload-init placement;
no old scheduler tax;
no mandatory process-exit teardown.

What LLVM still does better:

1
2
it is naturally conservative about preserving the user's launch request;
under the same over-wide explicit launch, REX has less room to separate.

The tie is the honest result. REX removed the real regressions without cheating on user intent. If the source explicitly asks for a launch that is too wide for the active work, both compilers are constrained by that source decision.

srad_v1: The Correct Result Is Also A Tie

Final fair status: effective tie.

srad_v1 sits below the level where a meaningful compiler win can be claimed from the current samples. GPU-total profiling slightly favored REX, while the benchmark’s coarse timing could move within noise.

What REX does better:

1
2
3
direct kernels;
tripcount-aware shaping for unfixed dimensions;
GPU-total samples at least competitive and slightly ahead.

What LLVM still does better:

1
2
slightly steadier coarse timing in some samples;
less dependence on REX-specific direct-loop canonicalization.

Those LLVM advantages do not justify calling it a native win. The device totals are effectively equal, and the wall-clock numbers are too close to rank confidently.

srad_v2: A Tie After The Lifecycle Bug Was Removed

Final fair status: effective tie.

srad_v2 looked like one of the last stubborn native LLVM wins until we separated GPU-total timing from wall-clock and then removed forced process-exit teardown. After that, the native-looking wall-clock edge no longer supported a GPU-offloading conclusion.

What REX does better:

1
2
3
4
direct kernel path;
tripcount-aware shaping for unfixed dimensions;
no forced offload teardown at normal process exit;
GPU-total profiling slightly favors REX in current samples.

What LLVM still does better:

1
2
some coarse wall-clock samples remain tighter;
native image lifecycle is simpler because the device image is embedded.

That last point matters. REX still has an architectural difference: it manages an external cubin image. The teardown fix removed an unnecessary tax, but it did not make every cold-start behavior identical to native LLVM. The fair classification is tie, not loss.

A matrix-style diagram mapping each benchmark to the mechanisms that explain its final result. — Figure 3. Different benchmarks exercise different REX advantages. The final story is not one universal trick.

What LLVM Still Teaches REX

The final LLVM 21 suite is strong for REX, but it is not a reason to dismiss native LLVM’s design.

LLVM still has three durable strengths.

First, LLVM’s device execution model is more uniform. It can carry a broad range of OpenMP semantics through one integrated path. REX is faster when the source matches its direct fast path, but it needs careful fallback behavior when it does not.

Second, LLVM is less exposed to source-to-source information loss. The b+tree read-only-load bug was a REX bug because REX had to preserve const provenance through its own generated CUDA and ROSE AST transformations. LLVM’s integrated pipeline did not have that exact failure mode.

Third, LLVM’s conservative launch behavior is a useful baseline for fairness. The nn investigation only became rigorous after we accepted that valid explicit user launch choices must be preserved, even when they are slow.

These strengths define the next engineering direction for REX:

1
2
3
4
make the direct fast path cover more source shapes;
make analyses like tripcount and read-only provenance more robust;
keep the fallback path correct;
do not win by rewriting user intent.

That is the important outcome of this wrap-up. REX now wins or ties the fair LLVM 21 suite, but the path forward is not benchmark hacking. It is analysis robustness.

The next post asks a different question: if the underlying toolchain moves from LLVM 21 to LLVM 22, does this result survive?

The Three REX Mechanisms

b+tree: REX Wins Inside The Kernel Body

bfs: REX Wins Without Depending On Launch Shrinking

gaussian: REX Wins Because Repetition Amplifies Everything

heartwall: REX Wins By Preserving One User Choice And Shaping Another

hotspot: REX Wins Even Without Tripcount Shaping

pathfinder: REX Wins, But Only After Measuring The Right Thing

nn: The Correct Result Is A Tie

srad_v1: The Correct Result Is Also A Tie

srad_v2: A Tie After The Lifecycle Bug Was Removed

What LLVM Still Teaches REX

Table of contents