How REX Makes Fair GPU Offloading Comparisons Against Native LLVM
nn, b+tree, pathfinder, and srad_v2.The previous post in this series explained how REX emits omp_offloading_entries and keeps host and device kernel identity aligned.
This one moves back up to the benchmark layer, but it stays much narrower than the general benchmark-validation post.
The question here is not:
how do we run a benchmark suite at all?
The question is:
when REX and native LLVM are close, noisy, or trading wins across different measurement modes, what makes a comparison fair enough to trust?
That needed its own post because the answer turned out to be stricter than “build both and compare times.”
During the REX GPU campaign, several early conclusions changed after the methodology was tightened:
nnstopped being an apparent easy REX win once explicit user launch clauses were honored.b+treestopped looking ambiguous once the timing source switched from coarse wall-clock to the benchmark’s own kernelTotal time:line.pathfinderstopped looking like a near-tie once GPU-total profiling replaced misleading proxy numbers.srad_v2stopped looking like a native LLVM holdout once wall-clock noise was separated from actual GPU work.
That is why fairness deserves its own engineering discussion. A compiler comparison is only as honest as the contract behind it.
Figure 1. A trustworthy benchmark verdict rests on three contracts at once: compare the same software stack, preserve the same user intent, and measure the same meaning of time.
Why “Fair” Needed To Be Defined Explicitly
At first glance, native LLVM versus REX sounds simple.
Compile the original OpenMP program with Clang offloading. Lower the same source through REX. Run both. Compare outputs. Compare timings.
That is necessary, but it is not sufficient.
There are at least four ways such a comparison can still be misleading:
- the two binaries are not actually using the same runtime stack,
- one compiler quietly changed the launch policy the user wrote,
- the benchmark timing source is measuring mostly host noise instead of GPU work,
- or the output diff is flagging formatting and timing lines instead of a real semantic mismatch.
Those are not hypothetical mistakes. They all happened in the campaign in one form or another.
So the comparison rules had to become explicit.
The final methodology was built around a narrow principle:
compare lowering decisions, not accidental differences in ecosystem, measurement source, or interpretation.
That sounds abstract. In practice it meant four very concrete rules.
Rule 1: Compare Lowering, Not Different Software Stacks
The first rule is the easiest to say and one of the easiest to violate.
If native LLVM is linked against one OpenMP offloading runtime stack and the REX-generated binary is linked against some other runtime stack, then the result is no longer “native LLVM lowering versus REX lowering.” It is “two different software systems.”
The campaign avoided that by keeping the runtime family aligned.
Native LLVM binaries were built with Clang OpenMP offloading, for example:
| |
The REX side was not treated as a separate runtime universe. REX generated host, device, and helper files, but the final binaries were still linked against the same LLVM libomp and libomptarget family used by the native LLVM side.
That matters because it keeps the comparison focused on code shape:
- native LLVM lowering plus LLVM runtime,
- versus REX lowering plus the same LLVM runtime family.
The LLVM 22 reevaluation reinforced why this rule matters. That pass intentionally reused the already-regenerated benchmark tree and only rebuilt both sides against the new source-built LLVM 22 toolchain. That kept the question narrow:
| |
Without that discipline, it would have been impossible to tell whether a benchmark movement came from LLVM 22 itself or from some unrelated regeneration difference.
This same rule also applies to input handling. A fair comparison is not “run vaguely similar workloads.” It is “run the same workload contract.” That is why each benchmark kept small explicit run commands instead of relying on memory or manual improvisation.
Rule 2: Preserve Explicit User Launch Intent
This was the most important fairness correction in the whole campaign.
A source-to-source compiler can often see that a user requested a poor launch shape. It may be able to prove that launching fewer threads or fewer teams would run faster. The temptation is obvious:
| |
That is not always fair.
The rule the campaign converged on is simple:
If the user explicitly requested
num_threadsornum_teams, the compiler should honor that request unless it is invalid.
That still leaves room for optimization, but only in the right places.
Allowed:
- choose defaults when the source did not specify a value,
- derive unfixed launch dimensions from tripcount,
- clamp values that exceed hardware or ABI limits,
- keep explicit dimensions and optimize only the dimensions the source left open.
Not allowed:
- shrink a legal explicit
num_threadsornum_teamsrequest just because the benchmark runs faster that way, - reinterpret user clauses only for a benchmark that happens to be under investigation,
- claim a compiler performance win that depends on overriding a valid source-level policy.
Figure 2. Fairness depends on ownership. Explicit launch clauses belong to the user. Unspecified launch dimensions belong to the compiler.
The benchmark that forced this rule into the open was nn.
At one stage of the work, REX looked much better because it had learned how to avoid launching a huge block for a very small active work window. That was a real optimization, and it explained a real speedup.
But the source also carried an explicit user launch request. Once fairness was defined properly, REX was not allowed to shrink that explicit request anymore. The earlier “win” therefore stopped being a fair win.
That did not make the compiler work useless. It made the interpretation honest.
nn is the canonical example of the fairness distinction:
- unfair conclusion: REX wins because it ignored the user’s oversized launch request and replaced it with a better one,
- fair conclusion: once both compilers honor the same explicit request, the benchmark becomes an effective tie, which is still a major improvement over the old REX regression.
heartwall shows the other side of the same rule. There, the benchmark used an explicit small thread count, but the block count was still open enough that REX could legally shape it from the tripcount. So REX could preserve the user’s explicit policy and still optimize the unfixed dimension.
That is the core fairness distinction:
- preserve what the user fixed,
- optimize what the user left open.
This is not only benchmarking etiquette. It is compiler design discipline.
Rule 3: Make “Time” Mean The Right Thing For Each Benchmark
Another major fairness problem is pretending every benchmark prints one equally trustworthy timing number.
They do not.
Some benchmarks print total runtime. Some print a compute stage. Some print a kernel-oriented line that is clearly the right measurement. Some print something named timer that is not comparable wall-clock seconds at all. Some print nothing directly useful for GPU work, which means a profiler-based total is better than any benchmark-owned number.
So the campaign used a metric-quality ladder.
Best:
- benchmark-owned timing that directly measures the GPU work of interest,
- or benchmark-owned timing that clearly includes transfers plus compute for the offloaded region.
Next best:
- profiler-derived GPU total, typically summed kernel time plus host-to-device and device-to-host copy time.
Last resort:
- coarse wall-clock proxy around a larger workflow.
This rule changed several benchmark stories.
b+tree: the benchmark already knew the right timing line
Early on, b+tree was measured with external wall-clock. That was a poor comparison because it included file parsing, tree transformation, and command handling around the kernel.
The benchmark itself already exposed the better metric:
| |
Once the comparison switched to that kernel Total time: line, the performance story became much clearer. That was not changing the result to favor one side. It was finally measuring the work the benchmark row was supposed to represent.
pathfinder: a line named timer was not what it looked like
The pathfinder benchmark was even more misleading. It printed a line named timer, but the source showed that this value was based on cycle counting rather than directly comparable elapsed seconds.
So the first-pass harness that treated that printed number like normal runtime was wrong.
For that benchmark, the fair comparison had to fall back to wall-clock proxy timing for the benchmark row, and then to GPU-total profiling when the row stayed close enough to be ambiguous.
GPU-total profiling: A more honest metric for noisy benchmarks
For benchmarks that did not expose a clean benchmark-owned GPU total, the campaign profiled:
- kernel time,
- host-to-device copies,
- device-to-host copies.
That was closer to the original performance question:
how much time did the actual GPU offloading work take?
In practice, this meant commands of the form:
| |
or equivalent profiler-based GPU totals during the earlier passes.
This mattered especially for pathfinder and srad_v2.
pathfinder looked close or ambiguous under wall-clock. Under summed GPU activity time, it clearly favored REX. That tells us the wall-clock number was measuring too much unrelated host behavior.
srad_v2 looked like a possible native LLVM edge in some wall-clock reruns. GPU-total checks showed that the device work itself was effectively tied and in some samples slightly favored REX. Again, the right conclusion was not “LLVM is faster on the GPU.” The right conclusion was “the remaining wall-clock delta is dominated by host-side noise.”
That is a fairness issue, not a statistical footnote. If the metric mostly measures the wrong thing, then a “winner” based on that metric is not trustworthy.
Figure 3. Fairness corrections changed real benchmark conclusions. The methodology was not cosmetic; it determined whether a row meant what it claimed to mean.
Rule 4: Compare Semantics, Not Log Formatting
Correctness comparison also needed fairness rules of its own.
The most basic one is:
do not treat timing lines, banners, or intentionally different reporting text as computational mismatches.
Examples from the campaign:
pathfinderoutput comparisons ignored thetimer:line.b+treecomparisons ignored timing-only lines and theTree transformation tookdiagnostic line.hotspotandsrad_v2used reducedOUTPUTruns because their default benchmark modes did not expose the numerical result clearly enough for a real correctness check.
There was a second, more subtle rule:
current native LLVM versus current REX matters more than stale saved references when the benchmark is floating-point heavy.
That mattered for hotspot and srad_v2.
In reduced-output correctness mode, current native LLVM and current REX matched each other, but both differed slightly from older saved reference files. That is not evidence of a current REX regression. It is evidence that saved baselines can drift while the current two-way comparison still agrees.
So the campaign separated three different questions:
- does current native LLVM still work?
- does current REX still match current native LLVM?
- do both still match an older saved reference exactly?
Those are not the same question, and fair reporting should not collapse them into one blunt yes-or-no diff.
The Case Studies That Forced The Rules
By the end of the campaign, four benchmarks had become reference examples for why fairness methodology matters. Figure 3 highlights the three most prominent cases; srad_v2 is excluded because it is primarily a lifecycle noise case rather than a distinct methodology pattern.
nn: fairness can remove an apparent REX win and still leave the compiler in a better place
nn is the clearest example of why honoring explicit launch clauses matters more than chasing a prettier benchmark table.
Once REX stopped charging offload initialization to the timed region and switched to a cleaner direct-kernel path, the benchmark improved dramatically. That was real progress.
But one apparent later win depended on shrinking a legal explicit user launch request. Once that was disallowed, the row became an effective tie.
That was the correct outcome. Fairness did not erase the engineering progress. It prevented overclaiming.
b+tree: the right timing source was already in the benchmark
b+tree showed the opposite problem. The comparison was initially noisy because the harness used too broad a timing source. Once the row switched to the benchmark’s own kernel Total time: line, the result became cleaner and more interpretable.
That benchmark also reinforced another fairness point: after the read-only-load fix restored __ldg(...) generation in the hot search kernel, the remaining performance story was about real compiler code generation, not about timing noise or reporting artifacts.
pathfinder: wall-clock was not answering the GPU question
pathfinder is the benchmark that best demonstrates why profiler-derived GPU totals can be fairer than benchmark-owned proxy numbers. Its printed timer value was not directly comparable runtime, and coarse wall-clock was noisy enough to blur the result.
Once GPU-total profiling was added, the real offloading picture became much clearer and favored REX more strongly than the proxy number suggested.
srad_v2: a supposed remaining gap turned out to be lifecycle noise
srad_v2 mattered because it looked like a stubborn native LLVM holdout for a while.
Later investigation showed that part of the apparent gap came from lifecycle behavior outside the useful GPU work. After the offload teardown policy was corrected and GPU-total profiling was checked, the benchmark no longer supported the story that native LLVM had a real remaining GPU advantage there.
That is exactly the kind of false conclusion a fairness framework is supposed to prevent.
What LLVM Still Did Better Even In A Fair Comparison
Fairness is not only about protecting REX from misleading losses. It is also about not erasing LLVM’s genuine strengths.
Even when REX won or tied fairly, native LLVM still retained several advantages that the campaign had to acknowledge plainly.
LLVM’s device execution model was more uniform and semantically conservative. It depended less on REX-specific front-end recovery of canonical loop shape and launch opportunities.
LLVM was also less exposed to some categories of source-to-source information loss. The earlier b+tree const-provenance issue was a good example: REX had to relearn enough read-only provenance to recover __ldg(...), while LLVM’s integrated pipeline never suffered from that exact source-to-source blind spot.
And in benchmarks where the source explicitly demanded a poor launch geometry, LLVM lost little from simply honoring that request. That mattered in nn. Once fairness forbade clause rewriting, REX’s strongest launch-shaping advantage on that benchmark was intentionally unavailable.
Those LLVM strengths do not undo REX’s wins. They explain why a fair report should describe ties as ties and should avoid turning every close benchmark into a narrative of total compiler superiority.
LLVM 22 Was A Good Stress Test For The Fairness Rules
The later LLVM 22 reevaluation was useful precisely because it reapplied the same fairness rules after a toolchain change.
The important methodological outcomes were:
- the benchmark tree was reused rather than regenerated, so the toolchain effect stayed isolated,
- native LLVM and REX were both rebuilt against the LLVM 22 runtime family,
- the timing-source corrections for
pathfinderandb+treewere re-applied instead of forgotten, - the interpretation of close cases still distinguished wall-clock proxy results from GPU-total evidence.
That reevaluation did not reopen a native LLVM performance lead anywhere in the fair suite. More importantly for this post, it showed that the methodology itself was stable enough to survive a toolchain migration.
That is the real standard for a fair comparison framework. It should not only explain one table after the fact. It should continue to produce trustworthy conclusions when the surrounding environment changes.
The Fairness Contract In One Sentence
The benchmark comparison between REX and native LLVM is fair only when all of the following are true at the same time:
- both sides use the same runtime family and workload contract,
- explicit user launch policy is preserved unless it is invalid,
- each benchmark is timed with the best metric it actually exposes,
- and correctness is judged on computational meaning rather than on timing text or stale baseline formatting.
That sounds stricter than a normal benchmark spreadsheet.
It is supposed to be.
Without those rules, the comparison starts drifting toward whichever number is easiest to collect or whichever compiler is willing to reinterpret more of the program on the user’s behalf. With those rules, the results become much harder to game and much easier to defend.
That is why the fairness layer ended up being just as important as the benchmark runs themselves. It is the reason the final REX results mean what they claim to mean.