How REX Validates GPU Offloading With Real Benchmarks

Posted on Apr 5, 2026 (Updated on Apr 16, 2026)

The top validation layer in REX is a benchmark campaign comparing native LLVM and REX-generated binaries for the same application. By running the same workloads and using benchmark-specific normalization and timing sources, this layer catches integration bugs, runtime-lifecycle issues, and performance regressions that earlier test layers miss.

The previous post in this series covered the semantic checkpoint before GPU execution: lowering_cpu runs the original OpenMP program and the REX-lowered program on the same CPU runtime and asks whether the transformation still preserves meaning.

That is a strong test layer.

It is still not the last one.

At some point the compiler has to survive the real thing:

a real application,
real offloading,
real runtime glue,
real device launches,
real numerical output,
and real performance scrutiny.

That is the benchmark-validation layer.

In the REX workflow that motivated this post, that layer was a side-by-side campaign over a NeoRodinia-derived benchmark tree. For each benchmark, we built:

a native LLVM OpenMP offloading binary,
a REX-generated binary linked against the same LLVM OpenMP runtime stack,
then compared correctness and performance carefully instead of trusting a single wall-clock number.

This post is about that layer alone: why it exists, how the comparison is made fair, why correctness requires benchmark-specific normalization rules, why performance needs a metric ladder instead of a one-size-fits-all stopwatch, and what this layer caught that none of the cheaper suites could have revealed.

A layered validation stack showing parser, frontend, lowering invariants, CPU equivalence, and finally full benchmark validation at the top, where runtime integration, output drift, and performance regressions are caught. — Figure 1. Full benchmark validation belongs at the top of the stack. It is the slowest and noisiest layer, but it is the only place where real runtime integration and performance problems become visible.

Why Real Benchmarks Still Matter After All The Earlier Tests

Once a project has parser tests, frontend corpus tests, semantic-analysis checks, lowering-structure checks, and CPU equivalence runs, it is tempting to think the hard part is over.

It is not.

Those earlier layers answer valuable but narrower questions:

did the directive parse?
did the OpenMP AST get built correctly?
did semantic analysis synthesize the right facts?
did lowering emit the right structural artifacts?
did the lowered host program preserve OpenMP semantics on CPU?

But a real GPU benchmark asks a different question:

when the entire offloading path runs for a real application, does the final system still behave correctly and competitively?

That question includes failure modes the earlier layers simply cannot see:

cubin registration or offload initialization happening at the wrong time,
helper/runtime ABI mismatches that only appear during actual device launch,
target-data lifetime mistakes that only surface across repeated kernels,
correctness checks complicated by benchmark-specific output formats,
performance regressions hidden by bad timing sources,
and fairness mistakes where a compiler “win” depends on silently overriding user launch clauses.

This is why benchmark validation must exist, and also why it must come last.

If you start here, everything is noisy. If you end here, the noise is at least interpretable.

What The Benchmark Campaign Actually Compared

The current campaign used nine GPU benchmarks under a common tree:

b+tree
bfs
gaussian
heartwall
hotspot
nn
pathfinder
srad_v1
srad_v2

For each benchmark, the comparison was deliberately symmetric.

The native LLVM side was built with Clang OpenMP target offloading using the same LLVM runtime family that the REX-generated code would later link against:

1
2
3
4
clang -O3 -fopenmp \
  -fopenmp-targets=nvptx64-nvidia-cuda \
  --offload-arch=sm_80 \
  ...

The REX side was not treated as a separate ecosystem. The compiler generated host, device, and helper files, but the resulting binary was still linked against the same libomp and libomptarget stack from LLVM.

That matters because this layer is not trying to prove that two entirely different software stacks happen to run similar applications. It is trying to isolate the effect of the compiler-generated code shape:

LLVM native lowering plus LLVM runtime,
versus REX lowering plus the same LLVM runtime family.

Representative benchmark invocations were kept in small run files next to each app:

1
2
3
4
5
./bfs.out ../../data/bfs/graph1MW_6.txt
./gaussian.out ../../data/gaussian/matrix1024.txt
./heartwall.out ../../data/heartwall/test.avi 20 4
./nn.out filelist_4 5 30 90
./pf.out 100000 1000

That looks mundane, but it is a useful discipline. The benchmark layer should not quietly drift into “who remembers the right command today?” territory. The comparison is only meaningful when both variants are run with the same workload contract.

A side-by-side benchmark workflow: build native LLVM binary, build REX binary, run the same input set, normalize outputs, then compare correctness and performance with explicit rules. — Figure 2. The benchmark layer is a controlled comparison loop, not an informal spot check. Both variants are built, run, normalized, and judged under an explicit contract.

Correctness Is Not One Diff

One of the fastest ways to get misleading benchmark conclusions is to treat output comparison as a naive file diff.

That does not work well for real applications.

Some benchmarks print timing lines that are obviously expected to differ between runs. Some print progress or banner lines that are irrelevant to numerical correctness. Some do not print the full numerical result in their normal performance configuration at all. Some have saved “golden” outputs that are themselves stale because floating-point behavior drifted slightly across toolchains while the current native LLVM and current REX outputs still match each other.

The benchmark layer therefore used benchmark-specific normalization rules.

Example 1: ignore timing-only lines

pathfinder is the simplest example. Its main output includes a timer: line. That line is useful to a user, but it is not a correctness signal. So the comparison strips it before diffing results.

Likewise, b+tree comparisons ignore timing-only lines such as the reported Total time: and the Tree transformation took diagnostic line before checking whether the meaningful output matches.

This is the right kind of normalization because it removes data that is supposed to differ while leaving the computational result intact.

Example 2: reduced OUTPUT runs for benchmarks that do not normally emit data

hotspot and srad_v2 exposed a different problem. Their default benchmark paths are optimized for throughput measurement, not for result inspection. If you want a correctness comparison, you need a smaller run mode that actually emits the computed data.

So the campaign used reduced OUTPUT builds for those cases:

smaller problem sizes,
output enabled,
then exact or tolerance-aware comparison on the emitted numerical results.

That is not cheating. It is simply recognizing that “benchmark mode” and “correctness visibility mode” are sometimes different operational configurations for the same application.

Example 3: current native versus current REX matters more than stale saved baselines

Another important lesson from the campaign was that saved reference outputs are useful but not absolute.

For hotspot and srad_v2, the current native LLVM output and the current REX output matched each other exactly in the reduced correctness mode, but both differed slightly from older saved references. That is a very different situation from “REX is wrong.”

It means the validation surface has to distinguish:

current native versus current REX mismatch,
from current outputs versus historical baseline drift.

This is one reason the benchmark layer should be treated as an investigation harness rather than as a single blunt pass/fail oracle. A saved reference can go stale. The current side-by-side comparison still tells you whether REX diverged from the behavior native LLVM produces today.

Floating-point equality is not always bitwise equality

This also matters for GPU floating-point code in general. The benchmark layer treated “same output” pragmatically:

exact equality where the benchmark output is deterministic and integer-like,
normalized equality where timing or banners must be stripped,
and tolerance-aware reasoning where the application is floating-point heavy and the important question is whether the current native LLVM and current REX results agree within expected numerical drift.

That is a much healthier contract than pretending every benchmark should produce byte-for-byte identical raw logs forever.

Performance Is Not One Number Either

Correctness comparison is only half the top-layer job. The other half is performance.

And performance can be even easier to mis-measure than correctness.

A benchmark may print:

total benchmark time,
only kernel compute time,
a host wall-clock proxy around a large workflow,
or nothing directly useful for GPU work at all.

If the validation layer collapses all of those into “one timing number per app,” the conclusions will be sloppy. The current campaign ended up with a metric ladder instead.

Figure 3. Benchmark performance comparison needs a metric ladder. The right question is not only “what number do we have?” but also “what does that number actually measure?”

The Metric Ladder

The best timing source was always the benchmark’s own directly relevant metric when one existed.

That usually meant one of these:

benchmark total time when it clearly represented the offloaded workload of interest,
benchmark compute-stage time when the app separated setup and computation cleanly,
or an internal kernel-oriented total that already reflected transfer plus compute.

Examples from the campaign:

nn used the benchmark’s own total time,
gaussian used the benchmark total that already included transfers,
bfs used the benchmark compute-stage timing,
b+tree eventually used the kernel’s own Total time: line rather than an external stopwatch.

That last detail matters. One of the early comparison mistakes was using an external wall-clock proxy for b+tree even though the benchmark already exposed a more relevant internal timing line. Once the comparison switched to the correct metric, the performance story became clearer.

When a benchmark did not expose a trustworthy GPU-total metric itself, the next best option was profiler-derived GPU activity time. In practice that meant summing:

kernel execution time,
host-to-device copies,
and device-to-host copies

from nsys profile --stats=true ./benchmark.

This was important for cases like pathfinder, hotspot, heartwall, and srad_v2, where plain wall-clock can mix GPU work with a lot of unrelated host-side behavior.

The campaign showed exactly why this distinction matters. pathfinder is the clearest example: under wall-clock proxy timing, it looked close enough to be ambiguous. Under summed GPU activity time, the offloading result was much clearer and strongly favored REX. That means wall-clock was not measuring the question we actually cared about.

Wall-clock is therefore the last resort in the metric ladder, not the default.

Fairness Rules Matter As Much As The Timing Source

A performance comparison can still be misleading even when the timing metric is good if the compiler silently changes the program’s launch intent.

This became a major concern during the campaign, especially after some optimizations improved results by trimming oversized launches.

There is a principled rule here:

if the user explicitly requests a launch configuration such as num_threads or num_teams, the compiler should honor it unless the request is invalid.

A compiler is free to choose a better default when the source does not specify one. It is not free to quietly rewrite an explicit user request just because that request performs badly on a benchmark.

That led to a fairness re-evaluation in the campaign:

explicit launch clauses had to be preserved in measured binaries,
benchmark-specific “fixups” that overrode legal user input were rejected,
and any optimization that remained had to be generic default shaping when the user left the choice open, or safety clamping when the request exceeded hardware or ABI limits.

This distinction was especially important because some benchmarks want opposite things. A tiny loop like nn can benefit dramatically from avoiding a comically oversized launch. A benchmark like heartwall can still benefit from preserving a small user thread count while trimming obviously unnecessary block count around it. The right conclusion is not “always shrink everything.” The right conclusion is “respect explicit user policy, and only optimize inside the space the source actually leaves open.”

That is a compiler design rule, not just a benchmarking etiquette rule.

What Only The Benchmark Layer Caught

The value of this layer becomes clearest when you look at the kinds of bugs it actually exposed.

Misplaced offload initialization in the timed path

One of the early nn regressions had nothing to do with device arithmetic. The generated program was paying one-time offload setup in the measured region. Parser tests, AST tests, lowering structural checks, and even CPU equivalence would not have told you that the benchmark’s own timer was accidentally now measuring cubin registration and offload startup.

The benchmark run made it obvious because the application-level total time was catastrophically wrong relative to native LLVM for a workload that should have been small.

That is a classic top-layer bug:

the transformed program is still structurally valid,
it may even be semantically correct,
but the runtime lifecycle is placed badly enough that the application-level performance result is false.

Misleading timing proxies

Another class of bug was not in the compiler at all. It was in the comparison methodology.

The campaign discovered that:

pathfinder prints a TSC-style timing value rather than a directly comparable GPU-total metric,
b+tree should use its own kernel Total time: line,
and wall-clock-heavy benchmarks can change apparent winner depending on how much non-GPU work their run mode includes.

That is exactly why the benchmark layer has to stay explicit about timing-source quality. Otherwise people will draw conclusions about GPU lowering from numbers that are mostly measuring something else.

Saved baseline drift

hotspot and srad_v2 also demonstrated another point that cheaper test suites rarely hit: the saved “reference output” itself can become the unstable part. The current native LLVM and current REX results matched, yet both differed slightly from older saved outputs.

That is not a compiler failure by itself. It is a validation-data maintenance issue, and you only discover it at the benchmark layer because that is where real floating-point output and real application configurations are being compared.

True end-to-end integration

Finally, this layer is where the whole stack has to cooperate:

generated host launch blocks,
runtime helper files,
cubin registration,
__tgt_target_kernel transport,
device-side kernel identity,
map-array construction,
target-data lifetime,
and application-owned I/O and timing code.

No cheaper suite composes all of that at once.

That is why full benchmarks are expensive but irreplaceable.

Why This Layer Must Stay Narrow In Purpose

The benchmark layer is important, but it should not try to do every testing job badly.

It is the wrong place to first detect:

parser drift,
clause spelling mistakes,
semantic-analysis normalization bugs,
or simple lowering-structure regressions.

Those all have cheaper and cleaner homes earlier in the stack.

The benchmark layer works best when it asks only the questions that genuinely require a real application:

does the full offloading path still run?
do current native LLVM and current REX outputs still agree?
do the measured performance conclusions still hold under a fair timing source?
and if they do not, is the difference caused by the compiler, the runtime lifecycle, the measurement method, or the benchmark’s own output conventions?

That is a valuable scope. It is also a narrow one.

The Real Contract Of The Top Layer

If you compress the whole benchmark-validation story down, the contract is this:

build native LLVM and REX variants of the same application against the same LLVM runtime family,
run the same inputs,
compare outputs with benchmark-specific normalization rules,
compare performance using the best metric each benchmark actually exposes,
respect explicit user launch clauses when judging fairness,
and treat the results as an investigation surface, not as a one-line scoreboard.

That last point matters.

The value of the benchmark layer is not only that it can say “REX is faster” or “LLVM is faster.” Its real value is that it can explain why a difference is believable or not:

because the output still matches after stripping timing-only lines,
because a wall-clock difference disappeared when compared with GPU-total profiling,
because a saved reference was stale while current native and current REX still agreed,
or because a real compiler/runtime-lifecycle regression was charging cold-start work to the application timer.

That is much more useful than a simplistic benchmark table.

Why This Post Comes After The Earlier Test Posts

The order of this series has been intentional.

You can only understand the benchmark layer properly after seeing the layers beneath it:

parser tests,
frontend AST construction tests,
semantic-analysis checks,
lowering invariant tests,
CPU equivalence tests.

Once those exist, the benchmark layer stops looking like a catch-all proof of correctness. It becomes what it should be:

the final reality check.

It is the place where compiler-generated code meets a real application and has to survive everything the earlier layers abstracted away.

That is exactly why it deserves its own post, and exactly why it belongs at the top of the stack rather than at the bottom.