What Only Real GPU Benchmarks Still Catch In REX

Posted on (Updated on )
REX already has parser tests, Frontend AST tests, semantic-analysis checks, lowering invariant tests, and CPU equivalence tests. None of those can replace real GPU benchmark runs. The benchmark layer was the only place that exposed cold-start offload initialization landing inside the timed region for nn, misleading timing proxies for pathfinder and b+tree, stale saved baselines in hotspot and srad_v2, and the full interaction of generated host code, helper files, CUBIN registration, launch packets, kernel identity, and application-owned timing/I/O. Real benchmarks are expensive, but they are still the final reality check.

The previous two posts in this series narrowed the benchmark layer into two specific contracts:

  • fairness in performance comparison,
  • and correctness comparison that does not trust naive raw diffs.

This post steps back one level.

It asks a simpler question:

after REX already has parser tests, Frontend AST tests, semantic checks, lowering invariant tests, and CPU equivalence tests, why are real GPU benchmarks still necessary at all?

The short answer is that those earlier layers prove narrower things.

They prove that the compiler understands the directive language, that it builds the right internal representation, that the lowerer emits recognizable structures, and that the transformed host code still behaves like OpenMP on CPU.

Those are all valuable.

They still do not answer the final question:

when the whole GPU offloading path runs inside a real application, does the final system still behave correctly and competitively?

That question is exactly what the benchmark layer exists to answer.

A layered stack showing parser, AST construction, semantic checks, lowering invariants, CPU equivalence, and finally real GPU benchmarks at the top, where the real application and runtime stack meet.

Figure 1. The benchmark layer is not redundant with the earlier test layers. It is the first place where the generated program meets the actual GPU runtime, application timing code, and full benchmark workflow at once.

What The Earlier Layers Already Prove

REX does not arrive at benchmarks empty-handed.

By the time a benchmark runs, several cheaper layers have already done important work.

Parser-level tests answer:

  • did the OpenMP directive syntax parse?
  • did the ompparser callback stream still build the intended OpenMPIR?

Frontend AST-construction tests answer:

  • did that OpenMPIR become the right SgOmp* nodes?
  • did the frontend survive real directive combinations and clause variants?

Semantic-analysis checks answer:

  • were data-sharing and clause semantics understood correctly enough for later phases?

Lowering invariant tests answer:

  • did the compiler emit the expected helper calls, outlined kernels, map arrays, and launch-side structures?
  • did specific regression-sensitive artifacts such as rex_offload_init() placement or offload-entry counts stay intact?

CPU equivalence tests answer:

  • when the lowered host code runs through the CPU runtime instead of the GPU path, does it still compute the same answer as the original OpenMP program?

That is already a strong stack.

It is strong precisely because each layer asks a narrow question and asks it cheaply.

So the benchmark layer is not there because the earlier layers are weak.

It is there because each of them deliberately stops before the final environment is fully assembled.

What Those Layers Still Cannot Prove

Even if every earlier layer passes, there are still classes of failure that remain invisible until a real benchmark runs.

The benchmark layer is the first place where all of these meet at once:

  • generated host launch code,
  • generated device code,
  • helper files such as register_cubin.cpp and rex_kmp.h,
  • linked libomp and libomptarget,
  • the real benchmark’s own I/O path,
  • the real benchmark’s own timer placement,
  • the real benchmark’s own output conventions,
  • and the real workload size that determines whether a launch policy is actually sane.

That composition is the point.

Any one earlier layer can tell you that a piece is locally correct. None of them can tell you that the entire assembled program still makes sense under the benchmark’s own execution model.

That is why the benchmark layer caught problems the cheaper suites never would have.

Four benchmark-only failure classes: misplaced offload init in timed code, misleading timing proxies, saved baseline drift, and true end-to-end integration failures.

Figure 2. The benchmark layer exposed four kinds of failure that the earlier layers were never designed to detect: wrong runtime lifecycle placement, bad measurement sources, stale validation baselines, and full-stack integration mistakes.

Failure Class 1: Misplaced Offload Initialization In The Timed Path

The clearest example was the early nn regression.

The problem had nothing to do with device arithmetic.

The generated program was paying one-time GPU offload setup inside the benchmark’s measured region. That meant the benchmark’s own timer was no longer measuring just the application work. It was also measuring CUBIN registration, image construction, and offload startup.

None of the earlier layers would have called that out.

Parser tests would still pass.

The AST would still look fine.

Lowering invariant tests could even confirm that rex_offload_init() existed somewhere in the host file.

CPU equivalence would still say the transformed program computed the same answer.

And yet the benchmark result would still be false as a performance signal.

That is exactly what a real benchmark exposed:

  • the transformed program was structurally valid,
  • the transformed program was semantically valid,
  • but its runtime lifecycle was placed badly enough that the measured total time was meaningless.

That is a top-layer bug by definition.

Only a benchmark with its own real timer around the workload can reveal that the compiler has accidentally moved one-time cost into the application’s hot path.

The later lowering fix and regression checks mattered, but the bug only became obvious because a real benchmark made the timing failure impossible to ignore.

Failure Class 2: Misleading Timing Proxies

Another problem class was not in the compiler at all. It was in the measurement method.

Real benchmarks taught the campaign that a timing line is not automatically the right performance metric.

pathfinder was the cleanest case.

It printed a line named timer, which looked at first like a natural result to compare.

It turned out not to be a directly comparable wall-clock metric for the GPU work the campaign actually cared about.

b+tree had a different version of the same mistake.

A coarse external wall-clock made the benchmark look like a much larger end-to-end program because it included file parsing, tree transformation, and command handling. But the benchmark already printed a better kernel-oriented Total time: line.

Earlier layers could not possibly detect that mistake, because it was not a compiler-structure problem.

It was a problem of interpretation:

  • which line in a real benchmark actually measures the offloaded work?

Only a top-layer benchmark harness can answer that, because only that layer is dealing with application-owned timing output in the first place.

This is also why the benchmark layer is not just a scoreboard. It is an investigation surface.

Its job is not merely to emit a number. Its job is to decide whether the number means what people think it means.

Failure Class 3: Saved Baseline Drift

The benchmark layer also exposed a different kind of correctness problem:

sometimes the unstable thing is not the compiler output. It is the validation data.

hotspot and srad_v2 demonstrated this clearly.

In reduced-output correctness mode:

  • current native LLVM and current REX matched each other,
  • but both differed slightly from older saved reference outputs.

That is not a failure mode the cheaper suites are built to distinguish.

A lowering invariant test does not know anything about floating-point drift in an application output corpus.

A CPU equivalence test does not tell you whether the saved benchmark baseline on disk is now stale.

A raw benchmark diff by itself also does not tell you enough, because it collapses three different questions:

  1. did current native LLVM still work?
  2. did current REX still match current native LLVM?
  3. did either of them still match an old saved baseline exactly?

The benchmark layer was the only place where that distinction became visible and necessary.

That matters because the engineering conclusion changes completely depending on which answer failed.

If current native LLVM and current REX still agree, but both drifted from the historical baseline, the likely issue is:

  • saved-baseline maintenance,
  • tolerance policy,
  • or floating-point drift across toolchain/runtime evolution.

That is a very different conclusion from:

  • current REX diverged from current native LLVM.

Only the benchmark layer had enough real application output to expose that distinction.

Failure Class 4: Full End-To-End Integration

The most important benchmark-only value is still the broadest one:

real benchmarks are where the full offloading stack has to cooperate.

That means all of these have to line up in one executable:

  • generated host launch blocks,
  • generated device kernels,
  • offload-entry emission,
  • CUBIN loading and registration,
  • runtime ABI wrappers,
  • map-array construction,
  • __tgt_target_kernel transport,
  • target-data lifetimes,
  • device-kernel identity,
  • benchmark-owned I/O,
  • benchmark-owned timing code.

No cheaper suite composes all of that at once.

A structural lowering test can prove that a __tgt_offload_entry was emitted.

A runtime-helper post can explain how register_cubin.cpp builds a __tgt_bin_desc.

A launch-block test can prove that __tgt_kernel_arguments was assembled.

None of those by themselves prove that the fully linked benchmark executable still launches the right kernel, with the right mappings, at the right moment, under the benchmark’s own control flow.

That is why end-to-end runs are expensive but irreplaceable.

They are not there because the compiler team forgot how to write smaller tests.

They are there because some truths only exist once the whole system is assembled.

A diagram showing generated host code, helper files, cubin registration, runtime ABI, map arrays, kernel identity, and benchmark-owned timing/I/O all converging into one full benchmark run.

Figure 3. The benchmark layer is the first time every moving part of GPU offloading is forced to cooperate inside one real application. That is why it still catches failures none of the smaller layers can see.

Why The Benchmark Layer Must Still Stay Narrow

Saying the benchmark layer is indispensable does not mean it should become the default place to detect every bug.

That would make it slow, noisy, and hard to interpret.

The benchmark layer is still the wrong place to first discover:

  • parser drift,
  • clause spelling mistakes,
  • semantic-analysis normalization issues,
  • or simple lowering-structure regressions.

Those belong in the earlier layers precisely because they are cheaper and cleaner there.

The benchmark layer works best when it asks only the questions that require a real application:

  • does the full GPU offloading path still run?
  • do current native LLVM and current REX still agree?
  • does the chosen timing source still mean what the metric claims to measure?
  • if performance moved, is the reason in the compiler, the runtime lifecycle, the measurement method, or the benchmark’s own structure?

That is a narrow purpose.

It is also the right purpose.

When people try to use benchmarks as a catch-all oracle, the result is usually confusion:

  • a performance number is blamed on parsing,
  • a stale baseline is mistaken for a compiler regression,
  • or a lifecycle bug is hidden inside what looks like ordinary runtime variance.

The discipline of the earlier layers is what makes the top layer interpretable.

Why This Layer Comes Last In The Series

The ordering of the REX OpenMP Journey posts has been deliberate.

The benchmark layer only makes sense after the reader already understands the layers below it:

  • how pragmas are carried,
  • how OpenMPIR becomes SgOmp*,
  • how GPU lowering emits host, device, and helper artifacts,
  • how runtime glue, CUBIN registration, and offload-entry identity work,
  • how lowering invariant tests and CPU equivalence protect cheaper parts of the stack.

Once those exist, the benchmark layer stops looking like a generic proof of everything.

It becomes what it really is:

the final reality check.

That is why it deserves its own post.

It is not the first line of defense in REX. It is the last one.

The Design Rule In One Sentence

Real GPU benchmarks still matter in REX because they are the only place where the fully assembled compiler output meets a real application’s timing, I/O, runtime lifecycle, and workload shape all at once.

That is exactly why they caught:

  • misplaced offload initialization in nn,
  • misleading timing proxies for pathfinder and b+tree,
  • saved-baseline drift in hotspot and srad_v2,
  • and the full interaction of host launch code, helper files, CUBIN registration, and device execution.

They are expensive.

They are noisy.

They are also still the only honest final test.