Why REX Treats GPU Benchmark Results As An Investigation Surface, Not A Scoreboard

Posted on (Updated on )
REX does not treat GPU benchmark results as a bare winner/loser table. The benchmark layer has a six-part contract: build native LLVM and REX against the same LLVM runtime family, run the same inputs, compare outputs with benchmark-specific normalization, compare performance with the best timing source each benchmark actually exposes, respect explicit user launch clauses, and interpret the result as evidence rather than a scoreboard. That approach is what made it possible to explain whether a moved row came from a compiler regression, a runtime-lifecycle mistake, a bad timing proxy, a fairness problem, or a stale saved baseline.

The previous post argued that the GPU benchmark layer in REX must stay narrow instead of becoming a catch-all test suite.

This post narrows the same idea one step further.

It is about the style of interpretation at the top layer.

The benchmark layer is not most useful when it says:

1
2
3
4
REX faster
LLVM faster
pass
fail

It is most useful when it says:

1
this row moved, and here is why the movement is believable

That difference is the reason REX treats the benchmark layer as an investigation surface rather than a scoreboard.

The distinction matters because real benchmark rows can move for very different reasons:

  • a real compiler regression,
  • a runtime-lifecycle placement bug,
  • a timing-source mistake,
  • a saved-baseline drift issue,
  • or a fairness problem where one side quietly rewrote the user’s launch policy.

If all of those collapse into one winner/loser table, the benchmark layer becomes noisy and easy to overclaim from.

If they are interpreted under a contract, the same benchmark layer becomes much more valuable.

A diagram showing six benchmark-layer contract steps: same runtime family, same inputs, normalized correctness comparison, benchmark-appropriate timing, clause-preserving fairness, and investigation-first interpretation.

Figure 1. The benchmark layer only becomes trustworthy once it is constrained by a contract. Without that contract, a result table is just a pile of numbers and diffs.

The Six-Part Contract Of The Top Layer

By the end of the benchmark campaign, the top-layer contract in REX had become fairly crisp.

It had six parts.

1. Build native LLVM and REX against the same runtime family

This keeps the comparison about lowering and generated code shape instead of about two unrelated software stacks.

If native LLVM and REX are linked against different runtime ecosystems, then the row no longer isolates the question the campaign actually cares about.

It becomes:

1
system A versus system B

instead of:

1
2
native LLVM lowering versus REX lowering
under the same LLVM offloading runtime family

2. Run the same inputs

This sounds trivial, but it is part of the contract for a reason.

Benchmark rows are only meaningful when both sides are exposed to the same workload, not to vaguely similar runs that happen to share a benchmark name.

3. Compare outputs with benchmark-specific normalization

Real applications do not print pure mathematical answers. They print timers, banners, and benchmark-specific diagnostics.

So correctness at the top layer requires narrow normalization rules:

  • strip timing-only lines when they are not computational output,
  • use reduced OUTPUT modes when benchmark mode hides the result,
  • compare current native LLVM and current REX separately from old saved baselines.

Without that, a raw diff confuses formatting noise with semantic drift.

4. Compare performance using the best metric each benchmark actually exposes

The benchmark layer must not pretend every benchmark prints one equally meaningful time.

Some rows should use:

  • a benchmark-owned total,
  • a compute-stage time,
  • a kernel-specific Total time: line,
  • a profiler-derived GPU total,

instead of a convenient wall-clock proxy that mostly measures something else.

5. Respect explicit user launch clauses

This is the fairness boundary.

If the user explicitly requested num_threads or num_teams, the compiler should not quietly rewrite that legal request just because the benchmark would look better.

Defaults can be shaped. Invalid requests can be clamped. Open choices can be optimized.

But explicit user policy must stay visible in the measured binary.

6. Treat the result as an investigation surface, not a one-line scoreboard

This is the part that turns the benchmark layer from “a table of outcomes” into an engineering tool.

The top layer should help explain:

  • why a row moved,
  • whether the movement is measuring the intended thing,
  • and what class of issue the movement belongs to.

That is a stricter and more useful contract than a scoreboard.

What A Scoreboard Gets Wrong

A scoreboard compresses rich evidence into very little language.

That is sometimes useful for summary, but it is dangerous as the primary interpretation.

Consider how different these statements are:

1
LLVM faster on benchmark X

versus:

1
2
benchmark X only looked better under wall-clock proxy timing;
GPU-total profiling erased the apparent gap

Or:

1
REX faster on benchmark Y

versus:

1
2
REX only won because it rewrote an explicit user launch request,
so the result is not a fair compiler win

Or:

1
REX output mismatched baseline

versus:

1
2
current native LLVM and current REX still matched each other exactly,
but both drifted from an old saved floating-point baseline

Those are not cosmetic differences in wording. They are different engineering conclusions.

A scoreboard pushes them toward the same shape. An investigation surface keeps them separate.

A side-by-side contrast showing a simple winner-loser scoreboard on one side and a richer investigation-oriented benchmark interpretation on the other.

Figure 2. A scoreboard compresses different causes into the same output shape. An investigation surface preserves why a result changed and whether the change is even trustworthy.

The Benchmark Layer’s Real Output Is Explanation

The benchmark layer earns its cost when it can explain believable movement.

That means the output of a benchmark campaign is not only a table. It is also a classification of what the table means.

In the REX campaign, that classification often looked like one of these:

A real compiler or runtime-lifecycle issue

The early nn regression is the clearest example.

The problem was not arithmetic throughput. The generated program was charging one-time offload setup to the timed region.

A scoreboard would only show a bad number.

The investigation view showed:

  • the transformed program was still structurally valid,
  • the transformed program was still semantically valid,
  • but the runtime lifecycle was placed badly enough to invalidate the benchmark timing.

That is much more actionable than “REX slower on nn.”

A measurement-method problem

pathfinder and b+tree both exposed this class.

For pathfinder, the printed timing signal was not the simple comparable runtime it first appeared to be.

For b+tree, external wall-clock included too much non-kernel work even though the benchmark already had a better kernel-specific timing line.

Again, a scoreboard would merely say a row looked close or ambiguous.

The investigation surface could say:

  • this row is being judged by the wrong metric,
  • so any winner/loser story here is premature.

A fairness problem

nn also exposed the fairness side.

At one stage of the work, REX looked better because it had learned how to avoid a comically oversized launch for a tiny active work window.

That was a real optimization. It was not always a fair benchmark win.

Once the campaign enforced the rule that explicit user launch clauses must be preserved, the interpretation changed.

That kind of correction is exactly why the benchmark layer should not be read as a scoreboard first.

A baseline-maintenance problem

hotspot and srad_v2 showed the correctness version of the same idea.

Current native LLVM and current REX still matched each other. Both drifted from older saved baselines.

That is not the same thing as a current REX regression.

A scoreboard is bad at expressing that distinction. An investigation surface is built around expressing it.

Why This Is Stronger Than Saying Less

It may sound as if the investigation-surface mindset weakens the clarity of benchmark results by refusing to summarize quickly.

It actually makes the results stronger.

A benchmark conclusion is more defensible when the reader knows:

  • the two sides used the same runtime family,
  • the workload was the same,
  • the output comparison was normalized correctly,
  • the timing source actually measured the intended work,
  • explicit user clauses were preserved,
  • and the explanation for row movement has already been classified.

At that point, even a simple summary line such as:

1
2
3
REX win
LLVM win
effective tie

means more, because it sits on top of a real argument instead of replacing one.

That is the important inversion:

the scoreboard is acceptable as a summary only after the investigation contract has already been satisfied.

It is not acceptable as the whole interpretation.

What This Looks Like In Practice

In practice, the investigation-surface mindset changes how benchmark results are written down and how they are read.

A disciplined benchmark note should not stop at:

  • timing numbers,
  • correctness flags,
  • or a winner column.

It should also preserve:

  • what timing source the row used,
  • what normalization rules were needed,
  • whether explicit launch clauses were preserved,
  • whether current native LLVM and current REX still matched,
  • and what class of issue explains any disagreement or apparent movement.

That is why the REX benchmark work ended up with more than one document and more than one pass over the same suite.

The later fair reruns, GPU-total checks, and baseline-drift interpretations were not distractions from the benchmark result.

They were the benchmark result becoming trustworthy.

A flow diagram classifying a moved benchmark row into compiler/runtime issue, timing-source issue, fairness issue, or baseline-maintenance issue.

Figure 3. The useful question after a benchmark row moves is not only ‘who won?’ but ‘what class of reason explains the movement?’

The Top Layer Is A Reality Check, Not A Verdict Machine

This is also why the benchmark layer belongs at the top of the REX testing stack.

It comes after:

  • parser tests,
  • frontend AST construction tests,
  • semantic-analysis checks,
  • lowering invariant tests,
  • and CPU equivalence.

By the time the benchmark layer runs, the cheaper and narrower explanations should already be mostly removed.

That is exactly what gives the top layer its investigative power.

If a row still moves after all that, the remaining explanations are already more interesting:

  • real runtime-lifecycle issues,
  • real measurement-method issues,
  • real fairness questions,
  • real end-to-end integration problems,
  • or real performance differences.

That is the right role for the benchmark layer.

It should not behave like a verdict machine that converts a messy real application into one simplistic label.

It should behave like the final place where the compiler team learns what kind of top-layer reality it is actually dealing with.

The Design Rule In One Sentence

REX treats GPU benchmark results as an investigation surface rather than a scoreboard because the top layer is only trustworthy when each row can explain why it moved, not merely which side happened to look better in one unqualified table.