How REX Validates Benchmark Correctness Without Trusting Naive Diffs

Posted on Apr 16, 2026 (Updated on Apr 22, 2026)

REX’s benchmark correctness checks are intentionally stricter than ‘did the program run?’ and looser than ‘do the raw logs match byte for byte forever?’ The campaign normalized away timing-only lines such as pathfinder’s timer: output and b+tree’s Total time: report, and used reduced OUTPUT configurations for benchmarks like hotspot and srad_v2 that do not normally expose their computed data. It also compared current native LLVM against current REX separately from old saved baselines. That framework made it possible to distinguish real regressions from formatting noise and floating-point baseline drift.

The previous post in this series focused on fairness in performance comparison: same runtime stack, same user intent, and the right meaning of time.

This post covers the correctness half of the same problem.

The question is:

when a benchmark is supposed to prove that native LLVM and REX still compute the same thing, what exactly counts as “the same thing”?

The answer turned out to be more careful than a raw diff.

That is because real benchmarks do not only print computational results. They also print:

timers,
status text,
banners,
benchmark-specific diagnostics,
and, in some cases, nothing directly useful for result checking unless they are rebuilt in a smaller output-enabled mode.

So the benchmark campaign had to learn a narrower rule:

compare computational meaning, not arbitrary log text.

That rule seems obvious after the fact. It was not obvious in practice until several misleading mismatches had already shown up.

A correctness funnel showing raw benchmark logs entering on the left, normalization and output-mode selection in the middle, and computational comparison on the right. — Figure 1. Benchmark correctness is a funnel, not a raw diff. The campaign first chose the right output mode, then removed non-semantic text, then compared the remaining computational payload.

Why A Raw Diff Was The Wrong Starting Point

The simplest possible correctness rule would be:

1
2
3
run native LLVM
run REX
diff the two output files

That rule is appealing because it looks objective.

It is also wrong for many real benchmarks.

The problem is not that strict comparison is bad. The problem is that a benchmark log is usually a mixture of different kinds of information:

computational results,
timing reports,
progress output,
benchmark-driver diagnostics,
and sometimes source-version-specific formatting.

If those categories are mixed together, then a raw diff is not measuring correctness. It is measuring everything at once.

That leads to two opposite failure modes.

The first failure mode is a false regression:

the logs differ,
but the only difference is a timing line or an expected banner.

The second failure mode is a false sense of safety:

the benchmark prints very little actual computed data,
so a matching log says almost nothing about the real result.

The campaign therefore treated benchmark correctness as a three-step process:

choose an output mode that actually exposes the computation,
remove log text that is not part of the computation,
compare current native LLVM and current REX first, then interpret any saved-baseline mismatch separately.

That sounds procedural, but each step came from a real benchmark problem.

Step 1: Strip Timing-Only And Diagnostic Lines

The easiest benchmark-output bug class was also the one most likely to waste time: lines that are expected to differ and should never count as correctness failures.

Two benchmarks made this painfully clear.

`pathfinder`: ignore the `timer:` line

pathfinder prints a timer: line in its main output.

That line is useful to a human reading a benchmark run. It is not part of the numerical result.

So the correctness comparison deliberately ignored it.

Conceptually, the normalization was as simple as:

1
2
3
rg -v '^timer:' native.out > native.norm
rg -v '^timer:' rex.out > rex.norm
diff -u native.norm rex.norm

The important idea is not the exact tool. The important idea is that the line is removed because it is semantically irrelevant to the correctness question.

Keeping it in the diff would make the test noisier without making it stricter in any meaningful way.

`b+tree`: ignore timing and tree-transformation diagnostics

b+tree exposed the same pattern in a different form.

Its logs contain lines such as:

Total time:
a numeric timing line,
and the Tree transformation took diagnostic.

Those lines are useful for runtime interpretation. They are not part of the query result that the benchmark is supposed to compute.

So the comparison stripped them before deciding whether native LLVM and REX agreed.

This is what good normalization looks like:

it removes lines that are supposed to differ,
but it leaves the actual algorithmic output intact.

That distinction matters. Benchmark normalization should be narrow and explainable. It should not become a bag of excuses for whatever the compiler happened to print differently.

Two side-by-side benchmark examples showing raw outputs for pathfinder and b+tree, with timing-only lines removed and the remaining payload compared. — Figure 2. Narrow normalization removes only what was never part of correctness in the first place. `pathfinder` and `b+tree` both needed this kind of cleanup before diffing their real outputs.

Step 2: Use A Correctness Mode When Benchmark Mode Hides The Result

The second problem was more subtle.

Some benchmarks do not print the useful result at all in their default performance configuration.

That is not a bug in the benchmark. It is a design choice. Benchmark mode is often optimized for throughput measurement, not for result inspection.

If correctness validation insists on using only the default benchmark mode anyway, then it can end up “verifying” almost nothing.

Two benchmarks forced the campaign to confront this directly: hotspot and srad_v2.

`hotspot`: reduced `OUTPUT` mode was the real correctness surface

For hotspot, the main benchmark configuration is good for timing but not good for inspecting the computed field in a meaningful way.

So the campaign used a smaller reduced-output mode:

enable OUTPUT,
use a smaller 64x64 problem,
use a short simulation interval,
then compare the emitted numerical grid.

In the campaign notes, that meant a reduced configuration such as 64x64 with sim_time=2.

That is not cheating and it is not changing the algorithm. It is simply choosing a mode where the benchmark exposes its computed state in a form that can actually be checked.

`srad_v2`: same principle, different benchmark

srad_v2 had the same structural issue.

Its default benchmark run is useful for throughput measurement, but not for comparing a large numerical output stream in a practical way.

So it also used a reduced 64x64 OUTPUT run for correctness.

The important design principle here is easy to miss:

performance mode and correctness-visibility mode do not always have to be the same run mode.

That is acceptable as long as both native LLVM and REX use the same reduced correctness configuration and the mode change is only about exposing the computed result, not changing the semantics being compared.

This is one reason the benchmark layer should be described as an investigation harness rather than as a single command that mechanically emits pass or fail. Real applications rarely line up so neatly.

Step 3: Separate “Current Agreement” From “Historical Baseline Agreement”

This was the most important correctness lesson in the whole campaign.

A saved reference file is useful, but it is not absolute law.

If current native LLVM and current REX match each other exactly, but both differ slightly from an older saved baseline, that is a very different situation from “REX is wrong.”

The campaign therefore split correctness into three distinct questions:

does current native LLVM still run and produce sensible output?
does current REX match current native LLVM?
do both still match an older saved reference exactly?

Those questions are related, but they are not interchangeable.

`hotspot`: current native and current REX agreed, saved baseline drifted

In reduced-output mode, the current native LLVM and current REX hotspot outputs matched each other exactly.

But both differed from the saved reference corpus.

The campaign notes recorded this very concretely:

token count stayed the same: 8199,
the first output line in the saved reference was 0 323.833,
the current native and current REX run produced 0 323.829,
the maximum absolute delta across the reduced output was about 20.757,
the maximum relative delta was about 6.0%.

Those facts matter because they show why the result should not be summarized as “REX failed correctness.”

A better summary is:

current native and current REX agree,
both differ from an older saved baseline,
so the likely issue is saved-baseline drift or overly strict baseline policy, not a current REX-only regression.

`srad_v2`: same pattern, smaller numerical drift

srad_v2 showed the same shape.

Again:

current native LLVM and current REX matched each other exactly in reduced-output mode,
both differed slightly from the saved reference corpus,
token counts still matched: 4107,
the maximum absolute delta was about 0.0364,
the maximum relative delta was about 2.17%.

This is exactly the sort of evidence that a naive pass/fail diff would misread.

A raw baseline diff would only tell you:

1
different

The full interpretation is richer:

current two-way agreement is intact,
historical exact match drifted,
the benchmark is floating-point heavy,
so the right next step is to classify the baseline mismatch, not to accuse the current compiler variant immediately.

A triangle connecting current native LLVM, current REX, and historical baseline, showing that current native versus current REX agreement answers a different question from current output versus saved baseline agreement. — Figure 3. There are three correctness comparisons, not one. Current native LLVM versus current REX answers a different question from current output versus historical baseline.

Exact Equality, Normalized Equality, And Tolerance-Aware Equality

Once the campaign stopped trusting raw diffs blindly, it still needed a positive rule for what kind of equality to require.

The resulting policy was layered rather than universal.

Exact equality where the benchmark truly exposes deterministic output

If the benchmark output is deterministic and not polluted by timing or banners, exact equality is ideal.

That is the simplest and strongest case.

Many of the integer-like or structurally simple outputs can and should be judged this way.

Normalized exact equality where the log contains non-semantic lines

If the benchmark log mixes result data with timing or reporting lines, exact equality still makes sense after narrow normalization.

That was the right rule for pathfinder and b+tree.

The key is that the normalization must be:

small,
benchmark-specific,
and easy to explain.

If a normalization rule starts removing arbitrary lines or rewriting numeric content, it is no longer a trustworthy correctness filter.

Tolerance-aware reasoning for floating-point-heavy cases

Floating-point-heavy GPU applications need one more layer of interpretation.

That does not mean giving up on correctness. It means recognizing that bitwise equality across toolchain generations is not always the only meaningful signal.

The campaign therefore used tolerance-aware reasoning where appropriate, especially when interpreting saved-baseline drift.

But there was also an important discipline here:

do not hide a current native-versus-REX mismatch behind a vague tolerance policy.

In the cases that mattered most, such as reduced-output hotspot and srad_v2, current native LLVM and current REX were actually byte-identical. The tolerance discussion only entered when interpreting their shared drift relative to older saved references.

That is the right ordering:

first ask whether current native and current REX still agree,
then interpret how that shared result relates to an older floating-point baseline.

Why This Is Better Than A Single Golden-File Oracle

It would be possible to insist on a simpler rule:

1
every benchmark must always match one saved file exactly

That rule sounds clean. It breaks down in practice for at least three reasons.

First, benchmarks evolve. Build flags, runtime libraries, output formatting, and floating-point behavior can all shift slightly over time.

Second, current native LLVM is itself part of the validation surface. If current native and current REX still agree, that is strong evidence about the current compiler path even when an older file no longer matches exactly.

Third, a benchmark suite is supposed to help investigation, not replace it. A good top-layer harness should tell you what kind of mismatch you are looking at:

non-semantic text mismatch,
hidden-output benchmark mode,
current two-way mismatch,
or historical baseline drift.

That diagnostic value is much more useful than one undifferentiated red flag.

The Correctness Contract In Practice

By the end of the campaign, the correctness contract for benchmark validation had become fairly crisp.

For each benchmark:

choose a run mode that actually exposes the computation when needed,
normalize away only the lines that are clearly non-semantic,
compare current native LLVM and current REX first,
then compare both against any saved historical baseline,
interpret floating-point drift as drift unless it creates a current native-versus-REX disagreement.

This is why the benchmark report could say something precise instead of something vague.

It could say:

current native LLVM and current REX matched on all 9 benchmarks,
hotspot and srad_v2 disagreed with older references,
those disagreements looked like saved-baseline drift rather than current REX regressions.

That is a much better engineering result than a binary “golden diff failed.”

The Design Rule In One Sentence

The design rule for benchmark correctness in REX is simple:

validate the computation the benchmark is meant to expose, not whatever text the benchmark happened to print around it.

That is why the correctness workflow strips timing-only lines, uses reduced-output modes when the default run hides the real result, and treats current native-versus-REX agreement as a first-class signal distinct from historical baseline drift.

Without those rules, the benchmark layer becomes noisy and fragile.

With them, it becomes what it should be: the final place where the current REX compiler proves that it still computes the same answer as the current native LLVM path, even when real benchmark logs are messy.