How REX Separated GPU-Total From Wall-Clock Noise In pathfinder And srad

Posted on Apr 28, 2026 (Updated on May 4, 2026)

After the b+tree read-only-load fix, the remaining suspicious rows were pathfinder, srad_v1, and srad_v2. Wall-clock or benchmark-proxy numbers made them look like possible native LLVM wins or near losses. But the project metric was total GPU execution time: kernel time plus host-to-device and device-to-host copies. Once those cases were profiled with nvprof, pathfinder became a clear REX GPU-total win, while srad_v1 and srad_v2 became effective ties with no confirmed native LLVM device-side advantage. The lesson was not to add another launch or kernel heuristic. The lesson was to stop optimizing from a timing signal that was measuring process lifetime more than GPU work.

The previous post closed the last obvious fair b+tree kernel-body gap. REX was no longer relying on an unfair launch-shape rewrite, and it no longer needed a global cache flag. It recovered read-only provenance in the generated device kernel and emitted selective __ldg(...) loads where the proof was strong enough.

That left a strange-looking benchmark table.

Some rows were clearly resolved. b+tree had moved from a fair loss into a clear REX win. Several other benchmarks already had stable REX advantages. But three rows still looked suspicious if we looked only at the broad comparison table:

1
2
3
pathfinder: near tie, direction changed across short reruns
srad_v1:    near tie, sometimes a tiny native edge
srad_v2:    small native-looking wall-clock lead in one table

At this point, the wrong move would have been to keep changing the compiler until every top-line number moved in the desired direction.

The right question was narrower:

Are these still GPU offloading gaps, or are they timing-source artifacts?

That distinction mattered because the project metric was already defined: compare total GPU execution time, meaning kernel compute plus host-to-device and device-to-host transfer time. Whole-process wall-clock is not the same metric. It includes file parsing, benchmark setup, host-side loops, output formatting, runtime initialization, process teardown, and OS scheduling noise.

Early in an optimization project, those differences can be hidden by large regressions. Once the real compiler bugs are gone, they become decisive.

Figure 1. Wall-clock time includes the GPU work, but it also includes much more. Once gaps are small, that extra process-lifetime work can dominate the conclusion.

Why The Existing Table Was Not Enough

The earlier strict fair rerun was useful, but it mixed timing sources. That was acceptable for broad triage because it let us keep iteration fast across a messy benchmark suite. It became insufficient once the remaining differences were small.

pathfinder was the first warning sign. Its broad table result was effectively a tie. Depending on sample count and machine state, the direction could flip. That did not line up with the code shape we had already recovered. The REX kernel was a direct CUDA-style grid-stride loop over the useful stencil domain. The old XOMP scheduler path was gone. Launch fairness rules were in place. If native LLVM was truly winning on device time, the evidence needed to be stronger than a one-percent wall-clock wobble.

srad_v1 and srad_v2 created a different ambiguity. They are related workloads, but not identical benchmark programs. Both use repeated offload regions and large arrays. Both have meaningful host work around the device regions. If one looked tied and the other looked like a native win, there might have been a real remaining code-generation issue. It also might have been a mismatch between what each benchmark printed and what the comparison harness was measuring.

There was a third REX-specific reason to be careful. REX-generated standalone programs do not have exactly the same process lifecycle as native LLVM binaries. REX registers a separately generated cubin image through helper code, while native LLVM embeds the device image in the ordinary OpenMP offload binary path. Even after registration was moved out of the wrong timed region, that architectural difference can still affect wall-clock timing.

None of that proves REX is faster. It proves that wall-clock alone is not a good enough signal for close cases.

The next step was therefore not another compiler heuristic. It was profiling.

The Measurement Rule

The profiling rule was simple:

1
2
3
if a benchmark has a small apparent gap
and the benchmark does not print a trustworthy GPU-total number,
then use profiler GPU activity totals before changing the compiler.

The command shape was:

1
nvprof --profile-from-start on --print-gpu-summary --csv <benchmark-command>

Then we summed:

1
2
3
all CUDA kernel time
all host-to-device copies
all device-to-host copies

That is not a perfect measurement environment. nvprof itself has overhead, and profiler runs can still vary. But it answers the relevant question much better than whole-process timing:

1
is the generated GPU offloading path slower, or is the process around it noisy?

If a real GPU gap exists, it should appear in kernel time, transfer time, or both. If the wall-clock gap disappears when we look at GPU activity only, then changing the generated kernel would be optimizing against the wrong signal.

The focus cases were:

1
2
3
pathfinder
srad_v1
srad_v2

hotspot was also profiled as a control because it already had a stable REX win. A useful measurement method should preserve that known direction instead of flattening everything into an artificial tie.

pathfinder: A Wall-Clock Near Tie Became A GPU-Total REX Win

The pathfinder profiler result was the clearest case.

The wall-clock table had made pathfinder look unresolved. The GPU-total profile did not:

1
2
REX total GPU activity:         149.118490 ms
native LLVM total GPU activity: 216.707807 ms

That is not a tie. It is roughly a 31 percent REX GPU-total win in that profiled run.

The breakdown explained why:

1
2
3
4
5
REX kernel total:         17.317271 ms over 999 kernel calls
native LLVM kernel total: 32.290071 ms over 999 kernel calls

REX HtoD total:           131.693797 ms
native LLVM HtoD total:   184.315209 ms

The broad wall-clock result had hidden the actual offload behavior because pathfinder spends enough time outside the GPU activity window. Once we measured the device work and transfers directly, there was no remaining native LLVM GPU advantage to fix.

That changed the interpretation completely. A compiler engineer looking only at the proxy table might be tempted to tune launch shape, memory access, or lowering structure. The profiler said not to. The generated GPU path was already better on the intended metric.

Figure 2. The suspicious wall-clock rows did not remain native LLVM wins under GPU-total profiling. pathfinder became a clear REX win, and both srad cases became ties or near ties.

srad_v1: The Honest Answer Was Tie

srad_v1 was useful because it did not turn into a dramatic story.

The GPU-total profile showed:

1
2
REX total GPU activity:         857.467848 ms
native LLVM total GPU activity: 858.037765 ms

That is below a tenth of a percent. Calling that a meaningful win for either compiler would be dishonest. The correct conclusion is:

1
srad_v1 is tied on the intended GPU-total metric.

This was important because it kept the profiling method honest. If every profiler run had transformed every close benchmark into a big REX win, we would have needed to question the measurement setup. Instead, the profiler preserved the most boring possible answer for srad_v1: the two generated offload paths are effectively equal at this scale.

That does not mean the binaries are identical. It means the remaining differences are smaller than the measurement signal we should use to justify compiler changes.

srad_v2: The Native-Looking Wall-Clock Lead Did Not Survive Profiling

srad_v2 needed extra caution because the broad table had shown a native-looking wall-clock lead.

The first GPU-total profile already contradicted that:

1
2
REX total GPU activity:         189.084268 ms
native LLVM total GPU activity: 189.775687 ms

That is an effective tie with a tiny REX edge in the sample.

Because the earlier wall-clock result had been more concerning, the profiler run was repeated. The three-profile sample produced:

1
2
REX mean GPU total:         189.630837 ms
native LLVM mean GPU total: 194.719622 ms

One native sample had visibly higher variance, and the medians were much closer than the means. So the rigorous conclusion is not “REX destroyed native LLVM on srad_v2.” The rigorous conclusion is:

1
the earlier wall-clock native lead is not evidence of a remaining native GPU advantage.

That is the result that mattered for compiler work. The data did not justify adding another launch heuristic, a new read-only pass, or a backend flag for srad_v2.

The Control Case: hotspot

The hotspot control behaved as expected:

1
2
REX total GPU activity:         144.144442 ms
native LLVM total GPU activity: 193.316538 ms

This confirmed that the profiler method was not simply washing away real differences. It preserved a known REX win. That made the close-case interpretation more credible.

The measurement method was doing what we needed:

1
2
large real GPU differences remain visible;
small wall-clock-only differences stop driving compiler changes.

A Tempting Experiment That Did Not Justify A Pass

Before closing the investigation, we still tested one attractive generated-code idea on srad_v2.

The idea was to manually convert obvious read-only kernel formals to const in the generated CUDA file. This was a reasonable question after the b+tree __ldg work. If srad_v2 was still hiding a read-only provenance problem, a manual const experiment might expose it quickly.

It did not.

The profiled totals stayed essentially where they were, and one copy-back component even moved slightly in the wrong direction in that sample. The generated file was restored to the compiler-produced baseline.

That negative result is worth recording. It prevented a plausible but unsupported compiler pass:

1
do not add a broad "make more formals const" transform just because it sounds similar to b+tree.

b+tree needed selective read-only load recovery because profiler and generated-file experiments showed a real kernel-body issue. srad_v2 did not show the same evidence. Similar-looking optimization ideas still need their own proof.

What This Changed In The Optimization Process

Before this profiling pass, the temptation was:

1
keep tuning until pathfinder and srad_v2 move in the wall-clock table.

After profiling, that target was wrong.

On the intended GPU-total metric:

1
2
3
pathfinder is already a substantial REX win;
srad_v1 is an effective tie;
srad_v2 is an effective tie with no confirmed native GPU advantage.

That does not mean wall-clock differences are imaginary. It means they belong to a different layer of the runtime story. Startup cost, cubin registration, process teardown, benchmark file I/O, and host computation can still affect end-to-end process time. Those are real engineering topics, but they are not evidence that the generated GPU kernel or transfer path is slower.

This distinction prevented a bad optimization loop. If we had kept tuning against the wall-clock proxy, we could have added compiler complexity to fix a number that was not measuring the thing we claimed to optimize.

Figure 3. Close wall-clock rows are not automatic compiler bugs. They become compiler bugs only if GPU-total profiling confirms a device-side or transfer-side loss.

The Rule REX Keeps

The practical rule from this post is:

1
2
3
4
use benchmark-owned GPU timing when it is available and trustworthy;
use profiler GPU-total timing for close cases without a trustworthy GPU timer;
do not add code-generation heuristics for wall-clock-only gaps;
separate GPU throughput questions from process-lifecycle questions.

This is a different kind of performance work than the earlier posts. It did not produce a compiler patch. It produced a cleaner scoreboard.

That is still progress. A compiler optimization project needs to know when not to change the compiler.

By this point in the REX work, the largest regressions had already been removed. The remaining differences were small enough that metric quality became part of the engineering problem. pathfinder, srad_v1, and srad_v2 were the point where we had to raise the measurement standard.

The conclusion was precise:

1
there was no confirmed remaining native LLVM GPU-total advantage in these suspicious rows.

The next post picks up the other side of that conclusion. If GPU totals were already tied or better, why did wall-clock still look worse in some short runs? That led away from kernel lowering and into process-lifetime behavior: explicit offload teardown at program exit.