How REX Separated GPU-Total From Wall-Clock Noise In pathfinder And srad

Tue, 28 Apr 2026 00:00:00 +0000

The previous post closed the last obvious fair b+tree kernel-body gap. REX was no longer relying on an unfair launch-shape rewrite, and it no longer needed a global cache flag. It recovered read-only provenance in the generated device kernel and emitted selective __ldg(...) loads where the proof was strong enough.

That left a strange-looking benchmark table.

Some rows were clearly resolved. b+tree had moved from a fair loss into a clear REX win. Several other benchmarks already had stable REX advantages. But three rows still looked suspicious if we looked only at the broad comparison table:

How REX Kept b+tree Launch Geometry Fair

Sun, 26 Apr 2026 00:00:00 +0000

The previous post established the general fairness principles for GPU launch geometry. This post applies those rules to a specific case: the b+tree benchmark, which exposed a performance gap that remained even after the direct __tgt_target_kernel and ABI migration work.

At that point, the easy failure modes were mostly gone. The b+tree benchmark built. It registered its cubin. It launched its kernels. Its output matched.

That left the uncomfortable kind of performance gap: a small one.

Rodinia on ./Code

How REX Separated GPU-Total From Wall-Clock Noise In pathfinder And srad

How REX Kept b+tree Launch Geometry Fair