How REX Recovered b+tree Read-Only Loads With __ldg

Mon, 27 Apr 2026 00:00:00 +0000

The previous post ended with an important constraint: REX was not allowed to win b+tree by silently shrinking a valid user-requested launch shape. The manual thread-width sweep had found a faster shape, but the source explicitly requested the launch geometry and native LLVM preserved it. That made the optimization useful as a diagnostic, not as a fair compiler rewrite.

So the remaining b+tree problem became sharper.

The launch contract had to stay fair. The direct __tgt_target_kernel path was already in place. Literal scalar target parameters were already repaired. Cubin registration was no longer the issue. The output matched native LLVM. Yet b+tree still had a small native advantage in fair runs:

How REX Kept b+tree Launch Geometry Fair

Sun, 26 Apr 2026 00:00:00 +0000

The previous post established the general fairness principles for GPU launch geometry. This post applies those rules to a specific case: the b+tree benchmark, which exposed a performance gap that remained even after the direct __tgt_target_kernel and ABI migration work.

At that point, the easy failure modes were mostly gone. The b+tree benchmark built. It registered its cubin. It launched its kernels. Its output matched.

That left the uncomfortable kind of performance gap: a small one.

Btree on ./Code

How REX Recovered b+tree Read-Only Loads With __ldg

How REX Kept b+tree Launch Geometry Fair