Posts, notes, and experiments.

How REX Removed Process-Exit Offload Teardown From Generated GPU Programs


The next REX GPU performance post: how srad_v2 niter=0 exposed a fixed process-lifecycle tax, why explicit offload teardown at process exit was unnecessary for short-lived generated programs, and how REX kept rex_offload_fini as an explicit API without forcing it into every main.
Read more ⟶

How REX Separated GPU-Total From Wall-Clock Noise In pathfinder And srad


The next REX GPU performance post: why pathfinder and srad looked like remaining regressions under wall-clock timing, how nvprof GPU-total measurements changed the conclusion, and why the right fix was methodology discipline rather than another compiler heuristic.
Read more ⟶

How REX Recovered b+tree Read-Only Loads With __ldg


The next REX GPU performance post: how the remaining fair b+tree gap moved from launch geometry into read-only memory access, why a global cache flag was the wrong fix, and how REX repaired const provenance so generated kernels could recover __ldg loads safely.
Read more ⟶

How REX Kept b+tree Launch Geometry Fair


The next REX GPU performance post: how b+tree exposed a real launch-geometry problem, why shrinking explicit user launch clauses would be an unfair benchmark win, and how REX turned the experiment into a generic default-only heuristic.
Read more ⟶

How REX Completed Direct __tgt_target_kernel Lowering And Repaired The Device ABI


The next REX GPU performance post: why switching from legacy __tgt_target_teams to direct __tgt_target_kernel was a full host/device ABI migration, how gaussian exposed the mismatch, and how REX repaired the generated kernel signature, scalar transport slots, and runtime argument packet.
Read more ⟶

How REX Made Literal Scalar Target Parameters Match The Modern OpenMP Launch ABI


The next REX GPU performance post: why scalar target parameters should not be lowered like address-based mapped objects, how REX identifies safe literal parameters, and how host packing plus device unpacking moved the generated code closer to LLVM's modern OpenMP launch ABI.
Read more ⟶

How REX Handles Launch Geometry Fairly: What It May Optimize And What It Must Preserve


The next REX GPU performance post: why launch geometry is a compiler policy problem, how REX separates explicit user launch clauses from compiler-owned defaults, and why fair optimization means shrinking only the parts of the launch contract that the user did not freeze.
Read more ⟶

How REX Replaced the Old XOMP Scheduler Path With Direct Grid-Stride GPU Lowering


The second performance-root-cause post in the REX GPU campaign: how the remaining `nn` gap moved from host lifecycle cost to device-loop shape, why the old XOMP static scheduler path was too expensive for canonical target loops, and how REX replaced it with generic direct grid-stride lowering.
Read more ⟶

How REX Fixed the Early `nn` GPU Regression By Moving Offload Initialization Out Of The Timed Path


The first performance-root-cause post in the REX GPU campaign: how a large early `nn` slowdown turned out to be misplaced one-time offload registration work, why the generated source had to place `rex_offload_init()` before benchmark timing, and how lowering tests now protect that invariant.
Read more ⟶

How REX Builds Native LLVM And Generated Variants For Side-By-Side GPU Benchmarking


A focused walkthrough of the benchmark build contract in REX: how the same application becomes a native LLVM OpenMP-offload binary and a REX-generated binary, why both must share the same LLVM runtime family, and how identical run commands keep the comparison honest.
Read more ⟶