How REX Removed Process-Exit Offload Teardown From Generated GPU Programs

Wed, 29 Apr 2026 00:00:00 +0000

The previous post separated GPU-total performance from wall-clock noise. That was necessary because pathfinder, srad_v1, and srad_v2 looked suspicious in broad timing tables, but profiler totals did not show a remaining native LLVM advantage in the actual GPU work.

That did not make the wall-clock discrepancy imaginary.

It meant the discrepancy lived somewhere else.

If kernel time and copy time are tied or better for REX, but whole-process timing still sometimes looks worse, the remaining cost is probably not inside the generated device kernel. It is in process lifetime: setup, registration, host work, teardown, or the interaction between those pieces.

Libomptarget on ./Code

How REX Removed Process-Exit Offload Teardown From Generated GPU Programs