How REX Removed Process-Exit Offload Teardown From Generated GPU Programs
pathfinder and srad issues were not device-side regressions. But some wall-clock runs still showed a fixed REX tax. srad_v2 with niter = 0 isolated that tax by keeping offload setup and mapping while removing the iterative kernels. CUDA API profiling and strace ruled out kernel work, driver calls, and cubin file I/O as the main cause. The real issue was lifecycle policy: generated programs were forcing rex_offload_fini() and unregister/unload work before process exit. REX now initializes the cubin image early, keeps rex_offload_fini() available for long-lived callers, but no longer auto-inserts teardown into normal standalone program exit.The previous post separated GPU-total performance from wall-clock noise. That was necessary because pathfinder, srad_v1, and srad_v2 looked suspicious in broad timing tables, but profiler totals did not show a remaining native LLVM advantage in the actual GPU work.
That did not make the wall-clock discrepancy imaginary.
It meant the discrepancy lived somewhere else.
If kernel time and copy time are tied or better for REX, but whole-process timing still sometimes looks worse, the remaining cost is probably not inside the generated device kernel. It is in process lifetime: setup, registration, host work, teardown, or the interaction between those pieces.
That distinction matters. If the problem is process lifecycle, adding another kernel-body optimization is not just ineffective. It can make the compiler more complicated while leaving the real cost untouched.
This post is about the lifecycle bug that appeared after the GPU work was already mostly clean: REX-generated programs were paying for explicit offload teardown at process exit.
Figure 1. The niter = 0 probe removed algorithmic work but kept the offload lifecycle, making fixed process costs visible.
Why srad_v2 niter = 0 Was The Right Probe
The investigation needed a benchmark mode that removed as much useful computation as possible while still exercising the offload runtime path.
srad_v2 was useful because it has a target-data region around the main computation and accepts a configurable iteration count. Running it with:
| |
keeps the setup and mapping path alive but removes the iterative kernel body. That creates a cleaner signal:
| |
The first repeated wall-clock comparison was immediately informative:
| |
The exact numbers were noisy, but the scale was too large to ignore. A gap around a tenth of a second with zero algorithmic iterations is not a stencil-performance issue. It is fixed runtime cost.
The next step was to split that fixed cost into layers instead of guessing.
Layer 1: CUDA API Time Was Not Large Enough
The first check was CUDA-driver-visible work:
| |
Representative API summaries showed both variants paying similar driver-level costs:
| |
The exact mix varied by sample. The important point was the scale. The CUDA-visible difference was not large enough to explain the full whole-process gap.
That matched the previous post’s profiler conclusion: REX was not losing in the useful GPU work. If the wall-clock gap was still real, much of it had to be outside the direct kernel/copy totals and outside the obvious CUDA API delta.
Layer 2: The Cubin File Was Not The Main Cost
The next obvious REX-only difference was image delivery.
Native LLVM embeds the device image in the offloading binary. REX-generated programs carry a separate helper and a separate cubin file:
| |
So the question was straightforward:
| |
strace -T answered that quickly. The REX process did open the cubin:
| |
but the file was tiny, about 12.9 KB, and the open/read timings were in the microsecond range. The external cubin model is an architectural difference, but the file read itself was not enough to explain a tenth of a second.
That left the lifecycle around the image.
Figure 2. The investigation ruled out kernel time, transfer time, CUDA API deltas, and cubin file I/O before changing the teardown policy.
Layer 3: Generated Programs Were Forcing Teardown
At that point, the generated host code and helper had to be read together.
The generated main still had the lifecycle shape:
| |
and the helper also had automatic process-exit cleanup through atexit(rex_offload_fini).
That meant standalone generated programs were doing two explicit things:
| |
The first one is required in the current REX architecture. REX delivers the cubin separately, so the generated program must register the image before runtime target calls can find the device entries. That is why rex_offload_init() belongs near the start of main.
The second one is a different policy question.
For a long-lived process that wants to unload offload state and keep running, explicit teardown is useful. For a normal benchmark-style executable that is about to exit, forcing teardown is mostly extra work. The operating system will reclaim the process address space. The driver and runtime will clean up process-owned device state as the process terminates. Forcing __tgt_unregister_lib, module unload, context release, and related runtime bookkeeping into the user-visible lifetime adds cost without improving the common standalone-program case.
That is the core distinction:
| |
The Fast Experiment
Before changing REX, we tested the hypothesis directly.
The experiment did not change kernels. It did not change map lists. It did not change launch geometry. It only changed teardown behavior:
| |
That kept the test narrow. If performance moved, it had to move because process-exit teardown was the cost.
The result was decisive.
For srad_v2, after skipping forced teardown:
| |
The CUDA API summary changed the way the hypothesis predicted. The REX niter = 0 summary no longer showed process-end teardown calls such as:
| |
That was the missing explanation. The fixed wall-clock tax was not a mysterious native LLVM startup advantage. It was REX-generated programs forcing offload cleanup before they exited.
The Compiler/Runtime Boundary
The proper fix was not to delete rex_offload_fini().
That would be the wrong abstraction. Some callers may embed generated code in a longer-running process. They may genuinely need deterministic teardown before process exit. REX should support that.
The correct fix was to stop auto-inserting teardown into every generated standalone main.
The final policy is:
| |
The relevant lowering code now has this shape:
| |
The helper still keeps storage safely managed during process lifetime:
| |
and explicit teardown remains available. The state reset happens inside the internal unregister helper, so a caller that explicitly tears down can reinitialize later in the same process:
| |
That boundary is important. REX did not replace RAII with careless global leaks. The helper still uses RAII-owned storage while the process is alive. The change is that normal process exit is allowed to be process exit. REX no longer forces expensive offload unregister/unload work into the last user-visible milliseconds of every generated command-line program.
Figure 3. REX still owns required image registration, but teardown is no longer mandatory generated boilerplate for normal standalone exit.
Why This Is Not A Leak Bug
This change can sound suspicious if it is described too loosely:
| |
That sentence is true but incomplete.
There are two different cases:
| |
The first case needs deterministic cleanup. That is why rex_offload_fini() still exists.
The second case does not need to pay explicit teardown cost before the process dies. The process is already terminating. Host memory, the process address space, file descriptors, and process-owned device state are reclaimed by the system and runtime. Calling into libomptarget and the driver to unregister, unload, and release before that happens mostly moves cleanup work into wall-clock time that the benchmark user sees.
So the real rule is:
| |
That is not a hack. It is the same kind of boundary ordinary native toolchains use all the time. Native LLVM-generated binaries do not emit source-level teardown boilerplate into every main just to make process exit more explicit.
Validation
After the real fix moved into the compiler/runtime tree, the benchmark suite was regenerated.
The generated host files had the intended lifecycle shape:
| |
The performance effect showed up where it should:
| |
The last point matters. Removing process-exit teardown did not magically turn every wall-clock sample into a perfectly stable ranking. Some cold-start variability remains. The first process in a series still initializes device state and registers an offload image. REX still has the architectural difference of a separately delivered cubin file.
What changed is narrower and defensible:
| |
That is enough. The fix removed a generic fixed tax from generated standalone GPU programs without changing kernel code, launch clauses, or correctness behavior.
The Lesson
This post is the other half of the previous one.
The previous post said:
| |
This post says:
| |
That is exactly what happened. The “missing performance” was not in a stencil kernel. It was not in a transfer list. It was not in a launch-shape heuristic. It was in lifecycle policy: REX had made explicit offload teardown mandatory for every generated standalone program.
The final design is cleaner:
| |
That is a compiler/runtime policy fix, not a benchmark-specific trick. It removes a fixed wall-clock tax from short-running generated GPU programs while preserving an explicit cleanup API for the cases that need one.