How REX Removed Process-Exit Offload Teardown From Generated GPU Programs

Posted on Apr 29, 2026 (Updated on May 4, 2026)

GPU-total profiling showed that the remaining pathfinder and srad issues were not device-side regressions. But some wall-clock runs still showed a fixed REX tax. srad_v2 with niter = 0 isolated that tax by keeping offload setup and mapping while removing the iterative kernels. CUDA API profiling and strace ruled out kernel work, driver calls, and cubin file I/O as the main cause. The real issue was lifecycle policy: generated programs were forcing rex_offload_fini() and unregister/unload work before process exit. REX now initializes the cubin image early, keeps rex_offload_fini() available for long-lived callers, but no longer auto-inserts teardown into normal standalone program exit.

The previous post separated GPU-total performance from wall-clock noise. That was necessary because pathfinder, srad_v1, and srad_v2 looked suspicious in broad timing tables, but profiler totals did not show a remaining native LLVM advantage in the actual GPU work.

That did not make the wall-clock discrepancy imaginary.

It meant the discrepancy lived somewhere else.

If kernel time and copy time are tied or better for REX, but whole-process timing still sometimes looks worse, the remaining cost is probably not inside the generated device kernel. It is in process lifetime: setup, registration, host work, teardown, or the interaction between those pieces.

That distinction matters. If the problem is process lifecycle, adding another kernel-body optimization is not just ineffective. It can make the compiler more complicated while leaving the real cost untouched.

This post is about the lifecycle bug that appeared after the GPU work was already mostly clean: REX-generated programs were paying for explicit offload teardown at process exit.

A diagram showing srad_v2 with niter equals zero preserving offload setup and mapping while removing iterative kernel work. — Figure 1. The niter = 0 probe removed algorithmic work but kept the offload lifecycle, making fixed process costs visible.

Why srad_v2 niter = 0 Was The Right Probe

The investigation needed a benchmark mode that removed as much useful computation as possible while still exercising the offload runtime path.

srad_v2 was useful because it has a target-data region around the main computation and accepts a configurable iteration count. Running it with:

1
niter = 0

keeps the setup and mapping path alive but removes the iterative kernel body. That creates a cleaner signal:

1
2
if a large gap remains at niter = 0,
the gap cannot be explained by kernel math.

The first repeated wall-clock comparison was immediately informative:

1
2
REX mean:         about 0.59 s
native LLVM mean: about 0.45 s

The exact numbers were noisy, but the scale was too large to ignore. A gap around a tenth of a second with zero algorithmic iterations is not a stencil-performance issue. It is fixed runtime cost.

The next step was to split that fixed cost into layers instead of guessing.

Layer 1: CUDA API Time Was Not Large Enough

The first check was CUDA-driver-visible work:

1
nvprof --print-api-summary --csv <srad_v2 niter=0 command>

Representative API summaries showed both variants paying similar driver-level costs:

1
2
3
4
5
6
7
8
REX cuDevicePrimaryCtxRetain:         about 102.5 ms
LLVM cuDevicePrimaryCtxRetain:        about  99.2 ms

REX cuMemcpyHtoDAsync:                about  38.0 ms
LLVM cuMemcpyHtoDAsync:               about  29.1 ms

REX cuDevicePrimaryCtxRelease:        about  25.4 ms
LLVM cuDevicePrimaryCtxRelease:       about  37.1 ms

The exact mix varied by sample. The important point was the scale. The CUDA-visible difference was not large enough to explain the full whole-process gap.

That matched the previous post’s profiler conclusion: REX was not losing in the useful GPU work. If the wall-clock gap was still real, much of it had to be outside the direct kernel/copy totals and outside the obvious CUDA API delta.

Layer 2: The Cubin File Was Not The Main Cost

The next obvious REX-only difference was image delivery.

Native LLVM embeds the device image in the offloading binary. REX-generated programs carry a separate helper and a separate cubin file:

1
2
register_cubin.cpp
rex_lib_nvidia.cubin

So the question was straightforward:

1
is reading rex_lib_nvidia.cubin the fixed tax?

strace -T answered that quickly. The REX process did open the cubin:

1
openat(AT_FDCWD, "rex_lib_nvidia.cubin", O_RDONLY) = 3

but the file was tiny, about 12.9 KB, and the open/read timings were in the microsecond range. The external cubin model is an architectural difference, but the file read itself was not enough to explain a tenth of a second.

That left the lifecycle around the image.

A layered diagram showing kernel and copies ruled out, CUDA API time too small, cubin file I/O too small, and process-exit teardown as the remaining large fixed cost. — Figure 2. The investigation ruled out kernel time, transfer time, CUDA API deltas, and cubin file I/O before changing the teardown policy.

Layer 3: Generated Programs Were Forcing Teardown

At that point, the generated host code and helper had to be read together.

The generated main still had the lifecycle shape:

1
2
3
4
rex_offload_init();
...
rex_offload_fini();
return 0;

and the helper also had automatic process-exit cleanup through atexit(rex_offload_fini).

That meant standalone generated programs were doing two explicit things:

1
2
register the offload image before user work;
unregister and unload the offload image before process exit.

The first one is required in the current REX architecture. REX delivers the cubin separately, so the generated program must register the image before runtime target calls can find the device entries. That is why rex_offload_init() belongs near the start of main.

The second one is a different policy question.

For a long-lived process that wants to unload offload state and keep running, explicit teardown is useful. For a normal benchmark-style executable that is about to exit, forcing teardown is mostly extra work. The operating system will reclaim the process address space. The driver and runtime will clean up process-owned device state as the process terminates. Forcing __tgt_unregister_lib, module unload, context release, and related runtime bookkeeping into the user-visible lifetime adds cost without improving the common standalone-program case.

That is the core distinction:

1
2
explicit cleanup API: good capability
mandatory process-exit cleanup boilerplate: bad default for short-lived programs

The Fast Experiment

Before changing REX, we tested the hypothesis directly.

The experiment did not change kernels. It did not change map lists. It did not change launch geometry. It only changed teardown behavior:

1
2
remove automatic atexit(rex_offload_fini);
skip generated-program unregister/unload at normal process end.

That kept the test narrow. If performance moved, it had to move because process-exit teardown was the cost.

The result was decisive.

For srad_v2, after skipping forced teardown:

1
2
niter = 0 wall-clock gap disappeared and slightly flipped toward REX
niter = 2 wall-clock also flipped slightly toward REX in the sample set

The CUDA API summary changed the way the hypothesis predicted. The REX niter = 0 summary no longer showed process-end teardown calls such as:

1
2
cuDevicePrimaryCtxRelease
cuModuleUnload

That was the missing explanation. The fixed wall-clock tax was not a mysterious native LLVM startup advantage. It was REX-generated programs forcing offload cleanup before they exited.

The Compiler/Runtime Boundary

The proper fix was not to delete rex_offload_fini().

That would be the wrong abstraction. Some callers may embed generated code in a longer-running process. They may genuinely need deterministic teardown before process exit. REX should support that.

The correct fix was to stop auto-inserting teardown into every generated standalone main.

The final policy is:

1
2
3
4
insert rex_offload_init() near the start of main;
do not auto-insert rex_offload_fini() at every main return;
do not register rex_offload_fini() with atexit by default;
keep rex_offload_fini() as an explicit API.

The relevant lowering code now has this shape:

1
2
3
4
5
6
7
8
9
SgExprStatement *expStmt = buildFunctionCallStmt(
    SgName("rex_offload_init"), buildVoidType(), buildExprListExp(),
    currentscope);
prependStatement(expStmt, currentscope);

// Do not auto-insert rex_offload_fini() at end of main. For standalone
// processes the OS reclaims the registered image and device-side state on
// exit, and forcing teardown into user-visible process lifetime adds a
// measurable fixed cost to short-running GPU programs.

The helper still keeps storage safely managed during process lifetime:

1
2
3
4
5
6
7
struct CubinStorage {
  std::vector<unsigned char> image;
  __tgt_device_image device_image{};
  __tgt_bin_desc bin_desc{};
};

std::unique_ptr<CubinStorage> cubin_storage;

and explicit teardown remains available. The state reset happens inside the internal unregister helper, so a caller that explicitly tears down can reinitialize later in the same process:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
void unregister_cubin_internal(void) {
  int state = kRegistered;
  if (!__atomic_compare_exchange_n(&registration_state, &state, kBusy, false,
                                   __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE)) {
    return;
  }

  if (cubin_storage != nullptr) {
    __rex_real___tgt_unregister_lib(&cubin_storage->bin_desc);
    cubin_storage.reset();
  }
  __atomic_store_n(&registration_state, kUnregistered, __ATOMIC_RELEASE);
}

void rex_offload_fini(void) {
  __atomic_store_n(&__cubin_desc, nullptr, __ATOMIC_RELEASE);
  unregister_cubin_internal();
}

That boundary is important. REX did not replace RAII with careless global leaks. The helper still uses RAII-owned storage while the process is alive. The change is that normal process exit is allowed to be process exit. REX no longer forces expensive offload unregister/unload work into the last user-visible milliseconds of every generated command-line program.

A diagram showing rex_offload_init inserted automatically, rex_offload_fini kept as an explicit optional API, and normal process exit relying on process cleanup. — Figure 3. REX still owns required image registration, but teardown is no longer mandatory generated boilerplate for normal standalone exit.

Why This Is Not A Leak Bug

This change can sound suspicious if it is described too loosely:

1
REX no longer unregisters the image at process exit.

That sentence is true but incomplete.

There are two different cases:

1
2
long-lived process continues running after unloading GPU code;
short-lived standalone process is already terminating.

The first case needs deterministic cleanup. That is why rex_offload_fini() still exists.

The second case does not need to pay explicit teardown cost before the process dies. The process is already terminating. Host memory, the process address space, file descriptors, and process-owned device state are reclaimed by the system and runtime. Calling into libomptarget and the driver to unregister, unload, and release before that happens mostly moves cleanup work into wall-clock time that the benchmark user sees.

So the real rule is:

1
2
do not rely on process exit for cleanup inside a long-running process;
do rely on process exit for cleanup when the whole process is exiting.

That is not a hack. It is the same kind of boundary ordinary native toolchains use all the time. Native LLVM-generated binaries do not emit source-level teardown boilerplate into every main just to make process exit more explicit.

Validation

After the real fix moved into the compiler/runtime tree, the benchmark suite was regenerated.

The generated host files had the intended lifecycle shape:

1
2
one rex_offload_init() near the top of main;
no auto-generated rex_offload_fini() at the end of main.

The performance effect showed up where it should:

1
2
3
pathfinder wall-clock returned to a stable REX lead;
srad_v2 wall-clock collapsed from suspicious native-looking lead to effective tie;
srad_v2 GPU-total profiling still showed no confirmed native LLVM advantage.

The last point matters. Removing process-exit teardown did not magically turn every wall-clock sample into a perfectly stable ranking. Some cold-start variability remains. The first process in a series still initializes device state and registers an offload image. REX still has the architectural difference of a separately delivered cubin file.

What changed is narrower and defensible:

1
REX stopped adding unnecessary teardown work to the normal exit path.

That is enough. The fix removed a generic fixed tax from generated standalone GPU programs without changing kernel code, launch clauses, or correctness behavior.

The Lesson

This post is the other half of the previous one.

The previous post said:

1
2
do not blame the compiler kernel path for wall-clock-only gaps
when GPU totals do not confirm a device-side loss.

This post says:

1
2
when the wall-clock gap is real but GPU totals are clean,
look at process lifecycle before changing code generation.

That is exactly what happened. The “missing performance” was not in a stencil kernel. It was not in a transfer list. It was not in a launch-shape heuristic. It was in lifecycle policy: REX had made explicit offload teardown mandatory for every generated standalone program.

The final design is cleaner:

1
2
3
initialize automatically because REX must register its external cubin image;
tear down explicitly only when the caller actually needs deterministic cleanup;
let normal process exit be normal process exit.

That is a compiler/runtime policy fix, not a benchmark-specific trick. It removes a fixed wall-clock tax from short-running generated GPU programs while preserving an explicit cleanup API for the cases that need one.