How REX Places `rex_offload_init()` And Why It Avoids Automatic `rex_offload_fini()`

Posted on Apr 12, 2026 (Updated on May 3, 2026)

REX uses both eager and lazy registration mechanisms, but normal generated GPU programs are intended to run with eager initialization. The lowerer inserts rex_offload_init() at the beginning of main so one-time CUBIN registration happens before user timing starts and before any offload site runs. It deliberately does not auto-insert rex_offload_fini() at process exit, because standalone GPU benchmark programs do not benefit much from explicit teardown while they do pay measurable fixed-cost overhead for it. The current lowering invariants explicitly check both sides of this policy: init appears once and early, and auto-fini never appears.

The previous two posts in this series covered the two helper layers that make REX GPU offloading runnable:

register_cubin.cpp, which registers the device image with libomptarget,
and rex_kmp.h, which rewrites generated __tgt_* calls into REX-controlled wrappers.

Those posts explain how registration and runtime calls work.

This post focuses on a narrower policy question:

when should registration happen, and when should teardown happen?

In REX, the answer is intentionally asymmetric.

For normal standalone generated GPU programs:

rex_offload_init() is inserted eagerly near the top of main,
but rex_offload_fini() is not auto-inserted at the end.

That asymmetry is deliberate. It is not unfinished cleanup logic or a missing feature. It is part of the runtime policy that REX wants generated programs to follow.

This post stays tightly focused on that policy. It explains:

where omp_lowering.cpp inserts rex_offload_init(),
why REX prefers eager init over purely lazy registration for generated programs,
why that placement matters for timing-sensitive benchmark code,
why the lowerer intentionally avoids adding rex_offload_fini() at process exit,
when explicit teardown is still useful,
and what the current lowering invariants verify about both choices.

A timeline showing main entry, rex_offload_init inserted before user statements and timing declarations, followed by repeated kernel launches and then process exit without auto-inserted rex_offload_fini. — Figure 1. REX wants one-time registration to happen before timed user work begins, but it does not want to add mandatory teardown cost to the visible end of every standalone process.

Why Init/Fini Is A Compiler Policy Question

It is easy to think of initialization and cleanup as runtime details that the compiler can ignore.

For REX, that would be the wrong model.

The compiler is the component that knows:

which files are generated,
where main is,
which programs are normal standalone outputs,
and how much of the launch path should be visible to the user as ordinary host code.

So the lowerer has to decide what kind of runtime lifecycle it wants generated programs to have.

That choice matters because REX has two valid ways to make registration happen:

eagerly, by inserting rex_offload_init() at the start of main
lazily, by relying on the safe wrappers in register_cubin.cpp to call register_cubin(...) the first time an offload API is reached

Both are correct in the narrow sense that they can make the image available before the first kernel launch.

But they are not equivalent in user-visible behavior.

If registration happens lazily at the first offload site, the program pays file I/O and runtime registration at whatever point that first site executes. In short-running GPU benchmarks, that can contaminate the very timing region the user is trying to measure.

If teardown happens automatically at process exit, the program pays one more fixed-cost runtime operation at the visible end of execution, even though most standalone GPU benchmark programs do not need explicit cleanup before process exit.

So REX chooses:

eager init,
no automatic fini,

and lets explicit teardown remain available for callers that truly need it.

That is a compiler policy, not just a helper detail.

Step 1: The Lowerer Inserts `rex_offload_init()` At The Start Of `main`

The actual insertion happens in insertAcceleratorInit(...) in omp_lowering.cpp.

The function first finds the program entry point:

1
2
3
4
5
6
7
8
SgFunctionDeclaration *mainDecl = findMain(sgfile);
if (mainDecl != NULL) {
  mainDef = mainDecl->get_definition();
  hasMain = true;
}
...
if (!hasMain)
  return;

Then it builds the call statement:

1
2
3
SgExprStatement *expStmt = buildFunctionCallStmt(
    SgName("rex_offload_init"), buildVoidType(), NULL, currentscope);
setSourcePositionForTransformation(expStmt);

And the placement policy is explicit:

1
prependStatement(expStmt, currentscope);

That is the key line.

REX is not inserting the init call “somewhere before the first kernel launch” in a vague sense. It is prepending the statement before the body’s user statements.

The nearby comment in the implementation says exactly why:

1
2
3
// Insert before all user statements so one-time cubin registration is not
// counted inside declaration initializers such as `long long time0 =
// clock();`.

That is an unusually concrete example, and it matters.

The compiler is not only trying to be correct. It is trying to keep one-time runtime setup out of the user’s measured region.

That is why prependStatement(...) is the right primitive here. It puts the policy into the generated source in the simplest possible way:

no special runtime startup framework,
no hidden constructor trick,
just an ordinary host-side call inserted before user code begins.

Why Eager Init Wins For Normal Generated Programs

REX still keeps the lazy path available. The safe wrappers in register_cubin.cpp do registration checks before dispatching to the real runtime calls.

That means correctness does not depend on eager init.

But REX still prefers eager init in normal standalone programs, and there are good reasons for that.

It keeps the hot path cleaner

Once rex_offload_init() has run, the intended generated-program path is:

the image is already registered,
direct wrappers from rex_kmp.h can call through to the runtime immediately,
and there is no need to perform registration checks at every generated offload site.

That keeps the steady-state path closer to “just launch the kernel” instead of “launch the kernel and maybe also bootstrap the whole runtime image right now.”

It keeps startup cost out of the first measured region

This is the more visible reason.

Registration does real work:

open the CUBIN file,
read the bytes,
build descriptors,
call __tgt_register_lib(...).

If the first offload site also triggers that work, then a user who times the region around that first offload is measuring both steady-state offloading and one-time image setup together.

For benchmark code, that is usually the wrong result.

Eager init fixes that by moving the one-time cost to the start of main, where it becomes part of normal process startup rather than part of the first measured offload region.

It makes generated source easier to reason about

When contributors inspect lowered host code, it is much easier to reason about the execution model if one explicit init call appears near the top of main.

That is a better source-to-source story than “the first offload wrapper you happen to reach will secretly register the image if it has not already been done.”

The lazy path is still necessary as a safety net. It is just not the path REX wants to present as the normal generated-program model.

A comparison diagram showing eager init performing registration before user timing starts, while lazy init defers registration until the first offload call and therefore pulls file I/O and registration cost into the first measured region. — Figure 2. Both eager and lazy registration can be correct, but they have different user-visible timing behavior. REX treats lazy registration as a safety path and eager init as the intended generated-program path.

Step 2: The Lowerer Deliberately Does Not Insert `rex_offload_fini()`

The same helper function that inserts init also documents the opposite choice for teardown:

1
2
3
4
5
// Do not auto-insert rex_offload_fini() at end of main. For standalone
// processes the OS reclaims the registered image and device-side state on
// exit, and forcing teardown into user-visible process lifetime adds a
// measurable fixed cost to short-running GPU programs. Explicit teardown
// remains available through rex_offload_fini() for callers that need it.

That comment is the whole policy in one place.

The lowerer could, in principle, search for return paths in main and append cleanup code.

In fact, the old comments around insertRTLinitAndCleanCode(...) still show that this general style of transformation exists in the compiler’s mental model:

find the main entry,
add runtime init code at the beginning,
find return points and append cleanup code.

For GPU offloading, REX intentionally does not follow that pattern all the way through to automatic fini insertion.

That is a pragmatic choice, and it is better than a dogmatic “every init must have a compiler-inserted cleanup” rule.

Why REX Avoids Automatic Fini For Standalone Programs

There are two main reasons.

Process exit already tears down the environment

In the normal standalone-program case, process exit already destroys:

the host process itself,
the registered helper state,
and the driver-managed device-side resources associated with that process.

That means explicit teardown at the end of main is often redundant from the user’s point of view.

Explicit unregister adds visible fixed cost

By contrast, calling rex_offload_fini() is not free.

It performs real runtime work:

transition the registration state,
call __tgt_unregister_lib(...),
destroy owned state,
and publish that the cached descriptor is gone.

That cost lands at the very end of program execution, which is exactly where short-running benchmark programs are most sensitive to extra fixed overhead.

So the tradeoff is asymmetric:

explicit init provides a clear timing and correctness benefit,
explicit fini often provides little benefit in standalone benchmark runs while still adding visible cost.

That is why REX chooses one but not the other.

When Explicit `rex_offload_fini()` Still Matters

Not auto-inserting fini does not mean teardown is never useful.

It means teardown is useful in a narrower set of scenarios than startup registration.

The most obvious cases are:

embedding REX-lowered code inside a longer-lived process,
reusing the same process for multiple program phases with different offload lifetimes,
or running in an environment where explicit runtime cleanup is part of the host application’s resource-management contract.

In those cases, rex_offload_fini() is still available as an explicit API.

That is the important compromise in the design:

REX does not force teardown cost into every standalone program,
but it still provides the hook for callers that genuinely need lifecycle control inside a process that outlives one benchmark-style run.

This is the same pragmatic pattern that shows up elsewhere in the helper layer:

keep the default generated-program path optimized for the common case,
keep explicit control available for the less common but still real cases.

A policy matrix contrasting standalone generated benchmark programs, where eager init is inserted and auto-fini is omitted, with longer-lived embedding scenarios, where explicit rex_offload_fini remains available and may be appropriate. — Figure 3. The init/fini policy is intentionally asymmetric because the common standalone-program case and the longer-lived embedded-process case have different cost models.

What The Current Tests Actually Verify

The current lowering invariants do not treat this policy as folklore. They check it directly.

The reduced Rodinia-style verifier enforces:

exactly one rex_offload_init() in the generated host file,
exactly zero rex_offload_fini() insertions,
and, in timing-sensitive cases, the ordering relation that init appears before the timer declaration.

The relevant checks in verify_outputs.sh are straightforward:

1
2
expect_count "${rose_file}" 'rex_offload_init[[:space:]]*\(' 1 "host offload init count"
expect_count "${rose_file}" 'rex_offload_fini[[:space:]]*\(' 0 "unexpected host offload fini count"

And the rodinia_nn_like case goes further by checking placement relative to the timing variable:

1
2
3
time0_line="$(first_line "${rose_file}" 'long[[:space:]]+long[[:space:]]+time0[[:space:]]*=[[:space:]]*clock[[:space:]]*\(')"
init_line="$(first_line "${rose_file}" 'rex_offload_init[[:space:]]*\(')"
(( init_line < time0_line )) || die "rex_offload_init moved after timer declaration"

That is exactly the kind of invariant a lowerer should own.

The compiler is not merely proving that some init call exists. It is proving that the call is in the right place to preserve the intended execution policy.

The suite README even describes the rodinia_nn_like case in those terms:

init ordering before timed declarations,
without automatic rex_offload_fini() insertion at process exit.

So this policy is not an informal convention. It is part of the tested lowering contract.

Why This Policy Fits REX Well

The broader design pattern here is the same one that shows up throughout REX’s GPU path:

make the generated program explicit,
keep common-case performance behavior intentional,
and avoid hiding important lifecycle work in places that are hard for users to inspect.

Eager rex_offload_init() satisfies all three:

it is visible in the generated source,
it happens at a predictable point,
and it keeps one-time registration cost out of the main measured offload path.

Avoiding automatic rex_offload_fini() also satisfies all three:

it avoids forcing redundant-looking teardown into every generated main,
it avoids measurable exit-time fixed cost in short programs,
and it still leaves explicit teardown available when a longer-lived host process actually needs it.

That is a good fit for a source-to-source compiler that is often used on benchmark-style GPU programs, where startup and steady-state behavior both matter and generated source should remain easy to inspect.

Closing

The init/fini policy in REX is simple once you state it directly:

initialize eagerly,
do not tear down automatically,
keep explicit teardown available.

That simplicity is the result of a very deliberate tradeoff.

REX wants generated programs to pay one-time registration cost at a predictable point near the top of main, not inside the first measured offload region.

At the same time, it does not want every standalone benchmark program to pay explicit unregister cost at the visible end of execution just to satisfy a cleanup symmetry that usually does not buy the user much.

That is why the lowerer inserts one call and omits the other, and why the test suite treats both choices as part of the real lowering contract.