Surviving LLVM Offload ABI Drift in REX

Posted on Mar 19, 2026 (Updated on Mar 21, 2026)

LLVM offloading migrations can break GPU execution without breaking compilation. REX stays resilient by treating the offload runtime ABI as a first-class contract. It does this by centralizing compatibility shims in helper headers, making device-image registration explicit and thread-safe, inserting one-time init deterministically, and encoding runtime-facing invariants in lowering tests.

We hit a class of regression that every compiler engineer recognizes, and every user hates:

the project still builds,
the lowered code still looks plausible,
but the program now fails at runtime with opaque CUDA or offloading errors.

When you work on OpenMP GPU offloading inside a source-to-source pipeline, this is the expected shape of many failures. The compiler proper can do everything “right” at the AST level and still emit code that no longer matches the runtime’s ABI expectations.

This post is about how we made REX survive LLVM offload ABI drift without turning the compiler into a pile of benchmark-specific hacks.

It has three goals:

explain where ABI drift enters an OpenMP offloading pipeline,
describe a debugging workflow that turns “CUDA error” into a concrete failing invariant,
document the design choices that make REX upgrades predictable rather than traumatic.

Layered view of REX offloading showing where LLVM ABI drift can break execution. — Figure 1. ABI drift rarely breaks everything at once. It usually breaks one boundary: a struct layout, a symbol signature, an image-registration assumption, or an ordering constraint. REX keeps those boundaries explicit so failures become diagnosable.

The Reality: OpenMP Directives Are Not The Contract

The user writes:

1
2
3
4
#pragma omp target teams distribute parallel for map(tofrom: y[0:n])
for (int i = 0; i < n; i++) {
  y[i] += 1.0f;
}

But the runtime does not execute pragmas. The runtime executes a protocol:

an offload entry table exists in the host binary,
a device image is registered against that table,
the host calls __tgt_* entry points with mapping arrays and a launch packet.

That protocol is the contract.

LLVM toolchain migrations tend to change that contract in ways that are subtle at the source level:

a previously-visible type becomes opaque or moves headers,
a runtime entry point gains a parameter (or changes meaning of an existing one),
image registration semantics tighten,
plugin behavior changes for what used to be “undefined but worked.”

If you only test at “compiles” granularity, you miss these until runtime.

Where ABI Drift Actually Hits

In REX, GPU offloading crosses multiple boundaries. The boundaries that most often drift across LLVM versions are:

Offload entry table layout and discovery The host binary uses an omp_offloading_entries section with __start_... and __stop_... symbols. If those entries are missing or malformed, the runtime cannot match host keys to device symbols.
Device image registration libomptarget expects a __tgt_bin_desc whose pointers remain valid for the lifetime of the registration. If the image bytes are freed too early, you get non-deterministic failures.
Launch ABI Kernel launch happens through __tgt_target_kernel with a __tgt_kernel_arguments packet. Small ABI changes here can turn into “everything runs but gives wrong answers” or “the runtime returns failure codes.”
Call-site conventions and ordering One-time initialization and registration must happen before the first offload call. If an upgrade changes when the runtime touches the image or when it performs symbol resolution, ordering mistakes that used to be harmless become failures.

REX’s approach is to make those boundaries explicit and testable, rather than letting them remain “implicit linker magic” that only fails in production.

The Failure Modes We Care About

From the user’s point of view, ABI drift often looks like one of these:

build-time breakage: compilation fails with incomplete types or missing declarations
link-time breakage: undefined references to __tgt_* symbols or mismatched signatures
runtime breakage: offload calls return errors; the runtime reports missing kernels; CUDA reports invalid device function
silent wrong answers: mapping flags or sizes drifted, and the kernel reads garbage
performance regressions: everything works, but fixed overhead suddenly dominates short runs

Not all of these are caused by LLVM changes, but LLVM migrations tend to surface them because they tighten assumptions.

The key idea is: treat each failure mode as a missing invariant you can encode in the test suite.

The Debugging Workflow: Reduce, Isolate, Invariant-ize

When an offload regression appears, “debugging” is mostly about moving from an unhelpful symptom to a precise failing invariant.

The workflow we converged on looks like this:

Reduce Reproduce the issue on the smallest input that still exercises the failing contract. In REX, the Rodinia-derived lowering tests are designed to be that reduced corpus.
Isolate the layer Decide whether the failure is:
- entry-table emission,
- image registration,
- call rewriting / ABI mismatch,
- mapping arrays / sizes / flags,
- ordering (init/data begin/end).
Inspect generated artifacts Because REX is source-to-source, you can open the generated rose_*.c, the generated device file, and the helper files. This is one of the biggest practical advantages of the architecture: you can debug using normal code inspection tools.
Encode a test invariant Once you know what went wrong, add a structural invariant check so the regression cannot reappear silently later.

A triage loop from runtime symptom to reduced case, invariant, fix, and suite re-run. — Figure 2. Treat offload regressions like protocol regressions. The goal is not to stare at a CUDA error message longer. The goal is to isolate the missing invariant and enforce it with a targeted test.

Why the Rodinia-Derived Lowering Suite Matters

REX already has parser tests and frontend compile tests, but ABI drift is most visible after lowering. This is why the lowering_rodinia suite exists: it validates lowering-specific behavior using reduced Rodinia-like inputs and invariant checks, not brittle golden dumps.

Examples of the kinds of invariants that matter during LLVM migrations:

multi-kernel lowering shape is preserved (three kernels remain three kernels)
repeated calls to the same lowered helper still work
the omp_offloading_entries section contains entries for each kernel
rex_offload_init() appears before declarations used by timing instrumentation
no automatic teardown is inserted at process exit

Those checks are “boring” by design. They encode the contracts that toolchain upgrades tend to break.

The Compatibility Strategy: Own The Boundary, Not Every Call Site

A naive response to ABI drift is to start changing code generation everywhere:

rewrite every __tgt_target_kernel(...) call to include a new parameter,
rewrite every include list to point at new runtime headers,
special-case behavior per LLVM version.

That is the fastest way to accumulate technical debt and make the next upgrade even worse.

REX uses a different strategy:

keep the lowerer’s emitted call shapes stable and simple,
centralize ABI and symbol mapping in helper headers and helper sources,
make registration explicit and fast on the hot path,
encode the resulting contracts in lowering tests.

This is why rex_kmp.h and register_cubin.cpp exist. They are not “extra layers for fun.” They are the compatibility surface.

The shim strategy: keep codegen stable, adapt ABI details in the helper layer. — Figure 3. When the runtime ABI changes, you want one place to fix it. REX keeps the lowerer output stable and adapts ABI details in a single helper layer.

Case Study: When A Runtime Type Stops Being A Public Header Type

One class of drift is that a type you used to get via some system header becomes internal.

In offloading, __tgt_offload_entry is a good example. The lowerer emits variables of that type and places them in omp_offloading_entries. But you cannot assume that system headers will always provide a complete type definition for it.

REX’s solution is simple and robust:

vendor the ABI struct definitions in rex_kmp.h,
include rex_kmp.h in lowered host sources,
treat that header as part of the compiler output contract.

This keeps code generation predictable. The lowerer never has to guess which system header version is installed. It always includes the same compatibility header.

Case Study: When A Runtime Entry Point Gains Parameters

Another common drift is a runtime entry point signature change.

Kernel launch is a high-value example because the lowerer emits the __tgt_target_kernel call directly. If the runtime changes its signature (for example, by requiring an additional “location/ident” parameter), you have two options:

update code generation everywhere, or
add a compatibility shim that supplies the extra parameter while keeping call sites stable.

REX takes the second approach.

In rex_kmp.h, we declare the true runtime symbol using an asm alias:

1
2
3
4
5
6
7
int __rex_real___tgt_target_kernel(ident_t *loc,
                                   int64_t device_id,
                                   int32_t num_teams,
                                   int32_t thread_limit,
                                   void *host_ptr,
                                   struct __tgt_kernel_arguments *kernel_args)
    __asm__("__tgt_target_kernel");

Then we provide a stable wrapper that keeps the call site small and injects the missing ABI detail:

1
2
3
4
5
6
7
8
9
static inline int rex_direct___tgt_target_kernel(int64_t device_id,
                                                 int32_t num_teams,
                                                 int32_t thread_limit,
                                                 void *host_ptr,
                                                 struct __tgt_kernel_arguments *kernel_args) {
  return __rex_real___tgt_target_kernel(&rex_target_kernel_ident,
                                        device_id, num_teams, thread_limit,
                                        host_ptr, kernel_args);
}

Finally, we use a macro so generated code can still call __tgt_target_kernel(...) and end up in the wrapper:

1
#define __tgt_target_kernel rex_direct___tgt_target_kernel

This does two things that matter for migrations:

ABI changes become a single-header fix, not a codegen rewrite across every lowered file.
The lowerer stays focused on transformation, not on tracking per-version runtime signatures.

This is a general pattern you can reuse for other __tgt_* drift as well.

Case Study: Registration Semantics And Lifetime Bugs

Device image registration is another place where “worked before” can stop working after an upgrade.

The runtime expects that:

__tgt_register_lib(&desc) is given pointers that remain valid,
the host entry table range is correct,
registration happens before launch,
repeated registration is either harmless or avoided.

REX makes registration explicit in register_cubin.cpp:

load the CUBIN bytes into a std::vector<unsigned char>,
build a __tgt_device_image pointing at those bytes,
build a __tgt_bin_desc pointing at the image and entry range,
call __tgt_register_lib(&desc),
keep the std::vector alive for as long as the registration is live.

Two migration hardening details are important here:

thread-safe one-time registration: an atomic state machine elects one registering thread and ensures subsequent calls have low overhead
explicit teardown is optional: rex_offload_fini() exists for embedded use cases, but standalone generated programs do not pay teardown cost by default

This is not only a performance choice. It also reduces nondeterminism. Registration that happens in one clear place (at the start of main) is easier to reason about than registration that happens “somewhere during the first offload call.”

Ordering Fixes: Init Before Timing And Before First Offload

LLVM migrations often expose ordering assumptions. Something that used to be lazy may become eager. Something that used to happen at first launch may move earlier.

REX addresses ordering by inserting rex_offload_init() explicitly at the beginning of main in the lowering phase. The insertion is intentionally before user statements, so one-time registration:

cannot be counted inside timing declarations (for example, time0 = clock()),
cannot accidentally happen after the first generated __tgt_* call,
does not require per-call registration checks on the hot path.

This is a great example of where source-to-source transparency helps: you can open the generated rose_*.c and see the init call in the right place.

Encoding The Migration Contracts In Tests

The single most important thing we did to make migrations manageable was to encode the runtime-facing contracts as explicit tests.

When you upgrade the toolchain, you want to catch:

missing offload entries,
duplicate or corrupted entry ranges,
broken init ordering,
unwanted teardown insertion,
wrong call shapes or missing wrappers.

Those are not “unit tests” in the classic sense. They are protocol tests. They assert that the generated output still satisfies the runtime’s expectations.

This is also why we prefer invariant-based checks over golden output dumps. The output formatting can change without meaningfully changing semantics. The invariants are what matter for ABI drift.

Practical Guidelines For The Next LLVM Upgrade

If you are about to migrate REX across another LLVM major version, the post-mortem advice is straightforward:

Assume runtime ABI drift is the default, not the exception.
Do not rewrite the entire lowerer to match the new runtime. Keep call shapes stable where possible.
Fix drift in one place: headers and helper sources.
Make registration explicit, idempotent, and thread-safe.
Insert init deterministically and keep it out of timed regions.
Add invariants to the lowering test suite the moment you understand the failure.

This approach is not fancy. It is the opposite. It is deliberate boringness: keep the compatibility surface small and obvious.

What Comes Next

Once the offloading pipeline is correct across LLVM toolchains, performance becomes the next reality check. The next post in this series should cover the performance work:

where REX was slower than native LLVM offloading and why,
where REX was faster and what exactly caused the advantage,
and how we closed gaps without violating fairness (honor user-specified launch configuration unless invalid).

If this migration post has one takeaway, it is this:

Toolchains evolve. Your compiler survives the evolution only if you treat the runtime ABI as a first-class contract and test it as such.