How `rex_kmp.h` Rewrites Offloading Runtime Calls In REX

Posted on Apr 11, 2026 (Updated on Apr 17, 2026)

REX uses rex_kmp.h to centralize offloading runtime policy. The header vendors ABI structs, exposes REX-controlled wrappers, and rewrites __tgt_* calls through macros unless REX_KMP_INTERNAL is set. That keeps code generation simple while the helper layer handles the real symbol bindings.

The previous post in this series focused on register_cubin.cpp: how REX loads a standalone CUBIN file, builds __tgt_device_image and __tgt_bin_desc, and registers the result with libomptarget.

That still leaves one more runtime-boundary layer to explain:

how do the generated host sources actually call the runtime API without hardcoding all of the ABI quirks themselves?

That is the job of src/midend/programTransformation/ompLowering/rex_kmp.h.

This header looks deceptively simple. It is only a header, and most of it is declarations. But architecturally it is doing several jobs at once:

it vendors the key OpenMP offloading ABI structs that lowered code needs to compile against,
it binds stable REX names to the real libomptarget symbols with asm aliases,
it exposes direct wrappers for the hot path,
it declares safe wrappers that can register the CUBIN lazily,
and it rewrites generated __tgt_* calls so the lowerer can keep emitting simple names instead of runtime-specific plumbing.

This post stays tightly focused on that wrapper layer. It explains:

why rex_kmp.h exists even though libomptarget already has runtime symbols,
how the header vendors ABI structs and protects itself from macro collisions,
how __rex_real___tgt_* names bind to the real runtime entry points,
why there are both rex_direct___tgt_* and rex___tgt_* wrappers,
how the macro layer rewrites generated __tgt_* calls,
why REX_KMP_INTERNAL disables that rewriting inside helper implementations,
and how rex_target_kernel_ident bridges the hidden ident_t * parameter required by __tgt_target_kernel.

A layered call path showing generated host code emitting __tgt_target_kernel, the preprocessor in rex_kmp.h rewriting it to rex_direct___tgt_target_kernel, and that wrapper finally calling the real __tgt_target_kernel symbol through an asm alias. — Figure 1. The lowerer emits simple runtime-looking names. `rex_kmp.h` rewrites those names into REX-controlled wrappers and only then reaches the real `libomptarget` symbols.

Why This Header Exists At All

At first glance, rex_kmp.h can look redundant.

Why not just let generated files include a system runtime header and call the runtime symbols directly?

Because REX wants two things at the same time:

simple generated source
centralized control over runtime policy and ABI quirks

If generated host code had to spell every detail explicitly, the lowerer would need to know far too much about the exact runtime interface in every place it emits a call.

For example, the lowerer wants to be able to generate straightforward host code like:

1
__tgt_target_kernel(device_id, num_teams, thread_limit, host_ptr, &kernel_args);

That shape is easy to generate and easy to inspect.

But the real runtime ABI is slightly more awkward than that:

__tgt_target_kernel actually takes an ident_t *loc parameter in front,
image registration policy should be centralized instead of reimplemented at every call site,
and generated sources should not depend on the exact header arrangement of whatever LLVM install happens to be on the build machine.

So rex_kmp.h becomes the contract boundary.

The lowerer can keep emitting source that looks like ordinary runtime calls. The header then translates that simple surface into the exact symbol names, argument shapes, and wrapper behavior that REX wants.

That is not unnecessary indirection. It is how the compiler keeps code generation simple without scattering ABI knowledge everywhere.

Step 1: Vendor The Runtime ABI Structs Directly

The first thing rex_kmp.h does is define the data structures that lowered code must see:

ident_t
__tgt_offload_entry
__tgt_device_image
__tgt_bin_desc
__tgt_kernel_arguments

This is a deliberately conservative choice.

The generated sources need these layouts in order to compile. REX could try to include them from some system header, but that would make generated code depend on the exact organization of the installed runtime headers. That is brittle and unnecessary for a source-to-source compiler.

So the header vendors the layouts directly.

This is especially useful for __tgt_kernel_arguments, which is one of the most ABI-sensitive structures in the whole offloading flow. The generated host code builds that object explicitly, and the wrapper layer passes it straight into the runtime.

Macro hygiene inside the struct definition

One detail here is easy to miss and worth calling out.

Before defining __tgt_kernel_arguments, the header does:

1
2
3
4
5
6
7
8
#pragma push_macro("Version")
#pragma push_macro("NumArgs")
...
#undef Version
#undef NumArgs
...
#pragma pop_macro("NumArgs")
#pragma pop_macro("Version")

and then restores those macros afterward.

That is not decorative. It is protecting the struct field names from accidental macro collisions coming from other headers.

Because REX-generated code often includes multiple helper headers and user headers, it cannot assume that names like Version, Flags, or NumArgs are safe as raw tokens. The push/undef/pop sequence makes the vendored ABI struct resilient in the presence of unrelated macro pollution.

This is a small but very REX-like design choice:

keep the ABI local,
keep it explicit,
and harden it against the kinds of header interactions a source-to-source compiler really encounters.

Step 2: Separate Real Runtime Symbols From REX-Controlled Names

The next layer is the asm-alias declarations.

Instead of calling the system runtime symbols directly everywhere, rex_kmp.h introduces names such as:

1
2
3
int __rex_real___tgt_target(/* args */) __asm__("__tgt_target");
int __rex_real___tgt_target_kernel(/* args */) __asm__("__tgt_target_kernel");
void __rex_real___tgt_register_lib(/* args */) __asm__("__tgt_register_lib");

These declarations matter because they give REX a stable internal vocabulary:

the generated and helper code can refer to __rex_real___tgt_*,
and the linker still resolves those names to the actual libomptarget symbols.

This is cleaner than mixing “real runtime symbols” and “wrapper entry points” under the same spelling inside helper code.

It also prevents the macro layer from becoming confusing.

Once the header later starts rewriting __tgt_target_kernel to rex_direct___tgt_target_kernel, the helper implementation still needs a way to say “no, I really mean the actual runtime function now.” The asm-alias names provide exactly that escape hatch.

So the alias layer is not just a naming trick. It is what lets the header separate:

what generated code looks like it is calling,
what REX wrappers want to expose,
and what symbol the final executable must really bind to.

Step 3: Expose Two Wrapper Families, Not One

The header and helper layer together expose two different wrapper families:

rex_direct___tgt_*
rex___tgt_*

That split is one of the most important design choices in this header.

The direct wrappers

rex_direct___tgt_* are defined as static inline functions directly in rex_kmp.h:

1
2
3
4
5
6
7
static inline int rex_direct___tgt_target(int64_t device_id, void *host_ptr,
                                          int32_t arg_num, void **args_base,
                                          void **args, int64_t *arg_sizes,
                                          int64_t *arg_types) {
  return __rex_real___tgt_target(device_id, host_ptr, arg_num, args_base,
                                 args, arg_sizes, arg_types);
}

The same pattern exists for:

rex_direct___tgt_target_teams
rex_direct___tgt_target_kernel
rex_direct___tgt_target_data_begin
rex_direct___tgt_target_data_end
rex_direct___tgt_target_data_update

These are the hot-path wrappers.

They do not check whether the CUBIN has been registered. They do not perform one-time initialization. They simply bridge from the lowerer’s simple call shape to the exact runtime call REX wants.

That makes them suitable for normal lowered programs after rex_offload_init() has already run.

The safe wrappers

The second family, rex___tgt_*, is declared in the header but implemented in register_cubin.cpp.

Those wrappers do perform registration checks:

1
2
3
if (register_cubin(REX_CUBIN_NAME) == nullptr) {
  return -1;
}

or, for void-returning routines, simply return early on failure.

These are not intended to be the steady-state fast path for normal generated host code. They are the safety net:

useful when a caller needs lazy registration behavior,
useful inside helper-controlled code paths,
and useful as the correctness-preserving fallback interface.

This split is why REX gets both properties it wants:

explicit eager initialization for performance-sensitive generated programs,
and safe on-demand behavior for the cases where eager init is not guaranteed.

Step 4: Hide The `ident_t *` ABI Quirk Inside The Wrapper

The most visible ABI mismatch is __tgt_target_kernel.

The lowerer wants to emit a five-argument call:

1
__tgt_target_kernel(device_id, num_teams, thread_limit, host_ptr, &kernel_args);

But the real runtime entry point is declared as:

1
2
3
int __tgt_target_kernel(ident_t *loc, int64_t device_id, int32_t num_teams,
                        int32_t thread_limit, void *host_ptr,
                        struct __tgt_kernel_arguments *kernel_args);

So where does the missing first argument come from?

From the direct wrapper:

1
2
3
4
5
6
7
8
static inline int
rex_direct___tgt_target_kernel(int64_t device_id, int32_t num_teams,
                               int32_t thread_limit, void *host_ptr,
                               struct __tgt_kernel_arguments *kernel_args) {
  return __rex_real___tgt_target_kernel(&rex_target_kernel_ident, device_id,
                                        num_teams, thread_limit, host_ptr,
                                        kernel_args);
}

That is the exact kind of quirk the wrapper layer should absorb.

The lowerer does not need to thread an ident_t * through every generated call site. The header centralizes that ABI detail once.

The shared rex_target_kernel_ident object itself lives in register_cubin.cpp, which is another reason the wrapper/header split is useful:

the header can promise that a stable location object exists,
and the helper implementation can provide the actual definition.

This is also what keeps the generated host code readable. Contributors inspecting lowered files see the logical launch call shape, not a runtime-specific location token whose origin is otherwise mysterious.

A before-and-after diagram showing the lowerer emitting a five-argument __tgt_target_kernel call, the macro layer rewriting it to rex_direct___tgt_target_kernel, and the wrapper injecting &rex_target_kernel_ident before calling the real six-argument runtime symbol. — Figure 2. The wrapper layer absorbs the hidden `ident_t *` requirement for `__tgt_target_kernel`. The lowerer stays simple, while the actual runtime ABI still gets the location object it expects.

Step 5: Rewrite Generated `__tgt_*` Calls With Macros

Once the wrappers exist, the header still needs one more mechanism: generated source files have to end up using them without forcing the lowerer to emit wrapper names explicitly.

That is what the macro layer does:

1
2
3
4
5
6
7
8
#ifndef REX_KMP_INTERNAL
#define __tgt_target rex_direct___tgt_target
#define __tgt_target_teams rex_direct___tgt_target_teams
#define __tgt_target_kernel rex_direct___tgt_target_kernel
#define __tgt_target_data_begin rex_direct___tgt_target_data_begin
#define __tgt_target_data_end rex_direct___tgt_target_data_end
#define __tgt_target_data_update rex_direct___tgt_target_data_update
#endif

This is the part that makes the whole design cohere.

The lowerer can keep producing host code that spells the canonical runtime names:

1
2
__tgt_target_data_begin(...);
__tgt_target_kernel(...);

But once rex_kmp.h is included, those call sites are rewritten by the preprocessor into the direct wrappers.

That gives REX a very pragmatic division of labor:

the lowerer emits a stable conceptual API,
the header chooses what that API means inside generated code,
and the helper implementation keeps control of the true runtime bindings.

This is much cleaner than teaching the lowerer to emit:

one set of names for generated files,
another set for helper files,
and still more special handling for the kernel-location parameter.

The macro layer is not there to be clever. It is there to keep the compiler simpler.

Step 6: `REX_KMP_INTERNAL` Prevents The Header From Rewriting Helper Code

Of course, once the header starts rewriting __tgt_* names, the helper implementation itself has to be protected from that rewrite.

That is why register_cubin.cpp starts with:

1
2
#define REX_KMP_INTERNAL
#include "rex_kmp.h"

With REX_KMP_INTERNAL set, the macro rewrites are disabled.

That matters because the helper implementation needs to do things that generated host code does not:

refer to the real runtime symbols intentionally,
define the safe wrappers,
and include the header without having the preprocessor silently reroute its own internal calls.

Without this guard, the helper layer would risk rewriting itself into the direct wrappers it is trying to implement or bypass.

This is one of the sharper signs that rex_kmp.h is not “just a convenience header.” It is a policy-bearing header, so it also needs an explicit internal mode that turns that policy off when the implementation itself is being compiled.

A split diagram showing ordinary lowered host code including rex_kmp.h in generated mode, where __tgt_* names are macro-rewritten, and helper implementation files defining REX_KMP_INTERNAL before including the header, where the macro layer is disabled. — Figure 3. `rex_kmp.h` behaves differently depending on who includes it. Generated host files get the macro rewrite layer; helper implementation files opt out with `REX_KMP_INTERNAL` so they can define and call the real support routines safely.

What This Buys The Lowerer

From the lowerer’s perspective, this wrapper layer buys several concrete simplifications.

It keeps generated call sites uniform

The lowerer can keep building host calls with one conceptual vocabulary:

__tgt_target_kernel
__tgt_target_data_begin
__tgt_target_data_end
__tgt_target_data_update

That is much easier to generate than a mixture of wrapper-specific spellings and ABI-specific exceptions.

It keeps runtime policy out of AST rewriting

The lowerer does not need to decide at every offload site:

whether lazy registration should run,
how to reach the real symbol,
or how to inject the kernel location object.

Those are helper-layer decisions, not AST-lowering decisions.

It keeps toolchain drift localized

When runtime ABI details shift, REX can often adjust rex_kmp.h and the helper implementation instead of changing every place the lowerer builds offload calls.

That is one reason this header belongs in the helper boundary rather than inside the lowerer proper.

What Current Tests Actually Prove

There is not a dedicated test named “wrapper layer works” that isolates rex_kmp.h by itself.

But the current test coverage still exercises the important parts of this design.

The GPU lowering invariant suite checks host-side facts such as:

exactly one #include "rex_kmp.h" in the lowered host file,
the expected number of __tgt_target_kernel(...) call sites,
and the ordering relation between that include and the generated offload entries.

That matters because the wrapper layer only exists if the generated host file actually includes the header and keeps using the canonical __tgt_* names that the macro layer rewrites.

Then end-to-end GPU execution covers the rest implicitly:

direct wrappers must bind to the real runtime symbols correctly,
the hidden ident_t * bridge for kernel launch must work,
and the safe wrappers plus registration path must still function when they are used.

So the coverage story here is similar to the registration-helper story:

structural lowering tests validate the generated-source contract,
and real offload execution validates the runtime behavior.

That is a reasonable fit for a wrapper layer whose whole purpose is to sit between generated source and the real runtime.

Closing

rex_kmp.h is where REX turns a messy set of runtime concerns into one coherent contract:

vendored ABI structs,
real-symbol aliases,
hot-path wrappers,
safe wrappers,
macro rewrite policy,
and the kernel-location bridge.

That is why the lowerer can keep emitting code that looks simple.

The simplicity is real at the source level, but it is achieved by moving runtime-specific complexity into the header and helper layer that are designed to own it.

Without that layer, the lowerer would need to know too much about runtime details, and every generated host file would become harder to read, harder to evolve, and easier to break when the surrounding toolchain changes.