How `rex_kmp.h` Rewrites Offloading Runtime Calls In REX
rex_kmp.h to centralize offloading runtime policy. The header vendors ABI structs, exposes REX-controlled wrappers, and rewrites __tgt_* calls through macros unless REX_KMP_INTERNAL is set. That keeps code generation simple while the helper layer handles the real symbol bindings.The previous post in this series focused on register_cubin.cpp: how REX loads a standalone CUBIN file, builds __tgt_device_image and __tgt_bin_desc, and registers the result with libomptarget.
That still leaves one more runtime-boundary layer to explain:
how do the generated host sources actually call the runtime API without hardcoding all of the ABI quirks themselves?
That is the job of src/midend/programTransformation/ompLowering/rex_kmp.h.
This header looks deceptively simple. It is only a header, and most of it is declarations. But architecturally it is doing several jobs at once:
- it vendors the key OpenMP offloading ABI structs that lowered code needs to compile against,
- it binds stable REX names to the real
libomptargetsymbols with asm aliases, - it exposes direct wrappers for the hot path,
- it declares safe wrappers that can register the CUBIN lazily,
- and it rewrites generated
__tgt_*calls so the lowerer can keep emitting simple names instead of runtime-specific plumbing.
This post stays tightly focused on that wrapper layer. It explains:
- why
rex_kmp.hexists even thoughlibomptargetalready has runtime symbols, - how the header vendors ABI structs and protects itself from macro collisions,
- how
__rex_real___tgt_*names bind to the real runtime entry points, - why there are both
rex_direct___tgt_*andrex___tgt_*wrappers, - how the macro layer rewrites generated
__tgt_*calls, - why
REX_KMP_INTERNALdisables that rewriting inside helper implementations, - and how
rex_target_kernel_identbridges the hiddenident_t *parameter required by__tgt_target_kernel.
Figure 1. The lowerer emits simple runtime-looking names. rex_kmp.h rewrites those names into REX-controlled wrappers and only then reaches the real libomptarget symbols.
Why This Header Exists At All
At first glance, rex_kmp.h can look redundant.
Why not just let generated files include a system runtime header and call the runtime symbols directly?
Because REX wants two things at the same time:
- simple generated source
- centralized control over runtime policy and ABI quirks
If generated host code had to spell every detail explicitly, the lowerer would need to know far too much about the exact runtime interface in every place it emits a call.
For example, the lowerer wants to be able to generate straightforward host code like:
| |
That shape is easy to generate and easy to inspect.
But the real runtime ABI is slightly more awkward than that:
__tgt_target_kernelactually takes anident_t *locparameter in front,- image registration policy should be centralized instead of reimplemented at every call site,
- and generated sources should not depend on the exact header arrangement of whatever LLVM install happens to be on the build machine.
So rex_kmp.h becomes the contract boundary.
The lowerer can keep emitting source that looks like ordinary runtime calls. The header then translates that simple surface into the exact symbol names, argument shapes, and wrapper behavior that REX wants.
That is not unnecessary indirection. It is how the compiler keeps code generation simple without scattering ABI knowledge everywhere.
Step 1: Vendor The Runtime ABI Structs Directly
The first thing rex_kmp.h does is define the data structures that lowered code must see:
ident_t__tgt_offload_entry__tgt_device_image__tgt_bin_desc__tgt_kernel_arguments
This is a deliberately conservative choice.
The generated sources need these layouts in order to compile. REX could try to include them from some system header, but that would make generated code depend on the exact organization of the installed runtime headers. That is brittle and unnecessary for a source-to-source compiler.
So the header vendors the layouts directly.
This is especially useful for __tgt_kernel_arguments, which is one of the most ABI-sensitive structures in the whole offloading flow. The generated host code builds that object explicitly, and the wrapper layer passes it straight into the runtime.
Macro hygiene inside the struct definition
One detail here is easy to miss and worth calling out.
Before defining __tgt_kernel_arguments, the header does:
| |
and then restores those macros afterward.
That is not decorative. It is protecting the struct field names from accidental macro collisions coming from other headers.
Because REX-generated code often includes multiple helper headers and user headers, it cannot assume that names like Version, Flags, or NumArgs are safe as raw tokens. The push/undef/pop sequence makes the vendored ABI struct resilient in the presence of unrelated macro pollution.
This is a small but very REX-like design choice:
- keep the ABI local,
- keep it explicit,
- and harden it against the kinds of header interactions a source-to-source compiler really encounters.
Step 2: Separate Real Runtime Symbols From REX-Controlled Names
The next layer is the asm-alias declarations.
Instead of calling the system runtime symbols directly everywhere, rex_kmp.h introduces names such as:
| |
These declarations matter because they give REX a stable internal vocabulary:
- the generated and helper code can refer to
__rex_real___tgt_*, - and the linker still resolves those names to the actual
libomptargetsymbols.
This is cleaner than mixing “real runtime symbols” and “wrapper entry points” under the same spelling inside helper code.
It also prevents the macro layer from becoming confusing.
Once the header later starts rewriting __tgt_target_kernel to rex_direct___tgt_target_kernel, the helper implementation still needs a way to say “no, I really mean the actual runtime function now.” The asm-alias names provide exactly that escape hatch.
So the alias layer is not just a naming trick. It is what lets the header separate:
- what generated code looks like it is calling,
- what REX wrappers want to expose,
- and what symbol the final executable must really bind to.
Step 3: Expose Two Wrapper Families, Not One
The header and helper layer together expose two different wrapper families:
rex_direct___tgt_*rex___tgt_*
That split is one of the most important design choices in this header.
The direct wrappers
rex_direct___tgt_* are defined as static inline functions directly in rex_kmp.h:
| |
The same pattern exists for:
rex_direct___tgt_target_teamsrex_direct___tgt_target_kernelrex_direct___tgt_target_data_beginrex_direct___tgt_target_data_endrex_direct___tgt_target_data_update
These are the hot-path wrappers.
They do not check whether the CUBIN has been registered. They do not perform one-time initialization. They simply bridge from the lowerer’s simple call shape to the exact runtime call REX wants.
That makes them suitable for normal lowered programs after rex_offload_init() has already run.
The safe wrappers
The second family, rex___tgt_*, is declared in the header but implemented in register_cubin.cpp.
Those wrappers do perform registration checks:
| |
or, for void-returning routines, simply return early on failure.
These are not intended to be the steady-state fast path for normal generated host code. They are the safety net:
- useful when a caller needs lazy registration behavior,
- useful inside helper-controlled code paths,
- and useful as the correctness-preserving fallback interface.
This split is why REX gets both properties it wants:
- explicit eager initialization for performance-sensitive generated programs,
- and safe on-demand behavior for the cases where eager init is not guaranteed.
Step 4: Hide The ident_t * ABI Quirk Inside The Wrapper
The most visible ABI mismatch is __tgt_target_kernel.
The lowerer wants to emit a five-argument call:
| |
But the real runtime entry point is declared as:
| |
So where does the missing first argument come from?
From the direct wrapper:
| |
That is the exact kind of quirk the wrapper layer should absorb.
The lowerer does not need to thread an ident_t * through every generated call site. The header centralizes that ABI detail once.
The shared rex_target_kernel_ident object itself lives in register_cubin.cpp, which is another reason the wrapper/header split is useful:
- the header can promise that a stable location object exists,
- and the helper implementation can provide the actual definition.
This is also what keeps the generated host code readable. Contributors inspecting lowered files see the logical launch call shape, not a runtime-specific location token whose origin is otherwise mysterious.
Figure 2. The wrapper layer absorbs the hidden ident_t * requirement for __tgt_target_kernel. The lowerer stays simple, while the actual runtime ABI still gets the location object it expects.
Step 5: Rewrite Generated __tgt_* Calls With Macros
Once the wrappers exist, the header still needs one more mechanism: generated source files have to end up using them without forcing the lowerer to emit wrapper names explicitly.
That is what the macro layer does:
| |
This is the part that makes the whole design cohere.
The lowerer can keep producing host code that spells the canonical runtime names:
| |
But once rex_kmp.h is included, those call sites are rewritten by the preprocessor into the direct wrappers.
That gives REX a very pragmatic division of labor:
- the lowerer emits a stable conceptual API,
- the header chooses what that API means inside generated code,
- and the helper implementation keeps control of the true runtime bindings.
This is much cleaner than teaching the lowerer to emit:
- one set of names for generated files,
- another set for helper files,
- and still more special handling for the kernel-location parameter.
The macro layer is not there to be clever. It is there to keep the compiler simpler.
Step 6: REX_KMP_INTERNAL Prevents The Header From Rewriting Helper Code
Of course, once the header starts rewriting __tgt_* names, the helper implementation itself has to be protected from that rewrite.
That is why register_cubin.cpp starts with:
| |
With REX_KMP_INTERNAL set, the macro rewrites are disabled.
That matters because the helper implementation needs to do things that generated host code does not:
- refer to the real runtime symbols intentionally,
- define the safe wrappers,
- and include the header without having the preprocessor silently reroute its own internal calls.
Without this guard, the helper layer would risk rewriting itself into the direct wrappers it is trying to implement or bypass.
This is one of the sharper signs that rex_kmp.h is not “just a convenience header.” It is a policy-bearing header, so it also needs an explicit internal mode that turns that policy off when the implementation itself is being compiled.
Figure 3. rex_kmp.h behaves differently depending on who includes it. Generated host files get the macro rewrite layer; helper implementation files opt out with REX_KMP_INTERNAL so they can define and call the real support routines safely.
What This Buys The Lowerer
From the lowerer’s perspective, this wrapper layer buys several concrete simplifications.
It keeps generated call sites uniform
The lowerer can keep building host calls with one conceptual vocabulary:
__tgt_target_kernel__tgt_target_data_begin__tgt_target_data_end__tgt_target_data_update
That is much easier to generate than a mixture of wrapper-specific spellings and ABI-specific exceptions.
It keeps runtime policy out of AST rewriting
The lowerer does not need to decide at every offload site:
- whether lazy registration should run,
- how to reach the real symbol,
- or how to inject the kernel location object.
Those are helper-layer decisions, not AST-lowering decisions.
It keeps toolchain drift localized
When runtime ABI details shift, REX can often adjust rex_kmp.h and the helper implementation instead of changing every place the lowerer builds offload calls.
That is one reason this header belongs in the helper boundary rather than inside the lowerer proper.
What Current Tests Actually Prove
There is not a dedicated test named “wrapper layer works” that isolates rex_kmp.h by itself.
But the current test coverage still exercises the important parts of this design.
The GPU lowering invariant suite checks host-side facts such as:
- exactly one
#include "rex_kmp.h"in the lowered host file, - the expected number of
__tgt_target_kernel(...)call sites, - and the ordering relation between that include and the generated offload entries.
That matters because the wrapper layer only exists if the generated host file actually includes the header and keeps using the canonical __tgt_* names that the macro layer rewrites.
Then end-to-end GPU execution covers the rest implicitly:
- direct wrappers must bind to the real runtime symbols correctly,
- the hidden
ident_t *bridge for kernel launch must work, - and the safe wrappers plus registration path must still function when they are used.
So the coverage story here is similar to the registration-helper story:
- structural lowering tests validate the generated-source contract,
- and real offload execution validates the runtime behavior.
That is a reasonable fit for a wrapper layer whose whole purpose is to sit between generated source and the real runtime.
Closing
rex_kmp.h is where REX turns a messy set of runtime concerns into one coherent contract:
- vendored ABI structs,
- real-symbol aliases,
- hot-path wrappers,
- safe wrappers,
- macro rewrite policy,
- and the kernel-location bridge.
That is why the lowerer can keep emitting code that looks simple.
The simplicity is real at the source level, but it is achieved by moving runtime-specific complexity into the header and helper layer that are designed to own it.
Without that layer, the lowerer would need to know too much about runtime details, and every generated host file would become harder to read, harder to evolve, and easier to break when the surrounding toolchain changes.