How Offloading Runtime Glue Works in REX

Posted on (Updated on )
REX lowers GPU offloading into explicit __tgt_* runtime calls, but those calls only work after a device image is registered. This post explains how the lowerer emits omp_offloading_entries, how register_cubin.cpp loads and registers a CUBIN with __tgt_register_lib, how rex_kmp.h rewrites runtime calls into REX wrappers, and how init/fini choices affect performance and correctness.

The previous post focused on what the GPU lowerer generates: outlined kernels, map arrays, __tgt_kernel_arguments, and the final launch call. This post focuses on what makes that generated code runnable.

If you look at the lowered host code, it is full of runtime-looking identifiers:

  • __tgt_target_kernel
  • __tgt_target_data_begin / __tgt_target_data_end
  • __tgt_register_lib
  • __tgt_offload_entry

Those names are not abstract concepts. They are part of the LLVM OpenMP offloading runtime ABI (as implemented by libomptarget and its device plugins). REX does not replace that ABI. REX targets it.

So the main question becomes: how does a source-to-source compiler that emits ordinary host and device source files connect those files to the offloading runtime in a way that is:

  1. correct for multiple kernels and repeated launches,
  2. debuggable by inspecting generated artifacts,
  3. fast enough that “helper glue” is not a measurable fixed cost.

The short answer is: REX ships a small helper layer (most importantly rex_kmp.h and register_cubin.cpp) that makes the runtime contract explicit and repeatable.

Runtime lifecycle: register the device image once, then launch kernels many times.

Figure 1. Offload execution has two phases. First, the device image is registered with libomptarget (once per process). Later, kernel launches and target-data regions reuse that registration.

The Three Pieces of the Runtime Boundary

You can understand the helper/runtime boundary in REX as three cooperating pieces:

  1. An entry table: the host program contains a table of offload entries describing what kernels exist.
  2. A device image registration step: the GPU binary is registered with the runtime alongside that entry table.
  3. Call-site rewriting: the lowered host code calls what looks like the runtime API, but rex_kmp.h rewrites those calls into REX-controlled wrappers that are stable across toolchains and fast on the hot path.

Each of these pieces is small, but the interaction is subtle. If any one of them is off, you get failures that feel “mysterious”: kernels compile but do not run, images exist but are not found, or launches happen with wrong argument mapping.

The rest of this post explains each piece and shows how they connect.

Piece 1: Offload Entries In omp_offloading_entries

The most important thing the lowerer emits (besides the kernel body) is the identity of the kernel as far as the runtime is concerned.

In omp_lowering.cpp, the GPU worksharing lowering path explicitly materializes a __tgt_offload_entry object and places it into a special ELF/COFF section called omp_offloading_entries. Conceptually, each kernel gets one entry with:

  • an address used as a host-side key (addr),
  • the kernel name string (name),
  • bookkeeping fields like size and flags.

The lowerer code looks like this (simplified for readability):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
SgClassDeclaration *tgt_offload_entry =
    buildStructDeclaration("__tgt_offload_entry", getGlobalScope(target));

SgVariableDeclaration *outlined_kernel_id_decl =
    buildVariableDeclaration(func_name + "id__", buildCharType(),
                             buildAssignInitializer(buildIntVal(0)), g_scope);

SgExprListExp *offload_entry_parameters = buildExprListExp(
    buildCastExp(buildAddressOfOp(buildVarRefExp(outlined_kernel_id_decl)),
                 buildPointerType(buildVoidType())),
    buildStringVal(func_name + "kernel__"),
    buildIntVal(0), buildIntVal(0), buildIntVal(0));

SgVariableDeclaration *offload_entry_decl = buildVariableDeclaration(
    func_name + "omp_offload_entry__", tgt_offload_entry->get_type(),
    buildAssignInitializer(buildBracedInitializer(offload_entry_parameters)),
    g_scope);

offload_entry_decl->get_decl_item(SgName(func_name + "omp_offload_entry__"))
    ->set_gnu_attribute_section_name("omp_offloading_entries");

Two details matter here.

First, the entry addr is not the kernel function pointer. Instead, it uses the address of a synthetic char symbol (func_name + "id__"). That address becomes the stable host-side key that identifies the kernel at runtime.

Second, putting the entry into omp_offloading_entries is what makes it discoverable later. Toolchains that support this section convention also provide boundary symbols:

  • __start_omp_offloading_entries
  • __stop_omp_offloading_entries

Those symbols are not “magic REX variables”. They are produced by the linker for that section. REX uses them as the authoritative range of offload entries present in the binary.

This is what makes multi-kernel programs work cleanly: each kernel contributes one entry, and the section boundary range covers the full set.

Piece 2: Registering A CUBIN With __tgt_register_lib

The second piece is device-image registration. In native Clang offloading, device images are usually embedded into the host binary as a bundled blob. REX uses a different strategy that fits a source-to-source pipeline better:

  • compile the generated device translation unit into a standalone CUBIN file,
  • load that CUBIN at runtime,
  • register it with libomptarget using a __tgt_bin_desc.

That is the job of src/midend/programTransformation/ompLowering/register_cubin.cpp.

At a high level, register_cubin.cpp does the following:

  1. reads rex_lib_nvidia.cubin into memory,
  2. builds a __tgt_device_image that points at the image bytes and the entry table,
  3. builds a __tgt_bin_desc that describes the image list and host entry list,
  4. calls __tgt_register_lib(&bin_desc),
  5. keeps the image bytes alive for the lifetime of the registration.

The structure setup is deliberately explicit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
storage->device_image.ImageStart = storage->image.data();
storage->device_image.ImageEnd = storage->image.data() + storage->image.size();
storage->device_image.EntriesBegin = &__start_omp_offloading_entries;
storage->device_image.EntriesEnd = &__stop_omp_offloading_entries;

storage->bin_desc.NumDeviceImages = 1;
storage->bin_desc.DeviceImages = &storage->device_image;
storage->bin_desc.HostEntriesBegin = &__start_omp_offloading_entries;
storage->bin_desc.HostEntriesEnd = &__stop_omp_offloading_entries;

__rex_real___tgt_register_lib(&storage->bin_desc);

There is an important lifetime constraint hidden in that code: ImageStart and ImageEnd point into the std::vector<unsigned char> buffer holding the CUBIN bytes. That buffer must stay alive as long as the runtime might use the image. That is why register_cubin.cpp stores the buffer inside a long-lived CubinStorage object and keeps it in a global std::unique_ptr.

This is also why the helper does not just do:

1
2
__tgt_register_lib(&desc);
free(image_bytes);

That would turn ImageStart into a dangling pointer and produce failures that are extremely hard to diagnose later.

ABI structures: __tgt_bin_desc contains __tgt_device_image and the offload entry table range.

Figure 2. The runtime registration contract is a small set of C structs. REX vendors these struct definitions and builds them explicitly so generated code can remain stable across compiler and runtime versions.

Thread Safety And One-Time Registration

Registration is done once per process, but the code must handle two realities:

  1. multiple kernels may be launched from multiple call sites,
  2. those call sites can appear in code paths where offloading is triggered from multiple threads.

register_cubin.cpp addresses this by making registration idempotent and thread-safe. The helper uses an atomic state machine:

  • kUnregistered: no image has been registered yet
  • kBusy: one thread is currently registering
  • kRegistered: registration completed, storage is live

The fast path reads the state and immediately returns the cached bin_desc when already registered. The slow path uses an atomic compare-exchange to elect one thread to do the registration work, while other threads spin until the state becomes kRegistered.

This is one of the reasons REX prefers to do registration before the first timed region in main: the slow path includes file I/O and runtime calls and would otherwise show up as “mysterious fixed overhead” in short-running benchmarks.

Why A File-Based CUBIN?

REX is a source-to-source compiler. That means the build system for a lowered application looks like a normal build: compile host sources, compile device sources, link a normal executable.

A file-based device image fits that model well:

  • the CUBIN is an inspectable artifact,
  • it can be swapped without relinking the host binary,
  • it keeps the host binary free of embedded blobs during debugging,
  • it keeps the helper runtime small and explicit.

There are tradeoffs (the CUBIN must be present at runtime), but in practice this is an excellent fit for a source-to-source workflow where “inspect the generated artifacts” is a primary debugging tool.

Piece 3: rex_kmp.h Rewrites Runtime Calls

The third piece is what makes the lowered host code pleasant to generate and fast to run.

REX wants the lowerer to emit “the obvious runtime calls”:

  • __tgt_target_data_begin(...)
  • __tgt_target_kernel(...)

But REX also wants control over:

  • which symbol names actually resolve at link time,
  • how to handle ABI details (especially __tgt_target_kernel’s location argument),
  • how to switch between a fast hot path and a safe lazy-registration path.

That is what rex_kmp.h provides.

Vendoring The ABI Structs

At the top, rex_kmp.h defines the key ABI types directly:

  • struct __tgt_offload_entry
  • struct __tgt_device_image
  • struct __tgt_bin_desc
  • struct __tgt_kernel_arguments

This is intentionally boring code, and that is the point. It lets the generated sources include a single header and not depend on a particular system header layout for the runtime ABI.

Separating “Real Runtime Symbols” From “What Generated Code Calls”

REX uses a naming trick to keep the helper layer in control without forcing the lowerer to emit different function names.

The header declares “real runtime functions” using asm aliases, for example:

1
2
int __rex_real___tgt_target_kernel(...) __asm__("__tgt_target_kernel");
void __rex_real___tgt_register_lib(...) __asm__("__tgt_register_lib");

Those declarations always refer to the true libomptarget symbols, regardless of any macro rewriting that might happen later in the same translation unit.

Then it declares REX-controlled wrappers such as:

1
2
static inline int rex_direct___tgt_target_kernel(...);
int rex___tgt_target_kernel(...);

The rex_direct_... wrappers are the fast path: they assume the image is already registered and call the real runtime entry points directly.

The rex___... wrappers are the safe path: they call rex_offload_init() on entry and only then delegate to the runtime.

The Macro Layer

Finally, and this is the key piece, rex_kmp.h uses macros so generated code can call __tgt_* and still end up in the right wrapper:

1
2
3
4
5
#ifndef REX_KMP_INTERNAL
#define __tgt_target_kernel rex_direct___tgt_target_kernel
#define __tgt_target_data_begin rex_direct___tgt_target_data_begin
...
#endif

So when lowered host sources include rex_kmp.h, the call sites produced by the lowerer:

1
__tgt_target_kernel(device_id, num_teams, thread_limit, host_ptr, &kernel_args);

are rewritten by the preprocessor into:

1
rex_direct___tgt_target_kernel(device_id, num_teams, thread_limit, host_ptr, &kernel_args);

and that wrapper adds whatever ABI details are required (for example, passing a stable ident_t * to the real runtime function).

register_cubin.cpp itself defines REX_KMP_INTERNAL before including rex_kmp.h, which disables these macro rewrites inside the helper implementation. That prevents the helper from accidentally calling itself or shadowing the actual runtime API while it is implementing registration.

Header layering: generated code calls __tgt_* but rex_kmp.h rewrites those calls into rex wrappers, while register_cubin.cpp disables the macros internally.

Figure 3. The macro layer is not a gimmick. It keeps code generation simple, keeps ABI details centralized, and makes the hot path a direct call into the runtime after one-time registration.

Init/Fini: Why rex_offload_init() Is Explicit

Registration has to happen before the first offload call that expects a registered image. REX has two complementary mechanisms:

  1. Explicit init in main: the lowerer inserts a call to rex_offload_init() at the beginning of main.
  2. Safe wrappers exist: rex___tgt_target_kernel and friends can register lazily if a caller uses them.

In normal generated programs, the intended flow is:

  • rex_offload_init() runs once and registers the CUBIN,
  • the rest of the program uses the fast macro-rewritten rex_direct___tgt_* calls,
  • there is no per-call registration check on the hot path.

This is why omp_lowering.cpp prepends rex_offload_init() before all user statements in main: it keeps one-time image registration out of user timing initializers and out of the steady-state path.

Why Not Auto-Insert rex_offload_fini()?

You might expect the helper to also unregister at process exit. Historically, it is common to add a destructor or atexit() handler for that. REX intentionally avoids auto-teardown in the default standalone generated-program flow for two reasons:

  1. for short-running GPU benchmarks, explicit unregister can add a measurable fixed cost at program exit;
  2. for standalone processes, the OS and driver tear down device state on exit anyway.

The helper still provides rex_offload_fini() as an explicit opt-in teardown for callers that embed REX-lowered code inside a longer-lived process and need to reclaim resources before exit.

This is a pragmatic tradeoff: make the default path fast and predictable, while keeping explicit resource management available for the cases where it truly matters.

A Debugging Checklist When Offloading “Builds But Does Not Run”

Most runtime-glue failures fall into a few buckets. When an application compiles but offloading fails at runtime, check these invariants in order.

  1. Is the CUBIN present? The default name is rex_lib_nvidia.cubin (configurable via REX_CUBIN_NAME). If the file is missing at runtime, registration fails and offload calls typically return an error code.

  2. Does the host binary actually contain omp_offloading_entries? If the lowerer did not emit the offload entry table correctly (or the section got stripped), the start/stop symbols may point to an empty range.

  3. Are kernel names consistent? The offload entry uses the outlined kernel name string. If the device compilation changes name mangling unexpectedly, the runtime/plugin may fail to find the expected symbols in the image.

  4. Does init run before the first offload? Generated code relies on rex_offload_init() being inserted early. If a custom build drops that call or reorders it after timed initializers, the first __tgt_* call may see an unregistered image.

  5. Are map arrays sane? If registration is correct but execution is wrong, inspect the generated map arrays and __tgt_kernel_arguments packet. Many “CUDA error at runtime” reports are actually “wrong mapping kinds/sizes” failures that only surface when the runtime touches device memory.

These checks are also reflected in the lowering test suite. Several Rodinia-derived cases exist specifically to catch regressions in entry integrity and init ordering.

The Helper Boundary In One Sentence

If the previous post reduced the GPU lowerer to “turn OpenMP structure into a runtime protocol,” then this post reduces the helper boundary to:

Make that protocol linkable, registerable, and fast on the hot path by centralizing ABI details in helper files instead of scattering them across code generation.

The next post in the series is the natural continuation: how toolchain migrations and performance work forced us to tighten these boundaries, add invariants, and evolve the helper glue without requiring users to hand-edit generated code.