How REX Registers CUBIN Images With `libomptarget`

Posted on Apr 10, 2026 (Updated on Apr 17, 2026)

REX does not embed GPU images into the host executable the way native Clang offloading typically does. It compiles the generated device translation unit into a standalone CUBIN file, loads that file at runtime in register_cubin.cpp, binds it to the host offload-entry table via __tgt_device_image and __tgt_bin_desc, and registers it with libomptarget through __tgt_register_lib. A small atomic state machine makes this one-time registration safe for repeated launches, while explicit rex_offload_init() keeps the slow path out of benchmark timing.

The previous post in this series focused on mapper expansion: how one source-level clause item can turn into many runtime map entries when declare mapper and array sections are involved.

This post stays at the runtime boundary instead.

Once REX has already:

outlined GPU kernels,
emitted host-side offload entries,
and generated a device translation unit such as rex_lib_<input>.cu,

there is still one more non-optional step before the program can actually offload:

the device image has to be registered with libomptarget.

That is the job of src/midend/programTransformation/ompLowering/register_cubin.cpp.

This helper is small, but it sits on an important fault line in the design.

If registration is wrong, the failure mode is usually not a nice compiler error. Instead, you get some version of:

the program builds but kernels do not launch,
the runtime cannot match a host entry to a device image,
or the first offload call pays a mysterious fixed cost at the wrong point in the benchmark.

So this post stays tightly focused on the registration path itself. It explains:

why REX uses a standalone CUBIN instead of an embedded device bundle,
how register_cubin.cpp reads the CUBIN and builds the ABI structs,
why the host offload-entry section and the device image must be registered together,
how the one-time registration state machine works,
why the helper keeps image bytes alive after calling __tgt_register_lib,
and why rex_offload_init() is inserted explicitly before timed regions instead of relying only on lazy wrappers.

A flow diagram showing a generated rex_lib_nvidia.cubin file and the host offloading entry section feeding register_cubin.cpp, which constructs __tgt_device_image and __tgt_bin_desc and then calls __tgt_register_lib in libomptarget. — Figure 1. Registration is the point where REX’s generated artifacts become one runtime-visible offload image. The CUBIN bytes and the host entry table have to be packaged together before `libomptarget` can launch anything.

Why REX Uses A Standalone CUBIN At All

Native Clang offloading usually works by bundling device images into the host binary. That is a sensible design for an integrated compiler stack.

REX is a source-to-source compiler, and that changes the tradeoff.

By the time REX has finished lowering, the compiler has not produced one finished executable. It has produced a set of source artifacts:

a rewritten host file such as rose_<input>.c,
a generated device file such as rex_lib_<input>.cu,
and helper/runtime files such as register_cubin.cpp and rex_kmp.h.

That artifact model makes a standalone CUBIN a good fit.

Instead of trying to mimic Clang’s embedded bundle flow, REX can do something much simpler and more inspectable:

compile the generated device file into a CUBIN,
ship that CUBIN alongside the executable,
load it at runtime,
register it with the offloading runtime.

That design has several practical benefits.

First, the device image stays visible as a normal build artifact. A contributor can inspect it, replace it, or regenerate it without relinking the host binary.

Second, it fits the source-to-source debugging model better. When something goes wrong, you can inspect:

the lowered host file,
the generated device file,
and the compiled device image

as separate artifacts instead of peeling an embedded bundle back out of a binary.

Third, it keeps the helper layer explicit. register_cubin.cpp has to construct the runtime ABI in plain source code, which makes the registration contract readable instead of hidden behind toolchain magic.

The tradeoff is obvious: the CUBIN file must be present at runtime. But for REX, that is usually a worthwhile trade. The whole system already assumes an inspectable artifact pipeline.

Step 1: Read The CUBIN Into Long-Lived Storage

The first thing register_cubin.cpp does is read the CUBIN file from disk.

That starts with a small helper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
bool readFile(const char *filename, std::vector<unsigned char> &buffer) {
  FILE *file = fopen(filename, "rb");
  if (file == nullptr) {
    return false;
  }
  ...
  buffer.resize(static_cast<size_t>(file_size));
  size_t bytes_read =
      fread(buffer.data(), 1, static_cast<size_t>(file_size), file);
  fclose(file);
  return bytes_read == buffer.size();
}

This looks boring, and that is the right tone for it. Registration needs reliable bytes, not a clever abstraction.

But the more important design point is where those bytes live afterward.

The helper does not allocate a temporary byte buffer on the stack, register it, and discard it. Instead it stores the image inside a long-lived CubinStorage object:

1
2
3
4
5
struct CubinStorage {
  std::vector<unsigned char> image;
  __tgt_device_image device_image{};
  __tgt_bin_desc bin_desc{};
};

and the file-level state keeps one owning instance:

1
std::unique_ptr<CubinStorage> cubin_storage;

That ownership model is essential, because the runtime ABI does not copy the image bytes out of the descriptors during registration. The helper passes pointers into that stored image buffer.

So the CUBIN bytes are not just input to registration. They become part of the runtime-visible state that must remain valid after __tgt_register_lib(...) returns.

Step 2: Build `__tgt_device_image` From The CUBIN And Entry Section

Once the bytes exist, register_cubin_internal(...) turns them into a runtime device-image descriptor:

1
2
3
4
5
storage->device_image.ImageStart = storage->image.data();
storage->device_image.ImageEnd =
    storage->image.data() + storage->image.size();
storage->device_image.EntriesBegin = &__start_omp_offloading_entries;
storage->device_image.EntriesEnd = &__stop_omp_offloading_entries;

This is the point where two independent generated artifacts are joined:

the raw device code bytes from rex_lib_nvidia.cubin,
and the host offload-entry table emitted into omp_offloading_entries.

That second part is easy to miss if you only think in terms of “load the CUBIN and register it.” The runtime needs more than image bytes. It also needs the entry table that says which host-visible offload entries belong to that image.

The boundary symbols:

__start_omp_offloading_entries
__stop_omp_offloading_entries

come from the linked host binary and delimit that table.

So __tgt_device_image in REX is not “just the image.” It is the image plus the entry range that tells libomptarget how to match host launch identities to device code inside that image.

This is why registration is not interchangeable with just cuModuleLoad() or some CUDA-only loader. REX is not registering a bare CUDA module for its own private use. It is registering an OpenMP offload image inside the LLVM runtime’s ABI model.

Step 3: Wrap The Image In `__tgt_bin_desc`

After __tgt_device_image is filled in, the helper wraps it in a __tgt_bin_desc:

1
2
3
4
storage->bin_desc.NumDeviceImages = 1;
storage->bin_desc.DeviceImages = &storage->device_image;
storage->bin_desc.HostEntriesBegin = &__start_omp_offloading_entries;
storage->bin_desc.HostEntriesEnd = &__stop_omp_offloading_entries;

The shape is deliberately explicit.

REX is telling the runtime:

this registration contributes exactly one device image,
here is the address range of that image,
and here is the host entry range that belongs to it.

That explicitness is one of the strengths of the helper layer. The runtime ABI is small enough that REX can express it directly in generated-compatible code instead of depending on toolchain-specific helper headers or hidden bundling steps.

It also makes the invariants obvious.

For REX’s current flow, these invariants are:

one translated program registers one image object per process,
that image object points at one CUBIN buffer,
and the host entry range exposed in the binary is the same range used in both __tgt_device_image and __tgt_bin_desc.

If those two entry ranges ever diverged, host/device matching would stop being trustworthy even if the image bytes themselves were fine.

A diagram showing CubinStorage owning a vector of image bytes, a __tgt_device_image pointing into that byte buffer and to the start and stop offloading entry symbols, and a __tgt_bin_desc pointing at the device image. The whole structure remains alive after __tgt_register_lib returns. — Figure 2. `register_cubin.cpp` does not build temporary descriptors and throw them away. `CubinStorage` owns both the raw image bytes and the descriptors that point into those bytes, so the runtime never sees dangling pointers.

Step 4: Register Once, Then Keep The Storage Alive

Once the descriptors are built, register_cubin_internal(...) calls the real runtime function:

1
2
3
__rex_real___tgt_register_lib(&storage->bin_desc);
cubin_storage = std::move(storage);
return &cubin_storage->bin_desc;

The ordering here matters.

The helper first prepares a fully populated temporary storage object. Then it calls the real registration function. Then it transfers ownership into the long-lived cubin_storage.

That final move is what keeps the image and descriptors alive for the rest of the program’s offloading lifetime.

This is the place where a lower-quality implementation would usually go wrong.

It would be tempting to do something like:

1
2
3
4
read file into temporary buffer
build local descriptors
__tgt_register_lib(&desc)
return

But that would leave ImageStart, ImageEnd, and descriptor fields pointing into storage that no longer exists.

REX avoids that bug by making the ownership model explicit. The descriptors are not ephemeral call arguments. They are part of the helper’s registered state.

That is also why the helper keeps unregister_cubin_internal() around. If the code explicitly unregisters later, it can then safely destroy cubin_storage. Before unregister, it cannot.

Step 5: Use An Atomic State Machine For One-Time Registration

Registration should happen once per process, but the helper still has to behave correctly if multiple offload call sites or threads reach it.

That is what ensure_cubin_registered(...) handles.

The implementation uses a small atomic state machine:

1
2
3
4
5
enum RegistrationState {
  kUnregistered = 0,
  kBusy = 1,
  kRegistered = 2,
};

The logic has two layers.

The fast path is trivial:

1
2
3
4
int state = __atomic_load_n(&registration_state, __ATOMIC_ACQUIRE);
if (state == kRegistered) {
  return cubin_storage == nullptr ? nullptr : &cubin_storage->bin_desc;
}

If registration already finished, callers immediately reuse the cached descriptor.

The slow path tries to elect one registering thread:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
while (true) {
  int state = __atomic_load_n(&registration_state, __ATOMIC_ACQUIRE);
  if (state == kRegistered) {
    return cubin_storage == nullptr ? nullptr : &cubin_storage->bin_desc;
  }
  if (state == kUnregistered &&
      __atomic_compare_exchange_n(&registration_state, &state, kBusy, false,
                                  __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE)) {
    struct __tgt_bin_desc *desc = register_cubin_internal(filename);
    if (desc != nullptr) {
      __atomic_store_n(&registration_state, kRegistered, __ATOMIC_RELEASE);
    } else {
      __atomic_store_n(&registration_state, kUnregistered, __ATOMIC_RELEASE);
      return nullptr;
    }
  } else {
    sched_yield();
  }
}

One thread flips the state from kUnregistered to kBusy, performs the file read and runtime registration, and then publishes kRegistered.

Other threads that lost the election call sched_yield() and retry until they observe the kRegistered state.

This is a simple design, but it is exactly the kind of simplicity that works well in helper code:

no duplicate registrations,
no partially initialized shared state becoming visible,
and no need for heavier-weight locking just to serialize a one-time action.

The important thing is not that it uses atomics for style points. The important thing is that the helper has a clear answer to a real concurrency problem.

A state machine with kUnregistered, kBusy, and kRegistered. One thread moves from unregistered to busy, performs file load and __tgt_register_lib, then publishes registered. Other threads either see registered immediately or spin while busy until registration completes. — Figure 3. Registration is not performed optimistically at every call site. A small atomic state machine makes one-time image registration safe while keeping the common case fast.

Step 6: Public Entry Points Layer Policy On Top Of The Core Helper

The helper’s public surface is intentionally small:

register_cubin(...)
rex_offload_init()
rex_offload_fini()

plus the safe runtime-call wrappers that first make sure registration exists.

The public register_cubin(...) mostly adds default filename handling and publishes the returned descriptor into __cubin_desc:

1
2
3
4
const char *cubin_name = filename == nullptr ? REX_CUBIN_NAME : filename;
struct __tgt_bin_desc *desc = ensure_cubin_registered(cubin_name);
__atomic_store_n(&__cubin_desc, desc, __ATOMIC_RELEASE);
return desc;

rex_offload_init() is intentionally tiny:

1
void rex_offload_init(void) { (void)register_cubin(REX_CUBIN_NAME); }

That tells you something important about the design. The helper does not want several competing initialization systems. It wants one central registration path, and rex_offload_init() is just the explicit “do it now” front door to that path.

rex_offload_fini() is equally small:

1
2
3
unregister_cubin_internal();
__atomic_store_n(&__cubin_desc, nullptr, __ATOMIC_RELEASE);
__atomic_store_n(&registration_state, kUnregistered, __ATOMIC_RELEASE);

The comment above it is pragmatic and worth taking seriously: standalone generated programs usually rely on process exit for teardown, and explicit unregister exists mostly for longer-lived embedding scenarios.

That choice matches the benchmark-oriented environment REX often runs in. The common case wants predictable startup and minimal exit overhead. The helper still preserves explicit cleanup when someone truly needs it.

Why `rex_offload_init()` Is Explicit Instead Of Purely Lazy

You could imagine relying entirely on the safe wrappers such as rex___tgt_target_kernel(...) and friends, each of which does:

1
2
3
if (register_cubin(REX_CUBIN_NAME) == nullptr) {
  return -1;
}

That would work for correctness. It is not the preferred generated-program path.

REX inserts rex_offload_init() near the start of main so registration happens before timed benchmark code begins running.

That matters because the slow path includes:

file I/O,
descriptor construction,
and the call into libomptarget.

All of that is real work. If the first kernel launch were also the first registration point, short-running programs would pay that startup cost inside whatever region they happen to measure first.

The lowerer and the test suite both treat this as important behavior rather than as an incidental detail. The reduced Rodinia-style lowering invariants explicitly check that rex_offload_init() appears exactly once and, in cases with timing variables, appears before the timed declaration that should remain outside registration overhead.

So the design is:

keep lazy registration in the safe wrappers as a correctness backstop,
but generate explicit eager init in normal standalone programs so timing stays fair and the hot path stays clean.

That is a pragmatic runtime policy, not just a coding preference.

What Current Tests Prove About This Path

There is not a single isolated unit test that only exercises register_cubin.cpp in a vacuum.

That is honest, and it is fine.

The registration path is currently covered in two broader but meaningful ways.

First, lowering invariant tests check the generated source structure around initialization:

rex_offload_init() appears exactly once,
and in timing-sensitive cases it appears before timed declarations instead of being dropped into the measured region.

Second, every real GPU benchmark run is implicitly exercising the registration contract end to end:

the host entry table exists,
the CUBIN is found,
the image is registered,
and kernels launch through libomptarget.

That means the helper is already covered by the test layers that matter most for this kind of code:

structural lowering checks for init placement,
and end-to-end offload execution for actual runtime behavior.

There is still room for tighter direct coverage in the future, especially around explicit failure modes such as missing CUBIN files or explicit unregister/re-register flows. But the current validation story is not “untested helper magic.” It is integrated runtime-path coverage, which is often the more relevant thing here.

Why This Registration Design Fits REX

The best way to understand this helper is not as a workaround for missing toolchain features. It is an architectural fit for a source-to-source compiler.

REX already chooses to keep artifacts explicit:

host file,
device file,
helper layer,
downstream build.

A standalone CUBIN plus explicit registration follows the same philosophy.

It keeps the runtime boundary visible, makes the ABI structs explicit, and lets REX control when registration happens without forcing the lowerer to encode runtime details at every call site.

The helper also draws a clean ownership line:

the lowerer emits offload entries and inserts eager init,
the device compiler produces the CUBIN,
register_cubin.cpp turns those artifacts into runtime-visible state,
and libomptarget consumes that state for the actual launches.

That separation is exactly why the system remains debuggable even when LLVM’s offload ABI changes or performance work forces the helper layer to evolve.

Closing

register_cubin.cpp is only a few pieces of code:

read the file,
build the descriptors,
register once,
keep the storage alive,
tear down only when explicitly asked.

But those few pieces connect almost every artifact REX emits on the GPU path.

Without them, the generated host file and generated device file would still exist, but they would not yet form one runtime offload image.

That is why this helper matters. It is the moment where REX’s source-level lowering artifacts stop being separate files on disk and become a launchable image inside libomptarget.