How REX Emits `omp_offloading_entries` And Keeps Kernel Identity Aligned

Posted on (Updated on )
REX’s GPU runtime path depends on an identity table before any kernel can launch. For every outlined device kernel, the lowerer emits a synthetic host-side char symbol and a __tgt_offload_entry object in the omp_offloading_entries section. The entry stores the address of that host key plus the device kernel name string. Later, register_cubin.cpp uses __start_omp_offloading_entries and __stop_omp_offloading_entries to bind that entry range to the standalone CUBIN image before calling __tgt_register_lib. The key invariant is simple: the host id symbol, offload entry name string, device kernel symbol, and launch host pointer must all describe the same outlined kernel.

The previous post in this series was a practical debugging guide for cases where REX GPU offloading builds but does not run. One of the recurring checks in that guide was:

prove that the host offload-entry table exists and that its kernel names match the generated device kernels.

This post explains that part directly.

The topic is narrow, but it is central to the whole runtime story. A REX-lowered program does not launch a device kernel merely because the generated host file contains a call to __tgt_target_kernel(...). That launch call needs a stable host-side identity, the runtime needs an entry table describing that identity, and the registered device image needs to carry matching device symbols.

That identity layer is the omp_offloading_entries table.

At a high level, REX does this for each generated GPU kernel:

  1. outline the device kernel with a generated name,
  2. create a small host-side key symbol for that kernel,
  3. create a __tgt_offload_entry object that connects the key to the kernel name,
  4. place that object into the omp_offloading_entries section,
  5. pass the section range to libomptarget when the CUBIN is registered,
  6. launch later using the same host key as the host_ptr.

If those pieces line up, multiple kernels and repeated launches work naturally. If they drift apart, the program can compile, link, and even register a CUBIN while still failing at runtime.

A diagram showing generated host code emitting synthetic id symbols and __tgt_offload_entry objects into omp_offloading_entries, with the linker producing start and stop symbols around the section.

Figure 1. REX emits one offload entry per outlined kernel. The entries live in omp_offloading_entries, and the linker-provided start and stop symbols turn that section into a runtime-visible table.

Why The Entry Table Exists

It is tempting to think of GPU offloading as a direct call:

1
launch this device function with these arguments

That is not how the LLVM OpenMP offloading runtime sees the world.

The host program passes a host-side pointer to the runtime. The runtime then uses the registered image descriptors and their entry tables to resolve that host identity to device code. The device image registration step says, in effect:

1
for this image, these host entries are the things you may launch

So the table is not optional glue. It is how the runtime learns which host identities correspond to device entries inside the registered image.

This distinction matters for REX because REX is a source-to-source compiler. It does not ask Clang to build the full native OpenMP offload bundle. It generates ordinary host source, generated device source, and helper files. That gives REX a lot of control, but it also means REX has to materialize the runtime contract explicitly.

The entry table is one of the places where that explicitness shows up.

The helper header rex_kmp.h defines the ABI shape REX needs:

1
2
3
4
5
6
7
struct __tgt_offload_entry {
  void *addr;
  char *name;
  size_t size;
  int32_t flags;
  int32_t reserved;
};

For a function entry, the important fields are addr and name.

The addr field is the host-side key. It is the pointer the runtime will later see at a launch site. The name field is the device-side symbol name that the runtime/plugin can use when resolving code in the registered image.

The remaining fields matter for the full ABI, but for the kernel identity story the core relationship is:

1
host pointer key -> offload-entry row -> device kernel name

That relationship is why REX cannot treat offload entries as incidental metadata. They are the bridge between generated host code and generated device code.

The Host Key Is A Synthetic Symbol, Not The Kernel Function

In native-looking source code, you might expect the host pointer to be the address of some outlined host stub function. REX instead uses a small synthetic char object as the key.

The lowerer creates a declaration like this conceptually:

1
char OUT__kernel_cpu__70__id__ = 0;

Then the generated offload entry stores its address:

1
2
3
4
5
6
7
8
9
struct __tgt_offload_entry
OUT__kernel_cpu__70__omp_offload_entry__
__attribute__((section("omp_offloading_entries"))) = {
  (void *)&OUT__kernel_cpu__70__id__,
  "OUT__kernel_cpu__70__kernel__",
  0,
  0,
  0
};

The exact generated stem varies by source location and lowering path, but the pattern is stable:

  • ...id__ is the host key,
  • ...omp_offload_entry__ is the table row,
  • ...kernel__ is the device kernel name.

Using a synthetic key has two practical advantages.

First, the host key is simple and stable. It does not depend on emitting a real host function body just to obtain an address. The key object exists only to give the runtime a unique address for one outlined kernel.

Second, it keeps identity separate from implementation. The host launch block can pass the key pointer while the actual implementation lives in the generated CUDA device translation unit. That is exactly the split REX wants: host code remains normal C/C++ after lowering, while the GPU body lives in rex_lib_<input>.cu.

Inside omp_lowering.cpp, the simplified construction looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SgVariableDeclaration *outlined_kernel_id_decl =
    buildVariableDeclaration(func_name + "id__", buildCharType(),
                             buildAssignInitializer(buildIntVal(0)), g_scope);

SgExprListExp *offload_entry_parameters = buildExprListExp(
    buildCastExp(buildAddressOfOp(buildVarRefExp(outlined_kernel_id_decl)),
                 buildPointerType(buildVoidType())),
    buildStringVal(func_name + "kernel__"),
    buildIntVal(0),
    buildIntVal(0),
    buildIntVal(0));

That snippet captures the main design: the generated func_name is the seed for both sides of the identity. REX does not separately invent a host key name and a device name and hope they remain related. It derives both from the same lowering stem.

The Device Kernel Uses The Same Name Stem

The host entry is only half of the contract. The generated device file must contain a kernel with the name stored in the entry.

The lowerer calls the outliner with the same stem:

1
Outliner::generateFunction(body_block, func_name + "kernel__", all_syms, ...);

The generated device file then has the corresponding kernel symbol shape:

__global__ void OUT__kernel_cpu__70__kernel__(...) {
  ...
}

Again, the exact prefix differs by source and transformation context. The important invariant is that the string stored in the host offload entry matches the device kernel symbol emitted by the outliner.

That invariant is stricter than it may look.

It is not enough for the host and device sides to agree on “the first kernel” or “the kernel generated from this target region.” The runtime/plugin performs symbol resolution using concrete names and pointers. If a refactor changes one side of the naming scheme without changing the other, the code can still pass many earlier checks:

  • the lowered host source exists,
  • the CUBIN exists,
  • the offload-entry section is non-empty,
  • registration may appear to succeed.

The failure only shows up when a launch needs to resolve a host pointer to a device symbol.

That is why entry-name matching is one of the first runtime debugging checks. It is also why the entry emission code should stay close to the outliner naming convention. The more places that independently construct kernel names, the easier it is to introduce a split-brain identity bug.

A diagram showing one generated kernel stem flowing into a host id symbol, an offload entry name string, a device kernel symbol, and a launch host pointer.

Figure 2. The safest mental model is one generated kernel stem flowing into four runtime-visible places. If any one of these names or pointers is produced by a different rule, host/device identity can drift.

Why The Section Attribute Matters

Creating a __tgt_offload_entry object is not enough. The runtime helper does not search all global variables looking for entries. It receives a contiguous entry range.

REX makes that range by placing each entry object into the omp_offloading_entries section:

1
2
3
4
5
6
7
SgVariableDeclaration *offload_entry_decl = buildVariableDeclaration(
    func_name + "omp_offload_entry__", tgt_offload_entry->get_type(),
    buildAssignInitializer(buildBracedInitializer(offload_entry_parameters)),
    g_scope);

offload_entry_decl->get_decl_item(SgName(func_name + "omp_offload_entry__"))
    ->set_gnu_attribute_section_name("omp_offloading_entries");

At the generated C level, this becomes the familiar attribute form:

1
__attribute__((section("omp_offloading_entries")))

This is the part that turns many independent global objects into one table.

For a single-kernel test, it may feel like a detail. For real programs, it is the mechanism that lets the runtime see all kernels together. A program with two target regions should not require two separate registration paths. It should emit two entries into the same section, and the section range should cover both.

This also keeps repeated launches simple. The table describes what kernels exist; each launch only needs to pass the right host key. Nothing about the table has to be rebuilt for every call.

There is a useful separation of responsibilities here:

  • the lowerer emits one entry per kernel,
  • the linker groups those entries into a section,
  • register_cubin.cpp uses the section boundaries,
  • libomptarget records the mapping during registration,
  • launch calls later reuse the host key.

That separation is why the entry table is both static and dynamic. It is static because the generated entries are ordinary global data. It is dynamic because the runtime only learns about them when the CUBIN and entry range are registered together.

The Linker Boundary Symbols Are The Table API

The entry table becomes useful to REX through two boundary symbols:

1
2
extern struct __tgt_offload_entry __start_omp_offloading_entries;
extern struct __tgt_offload_entry __stop_omp_offloading_entries;

These symbols are not declared in any header or generated file. They are provided automatically by the linker whenever a section named omp_offloading_entries exists in the link. The linker creates __start_<section> and __stop_<section> symbols as the first and last addresses of that section’s contents. REX does not generate these declarations; it simply references them.

register_cubin.cpp does not need to know how many target regions the original program had. It does not need a compiler-generated count variable. It does not need a generated array initializer. It only needs the start and stop addresses for the section.

That keeps the helper generic.

For one kernel:

1
2
3
__start_omp_offloading_entries
  entry 0
__stop_omp_offloading_entries

For many kernels:

1
2
3
4
5
6
__start_omp_offloading_entries
  entry 0
  entry 1
  entry 2
  ...
__stop_omp_offloading_entries

The same registration code handles both cases.

That is important for REX because the helper file is shipped by the compiler and should not be manually tweaked by users. The compiler-generated host code may have one target region or many; the helper should not care.

This is also the reason a missing struct __tgt_offload_entry definition can break builds immediately. The generated host file emits complete objects of that struct type. A forward declaration is not enough. The header has to provide the full ABI struct definition before those globals are declared.

That sounds mundane, but it is the kind of issue that matters when moving across LLVM versions or refactoring helper files. If generated code contains:

1
struct __tgt_offload_entry OUT__...omp_offload_entry__ = { ... };

then the translation unit needs a complete definition of struct __tgt_offload_entry at that point. Otherwise the compiler is correct to reject it as an incomplete type.

Registration Joins The Entry Table To The CUBIN

The entry table by itself does not launch anything. The CUBIN by itself is also not enough for the OpenMP runtime path. Registration is where they are joined.

In register_cubin.cpp, REX builds a __tgt_device_image:

1
2
3
4
5
storage->device_image.ImageStart = storage->image.data();
storage->device_image.ImageEnd =
    storage->image.data() + storage->image.size();
storage->device_image.EntriesBegin = &__start_omp_offloading_entries;
storage->device_image.EntriesEnd = &__stop_omp_offloading_entries;

Then it builds a __tgt_bin_desc around that image:

1
2
3
4
storage->bin_desc.NumDeviceImages = 1;
storage->bin_desc.DeviceImages = &storage->device_image;
storage->bin_desc.HostEntriesBegin = &__start_omp_offloading_entries;
storage->bin_desc.HostEntriesEnd = &__stop_omp_offloading_entries;

Finally, it registers the descriptor:

1
__rex_real___tgt_register_lib(&storage->bin_desc);

This is the full relationship:

1
CUBIN bytes + offload-entry range -> __tgt_device_image -> __tgt_bin_desc

Notice that the entry range appears twice: once inside __tgt_device_image and once at the bin descriptor level. REX intentionally uses the same start and stop symbols in both places. That keeps the image-level and host-level view aligned.

A diagram showing a standalone CUBIN and the omp_offloading_entries section feeding into __tgt_device_image and __tgt_bin_desc, then into __tgt_register_lib.

Figure 3. Registration is where the standalone CUBIN becomes an OpenMP offload image. The image bytes and the host entry range must be packaged together before launches can resolve host keys to device kernels.

This is also why the helper keeps CUBIN storage alive after registration. The descriptor fields point into long-lived storage, and the runtime can consult that registered state later. The entry table itself lives in the host binary, but the image bytes live in helper-managed storage.

That division of lifetime is intentional:

  • host entry globals live for the process lifetime,
  • CUBIN bytes are owned by CubinStorage,
  • the descriptor points at both,
  • explicit teardown is available through rex_offload_fini() when needed.

The entry table does not replace CUBIN lifetime management, and CUBIN lifetime management does not replace the entry table. Both are required.

How Launch Sites Reuse The Same Identity

Once registration has happened, an actual launch still needs to pass the same host key used in the offload entry.

In the generated host launch block, the host_ptr argument to __tgt_target_kernel(...) is the address associated with the outlined kernel’s synthetic id symbol. The lowerer already used that same address in the entry table.

Conceptually:

1
2
3
4
5
6
7
void *host_ptr = (void *)&OUT__kernel_cpu__70__id__;

__tgt_target_kernel(device_id,
                    num_teams,
                    thread_limit,
                    host_ptr,
                    &kernel_args);

rex_kmp.h may rewrite the apparent call name so the generated code goes through REX’s wrapper or fast path, but the identity value remains the important part. The runtime sees a host pointer and consults the registered entry mapping.

This explains why repeated launches are cheap from an identity perspective. The table is registered once. Later calls pass the same pointer key again and again.

For a loop that launches the same kernel many times, REX should not generate new offload entries per call. It should generate one offload entry for the outlined kernel and reuse that identity at each launch site.

For a program with multiple kernels, REX should emit multiple entries and make each launch pass the key for the correct one.

Those two cases are exactly why tests should cover:

  • multiple generated kernels in one lowered program,
  • repeated calls to the same generated kernel,
  • target-data regions where mappings outlive individual launches,
  • and build/run cases where the CUBIN and host binary are regenerated together.

The entry-table design supports all of those as long as the compiler keeps the identity invariant.

What Can Go Wrong

Most offload-entry bugs fall into a small set of categories.

The first category is missing entry emission. The generated host file does not contain __tgt_offload_entry globals at all, or they are not placed into omp_offloading_entries. In that case, registration has no useful host entries to bind to the image.

The second category is incomplete ABI definitions. If struct __tgt_offload_entry is only forward-declared, the generated host file cannot define complete entry objects. That is a compile-time failure, not a runtime failure, but it often appears during helper/header refactors.

The third category is name drift. The host entry stores "OUT__...kernel__", but the generated device file contains a different symbol. This is a runtime identity failure.

The fourth category is host-pointer drift. The offload entry stores the address of one synthetic id symbol, but the launch call passes a different pointer. That is just as bad as a name mismatch, because the runtime lookup starts from the pointer it receives at launch.

The fifth category is stale artifacts. The host binary and CUBIN were built from different lowering outputs. The entry table may describe kernels from one generated program while the CUBIN contains kernels from another.

The practical checks are direct:

1
2
3
4
rg -n 'omp_offload_entry__|omp_offloading_entries|__id__' rose_*.c
rg -n 'kernel__|omp_offload_entry__' rose_*.c rex_lib_*.cu
readelf -sW ./your_binary | rg 'omp_offloading_entries|OUT__.*omp_offload_entry__'
readelf -S ./your_binary | rg 'omp_offloading_entries'

Those commands do not prove every map-array field is correct, but they quickly prove whether the identity layer is plausible.

How The Test Layer Should Guard This Contract

The best tests for this layer are structural invariants, not golden whole-file comparisons.

The exact generated symbol stem may change when source locations, outline counters, or formatting change. A test that depends on every character of a lowered file is brittle. But the actual contract can be checked with more stable questions:

  • does the generated host file contain exactly one offload entry per expected kernel?
  • does each entry live in omp_offloading_entries?
  • does each entry refer to a generated id symbol?
  • does the entry name string match a device kernel name?
  • does a repeated launch reuse the existing identity instead of creating a fresh entry?
  • does a multi-kernel program produce multiple entries in one section?

That is the style REX uses in its reduced Rodinia lowering tests. The verification script checks facts such as the count of host offload entries and the presence of expected generated structures rather than treating the entire lowered file as a golden artifact.

That choice matches the nature of this layer. Offload identity is an invariant. Formatting is not.

The test suite should also keep a runtime-oriented case in the loop. A structural check can prove the entry table exists, but only an execution test proves that the table, CUBIN, wrapper layer, and runtime plugin all agree.

That is why the entry-table story should be validated at three levels:

  • source invariant checks for generated host/device artifacts,
  • binary symbol checks when toolchain behavior changes,
  • end-to-end benchmark runs for actual runtime resolution.

No single layer catches every failure. Together they make the identity path much harder to break silently.

The Design Rule

The design rule for this part of REX is simple:

one outlined kernel gets one stable identity, and every generated artifact must use that same identity.

The host entry table is where that identity becomes runtime-visible. The CUBIN registration helper is where the table is attached to device code. The launch block is where the same host key is reused.

That makes omp_offloading_entries more than a section attribute. It is the connecting tissue between the source-to-source lowerer and the LLVM OpenMP offloading runtime.

When the entry table is right, REX can support multiple kernels, repeated launches, explicit CUBIN registration, and direct runtime calls without relying on Clang’s native offload bundling pipeline.

When it is wrong, everything after it becomes suspicious.

That is why this layer deserves its own post: it is small enough to overlook, but important enough that the entire GPU path depends on it.