How REX Completed Direct __tgt_target_kernel Lowering And Repaired The Device ABI

Posted on Apr 25, 2026 (Updated on May 4, 2026)

Moving REX from legacy __tgt_target_teams lowering to direct __tgt_target_kernel was not a one-line host-call replacement. LLVM’s modern OpenMP kernel-launch path changes the whole contract: the host builds a __tgt_kernel_arguments packet, the runtime prepares the device argument list, and the device entry receives a hidden launch-environment slot before user arguments. REX initially built and launched but produced wrong gaussian output because the host and device disagreed about that argument layout. The fix was to complete the migration end to end: emit the kernel-args struct, call __tgt_target_kernel, prepend __rex_kernel_launch_env to generated CUDA kernels, transport literal scalars through pointer-sized slots, and reconstruct typed locals inside the kernel body.

The previous post isolated one part of the modern OpenMP launch ABI: literal scalar target parameters. Eligible scalar inputs should not be described like address-based mapped storage. They should be carried as literal value slots, with matching host packing, map types, and device unpacking.

That scalar work was necessary, but it was not the whole launch migration.

The larger change was moving REX from the old __tgt_target_teams style to direct __tgt_target_kernel lowering.

At first glance, that sounds like a host-side runtime call change:

1
2
old: __tgt_target_teams(...)
new: __tgt_target_kernel(...)

That interpretation is incomplete.

__tgt_target_kernel is not just a differently named entry point. In the LLVM offload runtime path REX was targeting, it comes with a different launch packet and a different device-entry argument layout. If the generated host code moves to the new API but the generated CUDA kernel still expects the old parameter list, the program can compile, register the cubin, launch the kernel, and still compute wrong answers.

That is exactly what happened during the migration.

gaussian exposed it.

A diagram showing that replacing the host runtime call is only one part of the ABI. The host kernel argument packet and device kernel signature must also change. — Figure 1. The direct kernel API is a host/device contract. Replacing the host call without repairing the device signature only moves the mismatch later.

Why Gaussian Was The Right Probe

gaussian was a useful failure case because it has several properties that make ABI mistakes obvious:

1
2
3
4
5
it has multiple kernels;
the kernels are launched repeatedly;
the hot kernels carry scalar loop-control values;
the same kernels also carry device pointers;
wrong scalar values corrupt matrix updates visibly.

The key scalar parameters were values such as Size and t. They are not optional metadata. They control loop bounds and indexing. If the device kernel reads the wrong argument slot for Size, it does not merely slow down. It computes the wrong matrix.

That made the failure mode sharp. The REX binary built successfully. The runtime launch did not immediately crash. But the output differed from native LLVM because the device entry interpreted its parameters under the wrong ABI.

That distinction matters. A compiler migration can fail in several ways:

1
2
3
4
compile-time failure: generated C/CUDA does not compile;
registration failure: the runtime cannot find or register the image;
launch failure: the runtime cannot launch the entry;
ABI failure: the launch succeeds but the kernel reads the wrong values.

This was the fourth case. It is more dangerous than the first three because it can look like an algorithm or floating-point issue until the generated kernel signatures are compared directly.

The Old Shape And The New Shape

The old REX path was still conceptually close to a direct CUDA-style call shape. The generated host had arrays of arguments and then called a legacy target-teams entry point:

1
2
3
4
__tgt_target_teams(__device_id, __host_ptr,
                   __arg_num, __args_base, __args,
                   __arg_sizes, __arg_types,
                   _num_blocks_, _threads_per_block_);

That call shape carries argument arrays directly through the old entry point.

The direct kernel path carries the same logical information through a structured packet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
struct __tgt_kernel_arguments __kernel_args = {
  3,
  __arg_num,
  __args_base,
  __args,
  __arg_sizes,
  __arg_types,
  (void **)0,
  (void **)0,
  (int64_t)__rex_tripcount,
  0LL,
  {_num_blocks_, 1, 1},
  {_threads_per_block_, 1, 1},
  0
};

__tgt_target_kernel(__device_id, _num_blocks_, _threads_per_block_,
                    __host_ptr, &__kernel_args);

That is the host-visible part of the migration. REX now builds that packet through buildTargetKernelArgsDeclaration(...) and emits the direct call from the target lowering path.

The struct layout in rex_kmp.h mirrors the runtime contract:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
struct __tgt_kernel_arguments {
  int32_t Version;
  int32_t NumArgs;
  void **ArgsBase;
  void **Args;
  int64_t *ArgSizes;
  int64_t *ArgTypes;
  void **ArgNames;
  void **ArgMappers;
  int64_t Tripcount;
  int64_t Flags;
  int32_t Teams[3];
  int32_t Threads[3];
  int32_t DynCGroupMem;
};

The important field for this post is Version. REX emits version 3, matching the modern kernel-argument form used by the LLVM path under discussion. That version tells the runtime how to prepare the kernel argument list.

And that preparation changes what the device sees.

The Hidden Launch-Environment Slot

The key observation came from comparing native LLVM PTX with REX PTX.

Native LLVM’s device entry for a gaussian kernel did not start with the user scalar parameters. It started with a hidden launch-environment pointer:

.entry __omp_offloading_..._Fan1_l177(
  .param .u64 .ptr .align 1 param_0,
  .param .u64 param_1,
  .param .u64 param_2,
  .param .u64 .ptr .align 1 param_3,
  .param .u64 .ptr .align 1 param_4
)

The first slot is not Size. It is the launch-environment slot. The next two slots are pointer-sized scalar literal transports. The final slots are device pointers.

The broken intermediate REX kernel had the old shape:

.entry OUT__...Fan1__kernel__(
  .param .u32 param_0,
  .param .u32 param_1,
  .param .u64 .ptr .align 1 param_2,
  .param .u64 .ptr .align 1 param_3
)

That is a slot shift. The host and runtime are preparing:

1
2
3
4
5
slot 0: hidden launch environment
slot 1: Size bits
slot 2: t bits
slot 3: m pointer
slot 4: a pointer

but the REX device kernel is reading:

1
2
3
4
slot 0: Size
slot 1: t
slot 2: m pointer
slot 3: a pointer

Once written down, the wrong-output symptom is no longer mysterious. Size and t are read from the wrong slots, so loop bounds and memory indexing are wrong.

A diagram showing the runtime argument slots with a hidden launch environment slot and the old REX device signature missing that slot. — Figure 2. The bad migration was a classic ABI slot-shift bug: the host prepared five slots while the device entry expected four.

Repairing The Device Entry

The fix had to repair the generated CUDA kernel signature, not just the host call.

REX now runs the outlined CUDA kernel through a device-ABI repair step. The first part prepends the hidden launch-environment parameter:

1
2
3
4
SgInitializedName *kernel_launch_env_param =
    SageBuilder::buildInitializedName("__rex_kernel_launch_env",
                                      buildPointerType(buildVoidType()));
prependArg(params, kernel_launch_env_param);

This gives the generated CUDA kernel the first slot the runtime is going to pass. Even if the current kernel body does not use that launch environment, the parameter has to exist so every later argument lands in the correct position.

The generated CUDA entry now starts like this:

1
2
3
4
5
6
__global__ void OUT__...Fan1__kernel__(
    void *__rex_kernel_launch_env,
    unsigned long long Size,
    unsigned long long t,
    float *_dev_m,
    float *_dev_a)

The name __rex_kernel_launch_env is intentionally explicit. It is not a user argument. It is an ABI slot inserted by the compiler so the generated device entry matches the runtime-prepared argument sequence.

This part of the fix is independent of scalar literal packing. Even a kernel with no literal scalars still needs the hidden slot when launched through this direct kernel path.

Repairing Scalar Transport Width

The second part of the fix handles literal scalar parameters.

As the previous post explained, the host packs eligible scalar values into pointer-sized runtime slots. That means the device entry should receive a transport slot, not the original source type.

For gaussian, native LLVM used .u64 slots for the scalar literals on the tested host/runtime path. REX had been generating .u32 entries for int scalars. That was not the right ABI for the direct kernel path because the runtime was transporting pointer-sized values.

The device ABI repair therefore rewrites literal scalar parameters to a pointer-sized integer transport type:

1
2
3
4
5
SgType *transport_type =
    get_host_pointer_size_bytes(body) <= 4
        ? buildUnsignedIntType()
        : buildUnsignedLongLongType();
param->set_type(transport_type);

This is not a numeric conversion. It is a transport representation. The scalar value is represented as raw bytes in a slot large enough for the ABI.

That distinction is important for floats. If the host packs a float bit pattern into a slot, the device must not recover it through a numeric cast from an integer. It must recover the original bytes.

REX handles that by creating a typed shadow local and copying the bytes into it:

1
2
3
4
5
6
7
8
9
SgVariableDeclaration *shadow_decl =
    buildVariableDeclaration(shadow_name, original_type, NULL, body);

SgExprStatement *memcpy_stmt = buildFunctionCallStmt(
    "__builtin_memcpy", buildPointerType(buildVoidType()),
    buildExprListExp(buildAddressOfOp(buildVarRefExp(shadow_sym)),
                     buildAddressOfOp(buildVarRefExp(param_sym)),
                     buildSizeOfOp(original_type)),
    body);

Then the compiler rewrites uses of the original parameter in the kernel body to the shadow symbol. Conceptually, the generated device code becomes:

1
2
3
4
5
6
7
8
9
int t__rex_value;
__builtin_memcpy(&t__rex_value, &t, sizeof(int));

int Size__rex_value;
__builtin_memcpy(&Size__rex_value, &Size, sizeof(int));

for (i = tid; i <= Size__rex_value - 1 - t__rex_value - 1; i += stride) {
  ...
}

This is what makes the ABI repair type-safe. The runtime gets the transport width it expects, and the kernel body gets the source-level type it was written against.

A diagram showing the repaired direct target kernel contract: host kernel args packet, runtime prepared slots, hidden launch environment, literal scalar transport, and typed device locals. — Figure 3. The repaired path aligns every layer: host packet, runtime slots, device signature, and typed kernel-body values.

Why This Was Not Just A Wrapper Problem

Earlier in the work, there was a design concern about adding REX wrapper layers instead of using the LLVM API directly. That concern was valid. A compiler should not hide design confusion behind wrappers.

This fix goes in the direct direction.

The generated code now calls the direct kernel API:

1
2
__tgt_target_kernel(__device_id, _num_blocks_, _threads_per_block_,
                    __host_ptr, &__kernel_args);

The REX header still declares the runtime types and symbol aliases needed so generated source can compile cleanly against LLVM’s offload runtime. But the design point is not “wrap everything.” The design point is:

1
2
3
emit the direct API shape;
build the kernel-args packet the runtime expects;
repair the device entry so the runtime-prepared argument list matches it.

That is the opposite of staying on a legacy compatibility path. It is a real migration to the modern runtime contract.

Correctness Came Before Timing

The most important result of this fix was correctness.

Before the device ABI repair, a direct-kernel REX binary could build and launch while producing wrong gaussian output. After the repair, regenerated REX output matched native LLVM output once timing-only lines were removed:

1
2
diff -u <(grep -v '^Time' /tmp/rex_gaussian_full.txt) \
        <(grep -v '^Time' /tmp/native_gaussian_full.txt)

The diff was empty.

Only then did timing become meaningful. In the run recorded in the optimization log, the repaired REX gaussian path reported roughly:

1
2
Time total including memory transfers: 0.121591 sec
Time for kernels:                      0.116513 sec

The native LLVM run on the same input reported roughly:

1
2
Time total including memory transfers: 0.263611 sec
Time for kernels:                      0.257208 sec

Those numbers should be read carefully. They show that the repaired path was no longer paying a correctness penalty and could be competitive or better on this workload. They are not the final full-suite conclusion. Later posts will cover the benchmark-by-benchmark wrap-up and the LLVM 22 reevaluation.

The point here is narrower: once the host and device ABI matched, gaussian stopped being an invalid comparison.

The Regression Tests Had To Change

A migration like this needs generated-code tests, not only benchmark runs.

The Rodinia-derived lowering verification now checks host and device invariants:

1
2
3
4
5
6
7
8
9
expect_count "${rose_file}" '__tgt_target_kernel[[:space:]]*\(' \
  "${kernel_count}" "host target kernel call count"

expect_count "${rose_file}" '__tgt_target_teams[[:space:]]*\(' \
  0 "unexpected host target teams call count"

expect_count "${cu_file}" \
  '__global__[[:space:]]+void[[:space:]]+OUT__.*\(void[[:space:]]*[*][[:space:]]*__rex_kernel_launch_env' \
  "${kernel_count}" "device hidden launch env parameter count"

Those checks encode the real invariant:

1
2
if the host uses direct __tgt_target_kernel,
the generated device entry must expose the hidden launch-environment slot.

The tests also check that the hidden slot is not duplicated:

1
2
3
expect_count "${cu_file}" \
  'void[[:space:]]*[*][[:space:]]*__rex_kernel_launch_env,[[:space:]]*void[[:space:]]*[*][[:space:]]*__rex_kernel_launch_env' \
  0 "duplicate hidden launch env parameter count"

That matters because AST transformations often run through multiple lowering paths. A robust compiler pass must be idempotent enough not to prepend the ABI slot twice.

The reduced Rodinia suite includes rodinia_gaussian_like specifically because it exercises the multi-kernel direct-launch shape with scalar loop-control values. That is the kind of small structural test that catches a future ABI regression faster than waiting for a full benchmark run to produce a wrong matrix.

The Design Rule To Keep

The durable lesson from this phase is:

1
2
a runtime API migration is complete only when host packet, runtime entry point,
device signature, and device-body value reconstruction agree.

For REX, that means the direct kernel path must keep these pieces aligned:

1
2
3
4
5
6
7
__tgt_kernel_arguments Version and fields;
__tgt_target_kernel host call;
offload-entry identity through __host_ptr;
hidden __rex_kernel_launch_env device parameter;
pointer-sized literal scalar transport slots;
typed shadow locals reconstructed with __builtin_memcpy;
generated-code tests that verify the shape.

The mistake to avoid is treating __tgt_target_kernel as a mechanical replacement for __tgt_target_teams. The host call is only the visible tip. The device signature is part of the API.

That is why this post follows the literal-scalar post. Literal scalar packing explained how individual scalar values should be represented. This post explains how the whole launch frame had to be repaired so those values land in the right device parameters.

The next post moves to b+tree, where the issue is no longer an ABI mismatch. The remaining gap there came from launch-geometry policy and fairness: what REX is allowed to optimize, what it must preserve, and how to improve performance without changing explicit user intent.