How REX Builds `__tgt_kernel_arguments` and Runtime Map Arrays

Posted on Mar 28, 2026 (Updated on Apr 15, 2026)

After REX builds the host launch block, it still has to package mapped arguments into the exact shape expected by LLVM’s offloading runtime. It materializes __args_base, __args, __arg_sizes, and __arg_types, chooses either a static braced-initializer path or a dynamic heap-backed expansion path, then assembles __tgt_kernel_arguments with the correct field order and launch dimensions. This packet is the real ABI contract: if it is wrong, the kernel may launch with the wrong data even when outlining and host-side control flow look correct.

The previous post in this series focused on the host launch block: the explicit host-side code that replaces a lowered omp target-family construct and eventually calls __tgt_target_kernel(...).

That still leaves one layer unexplained:

where do the runtime-facing argument arrays and the __tgt_kernel_arguments object actually come from?

This is the place where GPU lowering stops being “explicit host code” in the general sense and becomes an ABI contract in a very literal sense.

By the time this stage runs, REX already has:

an outlined device kernel,
a host block with kernel identity and launch-dimension variables,
and a set of mapping expressions computed from map(...) clauses and lowering-generated state.

What it still needs is a runtime packet in the exact layout that LLVM’s offloading runtime expects.

This post zooms into that packet-building stage inside src/midend/programTransformation/ompLowering/omp_lowering.cpp. The focus is intentionally narrow:

how __args_base, __args, __arg_sizes, and __arg_types are built,
how the lowerer chooses between static and dynamic map-array construction,
how packed literal arguments are stabilized before launch,
how buildTargetKernelArgsDeclaration(...) fills the __tgt_kernel_arguments struct,
and why getting this structure wrong causes failures that look like runtime bugs rather than syntax or AST bugs.

A flow diagram showing map expressions and launch variables feeding either static or dynamic array builders, then converging into a __tgt_kernel_arguments object passed to __tgt_target_kernel. — Figure 1. The runtime packet layer sits between the host launch block and the final offloading call. Different map-expansion paths still converge on one packet shape.

Why This Packet Layer Deserves Its Own Stage

It is easy to treat __tgt_kernel_arguments as a boring aggregate and focus on the final __tgt_target_kernel(...) call instead. That is the wrong mental model.

The launch call itself is small. Most of the offloading contract lives in the data structures passed into it.

That matters because failures at this layer are deceptive. If the lowerer gets the packet wrong:

the kernel may still be outlined correctly,
the host launch block may still look structurally reasonable,
the runtime call may still execute,
and the failure may only show up as wrong values on the device, partial mapping, or unexplained runtime misbehavior.

So this stage is not just “some declarations before the call.” It is where REX turns compiler-owned mapping facts into the exact memory layout that libomptarget expects to consume.

There are two big design ideas to keep in mind while reading this code:

The runtime packet is built from ordinary lowered host declarations.
REX first turns launch geometry and map expressions into explicit local state. The packet builder then consumes that state.
Static and dynamic map cases use different construction strategies but converge on the same ABI shape.
The runtime never sees “the static path” or “the dynamic path.” It sees one __tgt_kernel_arguments layout.

That convergence is the core of the design.

Step 1: Start From Four Parallel Map Lists

Long before the final packet object exists, the lowerer has already accumulated four synchronized expression lists:

map_variable_base_list
map_variable_list
map_variable_size_list
map_variable_type_list

These are the conceptual inputs to the runtime packet.

Each position across the lists represents one mapped argument slot:

base address,
effective mapped address,
size,
and map-type bits.

That is why the lowerer has a utility like getMapArgumentListCount(...) that asserts all four lists are the same length:

1
2
3
4
const size_t arg_count = map_variable_list->get_expressions().size();
ROSE_ASSERT(map_variable_base_list->get_expressions().size() == arg_count);
ROSE_ASSERT(map_variable_size_list->get_expressions().size() == arg_count);
ROSE_ASSERT(map_variable_type_list->get_expressions().size() == arg_count);

That assertion is more than defensive programming. It expresses the intended model: these are not four unrelated accumulators. They are one logical table being built in column form.

This column-oriented representation is useful inside the lowerer because different helper paths can append to the size and type expressions independently from the base and address expressions. But the runtime ABI still expects them to line up entry-for-entry.

Step 2: Stabilize Packed Literal Arguments Before Building Arrays

Some mapped scalars can use literal transport instead of normal address-based mapping. Earlier lowering stages may therefore place helper expressions such as rex_pack_literal_arg_bytes(...) directly into the map lists.

That is not the form the final host launch block should use.

Before the lowerer emits the static map arrays, it normalizes those expressions with materializeLiteralTargetArgExpressions(...).

The logic is straightforward:

1
2
3
4
5
if (isLiteralTargetParamPackCall(arg_exprs[i])) {
  packed_expr = arg_exprs[i];
} else if (isLiteralTargetParamPackCall(base_exprs[i])) {
  packed_expr = base_exprs[i];
}

When it finds such a packed expression, it creates a new local variable:

1
2
3
4
SgVariableDeclaration *packed_decl = buildVariableDeclaration(
    packed_name, buildPointerType(buildVoidType()),
    buildAssignInitializer(copyExpression(packed_expr)), scope);
outlined_driver_body->append_statement(packed_decl);

Then it rewrites both the argument and base list entries to refer to that same local:

1
2
arg_exprs[i] = buildVarRefExp(packed_sym);
base_exprs[i] = buildVarRefExp(packed_sym);

Why does this matter?

It avoids duplicating nontrivial expressions

If both ArgsBase and Args were initialized with independent copies of the same pack call, the lowered source would evaluate the pack expression twice. That is pointless at best and dangerous if the helper ever gains observable behavior.

It keeps the two lists aligned semantically as well as structurally

For a packed literal argument, the correct base and effective mapping expressions are intentionally the same stable value. Hoisting them into one local variable makes that relationship explicit in the generated source.

It makes the static and dynamic paths conceptually parallel

The dynamic path performs a similar stabilization step later with temporary names like __rex_packed_literal_arg_dyn_*. That is not accidental duplication. It is the same semantic rule applied in a different construction strategy.

This is a good example of REX preferring explicit lowered artifacts over clever implicit sharing.

Step 3: Choose Between Static and Dynamic Map-Array Construction

Once the map expressions exist, the lowerer branches on whether any dynamic map entries are present.

The high-level decision looks like this:

1
2
3
4
5
6
7
if (!dynamic_map_entries.empty()) {
  dynamic_arrays = buildDynamicRuntimeMapArgumentArrays(...);
  ...
} else {
  materializeLiteralTargetArgExpressions(...);
  ...
}

That is one of the most important control points in the packet-building stage.

The static path

If there are no dynamic entries, the lowerer emits ordinary local arrays with braced initializers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
args_base_decl = buildVariableDeclaration(
    "__args_base", buildArrayType(buildPointerType(buildVoidType())),
    buildAssignInitializer(offloading_variables_base), p_scope);

args_decl = buildVariableDeclaration(
    "__args", buildArrayType(buildPointerType(buildVoidType())),
    buildAssignInitializer(offloading_variables), p_scope);

arg_sizes = buildVariableDeclaration(
    "__arg_sizes", buildArrayType(buildOpaqueType("int64_t", p_scope)),
    buildAssignInitializer(map_variable_sizes), p_scope);

arg_types = buildVariableDeclaration(
    "__arg_types", buildArrayType(buildOpaqueType("int64_t", p_scope)),
    buildAssignInitializer(map_variable_types), p_scope);

Then it emits:

1
2
3
arg_number_decl = buildVariableDeclaration(
    "__arg_num", buildOpaqueType("int32_t", p_scope),
    buildAssignInitializer(buildIntVal(kernel_arg_num)), p_scope);

This is the simpler and more readable shape. In generated code, it looks like an ordinary bundle of local arrays initialized from compile-time-known expressions.

The dynamic path

If dynamic entries exist, the lowerer cannot just dump a fixed initializer list into a local array. It has to compute how many runtime argument slots will exist and then populate them procedurally.

That is the job of buildDynamicRuntimeMapArgumentArrays(...).

The function starts by counting any static prefix and suffix lists, then initializes __arg_num with that fixed part:

1
2
3
4
5
result.arg_number_decl = buildVariableDeclaration(
    "__arg_num", buildOpaqueType("int32_t", scope),
    buildAssignInitializer(
        buildIntVal(static_cast<int>(prefix_count + suffix_count))),
    block);

Then it runs a count-only pass over the expanded dynamic entries:

1
2
appendExpandedMapEntriesDynamicPass(
    dynamic_entries, DynamicMapExpansionPass::count_only, ...);

During this pass, direct items increment __arg_num, and mapper-expanded sections recursively count how many final leaf entries they will contribute.

Only after the total size is known does the lowerer allocate heap-backed arrays:

1
2
3
4
5
result.args_base_decl = buildVariableDeclaration(
    "__args_base", void_ptr_ptr_type,
    buildAssignInitializer(
        buildMallocArrayInitializer(void_ptr_type, arg_count_expr, scope)),
    block);

The same pattern is used for __args, __arg_sizes, and __arg_types.

This is why the dynamic path is not just “the static path with loops.” It is a different construction strategy whose first job is to discover the final arity.

A comparison diagram showing the static path creating braced local arrays and the dynamic path first counting entries, then mallocing arrays, and finally populating them with an arg-index cursor. — Figure 2. Static and dynamic mapping use different construction strategies, but both ultimately produce the same four runtime arrays and one argument-count declaration.

Step 4: Populate Dynamic Arrays With a Cursor, Not With Initializers

After allocation, the dynamic path creates an __arg_index cursor:

1
2
3
SgVariableDeclaration *arg_index_decl =
    buildVariableDeclaration("__arg_index", buildOpaqueType("int32_t", scope),
                             buildAssignInitializer(buildIntVal(0)), block);

That cursor is how the lowerer turns a potentially recursive expansion process into a flat runtime array layout.

Prefix lists first

If there are ordinary non-dynamic entries that should appear before the dynamic expansion, the lowerer copies them into the heap arrays with appendRawMapArgumentListsToDynamicArrays(...).

That helper performs the same four writes per entry:

1
2
3
4
5
6
7
8
appendMapArgumentArrayAssignment(block, scope, args_base_decl,
                                 arg_index_decl, bases[i], void_ptr_type);
appendMapArgumentArrayAssignment(block, scope, args_decl, arg_index_decl,
                                 args[i], void_ptr_type);
appendMapArgumentArrayAssignment(block, scope, arg_sizes_decl,
                                 arg_index_decl, sizes[i], int64_type);
appendMapArgumentArrayAssignment(block, scope, arg_types_decl,
                                 arg_index_decl, types[i], int64_type);

and then increments __arg_index.

Dynamic entries next

The lowerer then runs the populate pass:

1
2
appendExpandedMapEntriesDynamicPass(
    dynamic_entries, DynamicMapExpansionPass::populate, ...);

For direct dynamic items, it builds a MapArgumentExpressions bundle and writes the four fields into the heap arrays. If the mapping expression is a literal-pack call, it stabilizes that into a temporary __rex_packed_literal_arg_dyn_* variable before storing it.

For mapper-expanded array sections, the helper recursively builds loop nests, computes each element expression with buildArraySectionElementExpression(...), and expands the mapper on each leaf element before continuing the populate walk.

This is the core reason the dynamic path must be procedural. The final runtime array is flat, but the source-level mapping description may imply a recursive or nested expansion shape.

Suffix lists last

Finally, if there are fixed entries that must come after the expanded dynamic ones, the lowerer appends them with the same raw-list helper.

The important architectural point is that:

the runtime still receives one flat array per column.

The recursion, counting, temporary index variables, and heap storage all exist only so the compiler can flatten a richer mapping description into that flat ABI form.

Step 5: Assemble `__tgt_kernel_arguments` in the Exact ABI Field Order

Once the map arrays exist, the lowerer finally calls:

1
2
3
buildTargetKernelArgsDeclaration(
    g_scope, p_scope, arg_number_decl, args_base_decl, args_decl, arg_sizes,
    arg_types, num_blocks_decl, threads_per_block_decl, tripcount_expr);

This helper is the point where compiler-local declarations become a concrete ABI struct.

To understand why the field order matters, it helps to look at the vendored definition in rex_kmp.h:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
struct __tgt_kernel_arguments {
  int32_t Version;
  int32_t NumArgs;
  void **ArgsBase;
  void **Args;
  int64_t *ArgSizes;
  int64_t *ArgTypes;
  void **ArgNames;
  void **ArgMappers;
  int64_t Tripcount;
  int64_t Flags;
  int32_t Teams[3];
  int32_t Threads[3];
  int32_t DynCGroupMem;
};

The builder then fills those fields positionally:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
kernel_args_exprs.push_back(buildIntVal(3));
kernel_args_exprs.push_back(buildVarRefExp(arg_number_decl));
kernel_args_exprs.push_back(buildVarRefExp(args_base_decl));
kernel_args_exprs.push_back(buildVarRefExp(args_decl));
kernel_args_exprs.push_back(buildVarRefExp(arg_sizes));
kernel_args_exprs.push_back(buildVarRefExp(arg_types));
kernel_args_exprs.push_back(buildKernelArgNullPtrExpr());
kernel_args_exprs.push_back(buildKernelArgNullPtrExpr());
kernel_args_exprs.push_back(tripcount_expr);
kernel_args_exprs.push_back(buildLongLongIntVal(0));
kernel_args_exprs.push_back(
    buildKernelLaunchDimInitializer(buildVarRefExp(num_blocks_decl)));
kernel_args_exprs.push_back(
    buildKernelLaunchDimInitializer(buildVarRefExp(threads_per_block_decl)));
kernel_args_exprs.push_back(buildIntVal(0));

That expands to the following concrete meaning.

`Version = 3`

REX is intentionally targeting the current struct layout it vendors in rex_kmp.h. This is one of the reasons the helper/runtime layer exists: the generated source should not depend on the system headers accidentally agreeing with a remembered layout.

`NumArgs`, `ArgsBase`, `Args`, `ArgSizes`, `ArgTypes`

These are the four columns plus the count that the earlier lowering logic produced. They are the center of the packet.

`ArgNames = nullptr`, `ArgMappers = nullptr`

REX explicitly zeros out the optional name and mapper-pointer arrays via buildKernelArgNullPtrExpr(). The packet layout still includes these slots even when the current lowering does not populate them.

`Tripcount`

If the worksharing path computed a tripcount declaration, it is cast to int64_t and inserted here. Otherwise the helper stores 0.

`Flags = 0`

This post does not need to over-interpret that value. The important point is that REX writes the field intentionally; it is not left uninitialized or omitted.

`Teams = {num_blocks, 1, 1}` and `Threads = {threads_per_block, 1, 1}`

The helper uses buildKernelLaunchDimInitializer(...) to build a 3D launch tuple with only the X dimension varying:

1
2
return buildAggregateInitializer(buildExprListExp(
    copyExpression(x_dim_expr), buildIntVal(1), buildIntVal(1)));

That is how the host-side scalar launch variables become the 3-component launch arrays required by the ABI.

`DynCGroupMem = 0`

Again, the point is explicitness. If REX does not request dynamic cooperative-group memory, it writes a concrete zero into the correct final field.

A struct-layout diagram for __tgt_kernel_arguments showing the exact fields and the specific values REX stores in each one, including Version=3, null ArgNames/ArgMappers, Teams={num_blocks,1,1}, and Threads={threads_per_block,1,1}. — Figure 3. `buildTargetKernelArgsDeclaration(...)` is a positional ABI builder. The order of fields matters just as much as the values placed into them.

Why The Field Order Matters So Much

This builder is not constructing a semantic object through named setters. It is constructing a braced initializer in the order the ABI struct expects.

That means a seemingly small bug here can become catastrophic in ways that are hard to diagnose:

shift one field and the runtime may interpret ArgTypes as ArgNames,
write launch dimensions in the wrong slot and the kernel may launch with nonsense geometry,
forget to cast Tripcount correctly and the runtime may see garbage or truncation,
mismatch the struct layout against the helper header and the entire packet becomes undefined from the runtime’s perspective.

This is why REX vendors the ABI struct definition and why this helper deserves more attention than “just aggregate initialization.”

The generated source may still compile cleanly even when the packet is semantically wrong. The compiler front end does not know what libomptarget intends to do with each slot. That is why bugs here often look like runtime failures rather than compile-time failures.

Step 6: Clean Up Dynamic Storage Without Changing The Packet Shape

If the dynamic path was used, the lowerer appends cleanup immediately after the launch:

1
2
appendDynamicRuntimeMapArgumentArrayCleanup(dynamic_arrays,
                                            outlined_driver_body, p_scope);

That helper simply calls free(...) on the four heap-backed arrays in reverse-ish dependency order:

__arg_types
__arg_sizes
__args
__args_base

The important point is what does not change:

the runtime packet layout is identical,
the __kernel_args object was built the same way,
the launch call looked the same,
only the storage lifetime of the backing arrays was different.

That is exactly the design REX wants. Dynamic mapping changes how the compiler constructs the arrays, not what the runtime expects to receive.

What This Packet-Building Design Buys REX

Looking at the map-array and packet layer directly makes a few design strengths much clearer.

The runtime contract is isolated in one place

The lowerer can reason about mapping, outlining, and host launch structure at higher levels. When it is time to satisfy the runtime ABI, there is a narrow layer that does just that.

This is the biggest architectural win. Complexity from dynamic mapper expansion stays in the construction path, not in the packet format.

Literal scalar transport is integrated cleanly

Packed literal arguments are normalized into explicit lowered variables before being exposed to the runtime packet. That keeps the final arrays stable and readable.

The generated source remains inspectable

Even though this is an ABI-heavy stage, the output is still ordinary declarations:

__args_base
__args
__arg_sizes
__arg_types
__arg_num
__kernel_args

That is exactly the kind of explicit lowered code that makes source-to-source debugging viable.

The next post can now move to the final major section from the original lowering article: how REX lowers target data regions and how the tests exercise multi-kernel and repeated-launch lifetimes without relying on brittle full-file golden outputs.