How REX Builds the GPU Host Launch Block

Posted on Mar 26, 2026 (Updated on Mar 27, 2026)

After REX outlines a device kernel, it replaces the original target construct with an ordinary host block. This block names the kernel, computes teams and threads, materializes runtime argument arrays, and calls __tgt_target_kernel. The worksharing path adds tripcount-aware launch shaping and reduction cleanup, but the overall design stays the same: generate explicit host code rather than rely on implicit runtime interpretation.

The previous post in this series focused on the outlining boundary: how a target region becomes a real device kernel with a stable name and an explicit capture set.

That still leaves an important practical question:

once the kernel exists, what exactly replaces the original OpenMP statement in the host code?

The answer is not “the runtime figures it out.” REX emits an ordinary host-side block of declarations and statements that performs the launch explicitly.

This post zooms in on that block. It covers the host-side artifact created after outlining, not the device kernel itself. In src/midend/programTransformation/ompLowering/omp_lowering.cpp, this is the code that:

identifies the outlined kernel on the host side,
creates explicit launch-dimension variables,
packages mapped arguments,
emits the final __tgt_target_kernel(...) call,
and replaces the original omp target-family statement with ordinary host statements.

It is a useful stage to isolate because it sits at a very specific boundary in the lowerer:

the kernel already exists,
the runtime packet is not yet fully explained,
and the original OpenMP statement is about to disappear from the host AST.

A flow diagram showing rex_offload_init in main, an outlined kernel on the device side, and a generated host launch block that replaces the original target statement. — Figure 1. The host launch block is the host-side counterpart to outlining. Registration happens once near program startup, then each lowered target region becomes an ordinary host block that launches a specific kernel.

Why The Host Launch Block Exists At All

Once REX has created an outlined kernel, it still has not finished lowering the original target construct. The host translation unit still contains an OpenMP statement node such as:

SgOmpTargetStatement
SgOmpTargetTeamsStatement
SgOmpTargetParallelStatement
SgOmpTargetParallelForStatement
SgOmpTargetTeamsDistributeParallelForStatement

That node cannot remain in place if the goal is to emit ordinary host source that can be compiled and linked against LLVM’s offloading runtime.

So the lowerer produces a replacement artifact: a basic block that contains the host-side steps required to launch the already-outlined kernel. That block is not a conceptual runtime description. It is literal generated source.

The overall host-side story therefore has two layers:

one-time process initialization
REX inserts rex_offload_init() near the top of main so CUBIN registration and runtime setup happen once, before user timing declarations.
per-target launch blocks
Each lowered target region becomes a local host block that computes launch state and issues a single explicit offload call.

That separation matters. The launch block is hot-path code that may run many times. Image registration is not repeated there.

The initialization insertion is handled elsewhere in omp_lowering.cpp:

1
2
3
SgExprStatement *expStmt = buildFunctionCallStmt(
    SgName("rex_offload_init"), buildVoidType(), NULL, currentscope);
prependStatement(expStmt, currentscope);

By the time a launch block runs, the runtime helper layer is expected to be ready already. That is why the per-region host block can focus entirely on the actual launch and associated data mapping state.

The Launch Block Is The Real Replacement Artifact

One of the most important details in the lowering code is that the original target statement does not remain as a wrapper around generated statements. It is replaced.

In both the simpler SPMD path and the worksharing path, the lowering ends with the same kind of action:

1
replaceStatement(target, outlined_driver_body, true);

That line is easy to skim past, but it is the concrete moment when “OpenMP directive” becomes “ordinary host code.”

This is one of the reasons source-to-source lowering in REX is relatively inspectable. If a contributor opens the lowered host file, they do not have to imagine an invisible runtime interpretation layer. They can read the generated block and see:

where the kernel identity came from,
how many teams and threads were chosen,
what argument arrays were materialized,
and where the final offload call happens.

The exact construction of outlined_driver_body differs slightly by lowering path:

the SPMD path constructs a fresh basic block;
the worksharing path reuses the normalized target body block as the scaffold for the replacement.

But that implementation detail does not change the externally visible result. In both cases, the user-level OpenMP region is turned into explicit host statements.

Step 1: Give The Host Side A Stable Kernel Identity

The first job of the host launch block is not to pick a launch size. It is to answer a more basic question:

which exact kernel is this host block trying to launch?

REX answers that with a small but important pattern built around a synthetic global symbol and a matching offload entry.

A synthetic host-side key

After the kernel has been outlined and named, the lowerer creates a single-byte global declaration:

1
2
3
SgVariableDeclaration *outlined_kernel_id_decl =
    buildVariableDeclaration(func_name + "id__", buildCharType(),
                             buildAssignInitializer(buildIntVal(0)), g_scope);

That declaration is not the kernel body. It is a host-side identity token.

REX then builds a __tgt_offload_entry object whose addr field points at that synthetic symbol:

1
2
3
4
SgExprListExp *offload_entry_parameters = buildExprListExp(
    buildCastExp(buildAddressOfOp(buildVarRefExp(outlined_kernel_id_decl)),
                 buildPointerType(buildVoidType())),
    buildStringVal(func_name + "kernel__"), buildIntVal(0), buildIntVal(0), buildIntVal(0));

The entry itself is emitted into the omp_offloading_entries section:

1
2
offload_entry_decl->get_decl_item(SgName(func_name + "omp_offload_entry__"))
    ->set_gnu_attribute_section_name("omp_offloading_entries");

This is the same offload-entry table that the helper/runtime layer later uses when registering the CUBIN with libomptarget. So the host launch block is not pointing directly at a CUDA function pointer in the obvious C sense. It is participating in the runtime’s host-entry identity model.

The host pointer used for launch

Inside the actual replacement block, the lowerer materializes:

1
2
3
4
5
6
SgVariableDeclaration *host_ptr_decl = buildVariableDeclaration(
    "__host_ptr", buildPointerType(buildVoidType()),
    buildAssignInitializer(buildCastExp(
        buildAddressOfOp(buildVarRefExp(outlined_kernel_id_decl)),
        buildPointerType(buildVoidType()))),
    p_scope);

That __host_ptr value is what later flows into __tgt_target_kernel(...).

This is a good example of why it is worth separating the host launch block from the kernel body conceptually. The host block is not “calling a CUDA function” in the direct source-language sense. It is packaging a host-visible identity handle that the offloading runtime can match back to the registered device image and its entry table.

Preserving the relation between source and generated artifacts

REX prepends both the synthetic id variable and the offload entry as global declarations while preserving leading preprocessing information:

1
2
prependGlobalDeclPreservingLeadingPreproc(offload_entry_decl, g_scope);
prependGlobalDeclPreservingLeadingPreproc(outlined_kernel_id_decl, g_scope);

This is not only about correctness. It also keeps generated files inspectable. The host-side identity declarations live near the generated kernel and remain stable enough for structural tests and debugging.

A diagram showing an outlined kernel name connected to a synthetic kernel-id symbol, a __tgt_offload_entry in omp_offloading_entries, and a local __host_ptr used in __tgt_target_kernel. — Figure 2. The host launch block identifies kernels through the runtime’s offload-entry model. The local `__host_ptr` is derived from a synthetic global key, not from a direct source-level CUDA call.

Step 2: Materialize Launch Geometry As Ordinary Variables

Once the host side knows which kernel it is referring to, it needs executable launch dimensions.

REX does not leave this as abstract clause data. It emits normal local declarations such as:

_threads_per_block_
_num_blocks_
and, in the worksharing path, optionally __rex_tripcount

In the common shape, the first two look like this:

1
2
3
4
5
6
7
threads_per_block_decl = buildVariableDeclaration(
    "_threads_per_block_", buildIntType(),
    buildAssignInitializer(omp_num_threads), p_scope);

num_blocks_decl =
    buildVariableDeclaration("_num_blocks_", buildIntType(),
                             buildAssignInitializer(omp_num_teams), p_scope);

That means clause-derived launch geometry is no longer living only in the OpenMP node or in a side table. It has become explicit executable state in the host block.

This design has several advantages.

It gives later host-side shaping logic a place to operate

Because teams and threads exist as ordinary variables, the worksharing path can refine them with ordinary statements and control flow instead of having to encode everything inside one monolithic runtime-call builder.

It keeps user intent visible

If the user specified num_threads(...) or num_teams(...), those expressions are materialized directly into the launch block. Contributors inspecting the lowered source can still see where the launch shape came from.

It keeps launch policy separate from argument packaging

The compiler does not need to decide the final __tgt_kernel_arguments structure first. It can compute launch dimensions locally, then feed the resulting declarations into the later packet builder.

That separation is one reason this stage deserves its own post.

The Worksharing Path Adds Tripcount-Aware Shaping

The worksharing launch block extends the same core structure with loop-aware logic. If the host loop was analyzable earlier in the lowering path, the compiler materializes:

1
2
3
4
tripcount_decl = buildVariableDeclaration(
    "__rex_tripcount", buildOpaqueType("int64_t", p_scope),
    buildAssignInitializer(copyExpression(host_loop_iter_count_expr)),
    p_scope);

That declaration is not only for bookkeeping. It enables guarded launch shaping when the user did not explicitly specify thread geometry.

The code is careful about that condition:

1
2
3
if (!has_explicit_num_threads) {
  ...
}

That policy boundary matters. If the user explicitly asked for a particular thread count, REX should preserve it unless it is invalid. If the user did not specify one, the compiler is free to avoid obviously poor choices.

Rounding by a launch granularity

When the tripcount is smaller than the current thread count, the worksharing path computes a rounded thread count based on a launch granularity. The current logic starts from a warp-oriented default of 32, but uses the current block size itself when that block size is smaller than a warp:

1
2
3
4
5
6
7
8
9
SgVariableDeclaration *launch_granularity_decl = buildVariableDeclaration(
    "__rex_launch_granularity", buildOpaqueType("int64_t", p_scope),
    buildAssignInitializer(buildLongLongIntVal(32)), cap_threads_body);

cap_threads_body->append_statement(buildIfStmt(
    buildLessThanOp(buildCastExp(buildVarRefExp(threads_per_block_decl),
                                 buildOpaqueType("int64_t", p_scope)),
                    buildLongLongIntVal(32)),
    use_block_granularity_body, NULL));

Then it rounds the tripcount up to that granularity, clamps it to the existing thread limit, and stores the result back into _threads_per_block_.

This is a good example of the host launch block doing real work rather than merely copying clause expressions into a final call. The launch block is where source-level intent, analyzed loop structure, and safe defaults all meet.

Direct fast-path capping for nested loops

The worksharing path also has a second cap:

1
2
3
if (!has_explicit_num_threads && direct_launch_thread_cap > 0) {
  ...
}

Earlier in the lowering path, the compiler may set direct_launch_thread_cap based on nested loop depth when a direct target-loop fast path is available. The launch block then applies that cap with an explicit conditional assignment.

Again, the important point is not the specific numbers 128 or 256. The important point is where the policy lives. The host launch block is the place where launch policy becomes executable host code.

A decision diagram showing explicit thread counts being preserved, while default thread counts may be refined by tripcount-aware rounding and direct fast-path caps in the worksharing launch block. — Figure 3. Launch shaping belongs in the host launch block because that is where user clauses, analyzed tripcount, and runtime-facing launch variables coexist.

Step 3: Materialize The Launch Payload Around Those Variables

Once kernel identity and launch geometry exist, the host block still needs to package the mapped arguments and related metadata that the runtime launch call will consume.

This is where the code starts to touch:

__args_base
__args
__arg_sizes
__arg_types
__arg_num
and finally __kernel_args

There are two broad cases.

Static map-array materialization

If there are no dynamic map entries, the host launch block materializes literal argument expressions first:

1
2
3
materializeLiteralTargetArgExpressions(map_variable_list,
                                       map_variable_base_list,
                                       outlined_driver_body, p_scope);

Then it emits normal array declarations with braced initializers:

1
2
3
4
5
6
7
args_base_decl = buildVariableDeclaration(
    "__args_base", buildArrayType(buildPointerType(buildVoidType())),
    buildAssignInitializer(offloading_variables_base), p_scope);

args_decl = buildVariableDeclaration(
    "__args", buildArrayType(buildPointerType(buildVoidType())),
    buildAssignInitializer(offloading_variables), p_scope);

The same pattern applies to sizes, types, and argument count.

Dynamic map-array materialization

If dynamic entries are present, the lowerer takes a different path:

1
2
3
4
dynamic_arrays = buildDynamicRuntimeMapArgumentArrays(
    outlined_driver_body, p_scope, map_variable_list,
    map_variable_base_list, map_variable_size_list, map_variable_type_list,
    dynamic_map_entries, ...);

The result is still a set of declarations that the host block can feed into the runtime packet builder. The difference is only how those declarations are produced and later cleaned up.

The kernel-arguments packet is the next layer down

Both paths converge on buildTargetKernelArgsDeclaration(...), which constructs the local __kernel_args declaration:

1
2
3
4
SgVariableDeclaration *kernel_args_decl = buildTargetKernelArgsDeclaration(
    g_scope, p_scope, arg_number_decl, args_base_decl, args_decl, arg_sizes,
    arg_types, num_blocks_decl, threads_per_block_decl,
    tripcount_decl != NULL ? buildVarRefExp(tripcount_decl) : NULL);

This post is intentionally not a deep dive into that helper. The important point here is architectural:

the host launch block owns the ordinary declarations and local variables that feed the runtime packet builder.

The next post can zoom in one level deeper and explain exactly how __tgt_kernel_arguments is assembled and why its field order matters.

Step 4: Emit The Actual Launch Call

After the local launch state exists, the lowerer emits the real runtime call:

1
2
3
4
5
6
7
8
parameters = buildExprListExp(
    buildVarRefExp(device_id_decl), buildVarRefExp(num_blocks_decl),
    buildVarRefExp(threads_per_block_decl), buildVarRefExp(host_ptr_decl),
    buildAddressOfOp(buildVarRefExp(kernel_args_sym)));

SgExprStatement *func_offloading_stmt = buildFunctionCallStmt(
    "__tgt_target_kernel", buildIntType(), parameters, p_scope);
outlined_driver_body->append_statement(func_offloading_stmt);

A few details are worth calling out here.

The device id is explicit

REX materializes a __device_id declaration initialized to -1, which is the runtime’s default-device sentinel in this lowering path:

1
2
3
SgVariableDeclaration *device_id_decl = buildVariableDeclaration(
    "__device_id", buildOpaqueType("int64_t", p_scope),
    buildAssignInitializer(buildLongLongIntVal(-1)), p_scope);

That means the offload call is not relying on some hidden global default inside generated host code. The chosen device selector is explicit in the replacement block.

The launch call is not a CUDA syntax emission

Even though the device artifact is a CUDA kernel, the host-side emission here is not a <<<grid, block>>> call. It is a call into LLVM’s offloading runtime ABI:

device id
number of blocks
threads per block
host pointer identity
kernel arguments packet

That distinction is one of the key reasons this stage sits naturally between outlining and the helper/runtime layer posts in the series.

Cleanup stays local to the same block

If dynamic map arrays were created, the lowerer appends cleanup right after the runtime call:

1
2
appendDynamicRuntimeMapArgumentArrayCleanup(dynamic_arrays,
                                            outlined_driver_body, p_scope);

That keeps launch-local allocation and launch-local cleanup in one place rather than scattering cleanup policy elsewhere in the translation unit.

Step 5: Finish The Host-Side Epilogue

The worksharing path may still need a short host-side epilogue after the launch call.

CPU-side final reduction for per-block buffers

If the kernel used per-block reduction buffers, the host block iterates over those generated symbols, emits a CPU-side final reduction helper call, and then frees the temporary reduction buffer:

1
2
3
4
5
6
SgVariableSymbol *current_symbol_non_const = const_cast<SgVariableSymbol *>(current_symbol);
SgStatement *reduce_on_cpu_stmt = generateTargetReduceOnCPU(
    orig_var_name, current_symbol_non_const,
    num_blocks_decl,
    offload_ctx.per_block_reduction_map[current_symbol_non_const]);
outlined_driver_body->append_statement(reduce_on_cpu_stmt);

This is another example of why the host launch block is more than a single runtime call. It is the whole host-side execution envelope around the launch.

Fixups before final replacement

The lowering also performs scope/reference fixups before replacing the original statement. For example, the worksharing path explicitly repairs references to num_blocks_decl because later insertion can otherwise leave unresolved symbols:

1
SageInterface::fixVariableReferences(num_blocks_decl->get_scope());

The SPMD path uses SageInterface::fixStatement(outlined_driver_body, p_scope); for the same broad reason: once the new host block has been assembled, the compiler needs to ensure it is a consistent AST fragment before it becomes the visible replacement.

What The Host Launch Block Design Buys REX

Looking at the launch block in isolation makes several design choices much clearer.

The lowerer stays explicit

REX does not leave launch decisions trapped in OpenMP clauses or implicit runtime conventions. It emits:

a concrete host-side kernel identity,
concrete launch-dimension variables,
concrete map-array declarations,
and a concrete runtime call.

That makes the generated source easier to inspect and reason about.

Launch policy stays local

Tripcount-aware shaping, default-thread caps, and reduction-buffer epilogues live in the host block because that is where the compiler has all the relevant information at once:

user clause explicitness,
analyzed loop structure,
runtime-facing launch variables,
and any launch-local temporaries.

That is a much cleaner design than trying to hide all of this inside a single helper call builder.

The runtime packet is isolated as a deeper layer

The host launch block owns the ordinary source-level state. The packet builder then consumes that state. This layering keeps the compiler readable:

outlining creates the kernel,
host launch lowering creates the local launch block,
runtime-packet lowering turns those locals into ABI-structured data.

That is the right order of explanation, and it is also the right order in the compiler.

Lowered host code becomes testable

Because the target statement is replaced by an ordinary block, lowering tests can look for stable structural invariants:

whether __host_ptr exists,
whether launch variables are emitted,
whether __tgt_target_kernel is present,
whether tripcount shaping appears only for worksharing constructs,
and whether the original OpenMP node is gone from the host-side result.

That is much easier to validate than a design where the host side remains partly implicit.

The next post can now zoom one level deeper and stay narrowly focused on the runtime packet itself: how REX builds __args_base, __args, sizes, types, and finally __tgt_kernel_arguments in the exact shape that LLVM’s offloading runtime expects.