How REX Builds the GPU Host Launch Block
__tgt_target_kernel. The worksharing path adds tripcount-aware launch shaping and reduction cleanup, but the overall design stays the same: generate explicit host code rather than rely on implicit runtime interpretation.The previous post in this series focused on the outlining boundary: how a target region becomes a real device kernel with a stable name and an explicit capture set.
That still leaves an important practical question:
once the kernel exists, what exactly replaces the original OpenMP statement in the host code?
The answer is not “the runtime figures it out.” REX emits an ordinary host-side block of declarations and statements that performs the launch explicitly.
This post zooms in on that block. It covers the host-side artifact created after outlining, not the device kernel itself. In src/midend/programTransformation/ompLowering/omp_lowering.cpp, this is the code that:
- identifies the outlined kernel on the host side,
- creates explicit launch-dimension variables,
- packages mapped arguments,
- emits the final
__tgt_target_kernel(...)call, - and replaces the original
omp target-family statement with ordinary host statements.
It is a useful stage to isolate because it sits at a very specific boundary in the lowerer:
- the kernel already exists,
- the runtime packet is not yet fully explained,
- and the original OpenMP statement is about to disappear from the host AST.
Figure 1. The host launch block is the host-side counterpart to outlining. Registration happens once near program startup, then each lowered target region becomes an ordinary host block that launches a specific kernel.
Why The Host Launch Block Exists At All
Once REX has created an outlined kernel, it still has not finished lowering the original target construct. The host translation unit still contains an OpenMP statement node such as:
SgOmpTargetStatementSgOmpTargetTeamsStatementSgOmpTargetParallelStatementSgOmpTargetParallelForStatementSgOmpTargetTeamsDistributeParallelForStatement
That node cannot remain in place if the goal is to emit ordinary host source that can be compiled and linked against LLVM’s offloading runtime.
So the lowerer produces a replacement artifact: a basic block that contains the host-side steps required to launch the already-outlined kernel. That block is not a conceptual runtime description. It is literal generated source.
The overall host-side story therefore has two layers:
one-time process initialization
REX insertsrex_offload_init()near the top ofmainso CUBIN registration and runtime setup happen once, before user timing declarations.per-target launch blocks
Each lowered target region becomes a local host block that computes launch state and issues a single explicit offload call.
That separation matters. The launch block is hot-path code that may run many times. Image registration is not repeated there.
The initialization insertion is handled elsewhere in omp_lowering.cpp:
| |
By the time a launch block runs, the runtime helper layer is expected to be ready already. That is why the per-region host block can focus entirely on the actual launch and associated data mapping state.
The Launch Block Is The Real Replacement Artifact
One of the most important details in the lowering code is that the original target statement does not remain as a wrapper around generated statements. It is replaced.
In both the simpler SPMD path and the worksharing path, the lowering ends with the same kind of action:
| |
That line is easy to skim past, but it is the concrete moment when “OpenMP directive” becomes “ordinary host code.”
This is one of the reasons source-to-source lowering in REX is relatively inspectable. If a contributor opens the lowered host file, they do not have to imagine an invisible runtime interpretation layer. They can read the generated block and see:
- where the kernel identity came from,
- how many teams and threads were chosen,
- what argument arrays were materialized,
- and where the final offload call happens.
The exact construction of outlined_driver_body differs slightly by lowering path:
- the SPMD path constructs a fresh basic block;
- the worksharing path reuses the normalized target body block as the scaffold for the replacement.
But that implementation detail does not change the externally visible result. In both cases, the user-level OpenMP region is turned into explicit host statements.
Step 1: Give The Host Side A Stable Kernel Identity
The first job of the host launch block is not to pick a launch size. It is to answer a more basic question:
which exact kernel is this host block trying to launch?
REX answers that with a small but important pattern built around a synthetic global symbol and a matching offload entry.
A synthetic host-side key
After the kernel has been outlined and named, the lowerer creates a single-byte global declaration:
| |
That declaration is not the kernel body. It is a host-side identity token.
REX then builds a __tgt_offload_entry object whose addr field points at that synthetic symbol:
| |
The entry itself is emitted into the omp_offloading_entries section:
| |
This is the same offload-entry table that the helper/runtime layer later uses when registering the CUBIN with libomptarget. So the host launch block is not pointing directly at a CUDA function pointer in the obvious C sense. It is participating in the runtime’s host-entry identity model.
The host pointer used for launch
Inside the actual replacement block, the lowerer materializes:
| |
That __host_ptr value is what later flows into __tgt_target_kernel(...).
This is a good example of why it is worth separating the host launch block from the kernel body conceptually. The host block is not “calling a CUDA function” in the direct source-language sense. It is packaging a host-visible identity handle that the offloading runtime can match back to the registered device image and its entry table.
Preserving the relation between source and generated artifacts
REX prepends both the synthetic id variable and the offload entry as global declarations while preserving leading preprocessing information:
| |
This is not only about correctness. It also keeps generated files inspectable. The host-side identity declarations live near the generated kernel and remain stable enough for structural tests and debugging.
Figure 2. The host launch block identifies kernels through the runtime’s offload-entry model. The local __host_ptr is derived from a synthetic global key, not from a direct source-level CUDA call.
Step 2: Materialize Launch Geometry As Ordinary Variables
Once the host side knows which kernel it is referring to, it needs executable launch dimensions.
REX does not leave this as abstract clause data. It emits normal local declarations such as:
_threads_per_block__num_blocks_- and, in the worksharing path, optionally
__rex_tripcount
In the common shape, the first two look like this:
| |
That means clause-derived launch geometry is no longer living only in the OpenMP node or in a side table. It has become explicit executable state in the host block.
This design has several advantages.
It gives later host-side shaping logic a place to operate
Because teams and threads exist as ordinary variables, the worksharing path can refine them with ordinary statements and control flow instead of having to encode everything inside one monolithic runtime-call builder.
It keeps user intent visible
If the user specified num_threads(...) or num_teams(...), those expressions are materialized directly into the launch block. Contributors inspecting the lowered source can still see where the launch shape came from.
It keeps launch policy separate from argument packaging
The compiler does not need to decide the final __tgt_kernel_arguments structure first. It can compute launch dimensions locally, then feed the resulting declarations into the later packet builder.
That separation is one reason this stage deserves its own post.
The Worksharing Path Adds Tripcount-Aware Shaping
The worksharing launch block extends the same core structure with loop-aware logic. If the host loop was analyzable earlier in the lowering path, the compiler materializes:
| |
That declaration is not only for bookkeeping. It enables guarded launch shaping when the user did not explicitly specify thread geometry.
The code is careful about that condition:
| |
That policy boundary matters. If the user explicitly asked for a particular thread count, REX should preserve it unless it is invalid. If the user did not specify one, the compiler is free to avoid obviously poor choices.
Rounding by a launch granularity
When the tripcount is smaller than the current thread count, the worksharing path computes a rounded thread count based on a launch granularity. The current logic starts from a warp-oriented default of 32, but uses the current block size itself when that block size is smaller than a warp:
| |
Then it rounds the tripcount up to that granularity, clamps it to the existing thread limit, and stores the result back into _threads_per_block_.
This is a good example of the host launch block doing real work rather than merely copying clause expressions into a final call. The launch block is where source-level intent, analyzed loop structure, and safe defaults all meet.
Direct fast-path capping for nested loops
The worksharing path also has a second cap:
| |
Earlier in the lowering path, the compiler may set direct_launch_thread_cap based on nested loop depth when a direct target-loop fast path is available. The launch block then applies that cap with an explicit conditional assignment.
Again, the important point is not the specific numbers 128 or 256. The important point is where the policy lives. The host launch block is the place where launch policy becomes executable host code.
Figure 3. Launch shaping belongs in the host launch block because that is where user clauses, analyzed tripcount, and runtime-facing launch variables coexist.
Step 3: Materialize The Launch Payload Around Those Variables
Once kernel identity and launch geometry exist, the host block still needs to package the mapped arguments and related metadata that the runtime launch call will consume.
This is where the code starts to touch:
__args_base__args__arg_sizes__arg_types__arg_num- and finally
__kernel_args
There are two broad cases.
Static map-array materialization
If there are no dynamic map entries, the host launch block materializes literal argument expressions first:
| |
Then it emits normal array declarations with braced initializers:
| |
The same pattern applies to sizes, types, and argument count.
Dynamic map-array materialization
If dynamic entries are present, the lowerer takes a different path:
| |
The result is still a set of declarations that the host block can feed into the runtime packet builder. The difference is only how those declarations are produced and later cleaned up.
The kernel-arguments packet is the next layer down
Both paths converge on buildTargetKernelArgsDeclaration(...), which constructs the local __kernel_args declaration:
| |
This post is intentionally not a deep dive into that helper. The important point here is architectural:
the host launch block owns the ordinary declarations and local variables that feed the runtime packet builder.
The next post can zoom in one level deeper and explain exactly how __tgt_kernel_arguments is assembled and why its field order matters.
Step 4: Emit The Actual Launch Call
After the local launch state exists, the lowerer emits the real runtime call:
| |
A few details are worth calling out here.
The device id is explicit
REX materializes a __device_id declaration initialized to -1, which is the runtime’s default-device sentinel in this lowering path:
| |
That means the offload call is not relying on some hidden global default inside generated host code. The chosen device selector is explicit in the replacement block.
The launch call is not a CUDA syntax emission
Even though the device artifact is a CUDA kernel, the host-side emission here is not a <<<grid, block>>> call. It is a call into LLVM’s offloading runtime ABI:
- device id
- number of blocks
- threads per block
- host pointer identity
- kernel arguments packet
That distinction is one of the key reasons this stage sits naturally between outlining and the helper/runtime layer posts in the series.
Cleanup stays local to the same block
If dynamic map arrays were created, the lowerer appends cleanup right after the runtime call:
| |
That keeps launch-local allocation and launch-local cleanup in one place rather than scattering cleanup policy elsewhere in the translation unit.
Step 5: Finish The Host-Side Epilogue
The worksharing path may still need a short host-side epilogue after the launch call.
CPU-side final reduction for per-block buffers
If the kernel used per-block reduction buffers, the host block iterates over those generated symbols, emits a CPU-side final reduction helper call, and then frees the temporary reduction buffer:
| |
This is another example of why the host launch block is more than a single runtime call. It is the whole host-side execution envelope around the launch.
Fixups before final replacement
The lowering also performs scope/reference fixups before replacing the original statement. For example, the worksharing path explicitly repairs references to num_blocks_decl because later insertion can otherwise leave unresolved symbols:
| |
The SPMD path uses SageInterface::fixStatement(outlined_driver_body, p_scope); for the same broad reason: once the new host block has been assembled, the compiler needs to ensure it is a consistent AST fragment before it becomes the visible replacement.
What The Host Launch Block Design Buys REX
Looking at the launch block in isolation makes several design choices much clearer.
The lowerer stays explicit
REX does not leave launch decisions trapped in OpenMP clauses or implicit runtime conventions. It emits:
- a concrete host-side kernel identity,
- concrete launch-dimension variables,
- concrete map-array declarations,
- and a concrete runtime call.
That makes the generated source easier to inspect and reason about.
Launch policy stays local
Tripcount-aware shaping, default-thread caps, and reduction-buffer epilogues live in the host block because that is where the compiler has all the relevant information at once:
- user clause explicitness,
- analyzed loop structure,
- runtime-facing launch variables,
- and any launch-local temporaries.
That is a much cleaner design than trying to hide all of this inside a single helper call builder.
The runtime packet is isolated as a deeper layer
The host launch block owns the ordinary source-level state. The packet builder then consumes that state. This layering keeps the compiler readable:
- outlining creates the kernel,
- host launch lowering creates the local launch block,
- runtime-packet lowering turns those locals into ABI-structured data.
That is the right order of explanation, and it is also the right order in the compiler.
Lowered host code becomes testable
Because the target statement is replaced by an ordinary block, lowering tests can look for stable structural invariants:
- whether
__host_ptrexists, - whether launch variables are emitted,
- whether
__tgt_target_kernelis present, - whether tripcount shaping appears only for worksharing constructs,
- and whether the original OpenMP node is gone from the host-side result.
That is much easier to validate than a design where the host side remains partly implicit.
The next post can now zoom one level deeper and stay narrowly focused on the runtime packet itself: how REX builds __args_base, __args, sizes, types, and finally __tgt_kernel_arguments in the exact shape that LLVM’s offloading runtime expects.