How GPU Offloading Lowering Works in REX

Posted on Mar 17, 2025 (Updated on Mar 20, 2026)

Once REX has built SgOmp* nodes, omp_lowering.cpp turns them into runnable GPU offload code. The lowerer dispatches combined directives into a small number of transformation paths, outlines kernels, builds OpenMP runtime map arrays, constructs __tgt_kernel_arguments, emits host launch code and device code, and validates multi-kernel behavior with lowering-specific tests.

The first post in this series explained why REX insists on owning the OpenMP model all the way from pragma text to SgOmp*. The next practical question is what happens after that point. Once the frontend has already parsed directives and built the OpenMP AST, where does GPU execution actually come from?

The short answer is: the lowerer is where OpenMP stops being a directive tree and becomes a set of ordinary source files that can run on the LLVM offloading runtime.

That sounds simple, but the lowerer is doing several different jobs at once:

it decides which lowering path applies to each OpenMP offloading construct;
it outlines device code from the original region body;
it rewrites the host side into explicit runtime calls and data-mapping arrays;
it preserves enough source structure that generated files are still readable and testable;
it keeps multi-kernel and repeated-call behavior correct for real applications.

This post focuses on those jobs inside src/midend/programTransformation/ompLowering/omp_lowering.cpp. The goal is not to enumerate every helper function in the file. The goal is to give a contributor a working mental model to answer key questions about changing the GPU lowerer:

Which transformations are central?
Which runtime contracts matter?
Which tests are supposed to catch a regression?

OpenMP AST nodes flow through wrapper functions into either the SPMD or worksharing lowering path, then into outlined kernels and host runtime launch code. — Figure 1. The lowerer does not implement a separate path for every surface spelling. Most GPU offloading constructs are funneled into either `transOmpTargetSpmd()` or `transOmpTargetSpmdWorksharing()`, which then generate kernels, launch code, and mapping state.

The Lowerer’s Real Job

It is tempting to describe the GPU lowerer as “the phase that emits CUDA.” That is not wrong, but it is incomplete.

The lowerer is really the phase that converts compiler-owned OpenMP structure into an explicit runtime protocol. Before lowering, an offloading construct is still something like:

SgOmpTargetStatement
SgOmpTargetTeamsStatement
SgOmpTargetParallelStatement
SgOmpTargetTeamsDistributeStatement
SgOmpTargetParallelForStatement
SgOmpTargetTeamsDistributeParallelForStatement

After lowering, that same construct has been replaced by ordinary statements and declarations:

an outlined kernel declaration inserted into the surrounding translation unit;
a host-side block that computes launch dimensions;
arrays that describe mapped arguments, sizes, and map kinds;
a __tgt_kernel_arguments object;
a call to __tgt_target_kernel;
and, when needed, data-region calls such as __tgt_target_data_begin and __tgt_target_data_end.

That distinction matters because REX is not lowering into a private execution world. The generated host code speaks the existing LLVM offloading ABI. The lowerer owns the transformation, but the runtime contract on the far side of that transformation is shared with libomptarget.

This is also why omp_lowering.cpp sits at a particularly important boundary in the compiler. The frontend is still mostly about understanding source structure. The helper/runtime layer is mostly about making that structure executable with the target runtime. The lowerer is the bridge between those two worlds.

Stage 1: Dispatching OpenMP Nodes Into A Small Number Of Paths

One of the easiest ways to get lost in omp_lowering.cpp is to think of it as a giant pile of unrelated transformations. It is large, but the offloading part is more regular than it first appears.

The main lowering walk eventually reaches a switch on the OpenMP node kind and routes target constructs into a compact set of entry points:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
case V_SgOmpTargetStatement:
  transOmpTarget(node);
  break;
case V_SgOmpTargetTeamsStatement:
  transOmpTargetTeams(node);
  break;
case V_SgOmpTargetParallelStatement:
  transOmpTargetParallel(node);
  break;
case V_SgOmpTargetTeamsDistributeStatement:
  transOmpTargetTeamsDistribute(node);
  break;
case V_SgOmpTargetParallelForStatement:
  transOmpTargetParallelFor(node);
  break;
case V_SgOmpTargetTeamsDistributeParallelForStatement:
  transOmpTargetTeamsDistributeParallelFor(node);
  break;

The interesting part is what those wrappers do next. They do not each implement their own completely independent lowering pipeline. Instead, they mostly normalize directive-specific clause information and then funnel into one of two core paths:

transOmpTargetSpmd()
transOmpTargetSpmdWorksharing()

That split is the first key mental model for the lowerer.

transOmpTargetSpmd() is the simpler path. It is the route for region-like constructs such as target, target teams, and target parallel, where the body is handled as an SPMD kernel body without an explicit worksharing loop transformation dominating the structure.

transOmpTargetSpmdWorksharing() is the more involved path. It handles constructs where loop worksharing is central:

target teams distribute
target parallel for
target teams distribute parallel for

Those forms still end up as GPU launches, but the lowerer must additionally reason about loop tripcount, worksharing structure, per-block reduction buffers, and when clause-derived launch dimensions should or should not be overridden.

The wrapper functions make that choice explicit. For example, in the current lowering, transOmpTargetTeamsDistributeParallelFor() extracts the clause expressions REX uses for team and thread dimensions and then calls:

1
2
3
transOmpTargetSpmdWorksharing(target, omp_num_teams, omp_num_threads,
                              /*has_explicit_num_teams=*/true,
                              /*has_explicit_num_threads=*/true);

That signature says a lot. The lowerer is not merely receiving raw expressions for launch dimensions. It is also told whether those values came from explicit user clauses, which determines how much launch shaping the compiler is allowed to do later.

This is the same pattern we ended up relying on heavily during the performance work. If a user explicitly asks for a launch configuration, the compiler should preserve it unless it is invalid. If the user did not specify one, the lowerer has room to apply safer defaults or clamp obviously poor launch geometry.

Stage 2: Outlining The Device Kernel

Once the lowerer has picked the right transformation path, the next major step is outlining. This is the point where the original OpenMP region body becomes a callable kernel body in compiler-generated device code.

The offloading lowerer does not handwave this step. It explicitly collects the symbols that need to cross the host/device boundary, decides which parameters are passed by address, and then asks the ROSE outliner to materialize a new function from the original region body.

In the worksharing path, that looks roughly like this:

1
2
3
4
5
6
7
8
SgFunctionDeclaration *result =
    Outliner::generateFunction(body_block, func_name + "kernel__", all_syms,
                               addressOf_syms, restoreVars, nullptr, g_scope);

lowerLiteralTargetKernelParameters(result,
                                   offload_ctx.literal_target_param_syms);
maybeRecordTargetKernelLaunchBounds(result, omp_num_threads);
result->get_functionModifier().setCudaKernel();

A few details are worth paying attention to.

First, the lowerer gives the outlined function a stable, meaningful naming pattern. The generated name is not just a hash or an opaque counter. It incorporates the enclosing function name and source line number, which makes generated files much easier to inspect during debugging and test failures.

Second, the outlined function is not treated as a generic helper. The lowerer marks it as a CUDA kernel by setting the function modifier accordingly. In other words, outlining is not only about moving statements into a new function. It is also about declaring what kind of function that new artifact actually is in the generated device translation unit.

Third, the lowerer performs custom insertion of the outlined declaration into the surrounding translation unit rather than using a generic “append it somewhere later” policy. That insertion discipline is part of why the generated code remains inspectable and deterministic enough for structural tests.

Fourth, loop-based target constructs do one more thing after outlining: they immediately revisit the loop inside the outlined function with transOmpTargetLoopBlock(). That is the point where the loop body stops being merely “the original loop moved into a new function” and becomes a GPU-friendly worksharing kernel body.

This layered approach is important:

outlining decides the kernel boundary;
loop lowering decides how the loop executes inside that kernel boundary.

That separation is what keeps the lowerer extensible. The compiler can reuse the same broad outlining machinery for multiple target forms while still specializing loop execution where necessary.

Stage 3: Building The Host Launch Side

Once the device kernel exists, the lowerer still has to produce the host-side launch block that replaces the original OpenMP directive in the host translation unit.

This part of the generated code is easy to underestimate because it is not as visually dramatic as the kernel itself. In practice, it is one of the most important products of the lowerer. If the launch block is wrong, the kernel may still compile perfectly and yet never receive the right arguments, dimensions, or data mappings.

In the worksharing path, the lowerer creates a host-side block that contains at least the following pieces:

a device id initialized to the runtime’s default-device sentinel;
launch dimensions for threads per block and number of blocks;
optionally a tripcount declaration when the construct carries an analyzable loop iteration count;
host pointers and offload-entry state that identify the generated kernel;
argument arrays for addresses, base addresses, sizes, and map types;
a __tgt_kernel_arguments object;
the final __tgt_target_kernel call.

The launch-shaping part is especially instructive. The lowerer materializes declarations such as rex_threads_per_block, rex_num_blocks, and, for loop offloads, rex_tripcount. If the source did not explicitly request num_threads, the lowerer may cap or round the thread count using the tripcount so it does not launch a wildly oversized block for a tiny loop. That logic lives in the host launch block because this is the first point where source clauses, normalized loop structure, and runtime launch ABI all exist in the same place.

The code examples in this post use schematic local names like rex_device_id rather than implementation-reserved names such as __device_id. The real generated code still uses the runtime-facing ABI where appropriate, but the example reads more clearly if the local variables themselves are not written in implementation namespace style.

A simplified schematic of the host-side shape looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
rex_offload_init();

int64_t rex_device_id = -1;
int rex_threads_per_block = thread_limit_from_directive_or_default;
int rex_num_blocks = num_teams_from_directive_or_default;
int64_t rex_tripcount = n;
void *rex_host_ptr = (void *)&outlined_kernel_name;

void *rex_args_base[] = { ... };
void *rex_args[] = { ... };
int64_t rex_arg_sizes[] = { ... };
int64_t rex_arg_types[] = { ... };
struct __tgt_kernel_arguments rex_kernel_args = { ... };

__tgt_target_kernel(rex_device_id,
                    rex_num_blocks,
                    rex_threads_per_block,
                    rex_host_ptr,
                    &rex_kernel_args);

That last call is the decisive handoff from compiler-generated source to runtime execution. The lowerer has done all the work necessary to make the offload explicit. From that point onward, the offloading runtime is executing a protocol, not interpreting a directive.

The GPU lowerer builds mapping arrays and packages them into __tgt_kernel_arguments before launching the kernel. — Figure 2. The launch packet is the heart of host-side lowering. Map clauses become arrays, launch dimensions become explicit declarations, and everything is packed into `__tgt_kernel_arguments` before the final `__tgt_target_kernel` call.

Stage 4: The Runtime Packet Matters More Than The Launch Call

It is easy to focus on the final __tgt_target_kernel call and miss the fact that most of the real work happened just before it.

The lowerer has a dedicated helper, buildTargetKernelArgsDeclaration(), whose entire purpose is to construct the runtime’s launch packet in the exact shape the ABI expects. In the current lowering, the builder receives:

the argument count declaration;
the args_base array;
the args array;
the argument sizes array;
the argument types array;
the launch-dimension declarations;
and optionally the tripcount expression.

It then produces a declaration of type __tgt_kernel_arguments initialized with a braced aggregate. The structure contains, among other fields:

the number of dimensions;
the number of mapped arguments;
pointers to the mapping arrays;
the tripcount;
the grid dimensions;
the block dimensions.

That may sound like a trivial packing step, but it is actually a major compiler contract. If the lowerer gets these fields wrong, the bug will not look like a parsing issue or a source transformation issue. It will look like a runtime failure or an execution mismatch. That is one reason this topic warrants a detailed explanation rather than being treated as a black box between the AST and the GPU.

The map arrays are equally important. For non-dynamic mappings, the lowerer directly materializes arrays such as:

args_base
args
arg_sizes
arg_types

For dynamic mappings, the lowerer switches to a different path and calls helpers such as buildDynamicRuntimeMapArgumentArrays() and later appends cleanup for the dynamic arrays. This separation is crucial because array sections with runtime-defined extents require dynamic handling of the mapping arrays, which adds a layer of complexity not needed for fixed-size mappings.

What matters architecturally is that both cases still feed the same runtime packet. The lowerer does not invent separate launch ABIs for “simple maps” and “dynamic maps.” It always converges on the runtime’s expected argument model.

Stage 5: Target Data Regions And Multi-Kernel Lifetimes

Kernel launch is only half of GPU offloading. The other half is deciding when data should exist on the device and how long the mapping should remain live.

That is where target data and related constructs matter. REX lowers those constructs explicitly as well. transOmpTargetData() builds the same style of map arrays used by kernel launches, then emits:

__tgt_target_data_begin
the original body block
__tgt_target_data_end

When dynamic map entries are involved, it uses the dynamic-array builder and cleanup path here too. That design is one of the cleaner aspects of the current lowerer: data-region lowering and kernel-launch lowering share a common mapping vocabulary instead of each growing their own representation.

This matters for real applications because many benchmark kernels do not appear as isolated single launches. They appear as repeated kernels inside a larger target-data lifetime, or as several different kernels that operate over the same mapped data.

The reduced Rodinia-style lowering tests are especially useful here. One of the clearest examples is rodinia_axpy_multi_like.c, which contains three distinct offloaded loop shapes:

scale_like
axpy_like
bias_like

and then calls axpy_like twice from main.

That case is valuable because it exercises two compiler properties at the same time:

multiple distinct kernels must each produce correct outlined device entries and host launch code;
repeated calls to the same source-level function must keep reusing the correct lowered helper and kernel entry rather than accidentally duplicating or corrupting state.

The lowerer test suite describes that case explicitly as a “three-kernel lowering shape with repeated calls to the same lowered offload helper from main.” That is exactly the sort of behavior a source-to-source compiler must preserve automatically. It is not a user-level workaround. It is compiler-generated structure that has to remain stable across refactors.

A multi-kernel source file lowers into rewritten host code, generated device code, and shared helper files, while tests validate repeated calls and offload-entry integrity. — Figure 3. Lowering a realistic input rarely produces one output artifact. The host translation unit, device kernel file, and shared helper files all participate in the final build, and multi-kernel tests verify that repeated host calls still map to the right generated kernels.

Stage 6: What The Generated Files Actually Represent

A contributor usually understands the lowerer much faster once they stop thinking in terms of “the compiler emitted some code” and start thinking in terms of concrete output files.

For GPU offloading, the lowered output is typically split across several artifacts:

a rewritten host-side rose_*.c file;
a device-side rex_lib_*.cu file containing the generated kernels;
helper sources and headers such as rex_kmp.h, register_cubin.cpp, and xomp_cuda_lib_inlined.cu.

Each artifact has a specific role.

The rewritten host file contains the control flow that replaced the original OpenMP directive. That is where you will see rex_offload_init() inserted near the start of main, map-array declarations, launch-dimension logic, and calls into the offloading runtime.

The device-side file contains the outlined kernel bodies. Those kernels are not generic library kernels. They are compiler-generated versions of the user’s original loop or region body, with the necessary parameter and worksharing structure already baked in.

The helper files are the shared interface layer between generated code and the runtime/toolchain. They are not the main subject of this post, but it is important to see that the lowerer deliberately emits code that expects those helpers to exist. The lowerer is not a self-contained code generator that bypasses the rest of the build.

This is also why the build still feels like a normal compilation pipeline after lowering. REX is a source-to-source compiler, so it emits ordinary source files and lets the downstream compiler and runtime perform their normal roles. That makes debugging much more practical than a monolithic “compile straight to opaque binary blobs” model.

The insertAcceleratorInit() path is a small but telling example. It prepends a call to rex_offload_init() at the beginning of main so one-time offload setup does not affect user timing instrumentation later in the function. That is not a runtime behavior hidden somewhere else in the stack. It is an explicit source transformation that shows up in generated host code and therefore belongs in the lowerer story.

Stage 7: How The Tests Keep The Lowerer Honest

The first post argued that the OpenMP test strategy should mirror the pipeline. The lowerer is where that design becomes especially important.

Lowering bugs are often not obvious syntax bugs. They are shape bugs:

a kernel entry was emitted but not registered correctly;
one mapping array is out of sync with another;
launch geometry changed when it should not have;
target data bookkeeping was inserted in the wrong order;
a repeated host call accidentally refers to the wrong generated helper;
a multi-kernel input now emits only two offload entries instead of three.

Those are not good candidates for fragile golden-file tests. The exact formatting of generated code can change for harmless reasons. What matters are the invariants.

That is why the lowering_rodinia suite is such a good fit for the offloading lowerer. Its README states the intent clearly: validate lowering-specific behavior using reduced Rodinia-like inputs and invariant checks, not unstable reference dumps.

The current cases cover exactly the kinds of lowerer behavior that are easy to regress:

multi-kernel lowering shape;
repeated calls to the same lowered offload helper;
duplicate preamble/include handling;
direct __tgt_target_kernel launch shape;
collapse(2) lowering;
target data and private-clause behavior;
placement of rex_offload_init() before declarations used by timing instrumentation.

That last detail is especially revealing. It shows that the lowerer is not only responsible for broad semantic correctness. It is also responsible for source-level ordering details that can affect benchmark behavior and measurement quality.

For contributors, this yields a practical debugging rule:

if the issue is “the directive parsed wrong,” start before lowering;
if the issue is “the generated host/device structure is wrong,” start in the lowerer and its structural tests;
if the issue is “the launch runs but behaves badly,” inspect both the lowerer output and the runtime helper boundary.

The Lowerer In One Sentence

If the first post reduced the architecture to “REX owns the OpenMP model,” then the second post reduces the offloading implementation to this:

The GPU lowerer turns OpenMP structure into an explicit runtime protocol, while still leaving behind readable source files and testable invariants.

That is why the lowerer has to do more than just emit a kernel. It has to:

preserve directive intent,
route combined constructs into the right transformation path,
outline device code,
build mapping arrays,
construct __tgt_kernel_arguments,
sequence data regions, and
keep multi-kernel behavior stable enough that real benchmark programs still work after regeneration.

The next post in the series should look at the helper/runtime boundary directly: register_cubin.cpp, rex_kmp.h, device-image registration, and how the generated code plugs into the LLVM offloading runtime without giving up compiler ownership of the overall transformation.