How REX Outlines Device Kernels

Posted on Mar 25, 2026

After dispatch chooses the SPMD or worksharing path, REX outlines the target body into a named device kernel, classifies captured symbols, and marks it as CUDA. Loop-specific rewriting happens after outlining, keeping the lowerer reusable across target constructs.

The previous post in this series explained how many SgOmpTarget* surface constructs are funneled into a small number of lowering paths. The next question is what those core paths do first.

The answer is: they outline the target region into a callable device function.

That sentence sounds smaller than it is. In REX, outlining is not a mechanical “move these statements into a helper” step. It is the point where the lowerer decides:

what the kernel boundary is,
which variables must cross that boundary,
which ones travel as pointers versus packed literal values,
what the generated kernel is called,
where that kernel declaration is inserted,
and when loop-specific GPU rewriting is allowed to start.

This post is about that outlining boundary. It stays deliberately focused on one slice of src/midend/programTransformation/ompLowering/omp_lowering.cpp: the part that turns a normalized target body into a real device kernel declaration plus a matching call boundary.

A target region body on the left becomes two artifacts after outlining: a host launch block and a device kernel with a stable name and explicit captured parameters. — Figure 1. Outlining is the boundary where one OpenMP region becomes two explicit compiler artifacts: host launch code and a device kernel. The lowerer has to define that boundary before it can lower execution details inside it.

Why Outlining Is Its Own Stage

Once dispatch has selected either transOmpTargetSpmd() or transOmpTargetSpmdWorksharing(), the lowerer already knows what kind of offloading construct it is handling. But that still does not mean it is ready to rewrite loop execution or emit a runtime launch packet.

There is a missing step in between:

the compiler needs a real function boundary that represents “the device code for this region.”

That is what outlining gives it.

Before outlining, the target body is still just a basic block attached to an OpenMP statement in the host AST. After outlining, REX has a concrete function declaration that can be:

marked as a CUDA kernel,
placed into the surrounding translation unit,
given a stable name,
paired with an offload entry,
and targeted by the host-side runtime launch block.

This is why outlining sits after dispatch but before the final launch-building logic. Dispatch tells the compiler which lowering family applies. Outlining creates the device-side artifact that the rest of the lowering pipeline can refer to explicitly.

That separation is also why the same broad outlining machinery can be shared by both major GPU paths:

region-like target constructs use it to carve out an SPMD kernel body;
loop-dominated target constructs use it to carve out a kernel first and then lower the loop inside that outlined kernel.

If REX skipped that boundary and tried to lower directly from an OpenMP statement into a finished runtime launch plus a fully rewritten GPU loop body in one step, omp_lowering.cpp would be much harder to reason about than it already is.

Step 1: Build A Semantic Capture Set

The first outlining problem is not naming or insertion. It is capture discovery.

REX does not simply walk the region, collect every SgVarRefExp, and pass that list to the outliner. That would be too naive for GPU offloading, because the real boundary is defined by mapping semantics, synthesized device variables, and transport rules, not by raw syntax alone.

The core helper is transOmpMapVariables(...). Both the SPMD and worksharing paths call it before outlining:

1
2
3
all_syms = transOmpMapVariables(
    target, map_variable_list, map_variable_base_list, map_variable_size_list,
    map_variable_type_list, &offload_ctx, &dynamic_map_entries);

That call does several jobs at once.

It resolves mapped arrays into device-side pointer variables

For mapped arrays, the lowerer linearizes them and materializes temporary device pointer names such as _dev_<orig>. Those new symbols are what the outlined kernel should actually receive, not the original host array declarations.

Inside transOmpMapVariables(...), the lowerer:

analyzes map(...) clauses,
categorizes mapped items into array-like versus scalar-like variables,
creates device pointer declarations for mapped arrays,
rewrites body references to use those device variables,
and records the corresponding runtime map arrays for the eventual host launch.

So by the time outlining happens, the kernel boundary is already defined in device-facing terms, not just source-facing terms.

It filters the set to variables that are actually used in the region

The lowerer builds a variable_map for the region scope and only inserts symbols into all_syms when they are actually referenced in the normalized target body. This matters because a map(...) clause can mention more than the final outlined region genuinely needs after normalization and rewriting.

It records special scalar cases that can use literal transport

Some mapped scalar values do not need the full “treat this like addressable storage” path. If a scalar can be passed as a literal target parameter, transOmpMapVariables(...) records it in offload_ctx.literal_target_param_syms.

That decision is important enough that the helper does not keep it as an incidental local fact. It stores it in the offload context so the later outlining step can rewrite the kernel parameter list accordingly.

It adds synthesized reduction storage

The capture set is not limited to symbols that existed in the user’s original source. The lowerer also scans the region body for compiler-generated reduction buffers and inserts them into all_syms.

The exact name pattern depends on the path:

the SPMD path looks for _dev_per_block_*,
the worksharing path looks for __reduction_buffer_*.

That behavior is deliberate. Once the compiler has synthesized temporary storage that the kernel body relies on, those symbols are just as much part of the outlining boundary as a user-written mapped array.

In the worksharing path, the relevant shape looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Rose_STL_Container<SgNode *> nodeList =
    NodeQuery::querySubTree(body_block, V_SgVarRefExp);
for (Rose_STL_Container<SgNode *>::iterator i = nodeList.begin();
     i != nodeList.end(); i++) {
  SgVarRefExp *vRef = isSgVarRefExp((*i));
  std::string var_name_str = vRef->get_symbol()->get_name().getString();
  if (var_name_str.find("__reduction_buffer_", 0) == 0) {
    all_syms.insert(vRef->get_symbol());
    per_block_reduction_syms.insert(vRef->get_symbol());
  }
}

This is one of the easiest places to misunderstand the design. These buffers are not a post-outlining implementation detail. They have to be captured when the outlined function is created, otherwise the kernel signature no longer matches the body the compiler just generated.

A capture-classification diagram showing transOmpMapVariables building all_syms, then splitting them into by-address captures, literal target parameters, device pointers, and synthesized reduction buffers. — Figure 2. The outlining boundary is semantic, not purely syntactic. `all_syms` contains mapped and synthesized values, and later logic splits that set into transport classes rather than treating every capture the same way.

Step 2: Split The Capture Set Into Transport Classes

Once all_syms exists, the lowerer still has another decision to make:

for each captured symbol, should the generic outliner treat it as “use original form” or “use address form”?

That is what addressOf_syms is for.

The classification is conservative by default:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for (std::set<const SgVariableSymbol *>::iterator iter = all_syms.begin();
     iter != all_syms.end(); iter++) {
  if (!isPointerType((*iter)->get_type()) &&
      !isSgArrayType((*iter)->get_type()) &&
      offload_ctx.literal_target_param_syms.find(
          const_cast<SgVariableSymbol *>(*iter)) ==
          offload_ctx.literal_target_param_syms.end()) {
    addressOf_syms.insert(*iter);
  }
}

That rule says:

pointers and arrays stay in original form,
literal target parameters stay out of the by-address bucket,
ordinary non-array, non-pointer scalars default to address-based transport.

Why by-address classification exists at all

This is where people often get confused, because CUDA kernels do not literally take C++ references in the source-level sense.

The reason is that the ROSE outliner is a generic outlining facility. Its API is expressed in terms of which captured variables should use original form and which should use address form. REX feeds GPU target regions through that generic mechanism instead of inventing a second completely separate outlining system.

So addressOf_syms is not best understood as “the final CUDA ABI.” It is better understood as:

the transport classification the generic outliner should use when materializing the function boundary and the matching call arguments.

That distinction matters. Later code in the GPU path adjusts how those parameters are interpreted or rewritten for device execution.

How the call side uses the classification

After the function has been generated, REX builds the actual call argument list by selecting which symbols should remain in original form:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
std::set<SgInitializedName *> varsUsingOriginalForm;
for (ASTtools::VarSymSet_t::const_iterator iter = all_syms.begin();
     iter != all_syms.end(); iter++) {
  const SgVariableSymbol *current_symbol = *iter;
  if (addressOf_syms.find(current_symbol) == addressOf_syms.end())
    varsUsingOriginalForm.insert(current_symbol->get_declaration());
}

Outliner::appendIndividualFunctionCallArgs(all_syms, varsUsingOriginalForm,
                                           exp_list_exp);

Inside the generic outliner call builder, “original form” means pass the symbol itself. Otherwise, the call builder emits &symbol.

That is why this classification cannot be dismissed as incidental. It directly controls the boundary between the original region and the generated kernel call.

Literal target parameters are a third category, not just an exception

The most interesting case is the set excluded from addressOf_syms: literal_target_param_syms.

These are scalar values that REX has decided can travel as packed literal arguments. After the outlined function is created, lowerLiteralTargetKernelParameters(...) revisits the parameter list and rewrites those parameters into the transport form expected by the offloading path.

That helper does two important things.

First, it prepends a hidden launch-environment parameter:

1
2
3
4
SgInitializedName *kernel_launch_env_param =
    SageBuilder::buildInitializedName("__rex_kernel_launch_env",
                                      buildPointerType(buildVoidType()));
prependArg(params, kernel_launch_env_param);

This exists because LLVM’s __tgt_target_kernel launch ABI passes an additional environment slot even for ordinary kernels. REX makes that parameter explicit so the generated CUDA signature matches the runtime’s real calling convention.

Second, for each literal symbol, it changes the parameter type to a pointer-sized integer transport type, creates a shadow local with the original type, and reconstructs the value with __builtin_memcpy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SgType *transport_type =
    get_host_pointer_size_bytes(body) <= 4
        ? static_cast<SgType *>(buildUnsignedIntType())
        : static_cast<SgType *>(buildUnsignedLongLongType());
param->set_type(transport_type);

SgVariableDeclaration *shadow_decl =
    buildVariableDeclaration(shadow_name, original_type, NULL, body);
SgExprStatement *memcpy_stmt = buildFunctionCallStmt(
    "__builtin_memcpy", buildPointerType(buildVoidType()), memcpy_args, body);

This is a useful illustration of the general design:

the generic outliner creates the function boundary,
then GPU-specific lowering repairs or specializes parts of that boundary.

REX is not stuck in one abstraction layer. It deliberately composes them.

Step 3: Materialize A Stable Kernel Artifact

Once the capture sets are ready, the lowerer can finally ask the outliner to generate the kernel function itself:

1
2
3
SgFunctionDeclaration *result =
    Outliner::generateFunction(body_block, func_name + "kernel__", all_syms,
                               addressOf_syms, restoreVars, NULL, g_scope);

Three parts of this step are especially important: naming, kernel marking, and insertion.

Stable names are part of the design

The generated name starts from Outliner::generateFuncName(target) and then appends a human-meaningful suffix:

1
2
3
4
5
6
7
8
9
const Sg_File_Info *info = target->get_startOfConstruct();
SgFunctionDeclaration *enclosing_function =
    getEnclosingFunctionDeclaration(target);
std::string enclosing_function_name =
    enclosing_function->get_name().getString();
std::stringstream statement_line_number;
statement_line_number << info->get_line();
func_name +=
    enclosing_function_name + "__" + statement_line_number.str() + "__";

That suffix encodes:

the enclosing host function name,
and the source line where the original target statement started.

This is not cosmetic. It makes generated files inspectable and testable. When a lowering regression happens, a contributor can often jump from a generated kernel name directly back to the original source location that created it. That is much better than debugging with opaque numbered helpers only.

It also helps multi-kernel programs. If one source function contains several target regions, the generated names remain distinct without becoming unreadable.

The outlined function is explicitly marked as a CUDA kernel

After outlining, the lowerer updates both the defining declaration and the first nondefining declaration:

1
2
3
4
5
lowerLiteralTargetKernelParameters(result,
                                   offload_ctx.literal_target_param_syms);
maybeRecordTargetKernelLaunchBounds(result, omp_num_threads);
result_decl->get_functionModifier().setCudaKernel();
result->get_functionModifier().setCudaKernel();

setCudaKernel() is the point where the outlined function stops being “just another helper” and becomes a real device entry point in the generated CUDA-facing source.

maybeRecordTargetKernelLaunchBounds(...) is worth noticing too. If num_threads is a constant, or at least a safe expression to preserve, REX records launch-bounds information on the outlined kernel. That keeps source-level OpenMP launch information connected to the emitted device function instead of discarding it once the host launch block is built.

Insertion is custom, not delegated to a generic default

One of the clearest signals that REX treats generated code readability as a real engineering concern is that it does not accept the generic outliner insertion policy here.

The lowerer comments on this explicitly and performs a custom insertion:

1
2
3
SgFunctionDeclaration *target_func = const_cast<SgFunctionDeclaration *>(
    SageInterface::getEnclosingFunctionDeclaration(target));
insertStatementAfter(target_func, result);

The reason is simple. The generic insertion path would place the function at the end of the translation unit and prepend a prototype. That is legal, but it is not what REX wants for offloading-generated code.

Instead, REX inserts the outlined kernel next to the enclosing function that produced it. That gives several benefits:

generated code is easier to read in source order,
the relation between host function and kernel stays local,
structural lowering tests remain more deterministic,
and debugging does not require scrolling through an unrelated tail section of helpers.

This is a good example of REX using shared infrastructure without surrendering code-generation quality to the defaults of that infrastructure.

Step 4: In The Worksharing Path, Outline First And Rewrite The Loop Second

The biggest architectural payoff of this design shows up in the loop-dominated path.

In transOmpTargetSpmdWorksharing(), REX does not fully lower the loop before outlining. Instead, it first outlines the region, inserts the kernel, and then revisits the first for loop inside the outlined function:

1
2
3
4
5
6
7
8
9
SgFunctionDeclaration *result =
    Outliner::generateFunction(body_block, func_name + "kernel__", all_syms,
                               addressOf_syms, restoreVars, NULL, g_scope);
...
insertStatementAfter(target_func, result);

Rose_STL_Container<SgNode *> for_loops =
    NodeQuery::querySubTree(result, V_SgForStatement);
transOmpTargetLoopBlock(for_loops[0], NULL, &offload_ctx);

This is one of the most important design decisions in the whole lowerer.

Why not lower the loop first?

Because the loop rewriter needs the kernel boundary to exist already.

Once the loop has been moved into the outlined function, the compiler knows:

what the final captured parameters are,
which mapped variables have become device-facing parameters,
which synthesized reduction buffers are part of the kernel signature,
and where the kernel declaration lives in the translation unit.

Only at that point is the compiler in the right place to perform the loop-specific GPU rewrite.

If REX tried to fuse these two steps, it would have to mix together:

outlining concerns,
parameter transport concerns,
kernel declaration concerns,
and loop worksharing concerns

inside a single transformation block.

That would make it much harder to reuse the outlining stage for non-loop target constructs and much harder to test the boundary between “kernel creation” and “loop execution rewriting.”

Host-side loop analysis can still happen earlier

This does not mean the worksharing path ignores loop information until after outlining. It still performs host-side analysis beforehand when it needs tripcount or launch-shaping information:

inspect the first host for loop,
compute a tripcount expression when possible,
detect a direct target-loop fast path,
choose a thread cap when the user did not provide explicit geometry.

But that analysis is different from the final loop rewrite.

The clean mental model is:

analyze the host loop early if launch shaping needs it,
outline the region into a device kernel,
lower the loop inside that outlined kernel.

Those are three related tasks, but they are not the same task.

A sequence diagram showing wrapper dispatch, map-variable collection, outlining, CUDA-kernel marking, then a second pass over the loop inside the outlined function. — Figure 3. In the worksharing path, loop analysis and loop lowering are intentionally separated. REX may inspect the host loop early, but it lowers the loop body only after the kernel has been outlined and inserted.

What This Buys REX

Once you look at the outlining stage directly, a few broader design choices become much clearer.

It keeps the lowerer generic

The same outline-first structure works for:

plain target,
target teams,
target parallel,
target parallel for,
target teams distribute,
target teams distribute parallel for.

Those constructs do not all lower the same way internally, but they do share a common need for a stable device function boundary. Outlining provides that common layer.

It makes generated code debuggable

Stable names, explicit kernel marking, and custom insertion are not luxuries. They are what make source-to-source compilation workable for real debugging. Contributors can inspect generated files and still understand where a kernel came from and why it is located where it is.

It supports multi-kernel and repeated-call correctness

Because every target region becomes a distinct kernel artifact with a stable identity, REX can handle programs that:

generate multiple kernels in one translation unit,
call the same lowered function many times,
and mix different offloading constructs in the same benchmark.

That is exactly the kind of behavior later lowering tests and benchmark validations need to exercise.

It preserves a clean boundary for future changes

If a future optimization wants to improve:

capture classification,
literal scalar transport,
launch-bounds attachment,
reduction-buffer handling,
or loop rewriting,

it can usually do that without redefining the entire lowerer. The outlining stage already gives those optimizations a stable place to attach.

That is the real value of this design. It is not merely “we call the outliner here.” It is:

REX deliberately uses outlining as the boundary that turns OpenMP structure into a device-kernel object model.

Once that object model exists, the rest of GPU lowering becomes much easier to reason about.

The next post can move one step further down the pipeline and look at the host launch side in the same focused way: how the lowerer packages map arrays, launch dimensions, and runtime metadata once the device kernel already exists.