How REX Lowers `target`, `target teams`, and `target parallel` Through The SPMD Path

Posted on Apr 7, 2026 (Updated on Apr 17, 2026)

Not every GPU offloading construct in REX goes through the worksharing loop path. Region-like constructs such as target, target teams, and target parallel are normalized by small wrappers and then lowered through transOmpTargetSpmd(). That branch still performs full map-variable analysis, outlining, literal-parameter lowering, offload-entry creation, and __tgt_kernel_arguments packing, but it does not compute a loop tripcount or call transOmpTargetLoopBlock(). The kernel body is launched as a simpler SPMD region, with teams and threads coming directly from the wrapper-selected expressions.

The previous post in this series covered the worksharing side of GPU lowering: once REX outlines a target loop kernel, transOmpTargetLoopBlock() can rewrite the loop into a direct grid-stride form or fall back to the older XOMP scheduler path.

That is only half of the lowering story.

The other major branch is the one that does not treat the region as a canonical worksharing loop at all.

This is the path for:

#pragma omp target
#pragma omp target teams
#pragma omp target parallel

All three eventually funnel into the same helper:

1
2
void transOmpTargetSpmd(SgNode *node, SgExpression *omp_num_teams,
                        SgExpression *omp_num_threads)

That branch still does real GPU lowering:

it outlines the target body,
collects and classifies mapped symbols,
builds the host launch block,
creates offload entries,
packages the runtime argument arrays,
and calls __tgt_target_kernel.

But it does not do the extra worksharing-specific stages:

no host-side loop tripcount analysis,
no __rex_tripcount,
no direct grid-stride loop rewrite,
no transOmpTargetLoopBlock() at all.

This post is about that simpler but still important branch.

Three surface constructs, target, target teams, and target parallel, each normalize launch expressions and then flow into the shared transOmpTargetSpmd lowering path. — Figure 1. The SPMD path is the non-loop branch of GPU lowering. Several surface OpenMP forms normalize into the same helper, which then handles outlining, mapping, offload-entry generation, and runtime launch building.

Which Constructs Actually Use This Path

The surface wrappers are small, and that is part of their value. They make the launch-policy normalization explicit before the shared lowering code starts.

Plain `target`

The plain target form forces the narrowest launch default:

1
2
3
4
5
6
void transOmpTarget(SgNode *node) {
  SgOmpTargetStatement *target = isSgOmpTargetStatement(node);
  SgExpression *omp_num_teams = buildIntVal(1);
  SgExpression *omp_num_threads = buildIntVal(1);
  transOmpTargetSpmd(target, omp_num_teams, omp_num_threads);
}

This says a lot about how REX interprets the construct in the current offloading model. A plain target region becomes a single-team, single-thread launch unless a richer construct changed the execution model earlier.

`target teams`

The target teams wrapper preserves the num_teams clause but keeps thread count at 1:

1
2
3
4
5
6
7
8
void transOmpTargetTeams(SgNode *node) {
  SgOmpTargetTeamsStatement *target = isSgOmpTargetTeamsStatement(node);
  ...
  SgExpression *omp_num_teams =
      copyExpression(num_teams_clause->get_expression());
  SgExpression *omp_num_threads = buildIntVal(1);
  transOmpTargetSpmd(target, omp_num_teams, omp_num_threads);
}

That is the first clue that the SPMD path is about launch geometry, not only about directive spelling. target teams and target parallel do not become two unrelated lowering implementations. They become the same lowered shape with a different teams-versus-threads choice at the wrapper boundary.

`target parallel`

The target parallel wrapper makes the opposite choice:

1
2
3
4
5
6
7
void transOmpTargetParallel(SgNode *node) {
  SgOmpTargetParallelStatement *target = isSgOmpTargetParallelStatement(node);
  SgExpression *omp_num_teams = buildIntVal(1);
  SgExpression *omp_num_threads =
      copyExpression(num_threads_clause->get_expression());
  transOmpTargetSpmd(target, omp_num_teams, omp_num_threads);
}

Now the launch stays at one team and uses the source’s num_threads clause for block width.

This wrapper trio is one of the cleanest parts of the branch. The shared lowering function does not need to rediscover which surface construct it came from. The wrapper has already converted the source spelling into the exact launch expressions that matter to the host launch block.

Why This Is Called The SPMD Path

The name can be misleading if you assume it means “there is no loop in the body.”

That is not what it means.

The branch is SPMD in the sense that the region body becomes the kernel body. The lowerer does not identify one dominant worksharing loop and then transform that loop into a special execution model. It outlines the region and launches it as a CUDA kernel more directly.

So a construct like:

1
2
3
#pragma omp target parallel
for (i = 0; i < len - 1; i++)
  a[i] = a[i + 1] + 1;

still has a for loop in the region body. The difference is that this branch does not hand that loop to transOmpTargetLoopBlock(). It stays an ordinary loop inside the outlined kernel body rather than becoming the explicit worksharing loop object that the direct grid-stride path specializes.

That is the central architectural distinction:

transOmpTargetSpmdWorksharing(...) assumes loop worksharing is the dominant structural fact.
transOmpTargetSpmd(...) assumes the region itself is the dominant structural fact.

Step 1: Stabilize The Region Body Before Outlining

The SPMD helper begins by ensuring the target statement has its own basic-block body:

1
2
3
SgBasicBlock *omp_target_stmt_body_block =
    ensureBasicBlockAsBodyOfOmpBodyStmt(target);
ROSE_ASSERT(isSgBasicBlock(target->get_body()));

This is the same general discipline used elsewhere in the lowerer: create a clean insertion scope before rewriting begins. It matters here because the helper is about to introduce:

temporary mapping declarations,
launch-dimension declarations,
host pointer declarations,
and, later, the whole outlined driver block that replaces the original directive.

The function also cuts preprocessing information off the original directive early:

1
2
3
cutPreprocessingInfo(target, PreprocessingInfo::before, save_buf1);
cutPreprocessingInfo(target, PreprocessingInfo::after, save_buf2);
cutPreprocessingInfo(target, PreprocessingInfo::inside, save_buf_inside);

That is not just housekeeping. The lowerer already learned the hard way that outlining and replacement can disturb conditional guards and attached preprocessing fragments. This branch therefore preserves them explicitly and pastes them back after the final replacement.

Step 2: Preprocess The Body And Discover Captures

The next part will look familiar if you have read the outlining post, and that is the point. The SPMD branch shares the same broad outlining and mapping vocabulary as the worksharing branch.

1
2
3
Outliner::enable_classic = true;
SgBasicBlock *body_block = Outliner::preprocess(body);
transOmpVariables(target, body_block);

Then the helper builds the capture set and the runtime mapping lists:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
ASTtools::VarSymSet_t all_syms;
ASTtools::VarSymSet_t addressOf_syms;

SgExprListExp *map_variable_list = buildExprListExp();
SgExprListExp *map_variable_base_list = buildExprListExp();
SgExprListExp *map_variable_size_list = buildExprListExp();
SgExprListExp *map_variable_type_list = buildExprListExp();
std::vector<ExpandedMapEntry> dynamic_map_entries;

all_syms = transOmpMapVariables(
    target, map_variable_list, map_variable_base_list, map_variable_size_list,
    map_variable_type_list, &offload_ctx, &dynamic_map_entries);

This is an important point in the architecture.

The SPMD path is simpler than the worksharing path, but it is not a second-class path. It still gets:

full map-clause analysis,
dynamic map-entry expansion when required,
literal target parameter tracking,
and the same classification between original-form and address-form captures.

That shared structure is good design. If each branch invented its own capture and map-array logic, the lowerer would be much harder to reason about.

The capture classification stays the same

Once all_syms exists, the SPMD branch uses the same conservative rule for addressOf_syms:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for (std::set<const SgVariableSymbol *>::iterator iter = all_syms.begin();
     iter != all_syms.end(); iter++) {
  if (!isPointerType((*iter)->get_type()) &&
      !isSgArrayType((*iter)->get_type()) &&
      offload_ctx.literal_target_param_syms.find(
          const_cast<SgVariableSymbol *>(*iter)) ==
          offload_ctx.literal_target_param_syms.end()) {
    addressOf_syms.insert(*iter);
  }
}

That means:

pointers and arrays stay in original form,
literal target parameters are kept out of the by-address bucket,
and ordinary scalar captures default to address-based transport through the generic outliner API.

So although this post is not about literal parameter packing specifically, it is worth noticing that the SPMD path still participates in that machinery. lowerLiteralTargetKernelParameters(...) runs here just as it does in the worksharing branch.

Synthesized reduction storage still matters

The SPMD helper also scans for compiler-generated per-block reduction buffers with the _dev_per_block_ name pattern and adds them to all_syms.

That is a useful reminder that “non-loop branch” does not mean “trivial branch.” Even when the dominant structure is a region rather than a canonical worksharing loop, the outlined kernel may still rely on compiler-generated state that has to cross the outlining boundary.

Step 3: Outline The Region And Mark It As A CUDA Kernel

Once the capture sets are ready, the branch outlines the target body:

1
2
3
SgFunctionDeclaration *result =
    Outliner::generateFunction(body_block, func_name + "kernel__", all_syms,
                               addressOf_syms, restoreVars, NULL, g_scope);

Then it applies the GPU-specific adjustments:

1
2
3
4
lowerLiteralTargetKernelParameters(result,
                                   offload_ctx.literal_target_param_syms);
maybeRecordTargetKernelLaunchBounds(result, omp_num_threads);
result->get_functionModifier().setCudaKernel();

This is one of the best ways to understand what the SPMD branch is not doing.

It is not:

descending into the outlined function and rewriting the first for loop,
computing a host-visible tripcount from that loop,
or introducing any of the direct grid-stride or XOMP scheduler machinery.

Instead, once the region body has been outlined and marked as a CUDA kernel, the body mostly stays in the region-shaped form it already had.

That is why this branch is a useful contrast to the worksharing post. There, outlining was just the boundary before a loop-specific rewrite. Here, outlining is much closer to the final device-shape boundary itself.

Custom insertion still matters

The SPMD path also uses the same custom insertion strategy:

1
2
3
SgFunctionDeclaration *target_func = const_cast<SgFunctionDeclaration *>(
    SageInterface::getEnclosingFunctionDeclaration(target));
insertStatementAfter(target_func, result);

This keeps the generated kernel next to the enclosing function rather than letting a generic insertion policy append it somewhere farther away with a separate prototype. That is consistent with the rest of the lowering design: generated code should stay inspectable and structurally local enough that debugging and invariant-based tests remain practical.

Step 4: Build The Simpler Host Launch Block

The host side is where the difference from the worksharing branch becomes most visible.

The SPMD launch block starts by materializing the launch dimensions directly from the wrapper-selected expressions:

1
2
3
4
5
6
7
SgVariableDeclaration *threads_per_block_decl = buildVariableDeclaration(
    "_threads_per_block_", buildIntType(),
    buildAssignInitializer(omp_num_threads), p_scope);

SgVariableDeclaration *num_blocks_decl =
    buildVariableDeclaration("_num_blocks_", buildIntType(),
                             buildAssignInitializer(omp_num_teams), p_scope);

That is it. No loop tripcount. No launch granularity rounding. No nested-loop-based default cap. No distinction between explicit and inferred launch clauses.

Those extra stages are absent because there is no worksharing-loop analysis driving them here.

This is the cleanest way to phrase the difference:

in the SPMD path, the wrappers decide the launch dimensions; in the worksharing path, the lowerer may additionally reshape them around a canonical loop.

The host block then builds the same general runtime-facing state as the other branch:

__device_id initialized to the OpenMP default-device sentinel -1,
kernel ID globals,
one __tgt_offload_entry,
__host_ptr pointing at the kernel ID,
map arrays or dynamic map-array declarations,
one __tgt_kernel_arguments object,
and one __tgt_target_kernel(...) call.

The crucial call site is straightforward:

1
2
3
4
5
6
7
parameters = buildExprListExp(
    buildVarRefExp(device_id_decl), buildVarRefExp(num_blocks_decl),
    buildVarRefExp(threads_per_block_decl), buildVarRefExp(host_ptr_decl),
    buildAddressOfOp(buildVarRefExp(kernel_args_decl)));

SgExprStatement *func_offloading_stmt = buildFunctionCallStmt(
    "__tgt_target_kernel", buildIntType(), parameters, p_scope);

That should look familiar by now, and that consistency is the point. The SPMD and worksharing branches differ in how they prepare the kernel body and the launch dimensions, but they converge on the same runtime ABI.

No tripcount means the runtime packet gets `0`

The SPMD branch calls the packet builder with NULL for the tripcount:

1
2
3
SgVariableDeclaration *kernel_args_decl = buildTargetKernelArgsDeclaration(
    g_scope, p_scope, arg_number_decl, args_base_decl, args_decl, arg_sizes,
    arg_types, num_blocks_decl, threads_per_block_decl, NULL);

That means the __tgt_kernel_arguments aggregate still gets built in the same layout, but its tripcount slot is zeroed rather than carrying __rex_tripcount.

This is a subtle but important distinction. The absence of worksharing-loop analysis is reflected in the runtime packet too, not only in the host launch declarations.

The SPMD host launch block takes omp_num_teams and omp_num_threads directly from the surface wrapper, builds offload entries and map arrays, then calls __tgt_target_kernel without a tripcount field. — Figure 2. The SPMD launch block is simpler because no canonical worksharing loop is feeding it. Teams and threads come directly from the wrapper, the runtime packet still exists, but the tripcount path stays unused.

Step 5: Static And Dynamic Map Arrays Still Use The Same Shared Machinery

One easy misconception about this branch is that because it is structurally simpler, it must also have a simpler mapping path. That is not really true.

The SPMD launch block still has the same split between:

static materialization of __args_base, __args, __arg_sizes, __arg_types,
and dynamic runtime map-array construction when dynamic_map_entries is non-empty.

The branch does exactly the same thing the worksharing path does here:

1
2
3
4
5
6
7
if (!dynamic_map_entries.empty()) {
  dynamic_arrays = buildDynamicRuntimeMapArgumentArrays(...);
  ...
} else {
  materializeLiteralTargetArgExpressions(...);
  ...
}

That is worth emphasizing because it keeps the mental model clean:

loop specialization differs by branch,
map-list lowering does not.

This is also why the SPMD path still matters for generic lowering correctness. Region-like offloads can still contain:

array sections,
scalar literals,
pointer captures,
and dynamic mapping shapes.

The absence of a loop rewrite does not make those problems disappear.

Step 6: Replace The Original Directive And Restore Preprocessing Info

Once the outlined driver body is fully built, the helper replaces the original target statement:

1
replaceStatement(target, outlined_driver_body, true);

Then it restores the saved preprocessing information:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
if (!save_buf_inside.empty()) {
  for (PreprocessingInfo *info : save_buf_inside) {
    if (info != nullptr) {
      info->setRelativePosition(PreprocessingInfo::after);
    }
  }
  SageInterface::pastePreprocessingInfo(
      outlined_driver_body, PreprocessingInfo::after, save_buf_inside);
}
SageInterface::pastePreprocessingInfo(outlined_driver_body,
                                      PreprocessingInfo::before, save_buf1);
SageInterface::pastePreprocessingInfo(outlined_driver_body,
                                      PreprocessingInfo::after, save_buf2);

This is the kind of detail that can look like backend housekeeping until you have debugged enough transformed code. Once you do, it becomes obvious that preserving attached preprocessing fragments and conditional guards is part of code-generation quality, not an optional extra.

The branch then records the outlined function in target_outlined_function_list, the same way the rest of the GPU lowerer does.

How This Branch Differs From The Worksharing Path

It helps to summarize the contrast directly.

What is the same

The SPMD branch and the worksharing branch both do all of these:

ensure a stable basic-block body,
preprocess and normalize the region for outlining,
analyze map clauses through transOmpMapVariables(...),
build all_syms and addressOf_syms,
lower literal target parameters,
mark the outlined function as a CUDA kernel,
create kernel IDs and __tgt_offload_entry objects,
build static or dynamic map arrays,
pack __tgt_kernel_arguments,
launch with __tgt_target_kernel,
and replace the original directive with a generated host driver block.

What is different

The SPMD branch does not do these worksharing-specific things:

no transOmpTargetLoopBlock(),
no canonical target-loop analysis for host launch shaping,
no __rex_tripcount,
no direct grid-stride rewrite,
no XOMP round-robin fallback path,
no nested-loop-based direct-launch thread cap.

That is the cleanest way to remember the branch:

same outlining and runtime ABI, simpler kernel-body contract.

A comparison between the SPMD path and the worksharing path, showing shared machinery and branch-specific differences. — Figure 3. The SPMD and worksharing branches share the core outlining and runtime ABI machinery, but differ in how they specialize the kernel body and shape the host launch.

What Tests Actually Protect This Path Today

The testing story here is worth stating carefully.

The broad OpenMP_tests frontend corpus clearly exercises the surface spellings that enter this branch. Its CMake list includes files such as:

target.c
targetparallel.c
targetteams.c

Those are useful because they protect the frontend side of the contract:

does the directive parse,
does the AST get built,
does the construct survive earlier pipeline stages?

On the lowering side, the common reduced Rodinia verifier defines the host/device artifact contract that this branch shares with the rest of GPU lowering:

exactly one #include "rex_kmp.h" in the host file,
exactly one rex_offload_init(),
no rex_offload_fini() inserted in host output,
the expected __tgt_offload_entry count,
the expected __tgt_target_kernel(...) count,
exactly one extern "C" in the device file,
exactly one hidden __rex_kernel_launch_env parameter per kernel.

That verifier currently focuses mostly on worksharing-shaped reduced cases, because those were the most regression-prone during the recent offloading work. So the honest status today is:

the shared artifact invariants are well defined,
the surface spellings for SPMD constructs are clearly present in the broad corpus,
but the reduced lowering suite is still more loop-heavy than SPMD-heavy.

That is not a reason to doubt the branch. It is just the current shape of the tests. If anything, it points to an obvious future improvement: add one reduced Rodinia-style case whose dominant structure is a non-loop SPMD region rather than a worksharing loop.

Why This Branch Still Matters

It would be easy to look at the current performance work and assume the SPMD path is less important because the loop-heavy worksharing branch got more attention.

That would be the wrong conclusion.

This branch matters because it defines how REX handles region-like offloads when there is no canonical loop to specialize:

simple target regions,
target teams regions,
target parallel regions,
and target bodies where the region itself is the meaningful lowering unit.

Architecturally, it also matters because it shows how much of the lowerer is truly shared:

the mapping model,
the outlining boundary,
the offload-entry protocol,
the runtime packet ABI,
and the host-side replacement pattern.

What changes is not the whole offloading design. What changes is the body-shape specialization step.

That is a good compiler architecture:

share as much machinery as possible,
specialize only the part that actually differs.

The Main Design In One Sentence

If you compress the whole branch down, the core idea is:

transOmpTargetSpmd() lowers region-like offloads by keeping the region body as the kernel-shaped unit, then reusing the same outlining, mapping, offload-entry, and runtime-packet machinery as the rest of GPU lowering without invoking the extra worksharing loop-rewrite pipeline.

That makes it the natural sibling of the worksharing branch from the previous post.

The worksharing path answers:

how does REX lower a target loop?

The SPMD path answers:

how does REX lower a target region when the loop is not the central abstraction?