How REX Dispatches GPU Offloading Constructs

Posted on Mar 24, 2026 (Updated on Mar 25, 2026)

The GPU lowerer in REX does not implement a separate end-to-end lowering pipeline for every OpenMP surface spelling. It first dispatches SgOmpTarget* nodes through thin wrappers that extract clause information, choose defaults, and record whether launch dimensions came from explicit user clauses. Those wrappers then funnel most kernel-launching constructs into one of two core paths: transOmpTargetSpmd() for region-like constructs and transOmpTargetSpmdWorksharing() for loop-dominated constructs. That normalization boundary is what keeps the lowerer understandable and benchmark-independent.

The previous OpenMP posts in this series explained why REX owns the OpenMP model, how directives survive long enough to be parsed, how OpenMPIR becomes SgOmp*, and how the GPU lowerer and helper/runtime boundary work at a higher level.

This post zooms into the first important decision inside the GPU lowerer:

when the lowerer sees many different SgOmpTarget* node kinds, how does it avoid becoming a pile of unrelated special cases?

The answer is: it dispatches many surface constructs into a small number of core lowering paths.

That sounds like a minor implementation detail. It is not. This dispatch boundary is one of the main reasons the GPU lowerer stays generic instead of turning into benchmark-shaped code.

A funnel diagram showing multiple SgOmpTarget* variants entering wrapper functions and collapsing into two main GPU lowering paths, with target-data and target-update handled separately. — Figure 1. The GPU lowerer does not give every surface spelling its own full pipeline. Most kernel-launching target constructs collapse into either the SPMD path or the worksharing path.

Why Dispatch Is The First Real Lowering Decision

Once Stage 2 has produced SgOmp*, the compiler no longer has to worry about raw pragma spelling or directive parsing. But it still has to deal with a large surface area of OpenMP constructs.

On the GPU side, the important node kinds include:

SgOmpTargetStatement
SgOmpTargetTeamsStatement
SgOmpTargetParallelStatement
SgOmpTargetTeamsDistributeStatement
SgOmpTargetParallelForStatement
SgOmpTargetTeamsDistributeParallelForStatement
SgOmpTargetDataStatement
SgOmpTargetUpdateStatement

If each of these owned a separate lowering pipeline, omp_lowering.cpp would be much harder to reason about than it already is. Every new optimization or correctness fix would have to be ported across multiple similar but slightly divergent implementations.

REX deliberately avoids that.

The lowerer’s first job is not “generate kernels immediately.” Its first job is to classify the construct and decide which kind of GPU lowering problem this node really represents.

The Main Switch Is Big, But The Outcome Is Regular

The lowering walk eventually reaches a switch over OpenMP node kinds. In the offloading part, the relevant cases look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
case V_SgOmpTargetStatement:
  transOmpTarget(node);
  break;
case V_SgOmpTargetTeamsStatement:
  transOmpTargetTeams(node);
  break;
case V_SgOmpTargetParallelStatement:
  transOmpTargetParallel(node);
  break;
case V_SgOmpTargetDataStatement:
  transOmpTargetData(node);
  break;
case V_SgOmpTargetUpdateStatement:
  transOmpTargetUpdate(node);
  break;
case V_SgOmpTargetTeamsDistributeStatement:
  transOmpTargetTeamsDistribute(node);
  break;
case V_SgOmpTargetParallelForStatement:
  transOmpTargetParallelFor(node);
  break;
case V_SgOmpTargetTeamsDistributeParallelForStatement:
  transOmpTargetTeamsDistributeParallelFor(node);
  break;

The important part is what happens next.

These entry points are not eight unrelated lowerers. Most of them are wrappers whose real job is:

identify which launch dimensions matter for this construct,
extract the relevant clause expressions,
fill in defaults for what this construct does not specify,
record whether the launch dimensions were explicitly user-specified,
and then funnel into a smaller number of core lowering paths.

That is the real normalization boundary.

Not Every `target`-Family Construct Is The Same Problem

Before getting to the two core paths, it helps to separate three categories.

1. Region-like kernel launch constructs

These are offloading constructs whose body is lowered as a region-like SPMD kernel body:

target
target teams
target parallel

These differ in what they say about teams and threads, but they are not loop-dominated worksharing constructs.

2. Loop-dominated kernel launch constructs

These are the forms where the loop itself is central to the lowering strategy:

target teams distribute
target parallel for
target teams distribute parallel for

These still become GPU launches, but they need extra reasoning about:

loop tripcount,
loop collapse,
worksharing shape,
launch capping when the user did not explicitly specify thread geometry.

3. Data-management constructs

These are related to offloading but are not kernel-launch forms:

target data
target update

These stay outside the main kernel-launch funnel because they solve a different lowering problem: data region and data movement management rather than kernel outlining and launch shaping.

That separation is already a sign of good design. The lowerer does not pretend every target-prefixed directive belongs to one giant “GPU path.”

Three categories of target-family constructs: region-like kernel launches, loop-dominated worksharing launches, and data-management constructs. — Figure 2. The dispatch design starts by recognizing that not every `target`-family directive represents the same lowering problem. Kernel launches and data-management constructs should not be forced through the same path.

The Wrappers Are Thin On Purpose

The wrapper functions are short enough that their design intent is easy to see.

Take the simplest case:

1
2
3
4
5
6
7
void transOmpTarget(SgNode *node) {
  SgOmpTargetStatement *target = isSgOmpTargetStatement(node);
  SgExpression *omp_num_teams = buildIntVal(1);
  SgExpression *omp_num_threads = buildIntVal(1);

  transOmpTargetSpmd(target, omp_num_teams, omp_num_threads);
}

This says something very concrete:

plain target uses default launch dimensions,
and lowering should go through the simple SPMD path.

Now look at target teams:

1
2
3
4
5
6
7
8
void transOmpTargetTeams(SgNode *node) {
  ...
  SgExpression *omp_num_teams =
      copyExpression(num_teams_clause->get_expression());
  SgExpression *omp_num_threads = buildIntVal(1);

  transOmpTargetSpmd(target, omp_num_teams, omp_num_threads);
}

This construct still chooses the same core path. The only real difference is:

num_teams comes from a clause,
num_threads stays at its default.

target parallel mirrors that in the other direction:

default one team,
explicit num_threads from the clause,
same SPMD core path.

This is the key pattern. The wrappers are not where the real lowering lives. They are where construct-specific surface information is converted into a normalized launch description.

The Two Core Paths

After normalization, most GPU launch constructs end up in one of two functions:

transOmpTargetSpmd()
transOmpTargetSpmdWorksharing()

That split is the real high-level mental model for the lowerer.

`transOmpTargetSpmd()`

This is the path for region-like target constructs where the body is handled as a more direct SPMD kernel region.

At the beginning of that function, the lowerer does the common setup you would expect:

resolve the device(...) clause or default to device 0,
ensure the target body sits in a basic block so declarations can be inserted safely,
save preprocessing information before the outliner starts rewriting structure,
preprocess the body for outlining,
translate variables that need to cross the host/device boundary.

In other words, the SPMD path says:

“we already know this construct is fundamentally a region-like kernel body; now do the common work to outline and launch it.”

`transOmpTargetSpmdWorksharing()`

This path starts from the same broad responsibilities but adds loop-centric logic immediately.

Very early in the function, you can see extra work that does not exist in the simple SPMD path:

handle collapse(...) before outlining,
analyze the first host for loop,
compute a possible tripcount expression,
detect when a direct loop fast path is allowed,
and, if the user did not explicitly request thread geometry, cap thread count based on loop shape and nesting.

Later in the same function, after outlining, the lowerer revisits the moved loop through transOmpTargetLoopBlock(...).

That extra step is the concrete meaning of “worksharing path.” The loop is not just cargo moved into a kernel. The loop body is itself the main lowering object.

Wrapper functions extracting num_teams and num_threads clauses, choosing defaults, and attaching explicitness flags before calling one of the two core lowering paths. — Figure 3. The wrapper layer is where surface syntax becomes a normalized launch description. The lowerer records not only expressions, but also whether those expressions came from explicit user clauses.

Why The Explicitness Flags Matter

One of the most important details in the worksharing wrappers is easy to overlook. They do not only pass expressions like omp_num_teams and omp_num_threads. They also pass booleans such as:

has_explicit_num_teams
has_explicit_num_threads

For example:

1
2
3
transOmpTargetSpmdWorksharing(target, omp_num_teams, omp_num_threads,
                              /*has_explicit_num_teams=*/true,
                              /*has_explicit_num_threads=*/false);

This is not bookkeeping for its own sake. It encodes a policy boundary.

The lowerer needs to know not only what value it is starting from, but also where that value came from:

if the user explicitly asked for num_threads(...), the compiler should generally honor it unless it is invalid,
if the user did not specify it, the compiler is free to apply a better default or cap a poor launch geometry.

That distinction became especially important during the performance work, but it is fundamentally a lowering-design issue, not just an optimization issue. If you lose this information too early, every later launch-policy decision becomes ambiguous.

What The Worksharing Path Adds Beyond SPMD

The worksharing path does not just “handle loops too.” It adds a few specific capabilities:

Collapse-aware preprocessing
If the target construct has collapse(...), that has to be reflected before the outlining and loop-lowering logic proceeds.
Host-loop analysis before outlining is fully committed
The lowerer inspects the loop nest to see whether direct target-loop fast paths are possible and whether nested structure suggests a different thread cap.
Tripcount-aware thread shaping
If thread count was not explicitly specified, the lowerer can cap or round launch geometry based on the loop tripcount so obviously oversized launches are avoided.
Explicit loop lowering after outlining
Once the loop is inside the outlined kernel, transOmpTargetLoopBlock(...) rewrites the loop body into the GPU worksharing shape the runtime packet and generated kernel expect.

None of that belongs in the simple SPMD path because those concerns do not dominate region-like constructs the same way they dominate loop-driven constructs.

Why This Design Keeps The Lowerer Generic

The payoff from the dispatch design is not only smaller code. It is a cleaner compiler.

If every benchmark-shaped construct had its own end-to-end lowerer, then fixes and optimizations would be hard to generalize:

a launch-policy fix might land in one path and be forgotten in another,
a preprocessing-info fix might handle comments correctly for one construct family but not another,
a new helper/runtime API migration would require touching many duplicate lowering paths.

By contrast, the current design says:

wrappers own surface-construct normalization,
two core paths own the main kernel-launch lowering strategies,
data-management constructs keep their own separate, non-kernel path.

That means changes land at the right level:

construct-specific clause extraction stays in wrappers,
generic SPMD logic stays in transOmpTargetSpmd(),
generic loop/worksharing logic stays in transOmpTargetSpmdWorksharing().

This is the difference between a compiler that accumulates benchmark-shaped patches and a compiler that keeps a stable internal geometry.

Why The Dispatch Layer Is Also A Testing Boundary

The dispatch boundary is not only good for implementation structure. It is also good for tests.

Because the wrappers are thin and the core paths are few, test cases can validate meaningful invariants such as:

which constructs should produce one kernel vs multiple kernels,
whether repeated host calls still route through the same lowered helper shape,
whether the right launch API path was selected,
whether explicit user launch clauses were preserved,
whether non-launch constructs like target data stayed outside the kernel-launch funnel.

This is one reason the lowering_rodinia suite works so well. It is not testing every possible spelling independently. It is testing the stable contracts of the normalized lowering paths.

What Comes After Dispatch

Once dispatch has chosen the core path, the next major steps are:

outlining the device kernel,
building the host launch side,
constructing the runtime packet,
and managing target-data lifetimes around multi-kernel regions.

Those later stages are where the lowerer turns the normalized construct into explicit host/device source. But those stages are easier to understand only after the dispatch boundary is clear.

That is why dispatch deserves its own post. If you skip this step mentally, omp_lowering.cpp looks like a pile of unrelated transformations. Once you see the funnel, the file becomes much more regular.

The Design In One Sentence

REX keeps the GPU lowerer generic by treating construct dispatch as a normalization step, not as a separate lowering pipeline for every OpenMP surface spelling.

The wrappers extract clause-driven launch information, preserve whether it was explicitly user-specified, and then funnel most kernel-launch constructs into either the SPMD path or the worksharing path. That small number of core paths is what makes the lowerer understandable enough to evolve without turning benchmark-specific.