How REX Lowers Target Loops Into Direct GPU Kernels

Posted on (Updated on )
REX recovers canonical loop facts and rewrites eligible OpenMP target loops into direct CUDA-style grid-stride kernels. If the loop shape is not suitable, it falls back to the older XOMP round-robin scheduler. The same loop analysis feeds both device rewriting and host-side tripcount shaping.

The previous posts in this series covered two neighboring stages of GPU lowering:

  • how REX outlines a target region into a real device kernel,
  • and how the host side builds launch packets and runtime map arrays for that kernel.

There is still one very important step in between those two stories:

what happens to the actual for loop inside the outlined kernel body?

That step matters more than it first sounds.

Outlining gives REX a function boundary. Host launch generation gives REX a runtime call boundary. But neither of those answers the core execution question:

  • which CUDA lane owns which iteration?
  • how does the loop advance after the first assignment?
  • what happens if the loop shape is simple enough for a direct grid-stride mapping?
  • and what happens if it is not?

In the current lowerer, that decision lives in transOmpTargetLoopBlock() inside src/midend/programTransformation/ompLowering/omp_lowering.cpp.

This post stays focused on that slice alone:

  • the canonical loop facts REX reconstructs,
  • the direct grid-stride rewrite,
  • the older XOMP scheduler fallback that still remains,
  • and the host-side tripcount handshake that lets launch shaping use the same loop information without lowering the loop twice.
A normalized target loop is analyzed once, then used in two ways: host-side tripcount shaping and device-side loop rewriting after outlining.

Figure 1. Loop lowering sits between outlining and host launch construction. The same canonical loop facts feed both the device rewrite and the host launch-shaping logic, but they do so in different ways.

Why Loop Lowering Is Its Own Stage

The easiest mistake to make when reading the lowerer is to assume that outlining already “lowered the loop.”

It did not.

After outlining, the worksharing target path still contains a normal SgForStatement inside the new kernel function. REX then immediately revisits that loop:

1
2
3
Rose_STL_Container<SgNode *> for_loops =
    NodeQuery::querySubTree(result, V_SgForStatement);
transOmpTargetLoopBlock(for_loops[0], NULL, &offload_ctx);

That split is intentional.

Outlining decides:

  • where the kernel boundary is,
  • which values cross it,
  • and what the outlined function is called.

Loop lowering decides:

  • how the loop body maps onto CUDA execution,
  • whether the kernel can use a direct grid-stride mapping,
  • and whether post-rewrite cleanups such as read-only load hoisting should run.

Keeping those two jobs separate is what lets the same worksharing path handle more than one surface OpenMP spelling. A target parallel for and a target teams distribute parallel for may need different launch-clause treatment on the host side, but once REX has outlined the loop body, the device loop rewrite problem becomes much more uniform.

Step 1: Recover A Canonical Loop Model

The direct path starts with a small record:

1
2
3
4
5
6
7
8
struct TargetLoopLoweringInfo {
  SgInitializedName *orig_index = nullptr;
  SgExpression *orig_lower = nullptr;
  SgExpression *orig_upper = nullptr;
  SgExpression *orig_stride = nullptr;
  bool is_incremental = true;
  bool is_inclusive_bound = true;
};

This tells you exactly what the lowerer thinks it needs in order to reason about a GPU loop:

  • the induction variable,
  • the lower bound,
  • the upper bound,
  • the stride,
  • the direction,
  • and whether the bound is inclusive.

That is already a clue that the direct path is not driven by raw syntax. It is driven by a normalized loop model.

The device-side analyzer mutates the loop into a canonical form

The device-facing helper is analyzeTargetLoopForGpu():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
static TargetLoopLoweringInfo
analyzeTargetLoopForGpu(SgForStatement *for_loop) {
  rewritePointerBasedForIndices(for_loop);
  SageInterface::forLoopNormalization(for_loop);

  TargetLoopLoweringInfo info;
  bool is_canonical = isCanonicalForLoop(
      for_loop, &info.orig_index, &info.orig_lower, &info.orig_upper,
      &info.orig_stride, NULL, &info.is_incremental);
  info.is_inclusive_bound = true;
  return info;
}

There are three important details here.

First, REX rewrites pointer-based loop indices before it does anything else. That is not cosmetic. In outlined target kernels, loop indices can appear as pointer dereferences after earlier transformations. If the direct path tried to read those literally, canonical-loop detection would become much less reliable.

Second, forLoopNormalization() aggressively regularizes the loop shape:

  • for (int i = 0; ...) becomes a hoisted declaration plus for (i = 0; ...),
  • < and > tests become inclusive forms such as <= and >= with adjusted bounds,
  • i++ and i-- become explicit additive stride updates.

Third, after those rewrites, the loop is no longer being understood as “whatever source spelling the user happened to write.” It is being understood as a canonical control model that later code can safely rewrite.

That is the right level of abstraction for a lowerer.

The host-side analyzer reads without mutating

REX also has a second helper:

1
2
static bool analyzeTargetLoopForGpuReadOnly(SgForStatement *for_loop,
                                            TargetLoopLoweringInfo *info)

This one is used on the host side before the kernel body itself is rewritten. It tries isCanonicalForLoop(...) first, but if the loop is not already in the fully normalized form it also calls recoverCanonicalForLoopControl(...) and reconstructs whether the bound is inclusive from the test operator.

That distinction matters because the host side often wants the loop facts without rewriting the original body yet. The device path is allowed to normalize the loop because it is about to replace it. The host-side launch-shaping logic only needs the tripcount and related facts, so it uses the read-only route.

Step 2: Replace The Loop With A Loop-Local Lowering Block

Once REX has a TargetLoopLoweringInfo, it does not rewrite the loop in place blindly. It first creates a local basic block that becomes the new lowering scope:

1
2
3
4
5
6
7
8
9
SgVariableDeclaration *hoisted_index_decl =
    findHoistedTargetLoopIndexDeclaration(for_loop, info);
SgBasicBlock *bb1 = SageBuilder::buildBasicBlock();
replaceStatement(for_loop, bb1, true);
appendStatement(for_loop, bb1);
if (hoisted_index_decl != nullptr) {
  SageInterface::removeStatement(hoisted_index_decl, false);
  appendStatement(hoisted_index_decl, bb1);
}

This block matters for two reasons.

First, the loop-lowering path is going to introduce new declarations such as:

  • _dev_thread_num
  • _dev_thread_id
  • or, on the fallback path, scheduler temporaries like _dev_lower and _dev_upper

Those values need a clean local scope rather than being spilled into whatever surrounding statement structure the outlined function happened to have.

Second, forLoopNormalization() may have hoisted the loop index declaration immediately before the loop. If REX replaced the loop without moving that declaration along, the transformed code would be structurally awkward or even incorrect. findHoistedTargetLoopIndexDeclaration() is therefore doing a very specific cleanup: it detects the tightly-coupled hoisted declaration and moves it into the new lowering block before the final loop rewrite happens.

This is a good example of the lowerer treating normalization as a first-class transformation rather than as a magical black box. If an earlier normalization changes the tree shape, the later stage takes responsibility for carrying the now-hoisted declaration to the place where the rewritten loop still makes sense.

Step 3: Choose Between The Direct Path And The Fallback

The dispatch point inside transOmpTargetLoopBlock() is tiny:

1
2
3
4
5
6
bool use_direct_grid_stride = canUseDirectTargetLoopFastPath(info);
if (use_direct_grid_stride) {
  lowerTargetLoopDirectGridStride(for_loop, bb1, info);
} else {
  lowerTargetLoopRoundRobin(for_loop, bb1, info);
}

But the architectural meaning is important.

The lowerer does not assume that every target loop can or should use the direct CUDA-style path. It checks whether the canonical facts are present and then makes the choice explicitly.

Right now, the gate is intentionally simple:

1
2
3
4
static bool canUseDirectTargetLoopFastPath(const TargetLoopLoweringInfo &info) {
  return info.orig_index != nullptr && info.orig_lower != nullptr &&
         info.orig_upper != nullptr && info.orig_stride != nullptr;
}

This is not a heroic profitability model. It is a correctness gate. If the lowerer cannot recover the basic loop-control model, it does not guess. It falls back.

That fallback matters because this is a compiler, not a source-to-source benchmark script. A fast path that silently assumes too much is not a real optimization.

The direct path maps CUDA global thread ids directly to the loop induction variable, while the fallback path uses XOMP static scheduler helpers to hand out loop chunks.

Figure 2. The worksharing lowerer has two execution models. The preferred one is a direct grid-stride rewrite. The older one is a round-robin scheduler path that survives as a correctness fallback.

Step 4: Rewrite The Loop Into A Direct Grid-Stride Form

The direct path is compact and very revealing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
static void
lowerTargetLoopDirectGridStride(SgForStatement *for_loop, SgBasicBlock *bb1,
                                const TargetLoopLoweringInfo &info) {
  SgVariableDeclaration *dev_thread_num_decl = buildVariableDeclaration(
      "_dev_thread_num", buildLongLongType(),
      buildAssignInitializer(buildCudaGlobalThreadCountXExpr(bb1),
                             buildLongLongType()),
      bb1);

  SgVariableDeclaration *dev_thread_id_decl = buildVariableDeclaration(
      "_dev_thread_id", buildLongLongType(),
      buildAssignInitializer(buildCudaGlobalThreadIdXExpr(bb1),
                             buildLongLongType()),
      bb1);

  SgVariableSymbol *dev_thread_num_symbol =
      SageInterface::getFirstVarSym(dev_thread_num_decl);
  SgVariableSymbol *dev_thread_id_symbol =
      SageInterface::getFirstVarSym(dev_thread_id_decl);

  setLoopLowerBound(
      for_loop, buildAddOp(deepCopy(info.orig_lower),
                           buildMultiplyOp(buildVarRefExp(dev_thread_id_symbol),
                                           deepCopy(info.orig_stride))));
  setLoopUpperBound(for_loop, deepCopy(info.orig_upper));
  setLoopStride(for_loop, buildMultiplyOp(buildVarRefExp(dev_thread_num_symbol),
                                          deepCopy(info.orig_stride)));
}

Conceptually, REX is turning a canonical source loop into the usual CUDA-style global-thread mapping:

  • _dev_thread_id = blockDim.x * blockIdx.x + threadIdx.x
  • _dev_thread_num = gridDim.x * blockDim.x
  • first iteration for this lane = orig_lower + _dev_thread_id * orig_stride
  • subsequent iterations = add _dev_thread_num * orig_stride

That is exactly the grid-stride structure you would expect in a hand-written CUDA kernel.

The helper expressions make that explicit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
static SgExpression *buildCudaGlobalThreadIdXExpr(SgScopeStatement *scope) {
  return buildAddOp(buildMultiplyOp(buildCudaDimXRef("blockDim.x", scope),
                                    buildCudaDimXRef("blockIdx.x", scope)),
                    buildCudaDimXRef("threadIdx.x", scope));
}

static SgExpression *buildCudaGlobalThreadCountXExpr(SgScopeStatement *scope) {
  return buildMultiplyOp(buildCudaDimXRef("gridDim.x", scope),
                         buildCudaDimXRef("blockDim.x", scope));
}

That is an important design point. The direct path is not using a scheduler helper to compute lane ownership indirectly. It is lowering to the actual CUDA execution model directly in the generated kernel body.

Why only the lower bound and stride change

Notice what REX does not change in the direct path:

  • it keeps the normalized loop upper bound,
  • it keeps the original induction variable itself,
  • and it lets the canonical loop structure survive.

Only the starting point and the increment are rewritten.

That is a smart choice because it preserves much of the source loop’s recognizable structure while still mapping iterations across the grid. The generated code remains inspectable as “the original loop, but now lane-partitioned by CUDA thread geometry.”

Step 5: Clean Up The Direct Path So The Kernel Stays Efficient

The direct grid-stride rewrite is not the end of the story. After appending the rewritten loop into bb1, REX runs several targeted cleanups:

1
2
3
4
5
6
7
8
9
SgInitializedName *outer_index = getLoopIndexVariable(for_loop);
SgVariableSymbol *outer_index_sym =
    outer_index != nullptr
        ? isSgVariableSymbol(outer_index->get_symbol_from_symbol_table())
        : nullptr;
scalarizeDirectGridStrideOuterIndexAccesses(for_loop, outer_index_sym);
hoistReadOnlyInvariantAggregateRefsBeforeLoop(for_loop);
hoistReadOnlyInvariantFieldAccessesBeforeLoop(for_loop);
rewriteReadOnlyDeviceLoadsWithLdg(for_loop);

These helpers are worth understanding because they show that REX is not satisfied with merely assigning iterations correctly. It also wants the resulting kernel body to avoid obvious structural inefficiencies.

Scalarize repeated outer-index accesses

scalarizeDirectGridStrideOuterIndexAccesses(...) looks for repeated accesses such as:

  • a[i]
  • x[i]
  • other array references indexed by the outer grid-stride induction variable

When the pattern is safe, it introduces a cached scalar local, replaces the repeated indexed expressions with that local, and writes back once at the end if the element was written.

That is a very local optimization, but it makes sense in the direct path because grid-stride loops often re-use the same outer element many times within the loop body. Once lane ownership is fixed, that element access becomes a good candidate for scalar caching.

Hoist read-only aggregate and field references

The two hoisting helpers do something similar for invariant data used inside nested loops:

  • hoistReadOnlyInvariantAggregateRefsBeforeLoop(...)
  • hoistReadOnlyInvariantFieldAccessesBeforeLoop(...)

These helpers look for cases where an inner loop repeatedly re-reads the same outer invariant aggregate element or the same field access through a read-only base object. When safe, REX materializes a cached pointer or cached field value before the inner loop rather than letting the repeated expression stay inside the nested body.

This is one reason the direct path matters as a lowering stage rather than as a final codegen flourish. Once the compiler knows the loop body is now a straightforward grid-stride kernel, it can do loop-local cleanups that are much easier to justify than they would have been in a more opaque scheduler-driven execution model.

Rewrite eligible read-only loads to __ldg

Finally, rewriteReadOnlyDeviceLoadsWithLdg(...) scans the loop body for read-only device load candidates and rewrites them into:

1
__ldg(&expr)

The implementation is careful:

  • it gathers candidate expressions first,
  • sorts them by AST depth,
  • deduplicates them,
  • then replaces only expressions still inside the current subtree.

That discipline matters because AST-rewrite passes that mutate the tree while traversing it can easily become fragile. Here the helper deliberately stages the candidates before it starts replacing them.

The broader point for the post is simple: the direct path is not only about assigning iterations to lanes. It is also the gateway to a set of local device-side rewrites that assume a relatively plain kernel body and try to keep it that way.

Step 6: Keep The Old Round-Robin Path As A Correctness Fallback

The fallback path is older and much heavier:

1
2
3
4
5
6
7
SgStatement *call_stmt = buildFunctionCallStmt(
    "XOMP_static_sched_init", buildVoidType(), parameters, bb1);

SgExpression *func_call_exp = buildFunctionCallExp(
    "XOMP_static_sched_next", buildBoolType(), parameters, bb1);

SgWhileStmt *w_stmt = buildWhileStmt(func_call_exp, for_loop);

Instead of rewriting the loop into a direct global-thread formula, the fallback path:

  • declares scheduler state,
  • computes _dev_thread_num and _dev_thread_id through helper calls,
  • initializes the static scheduler,
  • repeatedly asks for the next chunk,
  • then rewrites the original loop bounds to _dev_lower and _dev_upper.

This is clearly a more general execution model. It is also clearly more machinery.

That is why the direct path is preferable whenever REX can recover the canonical loop model safely.

Still, the old path remains important. It gives the lowerer a correctness escape hatch when the direct path does not have enough structural information to proceed confidently. That is good compiler engineering:

  • prefer the simpler and more direct generated kernel when the loop shape supports it,
  • but keep a more general path for cases that do not.

The existence of the fallback also makes the design easier to evolve. New canonical-loop support can move more loops into the direct path over time without requiring the compiler to treat unsupported cases as fatal.

Step 7: Share Loop Facts With The Host Launch Block

One of the nicest design choices in the current path is that host launch shaping does not invent a second loop model.

Inside transOmpTargetSpmdWorksharing(...), before the loop is rewritten in the outlined kernel, REX analyzes the host-side loop read-only:

1
2
3
4
5
6
7
TargetLoopLoweringInfo host_loop_info;
if (analyzeTargetLoopForGpuReadOnly(host_for_loop, &host_loop_info) &&
    canUseDirectTargetLoopFastPath(host_loop_info)) {
  host_loop_iter_count_expr =
      buildTargetLoopTripCountExpr(host_loop_info);
  ...
}

If the loop is suitable, the host launch block later receives:

1
2
3
4
tripcount_decl = buildVariableDeclaration(
    "__rex_tripcount", buildOpaqueType("int64_t", p_scope),
    buildAssignInitializer(copyExpression(host_loop_iter_count_expr)),
    p_scope);

That __rex_tripcount value is then used to shape _threads_per_block_ conservatively when the user did not explicitly request num_threads.

The logic is more careful than “set threads to tripcount”:

  • it only runs when tripcount > 0,
  • it rounds using a launch granularity,
  • the launch granularity defaults to 32 lanes,
  • but if the current block size is already smaller than 32, the block size becomes the granularity instead,
  • and the rounded value is clamped so it never exceeds the current thread limit.

In schematic form, the generated host-side shape is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
int _threads_per_block_ = omp_num_threads;
int _num_blocks_ = omp_num_teams;
int64_t __rex_tripcount = recovered_loop_tripcount;
int direct_launch_thread_cap = 0;

if (!has_explicit_num_threads) {
  const int nested_loop_depth =
      computeMaxNestedForDepth(host_for_loop->get_loop_body());
  if (nested_loop_depth >= 2) {
    direct_launch_thread_cap = 128;
  } else if (nested_loop_depth >= 1) {
    direct_launch_thread_cap = 256;
  }
}

if (!has_explicit_num_threads && __rex_tripcount > 0) {
  int64_t __rex_launch_granularity = 32;
  if (_threads_per_block_ < 32) {
    __rex_launch_granularity = _threads_per_block_;
  }
  int64_t __rex_rounded_threads =
      round_up(__rex_tripcount, __rex_launch_granularity);
  if (__rex_rounded_threads > _threads_per_block_) {
    __rex_rounded_threads = _threads_per_block_;
  }
  if (direct_launch_thread_cap > 0 &&
      __rex_rounded_threads > direct_launch_thread_cap) {
    __rex_rounded_threads = direct_launch_thread_cap;
  }
  _threads_per_block_ = (int)__rex_rounded_threads;
  _num_blocks_ = (_threads_per_block_ > 0) ? (int)(1 + (__rex_tripcount - 1) / _threads_per_block_) : 1;
}

This is exactly the kind of host/device handshake that a good lowerer should have:

  • the host side recovers loop facts without rewriting the loop body,
  • the device side uses the same loop facts to rewrite the loop itself,
  • and explicit user launch clauses remain the boundary the compiler should not silently cross.
Host-side read-only loop analysis computes tripcount and optional launch caps, while the device side rewrites the same canonical loop into a grid-stride form.

Figure 3. The host and device sides do not duplicate the same transformation. They share loop facts. The host side uses them for launch shaping, and the device side uses them for the actual loop rewrite.

What This Buys REX

This design gives REX several things at once.

A device loop that looks like the execution model it targets

The direct path lowers straight to the CUDA grid model:

  • global lane id,
  • global lane count,
  • loop lower bound shifted by lane id,
  • loop stride multiplied by total lanes.

That is easier to inspect than a more opaque scheduler-driven loop body, and it lets later cleanups reason about the kernel in a more direct way.

A correctness fallback that does not block progress

The older XOMP round-robin path is still there when the canonical facts are not recoverable. That means the compiler can keep broad correctness coverage while still preferring the cleaner direct path when it is safe.

One loop model used in two places

The same TargetLoopLoweringInfo idea feeds both:

  • the device loop rewrite,
  • and the host-side tripcount computation.

That reduces the chance that host launch shaping and device execution silently drift apart.

A clean boundary for further device-side cleanup

Once the loop is a direct grid-stride kernel, helpers like:

  • scalarizeDirectGridStrideOuterIndexAccesses(...)
  • hoistReadOnlyInvariantAggregateRefsBeforeLoop(...)
  • hoistReadOnlyInvariantFieldAccessesBeforeLoop(...)
  • rewriteReadOnlyDeviceLoadsWithLdg(...)

become much easier to justify and test than they would be inside a heavier scheduler wrapper path.

The Main Design In One Sentence

If you compress this whole stage down, the core idea is:

REX does not lower target loops by treating them as anonymous work items for a generic scheduler unless it has to. It first tries to recover a canonical loop model and then rewrites the loop directly into the CUDA grid-stride execution model.

That is what makes this stage such an important bridge between the high-level OpenMP lowerer and the later performance work.

Outlining gave REX a kernel.

Host launch generation gave REX a runtime packet.

transOmpTargetLoopBlock() is the stage that makes the kernel body actually behave like a GPU kernel.