How REX Replaced the Old XOMP Scheduler Path With Direct Grid-Stride GPU Lowering

Posted on Apr 22, 2026 (Updated on May 3, 2026)

After REX moved CUBIN registration out of the nn timed path, the remaining gap was no longer a cold-start problem. It was a generated-kernel-shape problem. REX was still routing simple canonical target loops through old XOMP static-scheduler helpers, while native LLVM used a direct SPMD-style grid-stride kernel body. The fix was generic: recover canonical loop facts, rewrite eligible loops into direct grid-stride form, keep the scheduler fallback for unsafe shapes, and test that nn-like lowering no longer emits scheduler helpers.

The previous performance post explained why the first huge nn regression was not a GPU math problem. REX was charging one-time offload registration to the benchmark’s measured path. Moving rex_offload_init() before user timing removed that cold-start contamination and made the comparison meaningful.

But it did not finish the nn story.

Once the host lifecycle bug was removed, the remaining gap pointed at a different layer:

the shape of the generated device loop.

That is the right order of investigation. If first-use registration is still inside the timer, device-side profiling is polluted. Once that fixed cost is out of the way, the profiler and generated source can tell us something useful about the hot path.

For nn, the hot-path problem was the old XOMP scheduler path.

REX was still using generic static-scheduler machinery for a simple canonical one-dimensional target loop. Native LLVM was not. Native LLVM lowered the same kind of loop into a direct SPMD-style kernel body where each GPU lane owns iterations through a grid-stride mapping.

That difference matters because nn is small enough that extra helper state, extra control flow, and extra scheduler calls are visible. The old path was not semantically wrong. It was too general for the loop shape in front of it.

A diagram comparing the old XOMP scheduler path with a direct grid-stride path. The scheduler path goes through helper state, scheduler init, scheduler next, and a while loop. The direct path computes global thread id and global thread count, then rewrites the loop bounds and stride. — Figure 1. The old XOMP scheduler path solved a more general work-distribution problem than `nn` had. The direct path maps CUDA lanes to loop iterations without helper-driven chunk scheduling.

The Symptom After Cold Start Was Removed

The earliest comparison artifact had nn at roughly:

1
2
native LLVM: about 0.3031 s
REX:         about 0.9986 s

That result was dominated by the misplaced host/runtime lifecycle cost described in the previous post.

After the host-side cleanup and the first direct-lowering work, a later comparison showed:

1
2
native LLVM: about 0.2609 s
REX:         about 0.3056 s

That is a very different kind of gap.

The old result said:

something fundamental is being measured in the wrong place.

The later result said:

now we are close enough that generated kernel structure matters.

That shift is important. Compiler performance work is not just about making a number smaller. It is about making the number explainable. Once nn moved from almost 1.0 s to roughly 0.3 s, the remaining overhead was no longer a mysterious runtime bootstrap problem. It was in the steady-state path.

The generated device source made the next issue visible.

What The Old XOMP Path Did

The old lowering path normalized a canonical loop, then wrapped it in scheduler state:

1
2
3
4
5
6
SgVariableDeclaration *dev_loop_chunk_size_decl = buildVariableDeclaration(
    "_dev_loop_chunk_size", buildIntType(), NULL, bb1);
SgVariableDeclaration *dev_loop_sched_index_decl = buildVariableDeclaration(
    "_dev_loop_sched_index", buildIntType(), NULL, bb1);
SgVariableDeclaration *dev_loop_stride_decl =
    buildVariableDeclaration("_dev_loop_stride", buildIntType(), NULL, bb1);

Then it computed helper-based CUDA thread information:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SgAssignInitializer *init_block_count = buildAssignInitializer(
    buildFunctionCallExp(SgName("getCUDABlockThreadCount"), buildIntType(),
                         buildExprListExp(buildIntVal(1)), bb1),
    buildIntType());

SgAssignInitializer *init_thread_idx = buildAssignInitializer(
    buildFunctionCallExp(SgName("getLoopIndexFromCUDAVariables"),
                         buildIntType(), buildExprListExp(buildIntVal(1)),
                         bb1),
    buildIntType());

Then it called scheduler helpers:

1
2
3
4
5
6
SgStatement *call_stmt = buildFunctionCallStmt(
    "XOMP_static_sched_init", buildVoidType(), parameters, bb1);

SgExpression *func_call_exp = buildFunctionCallExp(
    "XOMP_static_sched_next", buildBoolType(), parameters, bb1);
SgWhileStmt *w_stmt = buildWhileStmt(func_call_exp, for_loop);

That is a reasonable fallback design. It gives the compiler a way to distribute work even when it does not want to rewrite the loop into a simple CUDA lane mapping.

But for nn, this machinery was solving the wrong problem.

The loop was canonical. It was one-dimensional. It did not need a helper to repeatedly hand out scheduler chunks. A CUDA global lane id and a CUDA global lane count were enough.

The cost of the old path came from several places at once:

more generated local variables for scheduler state,
helper calls to compute CUDA thread count and lane id,
a scheduler initialization call before the loop,
a scheduler-next call in the loop guard,
extra address-taken state for lower, upper, stride, and chunk information,
and a while wrapper around the original loop body.

None of those costs is individually shocking. Together they are a legacy abstraction tax.

That tax is easy to miss in a huge compute-heavy kernel. It is not easy to miss in nn.

What Native LLVM Was Doing Differently

Native LLVM was the right reference for this part of the investigation because it was compiling the same OpenMP source with an integrated GPU offload pipeline.

For the nn-like shape, native LLVM did not need a generic scheduler loop. It emitted a target kernel whose device-side structure followed the usual SPMD idea:

compute a logical GPU lane id,
compute a total number of participating lanes,
start each lane at a different iteration,
advance by the global lane count.

That pattern is the standard grid-stride loop:

1
2
3
4
5
for (i = lower + global_tid * stride;
     i <= upper;
     i += global_thread_count * stride) {
  body(i);
}

That is not a benchmark trick. It is the natural lowering for a canonical one-dimensional GPU target loop.

So the design question became:

can REX recognize that general loop shape and generate the direct form without losing the safety of the old fallback?

The answer had to be yes, but not by special-casing nn.

The Loop Facts REX Needed

The direct path starts with a small record:

1
2
3
4
5
6
7
8
struct TargetLoopLoweringInfo {
  SgInitializedName *orig_index = nullptr;
  SgExpression *orig_lower = nullptr;
  SgExpression *orig_upper = nullptr;
  SgExpression *orig_stride = nullptr;
  bool is_incremental = true;
  bool is_inclusive_bound = true;
};

This record is small on purpose.

To rewrite a loop into a grid-stride form, the lowerer needs to know:

which variable is the induction variable,
where the original iteration space starts,
where it stops,
how large each original step is,
whether the loop increments or decrements,
and whether the bound is inclusive after normalization.

That separation matters because the old scheduler path mixed several jobs together. It recovered loop shape, computed scheduler state, and generated work distribution in one conceptual flow.

The direct path splits those concerns:

first recover canonical loop facts,
then decide whether the loop is safe for direct lowering,
then generate either the direct grid-stride path or the scheduler fallback.

A decision diagram showing REX rewriting pointer-based loop indices, normalizing the loop, recovering canonical loop facts, checking direct-path eligibility, and then choosing either direct grid-stride lowering or the old XOMP fallback. — Figure 2. The optimization is not keyed to the benchmark name. It is keyed to recoverable loop facts. If those facts are missing, REX keeps the fallback path.

Why Pointer-Based Indices Mattered

One subtle problem was specific to REX’s source-to-source lowering pipeline.

After outlining, loop indices can appear as pointer dereferences rather than clean scalar induction variables. That can happen because the outlining path has to preserve host/device parameter passing semantics, and loop-local variables may be represented indirectly at intermediate stages.

If the direct fast path only worked on already-clean loops, it would be too fragile. It might pass a reduced nn experiment and then fail on real generated code.

That is why the current analysis begins by rewriting pointer-based loop indices before canonical-loop analysis:

1
2
rewritePointerBasedForIndices(for_loop);
SageInterface::forLoopNormalization(for_loop);

Only after that does REX ask whether the loop is canonical:

1
2
3
4
bool is_canonical = isCanonicalForLoop(
    for_loop, &info.orig_index, &info.orig_lower, &info.orig_upper,
    &info.orig_stride, NULL, &info.is_incremental);
ROSE_ASSERT(is_canonical);

This is a compiler fix, not a benchmark edit.

The fast path has to work on the loop forms REX actually generates, not just on the prettiest loop form one would write by hand.

The Direct Rewrite

The direct lowering computes two values:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
static SgExpression *buildCudaDimXLongLongRef(const std::string &name,
                                              SgScopeStatement *scope) {
  return buildCastExp(buildCudaDimXRef(name, scope), buildLongLongType());
}

static SgExpression *buildCudaGlobalThreadIdXExpr(SgScopeStatement *scope) {
  return buildAddOp(buildMultiplyOp(
                        buildCudaDimXLongLongRef("blockDim.x", scope),
                        buildCudaDimXRef("blockIdx.x", scope)),
                    buildCudaDimXRef("threadIdx.x", scope));
}

static SgExpression *buildCudaGlobalThreadCountXExpr(SgScopeStatement *scope) {
  return buildMultiplyOp(buildCudaDimXLongLongRef("gridDim.x", scope),
                         buildCudaDimXRef("blockDim.x", scope));
}

Those expressions are ordinary CUDA execution facts:

blockIdx.x * blockDim.x + threadIdx.x is the global lane id,
gridDim.x * blockDim.x is the total number of lanes in the launch.

The lowerer then rewrites the loop:

1
2
3
4
5
6
7
setLoopLowerBound(
    for_loop, buildAddOp(deepCopy(info.orig_lower),
                         buildMultiplyOp(buildVarRefExp(dev_thread_id_symbol),
                                         deepCopy(info.orig_stride))));
setLoopUpperBound(for_loop, deepCopy(info.orig_upper));
setLoopStride(for_loop, buildMultiplyOp(buildVarRefExp(dev_thread_num_symbol),
                                        deepCopy(info.orig_stride)));

The implementation materializes those two generated temporaries with buildLongLongType() in lowerTargetLoopDirectGridStride() so the launch-width products are not truncated on large grids. The emitted shape is:

1
2
3
4
5
6
7
8
long long _dev_thread_num = (long long)gridDim.x * blockDim.x;
long long _dev_thread_id = (long long)blockDim.x * blockIdx.x + threadIdx.x;

for (i = lower + _dev_thread_id * stride;
     i <= upper;
     i += _dev_thread_num * stride) {
  ...
}

No scheduler initialization is needed.

No scheduler-next call is needed.

No helper has to hand out chunks.

Each lane can decide its own first iteration and next iteration directly from the launch geometry.

Why The Fallback Stayed

It would be wrong to delete the XOMP path completely.

The direct path is valid for a class of loops, not every possible loop the frontend may encounter. If REX cannot recover the loop facts it needs, or if a future loop shape needs more complicated scheduling semantics, the compiler should preserve correctness rather than force a fast path.

That is why the lowerer still has two paths:

1
2
3
4
5
6
7
8
if (canUseDirectTargetLoopFastPath(info)) {
  lowerTargetLoopDirectGridStride(for_loop, bb1, info);
  if (used_direct_grid_stride != NULL) {
    *used_direct_grid_stride = true;
  }
} else {
  lowerTargetLoopRoundRobin(for_loop, bb1, info);
}

The important design point is the default for eligible canonical loops.

Before this work, a simple loop could still pay for the scheduler abstraction. After this work, the scheduler is a fallback for cases that need it, not the default for loops that do not.

That is the difference between a generic compiler optimization and a benchmark hack.

The Scope Bug This Could Have Introduced

Loop normalization can hoist an induction-variable declaration to the statement immediately before the loop.

That creates a mundane but dangerous code-generation problem. If REX replaces the loop with a new basic block but leaves the hoisted declaration outside the block, generated code can become invalid or subtly wrong.

The direct path handles that with a narrow helper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
static SgVariableDeclaration *
findHoistedTargetLoopIndexDeclaration(SgForStatement *for_loop,
                                      const TargetLoopLoweringInfo &info) {
  ...
  SgVariableDeclaration *index_decl =
      isSgVariableDeclaration(info.orig_index->get_declaration());
  if (index_decl == nullptr) {
    return nullptr;
  }

  if (SageInterface::getPreviousStatement(for_loop, false) != index_decl) {
    return nullptr;
  }

  const SgInitializedNamePtrList &decl_vars = index_decl->get_variables();
  if (decl_vars.size() != 1 || decl_vars.front() != info.orig_index) {
    return nullptr;
  }

  return index_decl;
}

This helper is intentionally conservative.

It only moves the declaration when the pattern is exactly the one produced by loop normalization or pointer-index rewriting: a single declaration tightly coupled to the loop being transformed.

That matters because AST rewrites are allowed to be aggressive only when ownership is clear. Moving a declaration that belongs to a wider scope would be a correctness bug.

The Tests Guard The Shape, Not The Whole File

The reduced Rodinia lowering suite protects this optimization with structural checks.

For the nn-like case, the verifier rejects the old scheduler machinery:

1
2
3
4
5
6
7
8
expect_count "${cu_file}" 'XOMP_static_sched_init[[:space:]]*\(' 0 \
  "unexpected XOMP scheduler init in nn-like device code"
expect_count "${cu_file}" 'XOMP_static_sched_next[[:space:]]*\(' 0 \
  "unexpected XOMP scheduler next in nn-like device code"
expect_count "${cu_file}" 'getCUDABlockThreadCount[[:space:]]*\(' 0 \
  "unexpected helper-based CUDA thread count in nn-like device code"
expect_count "${cu_file}" 'getLoopIndexFromCUDAVariables[[:space:]]*\(' 0 \
  "unexpected helper-based CUDA thread id in nn-like device code"

That is the right kind of test for this bug.

It does not require the entire generated CUDA file to match a golden output. It checks the meaningful invariant:

an nn-like canonical target loop should not go through the old scheduler helper path.

The same suite also protects the broader source-to-source requirement that the generated program still works for more than one kernel and more than one call.

For example, the rodinia_axpy_multi_like case expects three generated kernels and repeated host calls:

1
2
3
4
verify_common_cuda_lowering "${rose_file}" "${cu_file}" 3
expect_count "${rose_file}" \
  'axpy_like[[:space:]]*\([[:space:]]*x,[[:space:]]*y,[[:space:]]*a,[[:space:]]*n[[:space:]]*\)[[:space:]]*;' \
  2 "repeated host call count"

And the rodinia_btree_kernel_like case expects two kernels with repeated calls to both:

1
2
3
4
5
verify_common_cuda_lowering "${rose_file}" "${cu_file}" 2
expect_count "${rose_file}" 'kernel_cpu_like[[:space:]]*\(' 2 \
  "first kernel repeated host call count"
expect_count "${rose_file}" 'kernel_cpu_2_like[[:space:]]*\(' 2 \
  "second kernel repeated host call count"

Those tests exist because this optimization lives inside a compiler, not a hand-edited benchmark file.

REX has to regenerate correct code for arbitrary user programs. A fast path that only works for a single kernel in a single call site is not acceptable.

A matrix showing that the nn-like case rejects scheduler helpers, the axpy multi-like case checks multiple kernels and repeated calls, and the btree-like case checks multiple kernels with repeated host calls. — Figure 3. The test suite checks both sides of the change: `nn`-like kernels avoid the scheduler path, while multi-kernel and repeated-call cases still preserve generated program structure.

What This Changed In The Performance Story

This attempt did not happen in a pristine one-change microbenchmark harness. It landed as part of a sequence that also included the host lifecycle cleanup described in the previous post.

So the correct claim is not:

scheduler removal alone explains every millisecond of improvement.

The defensible claim is:

after cold-start setup was moved out of the timed path, the old scheduler path was the next structural overhead visible in nn, and replacing it was necessary for REX to get close to native LLVM.

The chronology supports that.

The earliest artifact showed REX nn near 0.9986 s. After the init cleanup and the direct grid-stride path had landed, the later comparison showed REX near 0.3056 s.

That improvement is too large to attribute only to timer placement. The host fix removed the pathological first-use cost. The direct lowering removed a hot-path abstraction tax.

Together they changed the nature of the remaining problem.

Before:

REX was measuring runtime setup,
and then running a scheduler-heavy kernel path.

After:

REX registered before timing,
and eligible canonical loops used direct CUDA-style work distribution.

At that point, the remaining nn gap was small enough to investigate launch geometry and ABI details instead of arguing about whether the compiler was measuring the right thing.

What REX Did Better After This Change

The most important improvement was not the raw number. It was that the generated device code became structurally appropriate for the source loop.

For canonical one-dimensional target loops, REX now does the thing a GPU compiler should do:

compute global lane identity directly from CUDA launch variables,
compute total launch width directly,
assign each lane its first iteration from the original lower bound and stride,
advance by the launch width,
and avoid scheduler helper calls in the loop guard.

That is exactly the kind of code a human would write in CUDA for this loop shape.

The second improvement was maintainability.

The lowerer now has an explicit TargetLoopLoweringInfo model. That makes the decision visible. Future work can ask concrete questions:

did canonical-loop recovery succeed?
did the direct-path eligibility test pass?
did the fallback path still exist?
did host launch shaping use the same loop facts?

That is much easier to reason about than a single generic scheduler path that hides every loop behind helper calls.

The third improvement was testability.

The reduced Rodinia verifier can now state the invariant directly: scheduler helpers must not appear in nn-like device code. That is a stronger regression guard than simply checking that the benchmark runs once.

What LLVM Still Did Better

Even after this fix, native LLVM still had advantages.

LLVM’s offload pipeline is integrated end to end. It does not have to recover as much structure after source-to-source outlining. It owns the original OpenMP lowering, the device IR, and the runtime ABI generation in one pipeline.

REX has to reconstruct that shape from generated source and ROSE AST transformations. That gives REX excellent inspectability, but it also means REX must explicitly recover loop facts, move declarations safely, and preserve fallback behavior.

LLVM also had a more mature launch ABI path at this stage. After the scheduler tax was removed, the remaining gaps were increasingly about:

launch geometry policy,
scalar target-parameter representation,
and the transition from legacy __tgt_target_teams style calls toward direct __tgt_target_kernel lowering.

Those are not loop-scheduler problems. They are the next layers in the performance stack.

That is why this post should not be read as “direct grid-stride lowering made REX done with nn.” It made REX’s kernel body credible. The next bottlenecks moved elsewhere.

The Rule From This Attempt

The design rule from this attempt is:

use the generic XOMP scheduler only when the compiler cannot safely lower the target loop directly.

For canonical GPU target loops, direct grid-stride lowering should be the default.

That rule is generic. It is not tied to nn. It applies wherever REX can recover the same loop facts and preserve semantics.

It also keeps the right failure mode. If the loop is not safe for direct lowering, REX falls back to the scheduler rather than generating fast but wrong code.

That is the pattern the rest of the performance campaign keeps following:

remove accidental overhead,
encode the generic compiler reason,
preserve user intent,
keep a correctness fallback,
and test the generated structure directly.

The next post moves to launch geometry:

what REX may optimize, what it must preserve, and why nn and heartwall made that question more complicated than “use the smallest launch that fits the trip count.”