How REX Handles Launch Geometry Fairly: What It May Optimize And What It Must Preserve

Posted on Apr 23, 2026 (Updated on May 4, 2026)

After REX removed the old XOMP scheduler path for simple target loops, the next question was launch geometry. The tempting fix was to use source-level tripcount knowledge to shrink every oversized launch. That would improve some benchmarks, but it would not be a fair compiler policy. The real rule is sharper: preserve explicit user launch choices unless they are invalid, and optimize only compiler-owned defaults. REX now uses recovered tripcounts to cap default thread counts and avoid pathological empty-lane launches, while preserving the requested team count and explicit thread-width choices such as num_threads or thread_limit. That makes the optimization generic rather than benchmark-specific.

The previous performance post ended after REX replaced the old XOMP scheduler path with direct grid-stride lowering for simple target loops. That removed a real device-side tax. The generated kernel no longer needed helper state, scheduler-next calls, or a loop around a software worksharing protocol when a plain global-thread-id mapping was enough.

That still left a second question:

What launch shape should the generated host code ask the OpenMP offload runtime to execute?

This sounds like a smaller problem than scheduler removal. It is not. Launch geometry sits exactly on the boundary between source semantics, compiler policy, runtime overhead, and GPU occupancy. A source-to-source compiler can easily make a benchmark faster by silently changing a launch shape. It can also make the comparison unfair by taking back a choice the user explicitly wrote in the OpenMP source.

That is why this post is about fairness as much as performance.

The useful result from the REX work was not “use a smaller launch.” The useful result was a rule for deciding which parts of the launch are compiler-owned and which parts are user-owned.

REX may optimize compiler-owned defaults.

REX must preserve explicit user launch choices unless they are invalid for the target.

Everything else in this post follows from that boundary.

A diagram separating explicit user launch clauses from compiler-owned default launch geometry. User-owned inputs flow to preservation checks, while compiler defaults flow to tripcount-aware shaping. — Figure 1. Launch geometry is not one object. Some of it belongs to the source program, and some of it belongs to the compiler’s default policy.

Why Launch Geometry Became The Next Problem

Once direct grid-stride lowering was in place, REX and native LLVM were no longer separated by the old scheduler path on the simplest target loops. That made smaller effects visible.

The generated REX host code still had to choose values equivalent to CUDA blocks and threads:

1
2
int _threads_per_block_ = omp_num_threads;
int _num_blocks_ = omp_num_teams;

Those values eventually feed the OpenMP offload launch call. In the direct path, the device loop uses the combined grid as the iteration space:

1
2
3
4
5
6
long tid = blockIdx.x * blockDim.x + threadIdx.x;
long stride = gridDim.x * blockDim.x;

for (long i = tid; i < tripcount; i += stride) {
  body(i);
}

That shape is simple and fast when the launch matches the live work. It is wasteful when the launch is much wider than the loop.

The nn benchmark made this obvious. Its hot offloaded loop is small. If the live tripcount is around a tiny window but the generated launch asks for hundreds or thousands of threads, most GPU lanes have no useful iteration. On a short benchmark, even a small amount of launch and scheduling waste is visible.

The first instinct is therefore natural:

1
2
threads = min(threads, tripcount);
blocks = min(blocks, ceil_div(tripcount, threads));

That can be a valid optimization when the compiler chose both values. But as a general lowering rule it is too aggressive.

OpenMP source may contain launch clauses. The user may write a team count, a thread limit, or a combined construct whose lowering carries specific launch intent into the generated host code. Even if that intent is not optimal for one input, it is still program input. A compiler cannot fairly claim a win against native LLVM by silently changing an explicit choice that native LLVM preserves.

This was the key correction in the REX work. Launch shaping is allowed, but only inside the part of the launch contract that REX owns.

The Fairness Boundary

The practical rule we adopted is:

1
2
3
4
if the user explicitly supplied a launch clause:
    preserve it unless it is invalid for the hardware
else:
    the compiler may derive a better default from source-level facts

That rule is not just about benchmark ethics. It is also a correctness and maintainability rule.

If source code says:

1
2
3
4
#pragma omp target teams distribute parallel for num_teams(256) thread_limit(1024)
for (int i = 0; i < n; ++i) {
  work(i);
}

then REX should not replace that with num_teams(1) and thread_limit(32) just because one input happened to have n = 10. The user may have chosen that shape because other inputs are larger, because occupancy is being controlled manually, or because the code was tuned with a specific runtime behavior in mind.

It is also possible that the choice is simply bad. That still does not make it compiler-owned.

This matters in performance comparisons with native LLVM. If native LLVM honors an explicit launch clause and REX ignores it, then a REX speedup is not proving that the lowering is better. It is proving that REX ran a different launch contract.

Fair comparison requires the same rule for both sides:

1
explicit source launch intent must remain explicit launch intent

The compiler can still protect itself from invalid requests. If a requested block size exceeds a hardware limit, it must be capped or rejected according to the target rules. But “suboptimal for this benchmark input” is not the same as “invalid.”

What REX Actually Does Now

The current direct-launch policy can be described with two launch-ownership facts:

1
2
bool has_explicit_num_teams;
bool has_explicit_thread_width;

The second name is deliberate in this post. REX’s internal SPMD driver ultimately needs one thread-width value for the launch call, so implementation code may carry that value through an omp_num_threads-style variable. The policy question is broader than the variable name: did the source explicitly constrain the thread width, either through num_threads on a parallel form or through thread_limit on a teams form? If yes, the tripcount cap must not treat that width as a compiler-owned default.

The generated host driver still starts from the OpenMP-derived launch values:

1
2
int _threads_per_block_ = omp_num_threads;
int _num_blocks_ = omp_num_teams;

Then it applies tripcount-aware shaping only where the compiler owns the thread-width choice. The snippet below is the simplified core of the policy: it shows the fairness-sensitive tripcount cap. The fuller schematic in the previous direct-grid-stride post also includes the nested-loop direct_launch_thread_cap guard that belongs to the direct-kernel fast path, but that extra cap does not change the ownership rule described here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
if (tripcount > 0 && !has_explicit_thread_width) {
  if (_threads_per_block_ > tripcount) {
    int64_t granularity = 32;

    if (_threads_per_block_ < 32) {
      granularity = _threads_per_block_;
    }

    int64_t rounded_threads =
        ((tripcount + granularity - 1) / granularity) * granularity;

    if (rounded_threads > _threads_per_block_) {
      rounded_threads = _threads_per_block_;
    }

    _threads_per_block_ = (int)rounded_threads;
  }
}

This is a small block of code, but every condition is doing real policy work.

First, the optimization is disabled when the source explicitly constrained thread width. That includes num_threads on forms that use that clause directly and thread_limit on teams-based forms that lower into the launch’s thread-width value. That protects user intent and keeps the comparison fair.

Second, it only fires when the requested block is wider than the loop tripcount. If the launch already has fewer threads than iterations, the grid-stride loop can use multiple passes per thread and there is no obvious empty-lane waste to remove.

Third, the cap is rounded to a granularity of 32 when possible. That avoids replacing a bad launch with a different bad launch such as 7 threads per block. The policy is not “always use exactly tripcount threads.” It is “avoid launching a block that is much wider than the useful work, without creating a strange sub-warp launch when a warp-sized launch is more sensible.”

Fourth, the code preserves _num_blocks_. That part is as important as the thread cap.

A flow diagram showing REX reading tripcount, checking whether thread width was explicit, capping threads when allowed, and preserving num_blocks. — Figure 2. The current REX policy is deliberately asymmetric: it may reduce compiler-owned thread width, but it does not shrink the team count as a side effect of seeing a small tripcount.

Why REX Does Not Shrink Everything

The early experiments showed why shrinking both dimensions is dangerous.

If a loop has only a small number of iterations, it is tempting to collapse the entire launch to the smallest shape that covers the tripcount. For example:

1
2
3
tripcount = 10
old launch = 256 teams x 1024 threads
minimal launch = 1 team x 32 threads

For a tiny loop with almost no work per iteration, that may be exactly what we want. But it is not a universal rule.

Teams map naturally to CUDA blocks. Blocks are the unit the GPU scheduler distributes across streaming multiprocessors. Preserving more teams can expose more block-level parallelism, more scheduling flexibility, and more latency hiding. If each outer iteration contains a meaningful amount of nested work, memory traffic, or divergent control flow, collapsing the launch to one block can underfill the GPU even when the source-level tripcount looks small.

This is where nn and heartwall initially created confusion.

nn rewards avoiding oversized thread width because its active work window is small. A huge block for a tiny loop is mostly idle lanes.

heartwall is different. Its offloaded computation is repeated many times and has different memory behavior. Depending on the input and experiment state, preserving block-level parallelism can matter more than minimizing the number of launched lanes. In early measurements, REX looked dramatically better than native LLVM on heartwall. Later comparison artifacts no longer showed the same simple picture. That did not mean the first observation was useless. It meant the launch-shape story was more complicated than one benchmark table could prove.

The lesson was:

1
do not use one benchmark's preferred geometry as a compiler-wide rule

A source-to-source compiler needs structural policies. The structural policy here is:

1
2
3
cap default thread width when it is clearly wider than useful work;
preserve teams unless a separate, justified rule owns that dimension;
never rewrite explicit user launch clauses just to win a benchmark.

That is why the REX implementation handles thread capping and team preservation separately.

The `nn` Versus `heartwall` Lesson

The user asked a direct question during the performance work: what happens if we try “team first” instead of “thread first”? For example, instead of 1 team x 10 threads, would 10 teams x 1 thread use more compute units in parallel?

That question is exactly the right way to think about launch geometry. Threads and teams are not interchangeable.

On a CUDA-like target, more threads in one block and more blocks with fewer threads create different scheduling behavior. A block consumes scheduling resources as a unit. A warp is the execution granularity inside a block. A grid-stride loop then determines whether each lane gets zero iterations, one iteration, or many iterations.

So these two shapes are not equivalent:

1
2
1 block  x 32 threads
32 blocks x 1 thread

They may cover the same number of logical lanes, but they do not expose the same block-level scheduling. They do not have the same occupancy behavior. They do not have the same warp efficiency. They can also interact differently with memory latency and per-block resource use.

For nn, the old problem was too much thread width for too little useful work. Shrinking compiler-owned thread width was enough to remove some waste.

For heartwall and gaussian, preserving enough teams can matter because repeated kernels and memory-heavy loops need block-level parallelism. The right strategy is not “always shrink teams” or “always preserve teams.” The right strategy is to decide whether the team count is user-owned, compiler-owned, or invalid.

The current post is intentionally about the conservative part of that strategy. REX now avoids the unfair optimization: it does not use tripcount to override explicit user launch clauses. It also avoids the dangerous optimization: it does not collapse teams just because the tripcount is small.

A comparison diagram showing nn favoring reduced idle thread width and heartwall favoring preserved block-level parallelism. — Figure 3. `nn` and `heartwall` exposed opposite failure modes. The compiler rule has to be based on ownership and structure, not benchmark names.

Why The Warp-Rounded Cap Matters

The thread cap is not simply:

1
_threads_per_block_ = min(_threads_per_block_, tripcount);

That would be easy to explain, but it would produce poor launch shapes for tiny tripcounts.

Suppose the compiler-owned default is 1024 threads and the recovered tripcount is 10. Launching 10 threads per block is legal in the abstract, but it is usually not a good GPU shape. A warp-sized cap is more sensible:

1
2
3
tripcount = 10
requested default = 1024
rounded cap = 32

That still removes the worst waste. It avoids launching 1024 lanes for 10 iterations. But it keeps the block size aligned to the hardware execution model.

The implementation also handles the case where the original request was already below 32:

1
2
3
4
5
int64_t granularity = 32;

if (_threads_per_block_ < 32) {
  granularity = _threads_per_block_;
}

That detail matters because the cap should not increase the thread count. If the incoming compiler-owned thread count is 16, the lowering should not round a tripcount up to 32 and accidentally ask for more threads than the original default. The cap is a cap, not a hidden expansion.

So the shape is:

1
2
rounded_threads = round_up(tripcount, min(32, original_threads))
threads = min(original_threads, rounded_threads)

This gives REX a predictable launch shape for tiny loops without violating the original upper bound.

What This Fixed And What It Did Not Fix

This work fixed one important class of performance mistake:

1
compiler-owned launch defaults that are obviously much wider than the recovered tripcount

That matters for short kernels and repeated kernels. A direct grid-stride device loop is only as good as the launch that feeds it. If the host side launches thousands of lanes for a handful of iterations, REX can still lose even though the kernel body is cleaner than the old scheduler path.

But launch geometry was not the whole performance story.

The measurement history showed that geometry changes could move benchmarks in both directions. Some runs improved nn but hurt other benchmarks. Some temporary generated-file experiments made heartwall look like a decisive REX win, then later full-suite rebuilds changed the picture. gaussian still needed deeper ABI work. b+tree later required a more careful distinction between fair launch preservation and kernel-body optimization.

That is the reason the final policy is deliberately modest.

It does not claim that REX has a perfect occupancy model.

It does not claim to infer every best block count.

It does not rewrite explicit source choices because they look suboptimal.

It only takes the safe, generic win: when REX itself owns the thread default and source analysis proves the tripcount is smaller than that width, cap the thread count to a warp-rounded useful size.

That sounds less exciting than “auto-tune every launch.” It is also the correct compiler engineering choice.

Why This Is Still A Real Optimization

Being conservative does not mean being passive.

Native LLVM’s OpenMP offload path is built to handle a wide range of target regions through a general runtime model. That generality is a strength. It means LLVM can preserve semantics across many constructs without relying on a source-to-source compiler’s ability to recover every source-level fact.

REX has a different advantage. Because it owns the source-to-source lowering, it can keep certain source facts alive in ordinary generated host code. The loop tripcount is one of those facts.

When REX can prove:

1
2
3
this target loop has a known host-side tripcount
and the user did not explicitly constrain the thread width
and the default thread count is wider than that tripcount

then it can generate a better host launch without changing source intent.

That is the important point. This is not a benchmark-specific hack. The generated code does not check for nn, heartwall, or any other application name. It checks structural facts:

1
2
3
4
Is the loop tripcount available?
Was thread width explicit?
Is the default block wider than the live work?
Can the cap be rounded without expanding the launch?

Those are compiler conditions. They apply to any program that matches the same shape.

How To Test This Kind Of Change

Launch geometry changes need more than one benchmark run.

The basic generated-code checks should confirm three things:

1
2
3
1. explicit thread-width constraints are still preserved;
2. compiler-owned thread defaults may be capped from recovered tripcounts;
3. num_teams is not accidentally collapsed by the same code path.

A useful structural check is to inspect the generated host driver and verify that the tripcount guard is under the explicit-thread condition:

1
2
3
if (!has_explicit_thread_width) {
  /* thread cap logic */
}

The benchmark checks then need to cover at least two opposite shapes:

1
2
3
4
5
small tripcount, no explicit thread-width constraint:
    REX should be allowed to shrink a default-width launch.

explicit wide launch on a small tripcount:
    REX should preserve the request, even if that leaves performance on the table.

That second case is the one that prevents cheating. A compiler can always look better by overriding a bad user choice. The interesting compiler result is whether it can improve the code while still respecting the same contract native LLVM is respecting.

The later fair comparison work used exactly this principle. Some apparent REX wins had to be reclassified once we noticed that the generated launch had changed an explicit source choice. Those wins were not useful evidence. After the fairness correction, the remaining performance story became stronger because the benchmark outcomes could be explained without relying on hidden launch-contract changes.

The Design Rule To Keep

The durable lesson from this phase is:

1
launch geometry optimization is allowed only where launch geometry ownership is clear

For REX, that means:

1
2
3
4
5
preserve explicit user launch clauses;
preserve launch dimensions that the current analysis does not own;
cap compiler-owned thread defaults using recovered tripcount facts;
round caps to sensible GPU granularity;
keep fallback paths when analysis cannot prove the shape.

That rule is what made the later performance work credible. Without it, every benchmark win would need a footnote: “REX was faster, but it may have silently changed the launch the user asked for.” With it, the comparison becomes cleaner. If REX wins, the win is coming from a legitimate compiler advantage: source-level tripcount knowledge, direct-kernel lowering, cheaper device control flow, better argument ABI, or later kernel-body improvements.

Launch geometry was therefore not the final fix. It was the fairness boundary that made the rest of the fixes meaningful.

The next post moves from launch shape to launch ABI: how REX started representing literal scalar target parameters in the modern OpenMP offload path instead of treating every scalar like an address-based mapped object.