How REX Kept b+tree Launch Geometry Fair

Posted on (Updated on )
The b+tree benchmark exposed a launch-geometry problem that looked easy to fix and easy to get wrong. A manual sweep showed that the original 1024-thread block shape left performance on the table, and smaller blocks such as 128 or 256 threads could run faster on the tested GPU. But the source program had explicitly requested its launch shape, and native LLVM honored that request. REX could not fairly claim a compiler win by silently rewriting valid user intent. The correct fix was to keep the heuristic as a generic default policy: use nested-loop structure and tripcount evidence to cap compiler-owned thread width, but preserve explicit user launch clauses unless they are invalid for the target.

The previous post established the general fairness principles for GPU launch geometry. This post applies those rules to a specific case: the b+tree benchmark, which exposed a performance gap that remained even after the direct __tgt_target_kernel and ABI migration work.

At that point, the easy failure modes were mostly gone. The b+tree benchmark built. It registered its cubin. It launched its kernels. Its output matched.

That left the uncomfortable kind of performance gap: a small one.

Small gaps matter in compiler work because they often point to a real missed invariant. A one or two percent loss in a reduced benchmark can become a much larger loss in a real application if the underlying pattern repeats thousands of times, appears inside an iterative solver, or interacts with memory traffic on a larger input.

For b+tree, the first fair comparison after the direct launch and ABI fixes still showed REX slightly slower than native LLVM in the GPU-total measurement. The numbers moved around slightly across runs, but the profile pointed to the same area:

1
2
3
4
REX total GPU time:    about 0.0143 s
LLVM total GPU time:   about 0.0140 s
REX main kernel time:  about 14.1 ms
LLVM main kernel time: about 13.7 ms

That is not a dramatic failure. It is also not noise that a compiler engineer should ignore without checking.

The important part of this investigation was not the first faster number we found. The important part was the distinction between a useful diagnostic experiment and a valid compiler optimization.

A diagram showing the b+tree performance investigation narrowing from ABI and runtime issues to launch bounds, backend flags, generated source tuning, and finally launch geometry.

Figure 1. Once ABI and registration issues were gone, the remaining b+tree gap had to be narrowed like a kernel-performance problem, not like a runtime-bringup problem.

Why b+tree Was A Good Trap

b+tree is a good performance trap because its offloaded loop shape is not a clean one-operation-per-thread map.

At the outer level, the generated kernel can use a direct grid-stride mapping:

1
2
3
4
5
6
long tid = blockIdx.x * blockDim.x + threadIdx.x;
long stride = gridDim.x * blockDim.x;

for (long bid = tid; bid < count; bid += stride) {
  search_one_query(bid);
}

That shape looks similar to the direct lowering used in simpler benchmarks. It is not the whole story.

Inside search_one_query, each logical work item performs nested serial work. It traverses tree levels, scans node keys, follows indices, and repeats those operations for the query assigned to that outer bid. The outer grid-stride loop distributes queries, but each GPU lane that receives a query performs more than one trivial arithmetic operation.

That changes the launch-geometry tradeoff.

For a tiny flat loop, launching too many threads mainly creates idle lanes. For b+tree, launching very wide blocks can also reduce scheduler flexibility. A 1024-thread block is large. It consumes a full maximum-width CTA on many CUDA targets. If each thread may do irregular nested work, the GPU has fewer independent CTAs available to schedule around long memory paths and control-flow variation.

That does not mean “smaller is always better.” It means the launch shape is now part of the performance model.

The source had a shape that was easy to question:

1
256 teams x 1024 threads

For this benchmark and input, that is a very wide block shape. It is plausible that smaller blocks could improve scheduling behavior. But plausibility is not enough. We needed to rule out other causes first.

What We Ruled Out First

The first suspect was launch metadata.

Native LLVM’s generated PTX carried a visible maximum-thread annotation:

.maxntid 1024, 1, 1

The REX-generated kernel initially did not. That made __launch_bounds__ a reasonable first experiment. If native LLVM gave the backend a clearer block-size contract, it might allow better register allocation or scheduling decisions.

So REX generated a CUDA kernel annotation equivalent to:

1
2
3
4
__global__ __launch_bounds__(NUM_THREADS)
void OUT__...__kernel__(__rex_kernel_launch_env *env, ...) {
  ...
}

The PTX then contained the expected .maxntid 1024 annotation.

The performance gap remained.

The next suspect was backend code generation. We tried variants that changed the PTX compilation path and optimization level. The results did not move in the right direction:

1
2
3
4
baseline current: about 0.01441 s
llc -O3:          about 0.01466 s
llc -O3 ptx83:    about 0.01461 s
clang backend:    about 0.01464 s

Those experiments were useful because they removed a tempting story. The remaining b+tree gap was probably not caused by missing .maxntid, and it was probably not caused by a simple backend-flag mismatch.

We also checked the old XOMP-style round-robin scheduler path again as a control. It was worse:

1
round-robin scheduler path: about 0.01653 s

That confirmed that reverting to the old scheduler was not the answer. The direct-grid-stride path was still the right baseline.

At that point, launch geometry became the main suspect.

The Diagnostic Sweep

The fastest way to test the launch-geometry hypothesis was not to change the compiler first. It was to patch the generated host file and run controlled experiments.

That is a valid way to investigate a compiler optimization. Generated files are a quick laboratory. If a manual edit cannot produce a better outcome, there is no reason to spend time teaching the compiler to produce that edit.

The experiment kept the same device kernel and cubin but changed the host-side thread width:

1
2
int _num_blocks_ = 256;
int _threads_per_block_ = 1024;  // original

Then we swept the thread count while keeping the rest of the benchmark path stable:

1
2
3
4
5
6
1024 threads: about 0.0143 s
512 threads:  about 0.0143 s
256 threads:  about 0.0139 s
192 threads:  about 0.0131 s
128 threads:  about 0.0132 s
64 threads:   about 0.0132 s

The exact order shifted slightly from run to run, but the conclusion did not: the original 1024-thread block shape was not ideal for this b+tree kernel on the tested GPU.

That was a real finding.

It was also not yet a valid compiler fix.

A bar chart showing b+tree runtime improving when diagnostic experiments reduce thread width from 1024 toward 128 or 256.

Figure 2. The thread sweep proved that launch shape mattered. It did not prove that REX was allowed to rewrite the source-requested launch shape.

The Fairness Problem

The dangerous fix would have been:

1
2
3
if (looks_like_btree_or_nested_search_kernel) {
  _threads_per_block_ = 128;
}

That is obviously benchmark-specific and invalid.

The less obvious dangerous fix would have been:

1
2
3
if (nested_loop_depth >= 2) {
  _threads_per_block_ = min(_threads_per_block_, 128);
}

This looks generic. It uses structure, not benchmark names. It would help this case. It might help other kernels that have heavy serial work inside each distributed outer iteration.

But it is still wrong if it fires on a value the user explicitly requested.

That was the key fairness concern. The b+tree source carried explicit launch intent. Native LLVM preserved that launch intent. If REX silently changed it from 1024 threads to 128 threads, then the comparison was not REX lowering versus LLVM lowering under the same program. It was REX running a different launch contract.

The user may be wrong. The user may choose too many threads. The user may write thread_limit(1024) for a tiny loop. A compiler can warn, and it can reject or cap invalid hardware requests, but it should not silently replace a valid explicit performance choice just because one benchmark input runs faster with another value.

The boundary is:

1
2
3
invalid user request: compiler may reject or legalize
valid explicit user request: compiler must preserve
compiler default: compiler may optimize

This is not merely about benchmark ethics. It is a source-to-source compiler design rule. REX is used to generate code from user programs. The generated code should be a faithful transformation, not an autotuner that overwrites source-level performance intent without a contract saying it may do so.

The Heuristic REX Can Keep

The diagnostic sweep still produced something useful: it identified a structural default heuristic for cases where REX owns the launch width.

The compiler can ask whether the distributed outer loop contains nested loops. The rough proxy is not perfect, but it is generic:

1
2
3
4
static size_t computeMaxNestedForDepth(SgStatement *stmt) {
  // Walk the lowered target-loop body and find the deepest nested loop shape.
  // This is a structural signal, not a benchmark-name check.
}

Then the launch lowering can derive a default cap only if the thread width was not explicit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
int direct_launch_thread_cap = 0;

if (!has_explicit_thread_width) {
  size_t nested_depth = computeMaxNestedForDepth(loop_body);

  if (nested_depth >= 2) {
    direct_launch_thread_cap = 128;
  } else if (nested_depth >= 1) {
    direct_launch_thread_cap = 256;
  }
}

Later, when the generated host launch variables exist, REX may apply that cap:

1
2
3
4
5
if (!has_explicit_thread_width && direct_launch_thread_cap > 0) {
  if (_threads_per_block_ > direct_launch_thread_cap) {
    _threads_per_block_ = direct_launch_thread_cap;
  }
}

The !has_explicit_thread_width guard is the important part. The nested-loop heuristic is not allowed to steal a valid user-specified num_threads or thread_limit value.

There is a second default-only cap based on tripcount. It handles the simpler case where a compiler-owned block is wider than the recovered loop tripcount:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
if (!has_explicit_thread_width && tripcount > 0) {
  if (_threads_per_block_ > tripcount) {
    long granularity = 32;
    long rounded =
        ((tripcount + granularity - 1) / granularity) * granularity;

    if (rounded > _threads_per_block_) {
      rounded = _threads_per_block_;
    }

    _threads_per_block_ = (int)rounded;
  }
}

This cap also belongs only to compiler-owned defaults. If the user explicitly requests a valid block width larger than the tripcount, REX must not change it silently.

That may leave performance on the table. That is acceptable. Preserving source intent is part of correctness for this class of transformation.

A diagram showing explicit thread-limit and num_threads clauses flowing to preservation, while compiler default launch width flows to nested-loop and tripcount caps.

Figure 3. The b+tree sweep became a fair compiler heuristic only after REX separated explicit launch intent from compiler-owned defaults.

Why This Is Not Just Conservative

Preserving explicit launch clauses might sound conservative, but it prevents a worse compiler behavior: hidden policy changes that users cannot reason about.

Suppose a user writes:

1
2
3
4
5
#pragma omp target teams distribute parallel for \
    num_teams(256) thread_limit(1024)
for (int q = 0; q < query_count; ++q) {
  search(q);
}

If REX changes thread_limit(1024) to 128 for one input, what should happen when query_count changes? What should happen on a newer GPU with different block scheduling behavior? What should happen if the user wrote that value because another target platform needed it?

Without an explicit autotuning mode, the compiler has no authority to answer those questions by rewriting the launch choice.

The default-only heuristic is different. When the source does not specify the thread width, REX already has to choose something. In that situation, using source structure is better than using a fixed global default. A nested-loop signal gives the compiler a reason to avoid maximum-width blocks for kernels where each distributed outer iteration may be heavy. A tripcount signal gives the compiler a reason to avoid launching many empty lanes for small flat loops.

That is a real generic optimization. It applies to programs, not benchmark names.

The rule can be summarized as:

1
2
When REX owns the choice, make a smarter choice.
When the user owns the choice, preserve it.

What The Tests Need To Prove

This kind of optimization needs tests that check policy, not just speed.

A performance number alone cannot prove the compiler is correct. A faster b+tree run could mean REX found a better generic lowering. It could also mean REX ignored the source launch clauses. Those are different outcomes.

The lowering tests should therefore check at least four invariants.

First, explicit thread-width clauses must survive:

1
2
source has thread_limit(1024)
lowered host launch keeps the explicit 1024 value unless illegal

Second, compiler-owned defaults may be capped:

1
2
source has no explicit thread-width clause
lowered direct path may apply nested-loop or tripcount caps

Third, team count must not be shrunk as an accidental side effect of thread capping:

1
2
thread cap changes _threads_per_block_
team preservation keeps _num_blocks_ under its own ownership rules

Fourth, correctness must remain separate from timing:

1
2
3
strip timing lines
compare semantic output
then compare GPU-total timing

That order matters. If output mismatches, the timing result is irrelevant.

The b+tree investigation reinforced a broader test design point from the REX benchmark layer: a performance test should explain why a speedup is legitimate. It is not enough to record that one binary is faster.

What This Fixed And What It Did Not

The launch-geometry work fixed a compiler-policy problem. REX gained a generic default heuristic for direct GPU lowering:

1
2
3
nested serial work inside a distributed iteration can cap compiler-owned thread width;
small recovered tripcounts can cap compiler-owned oversized thread width;
explicit user launch choices are preserved unless invalid.

That is the clean long-term result.

It did not fully close every fair b+tree gap by itself, because the explicit launch shape in b+tree could not be silently rewritten. Once the fairness boundary was enforced, the manual 128-thread result had to be reclassified as a diagnostic result, not a fair compiler result.

That is a useful outcome, not a failure. It narrowed the remaining problem:

1
2
3
4
5
6
7
8
ABI: fixed
runtime registration: fixed
modern launch packet: fixed
old scheduler path: removed
backend flags: not the cause
launch bounds: not the cause
explicit launch rewrite: unfair
remaining gap: likely inside generated kernel body

That led directly to the next investigation: why native LLVM’s b+tree kernel still had a small advantage when the launch contract was held fair. The answer was no longer launch geometry. It was read-only memory access. Native LLVM recovered load information that REX had lost in generated CUDA, and that changed the device instructions selected for the hot search path.

That is the next post: how REX recovered read-only loads and __ldg behavior without adding a benchmark-specific hack.