How REX Recovered b+tree Read-Only Loads With __ldg

Posted on (Updated on )
After REX restored fair launch geometry for b+tree, native LLVM still had a small device-side advantage. The tempting fix was a global cache flag such as ptxas -dlcm=ca, but that helped b+tree and hurt nn, so it was not a clean compiler policy. The real issue was in generated CUDA source shape: REX had lost enough const provenance that hot read-only irregular loads still compiled as ordinary global loads. A generated-file experiment proved that __ldg(...) closed the gap. The compiler fix was to recover read-only mapped-array and local-shadow provenance generically, then rewrite only safe scalar loads from const device inputs. That moved b+tree from the last fair loss into a clear REX win without changing user launch clauses.

The previous post ended with an important constraint: REX was not allowed to win b+tree by silently shrinking a valid user-requested launch shape. The manual thread-width sweep had found a faster shape, but the source explicitly requested the launch geometry and native LLVM preserved it. That made the optimization useful as a diagnostic, not as a fair compiler rewrite.

So the remaining b+tree problem became sharper.

The launch contract had to stay fair. The direct __tgt_target_kernel path was already in place. Literal scalar target parameters were already repaired. Cubin registration was no longer the issue. The output matched native LLVM. Yet b+tree still had a small native advantage in fair runs:

1
2
native LLVM total GPU time: about 0.0143 s
REX total GPU time:         about 0.0144 s

That is not a huge number, but it was stable enough to investigate. More importantly, profiling said the loss was in the device kernel, not in transfer size or host setup. That meant the next bug was probably not another offload-runtime lifecycle mistake. It was inside the generated search kernel.

That is where the investigation moved.

A diagram showing the b+tree investigation rejecting a global backend cache flag and moving toward selective source-level read-only load recovery.

Figure 1. The remaining b+tree gap looked like a memory-access problem, but the fix had to be selective source lowering, not a global cache-policy flag.

Why b+tree Is Sensitive To Read-Only Loads

b+tree is an irregular search workload. Each query walks tree nodes, compares keys, follows indices, and eventually reads records. The hot path repeatedly touches arrays that are read by the kernel but not modified by it:

1
2
3
4
knodes
keys
indices
records

That pattern matters on NVIDIA GPUs, especially on the tested sm_52 target. A normal global load and a read-only load can take different code-generation paths. The CUDA __ldg(...) intrinsic expresses that a load is read-only through this kernel path and can use the target’s read-only data path.

The useful generated shape is not mysterious:

const int query_key = __ldg(&_dev_keys[bid]);

if (__ldg(&node_keys[thid]) <= query_key &&
    __ldg(&node_keys[thid + 1]) > query_key) {
  next = __ldg(&node_indices[thid]);
}

The original REX output after the fairness cleanup was still closer to this:

const int query_key = _dev_keys[bid];

if (node_keys[thid] <= query_key &&
    node_keys[thid + 1] > query_key) {
  next = node_indices[thid];
}

Both are semantically equivalent if the loaded storage is genuinely read-only in the target region. They are not necessarily equivalent for performance. On this benchmark, the read-only path was the missing device-side lever.

The Flag We Did Not Keep

Before changing REX lowering, we checked whether native LLVM was simply passing a backend flag that REX was missing.

The native offload pipeline was inspected with clang -###. The relevant device assembler line looked like this:

1
ptxas -m64 -O3 --gpu-name sm_52 --output-file ... kernel_cpu-sm_52.s -c

There was no global -dlcm=ca flag and no equivalent cache-forcing switch in that path.

It was still worth trying the flag manually because it tested the hypothesis directly. Forcing ptxas -dlcm=ca on the generated b+tree cubin improved the benchmark. That initially looked promising.

Then the same global switch hurt nn.

That result disqualified the flag as a REX compiler fix. A global cache policy that helps one irregular search benchmark and hurts another benchmark is not a generic optimization. It is a workload-dependent backend override. It also does not explain what the source program means. It tells the backend to treat a broad class of loads differently, even when the compiler has not proved those loads are read-only.

The better rule was narrower:

1
2
do not change cache policy for the whole cubin;
recover read-only intent for the individual loads that are safe.

That moved the problem from backend flags into source-to-source lowering, where REX has the right information if it preserves it carefully.

The Generated-File Experiment

The fastest way to verify the real lever was to edit the generated CUDA file directly.

That is a useful compiler-debugging technique. Generated code is a laboratory. If a manual generated-file edit does not help, teaching the compiler to emit that edit is wasted work. If the manual edit helps, the compiler still needs a generic proof before the edit can become policy.

The profitable generated-file edits wrapped obvious read-only loads:

const int ____rex_cached__dev_keys_2__ = __ldg(&_dev_keys[_p_bid]);

if (__ldg(&____rex_field_keys_5__[_p_thid]) <=
        ____rex_cached__dev_keys_2__ &&
    __ldg(&____rex_field_keys_5__[_p_thid + 1]) >
        ____rex_cached__dev_keys_2__) {
  const long next_offset =
      __ldg(&_dev_knodes[____rex_cached__dev_offset_1__].indices[_p_thid]);
  if (next_offset < knodes_elem__rex_value) {
    ____rex_cached__dev_offset_1__ = next_offset;
  }
}

That experiment improved b+tree substantially. The conclusion was now concrete:

1
2
3
4
the remaining fair gap is not launch policy;
the remaining fair gap is not hidden host overhead;
the remaining fair gap is not a mandatory global ptxas flag;
the remaining fair gap is ordinary loads where REX should recover read-only loads.

The compiler task was to make that happen without recognizing the benchmark name and without rewriting unsafe loads.

Where The Pass Belongs

The clean insertion point was the direct grid-stride fast path in omp_lowering.cpp.

By that point REX has already converted the target loop into the CUDA-like structure used by the direct path. It has also run normalization passes that make the memory expressions easier to reason about:

1
2
3
4
scalarizeDirectGridStrideOuterIndexAccesses(for_loop, outer_index_sym);
hoistReadOnlyInvariantAggregateRefsBeforeLoop(for_loop);
hoistReadOnlyInvariantFieldAccessesBeforeLoop(for_loop);
rewriteReadOnlyDeviceLoadsWithLdg(for_loop);

That ordering is important.

REX is not trying to prove arbitrary source-level aliasing facts from the original OpenMP program in one giant pass. It is looking at its own normalized device kernel shape. After scalarization and hoisting, repeated aggregate and field references often become local const pointer temporaries. That makes the later __ldg pass both simpler and more precise.

The pass uses a conservative candidate rule:

1
2
3
4
5
the expression must produce a scalar value;
the expression must not be written through;
the expression's address must not be used in a way that would change meaning;
the base storage must resolve to a const-qualified device pointer;
derived local pointers must recursively trace back to a const device input.

That rule covers the useful b+tree shapes:

1
2
3
4
_dev_keys[bid]
node->keys[thid]
node->indices[thid]
records[idx].value

It does not blindly wrap every array access in __ldg. Mutable arrays, mutable pointer parameters, and stack locals are rejected.

A diagram showing const provenance flowing from mapped read-only arrays to const device pointers, derived field pointers, candidate scalar loads, and finally __ldg calls.

Figure 2. The optimization is only safe when REX can trace the scalar load back to read-only device storage.

Why The First Compiler Version Still Missed It

At this point the story should have been over. The manual edit worked. The compiler had a generic rewriteReadOnlyDeviceLoadsWithLdg(for_loop) pass. The direct fast path called it in the right location.

But regenerated b+tree output still missed the hot __ldg(...) calls.

The compiler-generated file had started to preserve one piece of the intended shape:

const int *_dev_keys

but the load itself still looked ordinary:

const int ____rex_cached__dev_keys_2__ = _dev_keys[_p_bid];

Temporary instrumentation made the bug precise:

1
[REX_DEBUG_LDG] total candidates: 0

That ruled out the wrong explanations. The unparser was not deleting __ldg. The CUDA build was not lowering it away. The pass was simply not recognizing the actual AST shapes that REX had generated.

There were two recognition bugs.

Bug 1: The Read-Only Map Analysis Climbed Too Far

The first bug affected whether mapped aggregates became const-qualified device pointers.

For _dev_knodes to become:

const struct knode *_dev_knodes

REX has to prove that the mapped knodes object is only read inside the target body.

The failing source pattern was nested inside a larger expression:

1
records[knodes[currKnode[bid]].indices[thid]].value

The analysis started from the knodes reference. Then it walked upward through parent expressions to find the access root. The bug was that it kept walking too far. It climbed past the access rooted at knodes and reached the outer records[...].value expression. At that point, the root no longer belonged to the knodes base symbol.

The debug trace said exactly that:

1
2
3
[REX_DEBUG_LDG] map-readonly reject knodes reason=root-mismatch
expr=knodes
root=records[knodes[currKnode[_p_bid]].indices[_p_thid]].value

That was a false rejection. The outer expression consumes a value derived from knodes, but the mapped-array proof for knodes should stop when the parent expression no longer derives from the same base.

The fix was to carry the base symbol during the walk and stop climbing when the parent expression changes base:

1
2
3
if (base_sym != nullptr && extractClauseVariableSymbol(parent) != base_sym) {
  break;
}

After that, knodes could be recognized as read-only in the target body, and the regenerated kernel used a const-qualified device pointer for it.

Bug 2: The __ldg Matcher Trusted The Wrong Symbol Shape

The second bug was more subtle.

Even _dev_keys[_p_bid] was not rewritten, although _dev_keys already had a const pointer type. The recognizer looked for two cases:

1
2
an actual function parameter with pointer-to-const type;
a local const pointer initialized from another read-only pointer.

That sounds reasonable, but the outlined kernel body did not always bind hot expressions to the formal parameter symbol directly. REX had same-named local shadows inside the kernel body. In the AST, _dev_keys[_p_bid] could resolve to a local const pointer declaration inside a basic block rather than the formal parameter node.

The debug trace showed the mismatch:

1
2
3
[REX_DEBUG_LDG] readonly-base reject symbol=_dev_keys
decl-parent=SgVariableDeclaration
decl-scope=SgBasicBlock

The base was const, but the recognizer rejected it because it was neither a formal parameter nor an initializer-derived local.

The fix was deliberately narrow:

1
2
3
4
if a local const pointer shadow has the same name as an enclosing const formal,
    treat it as derived from that formal;
if a no-initializer const pointer local is never written in the kernel body,
    allow it as a read-only base in this normalized direct-path shape.

That recovered the actual outlined-kernel forms without turning the pass into “every pointer-to-const local is always safe.” The pass still rejects symbols written inside the kernel body, and it still requires pointer-to-const type.

Once both recognizers were repaired, the pass saw the right candidates:

1
2
3
4
5
6
_dev_keys[_p_bid]
____rex_field_keys_5__[_p_thid]
____rex_field_keys_5__[_p_thid + 1]
_dev_knodes[...].indices[_p_thid]
(*____rex_ref__dev_knodes_4__).keys[_p_thid]
(*____rex_ref__dev_knodes_4__).indices[_p_thid]
A diagram showing two REX recognition failures: the mapped-array read-only proof climbing past the right base, and the __ldg matcher rejecting local const shadows.

Figure 3. The optimization was present, but two recognition bugs prevented it from seeing REX’s own lowered kernel shapes.

What The Regenerated Kernel Looks Like

The regenerated first search kernel now has the intended shape:

__global__ void OUT__...__kernel__(
    void *__rex_kernel_launch_env,
    const struct record *records,
    unsigned long long knodes_elem__rex_value,
    const struct knode *_dev_knodes,
    long *_dev_currKnode,
    long *_dev_offset,
    const int *_dev_keys,
    struct record *_dev_ans) {
  ...
  const int ____rex_cached__dev_keys_2__ = __ldg(&_dev_keys[_p_bid]);

  if (__ldg(&____rex_field_keys_5__[_p_thid]) <=
          ____rex_cached__dev_keys_2__ &&
      __ldg(&____rex_field_keys_5__[_p_thid + 1]) >
          ____rex_cached__dev_keys_2__) {
    const long next_offset =
        __ldg(&_dev_knodes[____rex_cached__dev_offset_1__].indices[_p_thid]);
    if (next_offset < knodes_elem__rex_value) {
      ____rex_cached__dev_offset_1__ = next_offset;
    }
  }
}

This is the right kind of compiler output for this workload. It preserves the explicit launch shape. It preserves semantics. It does not require a global cache flag. It exposes read-only intent exactly where the generated direct kernel can prove it.

The validation matched that expectation.

Correctness remained clean:

1
2
command_k.txt: native LLVM and REX outputs matched after timing-line normalization
command_j.txt: native LLVM and REX outputs matched after the same normalization

The benchmark moved from the last fair loss to a clear REX win:

1
2
3
command_k.txt 10-run benchmark-total mean
REX:         0.010941000003 s
native LLVM: 0.015615500137 s

The profiler showed that the win was in the kernel body:

1
2
3
REX kernel:         9.654465 ms
native LLVM kernel: 13.748185 ms
HtoD and DtoH copy totals: effectively tied

That matters. This was not a cheaper registration path, a timing artifact, or a launch-contract shortcut. The search kernel itself became faster.

What LLVM Still Did Better

This fix also says something useful about native LLVM.

LLVM native did not suffer from this exact REX bug because it does not pass through the same source-to-source canonicalization pipeline. REX creates CUDA-like source, normalizes it through ROSE AST transformations, introduces local aliases, emits helper temporaries, and then asks the CUDA toolchain to compile the generated device file. That gives REX a lot of control, but it also creates opportunities to lose source-level information.

LLVM’s integrated pipeline was more robust to this category of information loss. It did not need a REX-specific pass to rediscover that _dev_keys or knodes were read-only after source-to-source rewriting. That robustness was a real advantage, and it is why LLVM stayed ahead until REX repaired const provenance properly.

The final REX result is faster because once the provenance is recovered, the generated kernel is simpler and more explicit for this workload. But the lesson is not “LLVM was wrong.” The lesson is that REX’s advantage depends on preserving the semantic facts it needs for later generated-code optimizations.

The Design Rule

The clean rule from this post is:

1
2
3
4
prefer selective source-level proof over global backend policy;
recover read-only intent only when provenance is clear;
emit __ldg only for scalar loads from read-only device storage;
keep the optimization in the direct fast path until the fallback path has the same proof strength.

That rule keeps the fix generic.

It is not a b+tree special case. It is not a cache flag that happens to help one benchmark. It is not a launch-geometry rewrite. It is a compiler analysis repair: after REX lowers OpenMP into a direct CUDA-like kernel, it must still recognize read-only device storage through the AST shapes it created itself.

That is the broader performance lesson. Once the large runtime and ABI issues were fixed, the remaining wins came from preserving small semantic facts. In b+tree, the important fact was simple:

1
these irregular loads are reads from device storage that the kernel never writes.

When REX preserved that fact, the final fair b+tree gap closed cleanly.