How REX Fixed the Early `nn` GPU Regression By Moving Offload Initialization Out Of The Timed Path

Posted on Apr 21, 2026 (Updated on May 3, 2026)

The first large nn slowdown was not a GPU arithmetic problem. REX was charging one-time offload setup to the benchmark’s measured region when registration first happened through a lazy launch wrapper. The fix was generic: generated programs now call rex_offload_init() before user timing starts, while the runtime wrappers remain lazy-safe through an atomic one-time registration path. That removed the pathological cold-start cost from nn and gave the later device-kernel optimizations a clean performance baseline.

The previous posts in this series established the comparison method:

build native LLVM and REX variants side by side,
compare correctness before trusting performance,
preserve user launch intent,
and treat benchmark results as an investigation surface instead of a scoreboard.

This post starts the performance arc.

The first concrete regression was nn.

At the beginning of the campaign, nn looked much worse for REX than for native LLVM. The early comparison artifact showed native LLVM finishing the benchmark in roughly 0.3031 s, while the REX variant needed roughly 0.9986 s.

That is more than a 3x slowdown.

For a GPU compiler investigation, that number is useful because it is too large to explain away with normal kernel-code drift. A small benchmark like nn can certainly expose device-side inefficiency, but a jump from 0.30 s to almost 1.00 s usually means something fixed-cost has moved into the wrong timing window.

That is exactly what happened.

The first nn regression was a host-runtime lifecycle bug:

REX was paying one-time CUBIN registration and libomptarget setup on the benchmark’s measured path.

That sounds mundane, but it is the kind of mistake source-to-source compilers are especially exposed to. Native Clang offloading owns the whole compile/link/offload image lifecycle. REX generates ordinary host source, ordinary device source, and helper files, then relies on the downstream build to link and run those artifacts. That is a useful and inspectable design, but it also means lifecycle placement becomes visible source code.

If REX emits that source code in the wrong order, the user’s timer sees it.

A before and after timeline. Before, the benchmark starts timing before the first offload launch, so lazy CUBIN registration is counted. After, rex_offload_init runs before the timer and the measured launch path is warm. — Figure 1. The bug was not that registration existed. The bug was that first-use registration could happen after `nn` started its own timer.

Why `nn` Exposed This So Clearly

nn is not a massive kernel-throughput benchmark. It has enough GPU work to matter, but its measured region is small enough that host launch setup and first-use runtime work are visible.

That makes it useful as a lifecycle test.

The benchmark owns its timer. Conceptually, the source shape is:

1
2
3
4
5
6
7
8
9
int main(int argc, char **argv) {
  long long time0 = clock();

  /* input setup, target data movement, target region, result handling */

  long long time1 = clock();
  printf("total time : %15.12f s\n",
         (double)(time1 - time0) / 1000000.0);
}

That timer is not a CUDA event around one kernel. It is a benchmark-level timer around the whole run. So any work REX inserts after time0 = clock() becomes part of the reported number.

Some of that work is fair. Data movement and kernel launch overhead are part of the benchmark’s measured offload execution model.

One-time runtime image registration is different.

REX must register the generated CUBIN with libomptarget, but that registration is not part of the user algorithm. It is a compiler/runtime lifecycle operation needed to make the generated image visible to the OpenMP runtime. If REX charges that work to the first measured kernel launch, it is not measuring the same thing native LLVM measures.

This distinction matters most for small measured regions.

If a benchmark runs a huge kernel for seconds, a few milliseconds of first-use registration may disappear into noise. If a benchmark runs a small target region and prints an application-level total, that same fixed cost can dominate the result.

That is why the early nn gap was not interpreted as “REX generated a 3x slower nearest-neighbor kernel.” The magnitude pointed to a broader timing-path problem.

The Generated Source Was The First Place To Look

The first instinct in GPU performance work is often to open a profiler.

Profilers are useful, but here the generated host source was the faster source of truth.

The relevant question was simple:

Does rex_offload_init() appear before nn starts timing?

The healthy shape is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
int main(int argc, char *argv[])
{
  rex_offload_init();
  long long time0 = clock();

  ...

  __tgt_target_teams(__device_id, __host_ptr, __arg_num,
                     __args_base, __args, __arg_sizes, __arg_types,
                     _num_blocks_, _threads_per_block_);

  ...

  long long time1 = clock();
  printf("total time : %15.12f s\n",
         (double)(time1 - time0) / 1000000.0);
  return 0;
}

The exact spacing and symbol names are not important. The ordering is.

rex_offload_init() has to run before time0 = clock(). That call loads the generated CUBIN and registers the OpenMP offload image. Once it has run, the later target call is no longer responsible for paying the full cold registration path.

The compiler-side implementation is intentionally direct. In omp_lowering.cpp, insertAcceleratorInit(...) finds main, builds a call to rex_offload_init, and prepends it to the body:

1
2
3
4
5
6
7
8
SgExprStatement *expStmt = buildFunctionCallStmt(
    SgName("rex_offload_init"), buildVoidType(), NULL, currentscope);
setSourcePositionForTransformation(expStmt);

// Insert before all user statements so one-time cubin registration is not
// counted inside declaration initializers such as `long long time0 =
// clock();`.
prependStatement(expStmt, currentscope);

That prependStatement(...) is the important part.

It does not try to infer which later statement might be the first timer. It does not scan for every possible clock(), gettimeofday(), omp_get_wtime(), or benchmark-specific timer wrapper. It applies a stronger and simpler policy:

generated standalone offload programs should register their device image before any user statement runs.

That policy is more robust than chasing timers.

It also keeps the compiler honest. The lowerer does not need a special case for nn. It emits the same early initialization for generated offload programs in general. If a future benchmark declares a timer differently, or initializes a timing object through a constructor, the init still appears before that user code.

Why Lazy Registration Alone Was Not Enough

The runtime helper still needs lazy registration.

Generated code should normally call rex_offload_init() early, but helper wrappers cannot rely on every path being perfect forever. They also need to be safe if a user builds a nonstandard driver, calls into a generated offload function from another entry point, or reaches a target data operation before the explicit init call.

That is why the public wrappers defensively call registration before forwarding to the real LLVM runtime API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
int rex___tgt_target_teams(int64_t device_id, void *host_ptr,
                           int32_t arg_num, void **args_base, void **args,
                           int64_t *arg_sizes, int64_t *arg_types,
                           int32_t num_teams, int32_t thread_limit) {
  if (register_cubin(REX_CUBIN_NAME) == nullptr) {
    return -1;
  }
  return __rex_real___tgt_target_teams(device_id, host_ptr, arg_num,
                                       args_base, args, arg_sizes, arg_types,
                                       num_teams, thread_limit);
}

That wrapper is correct as a safety net.

It is not sufficient as the primary lifecycle policy.

If the first call to rex___tgt_target_teams(...) happens after time0 = clock(), then lazy registration is paid inside the measured region. The wrapper did its job from a correctness perspective, but the compiler failed from a performance-lifecycle perspective.

This is the central distinction:

lazy registration protects correctness,
early explicit initialization protects the benchmark timing model.

Both are needed.

REX cannot remove the lazy check entirely. Doing so would make generated artifacts brittle. But REX also cannot rely only on lazy registration, because that silently moves runtime setup to the first offload API call.

The nn regression was the practical proof of that rule.

What Registration Actually Does

The cost is not magic. The helper has real work to perform.

REX uses a standalone CUBIN artifact, so registration starts by reading that file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
bool readFile(const char *filename, std::vector<unsigned char> &buffer) {
  FILE *file = fopen(filename, "rb");
  if (file == nullptr) {
    return false;
  }

  ...

  buffer.resize(static_cast<size_t>(file_size));
  size_t bytes_read =
      fread(buffer.data(), 1, static_cast<size_t>(file_size), file);
  fclose(file);
  return bytes_read == buffer.size();
}

Then the helper builds the OpenMP runtime descriptors:

1
2
3
4
5
struct CubinStorage {
  std::vector<unsigned char> image;
  __tgt_device_image device_image{};
  __tgt_bin_desc bin_desc{};
};

The image descriptor points at the CUBIN bytes and at the host offload-entry range:

1
2
3
4
5
storage->device_image.ImageStart = storage->image.data();
storage->device_image.ImageEnd =
    storage->image.data() + storage->image.size();
storage->device_image.EntriesBegin = &__start_omp_offloading_entries;
storage->device_image.EntriesEnd = &__stop_omp_offloading_entries;

The binary descriptor wraps that image and the host entry range:

1
2
3
4
storage->bin_desc.NumDeviceImages = 1;
storage->bin_desc.DeviceImages = &storage->device_image;
storage->bin_desc.HostEntriesBegin = &__start_omp_offloading_entries;
storage->bin_desc.HostEntriesEnd = &__stop_omp_offloading_entries;

Finally, the helper registers the binary descriptor with libomptarget:

1
__rex_real___tgt_register_lib(&storage->bin_desc);

None of that belongs in the first measured kernel launch if the benchmark is trying to measure algorithm execution rather than runtime cold start.

It is also not enough to make the registration fast in the average case. The first case matters. nn starts its measured region before the offload path, so first-use placement directly affects the printed result.

A vertical stack showing the cold registration path: read CUBIN, build device image, bind host offload entries, call __tgt_register_lib, then launch. — Figure 2. Lazy first use can include file I/O, descriptor construction, host-entry binding, and `libomptarget` registration before the kernel launch itself.

The Runtime Still Has To Be Safe

Moving init earlier does not mean the helper becomes careless.

The current register_cubin.cpp keeps CUBIN state in RAII-managed storage:

1
std::unique_ptr<CubinStorage> cubin_storage;

and protects one-time registration through an atomic state machine:

1
2
3
4
5
enum RegistrationState {
  kUnregistered = 0,
  kBusy = 1,
  kRegistered = 2,
};

The registration gate loads the current state, lets exactly one thread perform cold registration, and returns the stored descriptor on the hot path:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct __tgt_bin_desc *ensure_cubin_registered(const char *filename) {
  for (;;) {
    int state = __atomic_load_n(&registration_state, __ATOMIC_ACQUIRE);
    if (state == kRegistered) {
      return cubin_storage == nullptr ? nullptr : &cubin_storage->bin_desc;
    }
    if (state == kUnregistered &&
        __atomic_compare_exchange_n(&registration_state, &state, kBusy,
                                    false, __ATOMIC_ACQ_REL,
                                    __ATOMIC_ACQUIRE)) {
      struct __tgt_bin_desc *desc = register_cubin_internal(filename);
      if (desc == nullptr) {
        __atomic_store_n(&registration_state, kUnregistered,
                         __ATOMIC_RELEASE);
        return nullptr;
      }
      __atomic_store_n(&registration_state, kRegistered, __ATOMIC_RELEASE);
      return desc;
    } else {
      sched_yield();
    }
  }
}

That code matters for two reasons.

First, repeated launches should not repeatedly read the CUBIN or re-register the image. Once the state is kRegistered, the wrapper returns the existing descriptor.

Second, the lazy safety path has to be thread-safe. A generated program may be mostly single-threaded at the host level, but runtime helpers should not be designed around that assumption. If two host threads reach offload wrappers at the same time, only one should perform cold registration. The others should yield and retry instead of burning CPU in a tight busy-wait loop while the registering thread is doing file I/O and runtime setup.

So the final design is not “replace lazy registration with explicit init.” It is:

keep lazy registration as a synchronized fallback,
emit eager init so normal generated programs avoid cold first use in user timing,
preserve RAII ownership for CUBIN bytes and descriptors,
and keep explicit rex_offload_fini() available for longer-lived callers that need controlled teardown.

That combination is what made the fix generic instead of benchmark-specific.

Why This Is Different From Native LLVM

Native LLVM does not have to solve this exact problem in the same way.

A native Clang offload binary typically carries its device image through the compiler and linker pipeline as part of the integrated OpenMP offload model. The host binary, runtime descriptors, and device image are produced together. The application does not see a generated helper file that reads rex_lib_nvidia.cubin from disk and manually constructs a __tgt_bin_desc.

REX deliberately chooses a different artifact model.

It emits source and helper files that users can inspect, build, and debug. That is a major advantage for a source-to-source compiler. It means a generated host file can be read directly. A generated CUDA file can be compiled separately. A CUBIN can be inspected as a normal artifact.

But the price of that explicitness is that REX has to recreate enough of the offload-image lifecycle in generated source and helper code.

That is why rex_offload_init() is not just a convenience wrapper. It is the point where the generated artifact model reconnects to the OpenMP runtime’s image lifecycle.

When native LLVM is measured, that lifecycle is mostly hidden inside the native offload binary and runtime startup path. When REX is measured, the lifecycle is visible as ordinary source. If the source places it after a benchmark timer, the comparison is unfair.

The fix was not to hide REX’s artifact model. The fix was to make the artifact model obey the same timing boundary.

The Regression Test Is The Contract

This bug needed a test because it is easy to reintroduce.

The lowering output can change for many legitimate reasons:

helper names can move,
generated declarations can be reordered,
launch APIs can change,
and formatting can drift as ROSE unparsing evolves.

A golden-file test would be too brittle for this layer.

The right test is an invariant:

in an nn-like lowered file, rex_offload_init() must appear before the benchmark timer declaration.

The reduced Rodinia lowering suite checks exactly that:

1
2
3
4
5
6
time0_line="$(first_line "${rose_file}" 'long[[:space:]]+long[[:space:]]+time0[[:space:]]*=[[:space:]]*clock[[:space:]]*\(')"
init_line="$(first_line "${rose_file}" 'rex_offload_init[[:space:]]*\(')"

[[ -n "${time0_line}" ]] || die "missing timer declaration marker"
[[ -n "${init_line}" ]] || die "missing rex_offload_init call"
(( init_line < time0_line )) || die "rex_offload_init moved after timer declaration"

That is a small test, but it encodes the real bug.

It does not care about every line in the lowered file. It cares about the ordering that protects the benchmark timing boundary.

The same suite also checks that the nn-like device code no longer contains old scheduler helpers such as XOMP_static_sched_init and XOMP_static_sched_next. That belongs to the next performance post, but it is useful context here: the nn work split into a host-lifecycle fix first, then a device-loop-shape fix second.

The init-ordering check keeps those concerns separate.

If nn regresses again, the test can tell whether the cold-start cost has leaked back into the timed path before we waste time debugging device kernel throughput.

A triage flow for the nn regression: observe large slowdown, inspect generated host timing, inspect lazy registration path, add lowering invariant, then continue to device-side scheduler investigation. — Figure 3. The useful split was host lifecycle first, device-loop shape second. `nn` needed both investigations, but the cold-start bug had to be removed before kernel comparisons were meaningful.

What The Fix Changed In The Numbers

The earliest comparison captured the pathological state:

1
2
native LLVM nn: about 0.3031 s
REX nn:         about 0.9986 s

That result was not a fair measurement of generated kernel quality. It included cold setup work in the REX timed region.

After the host-side lifecycle cleanup and subsequent compiler work, a later comparison showed:

1
2
native LLVM nn: about 0.2609 s
REX nn:         about 0.3056 s

This does not mean the init-placement fix alone made REX reach 0.3056 s. The later number also includes additional cleanup that will be covered in the next posts.

The important conclusion is narrower and stronger:

once cold-start registration was moved out of the measured path, the obviously pathological 3x regression disappeared, and the remaining gap became a real kernel/lowering problem.

That was the correct engineering outcome for Attempt 1.

Do not overclaim the fix. It did not prove that REX’s nn kernel was optimal. It did not make REX faster than native LLVM by itself. It did not remove the need to inspect the generated device loop.

It made the comparison meaningful.

That is often the first real performance fix in a compiler campaign. Before optimizing the kernel, you have to remove measurement contamination.

What We Should Not Have Done

There were several tempting wrong fixes.

The first wrong fix would have been to manually edit the generated nn file and move the init call only there.

That would make one benchmark look better, but it would not fix REX. Any user program with the same timing shape would still be exposed to the bug. Since REX is the source-to-source compiler generating these files, the ordering policy belongs in lowering.

The second wrong fix would have been to remove lazy registration from wrappers and assume every generated program always calls rex_offload_init().

That would make the hot path look cleaner, but it would weaken correctness. Generated code evolves. Users can link helpers in nonstandard ways. Runtime wrappers are the last safety gate before calling into libomptarget; they should remain defensive.

The third wrong fix would have been to special-case timer names.

For example, REX could scan for time0, start, begin, or calls to clock() and insert init before those. That is fragile. It makes the compiler depend on benchmark naming conventions and misses timers hidden behind helper functions or object initialization.

Prepending rex_offload_init() before user statements is simpler and more correct for standalone generated offload programs.

The fourth wrong fix would have been to treat the entire 0.9986 s result as a device-code problem.

That would have sent the investigation directly into block sizes, memory access, and loop scheduling. Those did matter later, but they were not the first-order explanation for a 3x slowdown. The host lifecycle bug had to be removed first.

The General Rule

The rule that came out of this attempt is now part of the REX performance model:

compiler-emitted one-time offload setup must run before benchmark-owned timing starts, unless the program explicitly chooses to time runtime setup.

That rule is not specific to nn.

It applies to any generated offload program where:

the compiler has to register a runtime-visible device image,
the user program owns its own timing boundary,
and the first offload call would otherwise trigger cold runtime work.

It also explains why the fix sits in two places:

the lowerer controls placement by emitting rex_offload_init() at the start of main,
the helper controls safety by keeping synchronized lazy registration in every public offload wrapper.

One part without the other is incomplete.

Early init without lazy safety makes the runtime brittle.

Lazy safety without early init makes benchmark timing unpredictable.

Together they make the generated program both robust and measurable.

What Came Next

After this fix, nn no longer looked like a lifecycle disaster. It looked like a normal compiler performance problem.

That was progress.

The next question became:

why was REX still behind native LLVM after cold-start work was removed?

That led to the old XOMP scheduler path.

For a simple canonical target loop, native LLVM was using a direct SPMD-style grid-stride kernel body. REX still had cases where it routed device loop execution through generic scheduler helpers. That extra structure mattered for nn, because once the cold-start cost was gone, the measured region was dominated by repeated launch and loop-body overhead.

So the next post will move from host lifecycle to device loop shape:

how REX replaced the old XOMP scheduler path with direct grid-stride GPU lowering.

The important point is that this post and the next one are separate on purpose.

The nn story had two different root causes:

first, one-time runtime setup was charged to the benchmark timer;
second, the generated device loop shape still carried unnecessary scheduler overhead.

Conflating them would make the fix harder to trust.

The lifecycle fix made the measurement honest. The scheduler fix made the kernel path faster.

Why nn Exposed This So Clearly