How REX Builds Native LLVM And Generated Variants For Side-By-Side GPU Benchmarking

Posted on (Updated on )
The REX benchmark campaign does not compare two unrelated GPU software stacks. For each benchmark, it builds two comparable binaries for the same application: a native Clang OpenMP-offloading binary and a REX-generated binary whose host, device, and helper files are still linked against the same LLVM libomp and libomptarget family. Both variants are then run with the same benchmark-owned invocation contract. That is what makes the result a comparison of lowering decisions rather than a comparison of unrelated ecosystems.

The previous post explained why the GPU benchmark layer in REX should be treated as an investigation surface rather than a scoreboard.

This post steps one level earlier in that same top-layer contract.

Before you can compare correctness or performance meaningfully, you need to answer a much more basic question:

what exactly are the two binaries being compared?

That sounds like bookkeeping.

It is not.

If the native LLVM side and the REX side are not built under a disciplined contract, then the benchmark result stops meaning what it claims to mean.

Instead of:

1
native LLVM lowering versus REX lowering

you get:

1
one software stack versus some other software stack

That is why the build story itself deserves its own post.

The benchmark campaign was not just “compile both somehow and run them.” It was a side-by-side build contract with three core properties:

  • the same application is used on both sides,
  • the same LLVM offloading runtime family is used on both sides,
  • and the same run command is used on both sides.

Once those are true, the comparison becomes much more defensible.

A diagram showing one benchmark source splitting into a native LLVM build path and a REX-generated build path, then rejoining at the same LLVM runtime family and the same benchmark invocation.

Figure 1. The benchmark campaign is not comparing arbitrary binaries. It starts from one application, builds two variants under one runtime family, then rejoins them at the same workload contract.

One Application, Two Build Paths

For each benchmark in the NeoRodinia-derived tree, the campaign built two binaries.

The first was the native LLVM variant.

That path is conceptually straightforward:

  1. take the benchmark’s original source,
  2. compile it with Clang OpenMP target offloading,
  3. link it against the LLVM OpenMP runtime stack.

The second was the REX variant.

That path is different in structure, but not in ultimate runtime target:

  1. take the same benchmark source,
  2. lower it through REX,
  3. produce generated host, device, and helper artifacts,
  4. compile those artifacts,
  5. link the resulting binary against the same LLVM OpenMP runtime family.

That distinction matters.

The native side is one integrated compiler pipeline.

The REX side is a source-to-source pipeline that materializes intermediate artifacts such as:

  • a rewritten host file like rose_<input>.c,
  • a generated device file like rex_lib_<input>.cu,
  • and helper files such as register_cubin.cpp and rex_kmp.h.

Those two paths are structurally different on purpose.

What matters for benchmarking is that they rejoin at a shared runtime boundary rather than diverging into separate runtime ecosystems.

The Native LLVM Side

The native side was built with Clang OpenMP target offloading in the ordinary LLVM style.

Representative commands looked like:

1
2
3
4
clang -O3 -fopenmp \
  -fopenmp-targets=nvptx64-nvidia-cuda \
  --offload-arch=sm_80 \
  ...

The exact architecture flag varies with the machine and toolchain used during the campaign, but the important structure is stable:

  • Clang compiles the original OpenMP source,
  • Clang lowers the target regions natively,
  • the final binary uses LLVM’s OpenMP offloading runtime stack.

This is the reference path because it tells us what the same source looks like when the OpenMP compiler and offloading runtime are both owned by the LLVM toolchain directly.

That makes it the right baseline for the REX comparison.

Not because native LLVM is magically the only correct implementation, but because the REX GPU path is deliberately targeting the same runtime ABI family.

The REX Side

The REX side is more explicit.

It does not ask Clang to own the OpenMP transformation model. Instead, it lowers first and compiles the result afterward.

That produces a different set of build artifacts:

  • host-side generated source,
  • device-side generated source,
  • helper/runtime glue files,
  • and usually a standalone CUBIN artifact built from the generated device file.

In simplified form, the REX side looks like this:

1
2
3
4
5
6
original benchmark source
  -> REX lowering
  -> rose_<input>.c
  -> rex_lib_<input>.cu
  -> helper files
  -> compiled binary + device image

That sounds farther from native LLVM, but the important point is where the two paths meet.

REX still targets the LLVM offloading runtime ABI:

  • __tgt_* runtime calls,
  • libomp,
  • libomptarget,
  • and the same general device-runtime boundary that the native LLVM path eventually uses.

So the REX path is not a different benchmark universe. It is a different lowering path aimed at the same runtime family.

That is exactly what makes the comparison meaningful.

Why The Shared Runtime Family Matters So Much

The benchmark campaign was never trying to prove that “system A” and “system B” happened to run similar applications.

It was trying to isolate the effect of compiler-generated code shape.

That only works if the runtime family is aligned.

If native LLVM used one offloading runtime and REX used some unrelated custom runtime, then a measured difference could come from:

  • the compiler,
  • the runtime,
  • or the interaction between a special runtime and a special compiler path.

That would be a valid systems comparison. It would not be the comparison this campaign wanted.

The campaign wanted something narrower:

  • native LLVM lowering plus the LLVM offloading runtime family,
  • versus REX lowering plus the same LLVM offloading runtime family.

That way, when a row moved, the explanation space stayed much more focused:

  • generated kernel shape,
  • launch geometry,
  • runtime-lifecycle placement,
  • mapping behavior,
  • helper glue,
  • or measurement and fairness issues.

Not “maybe the whole runtime stack was just different.”

A diagram showing native LLVM lowering and REX lowering entering the same LLVM libomp and libomptarget runtime family before reaching the GPU.

Figure 2. The benchmark campaign isolates lowering differences by forcing both paths back through the same LLVM runtime family. That is what keeps the comparison about the compiler rather than about two unrelated ecosystems.

This also explains why the LLVM 22 reevaluation was done the way it was.

That pass deliberately rebuilt both native LLVM and REX benchmark binaries against the new source-built LLVM 22 toolchain while reusing the same benchmark tree. The point was to isolate the effect of the runtime/toolchain upgrade itself, not to mix a toolchain change with a fresh regeneration and then guess which part moved the result.

That discipline only makes sense if the shared runtime boundary is considered part of the benchmark contract.

Same Inputs, Same Invocation Contract

The build contract is only half the story.

A side-by-side comparison still becomes meaningless if the two binaries are run differently.

So the benchmark campaign also standardized the invocation contract.

Representative run commands were kept in small benchmark-owned files or scripts, for example:

1
2
3
4
5
./bfs.out ../../data/bfs/graph1MW_6.txt
./gaussian.out ../../data/gaussian/matrix1024.txt
./heartwall.out ../../data/heartwall/test.avi 20 4
./nn.out filelist_4 5 30 90
./pf.out 100000 1000

That may look mundane.

It is actually a major part of reproducibility.

Without explicit benchmark-owned invocation contracts, a comparison can drift into:

  • stale remembered commands,
  • slightly different problem sizes,
  • different flags or argument order,
  • or accidental use of a correctness-oriented reduced mode on one side and a performance mode on the other.

The run contract prevents that.

It says:

1
these are the exact inputs and arguments both variants must use

That keeps the campaign about compiler behavior rather than about operator memory.

A diagram contrasting a good run contract (explicit commands and inputs) with a bad one (relying on memory or ad hoc commands).

Figure 3. The run contract is the final piece of the benchmark contract. It ensures that both variants are judged against the exact same workload.

Why This Is A Lowering Comparison, Not A Stack Comparison

Once the build and run contracts are aligned, the benchmark layer can legitimately ask the question it cares about:

given the same application, the same inputs, and the same LLVM runtime family, how do native LLVM lowering and REX lowering behave?

That is the whole point.

The interesting differences then become:

  • how the device kernel body was generated,
  • how launch geometry was chosen,
  • how argument packing and map arrays were built,
  • where runtime lifecycle work was placed,
  • how helper files reconstructed the offloading ABI,
  • and whether any benchmark result moved because of compiler design or because of methodology mistakes.

Those are exactly the differences worth investigating.

They only become visible cleanly because the campaign worked to eliminate the larger structural mismatches first.

What This Build Contract Prevented

A disciplined side-by-side build contract is valuable not only for what it enables, but also for what it rules out.

It prevents at least four bad comparisons.

1. Comparing different runtime ecosystems

This is the most obvious failure mode.

If the two binaries do not share the same LLVM runtime family, then a result no longer isolates lowering behavior.

2. Comparing different workloads by accident

If one side runs one input and the other side runs a slightly different command, the row is already compromised.

3. Mistaking generated-artifact complexity for a different semantic target

The REX path naturally has more visible build artifacts:

  • host source,
  • device source,
  • helper files,
  • device image.

That complexity can make it look like the benchmark is comparing two incompatible systems.

The shared runtime boundary is what prevents that misreading.

4. Losing reproducibility across reruns

The benchmark campaign had multiple phases:

  • initial comparisons,
  • fairness reruns,
  • correctness-mode reruns,
  • GPU-total profiling,
  • and toolchain reevaluations.

Those would have been much harder to interpret if the build and run contracts were not already explicit.

Why This Build Contract Is The Foundation For Performance Analysis

This post matters as a prerequisite for the deeper optimization stories.

When analyzing:

  • the early nn regression,
  • the old XOMP scheduler path,
  • launch-geometry policy,
  • read-only-load recovery in b+tree,
  • or the LLVM 22 reevaluation,

the reader needs to trust that the campaign was actually comparing like with like.

That trust starts here:

  • same application,
  • same runtime family,
  • same inputs,
  • two different lowering paths.

Once that is clear, the later performance story can focus on the real engineering reasons the rows moved instead of having to keep defending the basic validity of the comparison.

The Design Rule In One Sentence

REX builds native LLVM and generated variants side by side for GPU benchmarking by making the two paths differ in lowering but converge again at the same LLVM offloading runtime family and the same benchmark invocation contract.

That is what turns the campaign into a meaningful compiler comparison rather than a comparison of unrelated software stacks.