How REX Tests GPU Lowering Invariants With Reduced Rodinia Cases

Posted on Mar 31, 2026 (Updated on May 3, 2026)

The lowering_rodinia suite in REX tests GPU offloading the way a lowerer should be tested: not with brittle full-file golden outputs, but with stable invariants over the generated host and device artifacts. Each reduced Rodinia-like case targets a concrete failure class such as wrong kernel count, repeated-call breakage, misplaced rex_offload_init(), broken target data lifetime structure, hidden device-parameter duplication, pragma/comment relocation, or incorrect map-array construction. That makes the suite durable across harmless formatting churn while still catching real semantic regressions.

The previous post in this series focused on the generated artifact model for GPU offloading: the rewritten host file, the synthesized rex_lib_*.cu device file, and the shared helper layer that makes the final build work.

Once that file split is understood, the next practical question is obvious:

how do we test that those generated files are still correct when the lowerer changes?

That question matters because GPU lowering is one of the easiest places for a source-to-source compiler to become brittle. A harmless helper reorder can change a file diff. A symbol hash can change. Formatting can move. A new internal variable name can appear. If the test strategy is “compare the whole generated file byte for byte,” developers quickly end up debugging noise instead of regressions.

REX’s answer is the lowering_rodinia suite under tests/nonsmoke/functional/CompileTests/OpenMP_tests/lowering_rodinia.

This suite uses reduced Rodinia-like inputs and checks invariants over the generated files rather than exact-file snapshots. That design is not just nicer testing style. It is what makes the lowering pipeline maintainable while the compiler keeps changing.

A comparison between brittle whole-file golden diffs and stable invariant-based lowering checks. — Figure 1. A lowerer should be tested on stable semantic facts, not on every generated token. Invariant checks keep the suite sensitive to regressions without making it fragile to harmless churn.

Why Lowering Needs Invariants Instead Of Full-File Goldens

Lowering is not like parser testing.

If a parser round-trip changes unexpectedly, the textual spelling itself is often the right comparison surface. But a lowerer manufactures large amounts of code that the user never wrote:

synthesized helper arrays,
generated kernel identifiers,
offload-entry objects,
runtime wrapper calls,
hidden parameters,
device-only declarations,
comments moved to new structural positions,
and sometimes whole support blocks inserted into main.

Many of those things are semantically stable while remaining textually unstable.

That is exactly why the lowering_rodinia README defines its goals this way:

no dependence on legacy output reference files,
catch semantic regressions observed during the LLVM-22 migration,
remain robust against unstable symbol hash IDs and format churn.

That is the right mindset for a lowerer. A good lowering test asks questions such as:

how many kernels were emitted?
did the host file include the runtime header exactly once?
did rex_offload_init() appear exactly once, and in the right place?
did the host still call the same lowered helper multiple times?
did the device file get the hidden launch environment parameter once, not twice?
did a target data region still enclose the right multi-kernel lifetime pattern?
did a commented-out pragma stay attached to the code it documents?

Those are semantic facts. They are the contract the lowerer is supposed to preserve.

By contrast, these are usually poor test anchors:

the exact symbol suffix chosen for OUT__...,
line-for-line formatting,
declaration ordering that carries no meaning,
or the precise spellings of intermediate temporaries.

When a suite is tied too tightly to those unstable details, the failures become low-signal. Developers start distrusting the tests, and once that happens the suite stops serving its real job.

What `lowering_rodinia` Actually Runs

The suite is small in source size but precise in intent. The CMake registration is direct:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
set(RODINIA_LOWERING_CASES
    rodinia_axpy_multi_like
    rodinia_bfs_like
    rodinia_gaussian_like
    rodinia_hotspot_like
    rodinia_inactive_output_like
    rodinia_nn_like
    rodinia_pathfinder_like
    rodinia_srad_comments_like
    rodinia_srad_v2_like
    rodinia_btree_kernel_like
)

Each case is a reduced, Rodinia-shaped source input that targets one or more concrete lowering failure classes. The suite is not trying to be a benchmark harness. It is trying to preserve structural truths about generated output that real benchmarks depend on.

The test driver itself is deliberately thin:

1
2
3
4
5
6
(
  cd "${workdir}"
  "${parse_omp}" --rex-omp-lowering -w -rose:verbose 0 -c "${input_file}" > run.log 2>&1
)

bash "$(dirname "$0")/verify_outputs.sh" "${case_name}" "${workdir}"

That split matters.

Step one runs the translator exactly through the lowering path that matters here. Step two does not diff the whole output tree. It inspects the generated rose_*.c and rex_lib_*.cu files and asks whether the expected invariants still hold.

This is a good compiler-testing pattern because it separates:

“did lowering run at all?”
from “did lowering produce the right structural result?”

The first question is mostly a driver question. The second is the one that protects correctness.

The Core Design: Host Checks And Device Checks Are Different

The most important function in the verifier is verify_common_cuda_lowering(). It is worth studying because it tells you what the suite believes the lowering contract actually is.

On the host side, it checks facts such as:

exactly one #include "rex_kmp.h",
exactly one rex_offload_init(),
zero rex_offload_fini() insertions,
the expected number of __tgt_offload_entry objects,
the expected number of kernel ID symbols,
the expected number of __tgt_target_kernel(...) launch sites,
and the ordering relation that the runtime include must appear before the first offload entry.

On the device side, it checks a different contract:

exactly one #include "rex_nvidia.h",
exactly one extern "C" wrapper,
the expected number of __global__ void OUT__... kernels,
the expected number of hidden __rex_kernel_launch_env parameters,
and the absence of duplicate hidden launch-environment parameters.

That host/device split is one of the most mature things about the suite. It recognizes that a lowerer does not generate “one output.” It generates multiple artifacts with different responsibilities, so the invariants must be stated per artifact.

Host-side checks over the generated rose file and device-side checks over the generated rex_lib file. — Figure 2. The verifier does not treat the lowered output as one flat blob. Host and device files carry different contracts, so they are checked differently.

You can see that design directly in the shell code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
expect_count "${rose_file}" '#include "rex_kmp.h"' 1 "host runtime include count"
expect_count "${rose_file}" 'rex_offload_init[[:space:]]*\(' 1 "host offload init count"
expect_count "${rose_file}" 'struct[[:space:]]+__tgt_offload_entry[[:space:]]+OUT__' \
  "${kernel_count}" "host offload entry count"
expect_count "${rose_file}" '__tgt_target_kernel[[:space:]]*\(' \
  "${kernel_count}" "host target kernel call count"

expect_count "${cu_file}" '#include "rex_nvidia.h"' 1 "device runtime include count"
expect_count "${cu_file}" '__global__[[:space:]]+void[[:space:]]+OUT__' \
  "${kernel_count}" "device kernel count"
expect_count "${cu_file}" '__global__[[:space:]]+void[[:space:]]+OUT__.*\(void[[:space:]]*[*][[:space:]]*__rex_kernel_launch_env' \
  "${kernel_count}" "device hidden launch env parameter count"

What is notable here is not the use of grep. It is the choice of what to grep for.

The verifier is intentionally indifferent to unstable details, but it is very strict about the facts that matter for correctness and runtime ABI conformance.

Multi-Kernel And Repeated-Call Cases

The strongest tests in the suite are the ones that mirror the shapes that repeatedly caused bugs during real GPU-offloading work: multiple kernels in one source file, repeated host calls to the same lowered helper, and target data regions whose lifetime spans more than one kernel.

`rodinia_axpy_multi_like`: same helper called more than once

This case is compact, but it checks a surprisingly rich contract:

1
2
3
4
scale_like(x, scale, n);
axpy_like(x, y, a, n);
bias_like(y, bias, n);
axpy_like(x, y, a, n);

There are three kernels in the input, but the important twist is that axpy_like(...) is called twice from main.

That matters because a broken lowerer can accidentally get this wrong in several ways:

emitting the kernel only once but failing to preserve both host launch sites,
reusing mutable launch-side state incorrectly across calls,
or outlining in a way that makes repeated calls refer to the wrong helper or wrong mapping shape.

The verifier protects against that by checking both the global kernel count and the repeated-call count on the host side:

1
2
expect_count "${rose_file}" 'axpy_like[[:space:]]*\([[:space:]]*x,[[:space:]]*y,[[:space:]]*a,[[:space:]]*n[[:space:]]*\)[[:space:]]*;' \
  2 "repeated host call count"

That is a good example of an invariant that a whole-file diff would hide in noise. The meaningful fact is not “the file matches exactly.” The meaningful fact is “the call graph shape survived lowering.”

`rodinia_hotspot_like`: two kernels under one `target data` lifetime

This case checks a different shape. The input contains one target data region and two collapsed kernels inside it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#pragma omp target data map(from : result[0 : RODINIA_HOTSPOT_SIZE]) \
    map(to : power[0 : RODINIA_HOTSPOT_SIZE]) \
    map(tofrom : temp[0 : RODINIA_HOTSPOT_SIZE])
{
#pragma omp target teams distribute parallel for collapse(2)
  ...

#pragma omp target teams distribute parallel for collapse(2)
  ...
}

The real invariant here is lifetime structure, not only kernel count. The suite wants to know that lowering still preserves the fact that both kernels are siblings inside one mapping region rather than separate regions with duplicated transfers or broken sharing semantics.

That is why reduced Rodinia cases are so useful. They retain the structural property that matters without carrying benchmark-scale noise.

`rodinia_pathfinder_like`: a host loop launching a kernel repeatedly inside one region

This case adds one more lifetime pattern:

1
2
3
4
5
6
7
8
9
#pragma omp target data ...
{
  for (t = 0; t < RODINIA_PATH_ROWS - 1; t++) {
#pragma omp target teams distribute parallel for private(n)
    for (n = 0; n < RODINIA_PATH_COLS; n++) {
      ...
    }
  }
}

Now the issue is not two distinct kernels in one region. It is one lowered kernel shape launched repeatedly inside a long-lived region.

That catches a different class of bugs:

map lifetimes accidentally shortened to one iteration,
state or helper declarations emitted in the wrong scope,
or host launch generation that does not survive repeated use under one enclosing offload lifetime.

Together, axpy_multi_like, hotspot_like, and pathfinder_like form a small matrix of the multi-kernel and repeated-call shapes that matter most in real applications.

Non-Obvious Regression Classes

A good lowering suite does not only check the obvious cases. It also protects the weird regressions that show up when a compiler is under real development pressure.

A matrix mapping reduced Rodinia cases to the kinds of lowering regressions they target. — Figure 3. Each reduced Rodinia case exists because a different structural failure mode matters. Together they cover repeated calls, `target data` lifetimes, comment relocation, launch ordering, and map-array integrity.

`rodinia_nn_like`: `rex_offload_init()` placement and scheduler cleanup

This case became especially important during the migration away from older CUDA helper patterns. The source starts with a timer declaration:

1
2
3
4
int main(void) {
  long long time0 = clock();
  int i;
  ...

The suite explicitly checks that lowering inserts rex_offload_init() before that timer declaration, not somewhere later in the function:

1
2
3
time0_line="$(first_line "${rose_file}" 'long[[:space:]]+long[[:space:]]+time0[[:space:]]*=[[:space:]]*clock[[:space:]]*\(')"
init_line="$(first_line "${rose_file}" 'rex_offload_init[[:space:]]*\(')"
(( init_line < time0_line )) || die "rex_offload_init moved after timer declaration"

That is a good example of a structural invariant with real consequences. The concern is not style. The concern is that initialization must happen before the host starts the timed region and before the launch path can be reached.

The same test also forbids old device-side scheduler helpers:

XOMP_static_sched_init(...)
XOMP_static_sched_next(...)
getCUDABlockThreadCount(...)
getLoopIndexFromCUDAVariables(...)

That is how the suite encodes the migration away from an outdated lowering shape without hard-coding every line of the resulting device file.

`rodinia_srad_comments_like`: commented pragmas still need stable attachment

Comment handling may look cosmetic until it breaks source fidelity in generated output. This case uses marker comments such as:

1
2
3
4
5
// RODINIA_SRAD_ROWS
// #pragma omp parallel
for (i = 0; i < RODINIA_NR; i++) {
  ...
}

The verifier does not merely count those comments. It checks ordering and distance:

the pragma comment must stay between the right marker comments,
and it must remain close enough to the code it documents.

That is an unusually thoughtful test. It acknowledges that source-to-source compilation is partly about preserving code that humans will read. A lowerer that preserves semantics while scattering user comments and commented directives into confusing positions is still degrading the generated artifact.

`rodinia_btree_kernel_like`: map arrays, implicit pointers, and trailing-comment stability

This case is probably the richest single case in the suite.

It has:

two kernels,
repeated host calls to both,
multiple map(to) clauses,
implicit pointer captures,
and a trailing // main comment attached to the second helper.

The host-side checks are correspondingly specific:

repeated call counts for both helpers,
zero-sized argument sizes for implicit pointer entries,
specific runtime map-type patterns,
no stray “target parameter only” map types,
and one preserved trailing // main comment.

This is exactly the sort of case that explains why a lowerer needs invariant tests. A full-file diff would bury these details in generated noise. The current suite turns them into named structural contracts.

Why Reduced Rodinia Inputs Work So Well

It is worth pausing on the “reduced Rodinia-like” part of the design.

These cases are not meant to replace full benchmark runs. They exist because benchmark-scale inputs are too expensive and too noisy for day-to-day lowering development. But toy loops alone are often too weak: they do not reproduce the lifetime nesting, repeated-call shapes, or mapping patterns that real applications use.

Reduced Rodinia cases hit the middle ground:

small enough to inspect quickly,
stable enough to live in a compile-test suite,
but rich enough to encode real structural bugs taken from GPU-offloading work.

That middle ground is what gives the suite its leverage. It lets developers debug the lowerer with cases that still look like real programs.

What This Layer Catches That Other Test Layers Do Not

The broader OpenMP test stack in REX already has parser tests, frontend compile tests, and larger end-to-end checks. The question, then, is why this layer deserves to exist separately.

The answer is that lowering_rodinia catches failures that naturally fall through the cracks of the other layers.

Parser and AST tests can tell you that:

the directive was recognized,
the clauses were attached,
and the frontend did not fail while constructing SgOmp*.

But they do not tell you whether lowering emitted:

one or three kernels,
a duplicated hidden device parameter,
launch code in the right host scope,
the correct repeated host call pattern,
stable target data lifetime structure,
or the right runtime map-array shape for implicit pointers.

Full benchmark runs can reveal some of those problems eventually, but only after the failure has become expensive to debug. By then you are staring at a large application and asking which compiler stage drifted.

lowering_rodinia stops that debugging chain earlier.

It is the suite that says:

the parser was fine,
the AST was fine,
but the generated host/device artifacts no longer satisfy the lowerer’s contract.

That is a much better failure report.

The Real Value Of The Suite

The deepest value of lowering_rodinia is not that it uses shell scripts or reduced benchmark cases. It is that it encodes a mature view of what lowerer correctness actually means.

Lowering correctness is not only:

“did a file get produced?”
or “did the benchmark still run once on one machine?”

It is also:

did host and device artifacts agree on kernel structure?
did repeated host launches survive outlining?
did runtime include and registration scaffolding appear exactly once?
did offload initialization land in the correct place?
did the device signatures stay normalized?
did target data lifetime relationships survive translation?
did user-visible comments stay attached to the right code?

Once you frame the problem that way, the design of the suite makes sense immediately.

This is not a golden-file museum. It is a set of executable assertions about the lowerer’s real responsibilities.

That is why the suite has held up well while the GPU-offloading path changed underneath it. And it is why this layer is worth understanding separately from the broader OpenMP test story.

The next time a GPU-lowering patch changes generated output, the useful question is not “did every line stay identical?” It is “did the structural promises of lowering remain true?”

lowering_rodinia is where REX answers that question.

Why Lowering Needs Invariants Instead Of Full-File Goldens

What lowering_rodinia Actually Runs

The Core Design: Host Checks And Device Checks Are Different

Multi-Kernel And Repeated-Call Cases

rodinia_axpy_multi_like: same helper called more than once

rodinia_hotspot_like: two kernels under one target data lifetime

rodinia_pathfinder_like: a host loop launching a kernel repeatedly inside one region