How REX Proves Lowered OpenMP Semantics With CPU Equivalence Tests

Posted on (Updated on )
REX’s lowering_cpu suite is the semantic checkpoint between structural lowering tests and full GPU benchmark runs. For each case, it compiles the original OpenMP source and the REX-lowered source with the same Clang toolchain and the same LLVM OpenMP runtime, runs both repeatedly with OMP_NUM_THREADS=2 and OMP_NUM_THREADS=4, and compares stdout and stderr in either exact or sorted mode. That isolates lowering bugs from GPU-specific noise and catches semantic regressions such as wrong reduction results, broken single or barrier semantics, clause-precedence drift, or incorrect loop scheduling behavior much earlier than a benchmark-only workflow can.

The previous post in this series focused on lowering_rodinia: reduced Rodinia-like inputs, stable invariants, and why GPU lowering should be tested on structural facts instead of brittle full-file snapshots.

That leaves one obvious gap.

Structural correctness is necessary, but it is not enough.

A lowerer can still emit the right number of kernels, the right runtime include, the right helper ordering, and the right launch API shape while changing the program’s meaning. A broken reduction, a misplaced barrier, or an incorrect if plus num_threads interaction will not necessarily show up in a structural verifier. It will show up when the transformed program actually runs.

REX’s answer is the lowering_cpu suite under tests/nonsmoke/functional/CompileTests/OpenMP_tests/lowering_cpu.

This suite asks a much sharper question than a GPU benchmark does:

if we compile the original OpenMP program and the REX-lowered program with the same CPU OpenMP runtime, do they still behave the same?

That is exactly the right next layer after structural lowering checks.

A three-layer testing ladder with structural lowering tests, CPU equivalence tests, and full GPU benchmark validation.

Figure 1. CPU equivalence sits in the middle of the OpenMP test stack for a reason. It is the layer that checks semantics without dragging in GPU-specific runtime, launch, transfer, and scheduling noise.

Why CPU Equivalence Deserves Its Own Test Layer

The key idea behind lowering_cpu is not complicated, but it is very disciplined:

  • keep the original OpenMP source,
  • lower the same source with REX,
  • compile both programs with the same Clang toolchain,
  • link both programs against the same LLVM OpenMP runtime,
  • run both with the same thread counts,
  • and compare what they print.

That looks almost too simple. In practice it is one of the most valuable layers in the whole test stack.

The reason is that it removes the wrong variables from the experiment.

When a GPU test fails, the failure could come from many places:

  • target lowering,
  • map construction,
  • host launch generation,
  • helper/runtime glue,
  • device code generation,
  • offload runtime behavior,
  • device scheduling,
  • transfer timing,
  • or even a toolchain drift issue unrelated to lowering itself.

That is useful for final validation, but it is a terrible first debugging surface for a semantic question.

By contrast, lowering_cpu keeps the execution model much tighter. It tests whether REX’s source-to-source lowering of OpenMP still preserves OpenMP semantics when the result is executed in a controlled CPU environment. If this suite fails, the suspicion immediately narrows:

  • the parser may have been fine,
  • the AST may have been fine,
  • the structural lowering shape may have looked fine,
  • but the transformed host program no longer behaves like the source it came from.

That is a much more actionable failure.

What The Suite Actually Compares

The suite README states the contract plainly:

  1. compile the original OpenMP source and run it with LLVM OpenMP runtime,
  2. compile the REX-lowered source and run it with the same runtime,
  3. compare outputs under controlled thread-count settings.

The CMake file makes the suite concrete by splitting cases into two comparison modes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
set(OMP_LOWERING_CPU_EXACT_CASES
    barrier.c
    parallel.c
    parallel-if.c
    parallel-if-numthreads.c
    parallel-numthreads.c
    parallel-reduction.c
    parallel-reduction2.c
    parallelfor2.c
    single.c
    omp_version.c)

set(OMP_LOWERING_CPU_SORTED_CASES
    ompfor.c
    ompfor_c99.c
    ompfor3.c
    ompfor4.c
    ompfor-default.c
    ompfor-decremental.c
    ompfor-static.c)

That case list already tells you something about the suite’s intent.

This is not a random sample of OpenMP examples. It is a focused semantic surface:

  • barriers,
  • single,
  • if clauses,
  • num_threads,
  • reductions,
  • parallel for,
  • schedule behavior,
  • and even the _OPENMP macro contract.

In other words, the suite is trying to answer exactly the question a lowerer must answer: did the program still mean the same thing after transformation?

The Apples-To-Apples Contract

The most important engineering decision in the harness is that both variants are compiled in the same environment.

The original source is built like this:

1
"$compiler" "${compile_flags[@]}" "$input_file" -o "$workdir/orig.exe"

Then the lowered source is generated and built in the same script:

1
2
3
4
5
6
7
8
(
  cd "$workdir"
  "$parse_omp" --rex-omp-lowering -w -rose:verbose 0 \
    -rose:skipfinalCompileStep -c "$input_file" > lower.log 2>&1
)

"$compiler" "${compile_flags[@]}" -I"$lowering_inc" "${lowered_sources[@]}" \
  -o "$workdir/lowered.exe"

That is a very clean contract.

The same compiler is used for both binaries. The same runtime library directory is passed to both. The same -fopenmp=libiomp5 setting is used for both. The same _OPENMP version macro is injected into both. Even the dynamic loader path is aligned before execution:

1
export LD_LIBRARY_PATH="${omp_runtime_dir}${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"

That means the test is not asking whether “REX plus some different downstream environment” behaves like the original. It is asking whether the transformation itself preserved semantics when all downstream variables are held constant.

A diagram showing the original OpenMP source and the REX-lowered source both compiled by the same Clang compiler and linked against the same LLVM OpenMP runtime before output comparison.

Figure 2. The harness is deliberately narrow. The original and lowered programs go through the same compiler and the same runtime so the comparison isolates the effect of lowering rather than conflating it with downstream environment changes.

Why -rose:skipfinalCompileStep matters

The harness tells parseOmp to stop after producing the transformed source:

1
-rose:skipfinalCompileStep

That is the right choice for this layer. The test does not want the translator to hide compile behavior inside an opaque internal step. It wants the lowered source files on disk so it can compile them in exactly the same way as the original. That preserves the apples-to-apples comparison.

It also makes failures easier to debug. If lowering fails, the harness can show lower.log. If compilation fails, the compiler invocation is visible. If execution diverges, the script can diff the resulting outputs directly.

Why the suite stages only omp.h

One subtle but important detail is in the CMake setup:

1
2
3
# Stage only omp.h so include lookup still uses the active compiler's std headers.
set(_omp_header_stage_dir "${CMAKE_CURRENT_BINARY_DIR}/llvm_openmp_include")
configure_file("${_llvm_omp_header}" "${_omp_header_stage_dir}/omp.h" COPYONLY)

This is more thoughtful than it looks.

The suite wants the OpenMP header from the LLVM runtime stack it is testing, but it does not want to redirect all include resolution through a foreign sysroot-like directory. That would risk dragging in nonstandard header combinations and turning the comparison into an environment experiment.

So the harness stages only omp.h, then lets the active compiler continue resolving the ordinary C standard headers normally.

That is exactly the right boundary for a semantic equivalence test.

Exact Mode And Sorted Mode

The suite does not use one blunt comparison rule for every case. It uses two.

In exact mode, stdout and stderr must match exactly.

In sort mode, the first stdout line is preserved and the remaining lines are sorted before comparison:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
canonicalize() {
  local in_file="$1"
  local out_file="$2"
  if [[ "$mode" == "sort" ]]; then
    if [[ -s "$in_file" ]]; then
      head -n 1 "$in_file" > "$out_file"
      tail -n +2 "$in_file" | sort >> "$out_file"
    else
      : > "$out_file"
    fi
  else
    cp "$in_file" "$out_file"
  fi
}

This is a very pragmatic compromise, and it shows a good understanding of OpenMP behavior.

Some outputs are inherently deterministic enough that exact comparison is the right bar. If a single region or a reduction test prints the wrong thing, the suite should fail precisely.

But some loop-scheduling cases are expected to produce the same set of iteration reports while allowing harmless line-order differences across executions. In those cases, requiring exact line order would make the suite artificially brittle. It would be testing interleaving luck instead of semantic equivalence.

So the suite preserves the first line, which often contains a stable banner such as the thread count, and then sorts the remaining per-iteration lines before comparison.

That is not a hack. It is a precise statement about what the suite believes is semantically meaningful for those tests.

A comparison between exact output matching and sorted matching for OpenMP CPU equivalence tests.

Figure 3. Different OpenMP constructs deserve different comparison surfaces. Exact mode is right for deterministic outputs. Sorted mode is right when the set of observations matters more than the incidental print interleaving.

What Kinds Of Semantics The Cases Cover

The suite is small, but it is not shallow. A few representative cases make that clear.

Reductions: value semantics, not just structure

parallel-reduction.c is a good example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
int sum = 100, i = 1, thread_num;
#pragma omp parallel reduction(+ : sum)
{
#pragma omp single
  {
    thread_num = omp_get_num_threads();
  }
  sum += i;
}
printf("thread num=%d sum =%d\n", thread_num, sum);
assert(sum == (i * thread_num + 100));

A structural lowering test can prove that a reduction-shaped construct existed in the source and that the generated output still looks like a reduction site. But it cannot prove that the transformed program still computes the correct final value.

This CPU equivalence case can.

If lowering accidentally drifts on:

  • the reduction variable’s scope,
  • initialization,
  • update placement,
  • synchronization,
  • or the relationship between single and the reduction body,

the mismatch appears immediately in either the printed value or the assertion outcome.

single and barrier: control-flow semantics

single.c and barrier.c cover a different category. They are not about arithmetic; they are about ordering and control flow.

single.c checks that only one thread executes the single body and that the surrounding updates still mean the same thing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
int i = 100, num_threads;
#pragma omp parallel
{
#pragma omp single
  {
#pragma omp atomic
    i += 100;
  }

#pragma omp single nowait
  {
    num_threads = omp_get_num_threads();
  }
}
assert(i == 200);

barrier.c checks that work before and after the barrier still respects the directive boundary.

These are exactly the kinds of bugs a lowerer can introduce while leaving the program looking superficially reasonable. If block restructuring, statement motion, or clause handling drifts, the transformed code may still compile and even pass a structural test while no longer respecting OpenMP control-flow semantics.

if and num_threads: clause precedence

parallel-if-numthreads.c is especially useful because it checks an interaction, not just one clause in isolation:

1
2
3
4
5
6
7
8
9
int i = 0;
#pragma omp parallel if (i != 0) num_threads(3)
{
#pragma omp single
  {
    assert(omp_get_num_threads() == 1);
  }
  printf("Mutual exclusive output 2.\n");
}

The intended meaning is clear: if the if clause disables parallel execution, the num_threads(3) request should not force a three-thread team anyway.

That is the sort of semantic detail a parser test will never prove and a structural lowerer test will only gesture at indirectly. The CPU equivalence suite can check it directly under execution.

ompfor*: loop scheduling and iteration ownership

The ompfor* cases are where sorted comparison becomes important.

ompfor-static.c is a good example:

1
2
3
4
5
#pragma omp for schedule(static, 3)
for (i = lower; i < upper; i += stride) {
  printf("Iteration %2d is carried out by thread %2d\n", i,
         omp_get_thread_num());
}

The semantic goal is not “output line 7 must print before output line 8.” The semantic goal is that the transformed program still reports the same iteration ownership pattern under the same schedule semantics.

Sorting the per-iteration lines keeps the test focused on that fact instead of on incidental interleaving.

_OPENMP: compilation contract, not only runtime behavior

omp_version.c serves a different purpose. It checks that the compile environment presented to the lowered program still reflects the intended OpenMP version:

1
2
3
#if (_OPENMP<201511)
#error "An OpenMP 4.5 compiler is needed to compile this test."
#endif

Then it prints the recognized _OPENMP value at runtime.

That is a useful reminder that semantic preservation is not only about runtime execution. It is also about the compile-time contract the transformed source sees.

Why The Harness Repeats Runs

The script does not run each binary once and call it a day. It uses:

1
2
thread_counts=(2 4)
repeats=5

and then executes both variants repeatedly under both thread counts.

That is another disciplined design choice.

A single OpenMP run can pass by luck, especially for cases whose visible output depends on scheduling or where a latent bug only manifests under certain team sizes. Repeating the tests across multiple thread counts reduces the chance that a fragile lowering happens to sneak through on one lucky run.

The timeout protection is also important:

1
2
timeout 30s env OMP_NUM_THREADS="$threads" "$workdir/orig.exe" ...
timeout 30s env OMP_NUM_THREADS="$threads" "$workdir/lowered.exe" ...

This keeps a bad lowering from hanging the entire suite indefinitely. Again, that is a small implementation detail with a big effect on maintainability.

What This Layer Catches That lowering_rodinia Does Not

The relationship between this post and the previous one is worth stating directly.

lowering_rodinia and lowering_cpu are not competing strategies. They answer different questions.

lowering_rodinia asks:

  • did the generated host and device artifacts have the right shape?
  • did the right number of kernels appear?
  • did rex_offload_init() land in the right place?
  • did repeated host call sites survive?
  • did comment and map-array invariants remain intact?

lowering_cpu asks:

  • did the transformed program still behave the same under execution?
  • did the reduction still reduce correctly?
  • did single, barrier, and nowait semantics survive?
  • did clause interactions such as if plus num_threads remain correct?
  • did loop scheduling still assign work the same way?

Those are not interchangeable questions.

A compiler needs both layers because structure can be right while semantics drift, and semantics can fail long before a full GPU benchmark tells you where to look.

What This Layer Catches Earlier Than GPU Benchmarks

It is also important to understand what lowering_cpu buys compared with the full benchmark layer.

Full benchmark runs are essential. They test real applications, offloading correctness, and end-to-end performance. But they are also slow, noisy, and expensive to debug.

If a benchmark fails, the immediate question is often broad and unpleasant:

is this a real lowering bug, a runtime glue issue, a target mapping issue, a GPU-only scheduling issue, or just a benchmark-specific environment problem?

The CPU equivalence suite answers a much more surgical question first:

does the lowered source still match the meaning of the original when GPU-specific variables are removed from the experiment?

If the answer is no, the lowerer has a semantic bug and the debugging target is much smaller.

If the answer is yes, then the later GPU investigation can focus on what is genuinely GPU-specific.

That is a huge reduction in debugging cost.

Why This Is A Particularly Good Fit For REX

This kind of suite works especially well for REX because REX is a source-to-source compiler.

The lowered source is not hidden inside a private binary IR. The harness can:

  • ask parseOmp to emit rose_*.c,
  • compile that source directly,
  • diff execution results against the original source,
  • and treat the transformed output as a first-class program artifact.

That is a real advantage of the architecture.

It means the semantic checkpoint is not theoretical. It is built around the same visible source artifacts that developers already inspect when debugging lowering.

In a sense, lowering_cpu is where the source-to-source design pays back some of its complexity. Because the lowered program exists as ordinary source, the suite can test it like an ordinary program while still isolating the transformation that created it.

The Real Value Of lowering_cpu

The deepest value of the suite is not that it runs a handful of OpenMP examples on the CPU. It is that it encodes a mature rule for compiler testing:

after you prove structural shape, prove semantics in the smallest environment that still exercises the transformed program honestly.

That is what lowering_cpu does.

It does not try to replace structural lowering tests. It does not try to replace benchmark validation. It sits between them and makes both more useful:

  • if structural invariants break, you know the lowerer’s shape is wrong;
  • if CPU equivalence breaks, you know semantics drifted before GPU-specific complexity entered the picture;
  • if both pass and a benchmark still fails, the remaining search space is much smaller.

That is why this layer deserves to exist on its own and to be understood on its own.

It is the point in the OpenMP test stack where REX stops asking “did we generate something plausible?” and starts asking “did we preserve the program?”