How REX Tests OpenMP End to End

Posted on Mar 23, 2026 (Updated on Mar 24, 2026)

REX’s OpenMP tests are strongest when viewed as a stack that mirrors the compiler pipeline. ompparser tests validate directive parsing and location tracking, the OpenMP_tests corpus exercises frontend AST construction and analysis, lowering_rodinia checks stable lowering invariants rather than brittle golden dumps, lowering_cpu compares original and lowered executables with the same CPU runtime, and full benchmark runs validate GPU correctness and performance. Each layer catches a different class of regression before it can leak into the next one.

The previous posts in this series focused on architecture:

why REX keeps OpenMP under its own ownership,
how Stage 1 preserves directives before parsing,
how Stage 2 converts OpenMPIR into SgOmp*.

But none of that architecture matters if the compiler becomes too fragile to change.

This is where the test system comes in.

REX’s OpenMP tests are most understandable when you stop thinking of them as one giant suite and instead treat them as a stack. Each layer matches a stage in the compiler pipeline and answers a narrower question than the layer above it.

That structure is deliberate. Parser bugs, AST-construction bugs, analyzer bugs, lowering-shape bugs, semantic regressions, and benchmark-level runtime or performance regressions are different failure classes. They should not all be debugged through the same slow end-to-end path.

A layered OpenMP test stack for REX, from parser tests to benchmark validation. — Figure 1. The OpenMP test strategy mirrors the compiler architecture. Each layer catches a different kind of failure before it contaminates the next stage.

Why The Test Stack Mirrors The Compiler Stack

REX’s OpenMP pipeline already has explicit checkpoints:

parse-only,
AST-only,
analysis,
lowering,
and then downstream execution.

That is visible in the command-line layer itself. The frontend exposes -rose:openmp:parse_only, -rose:openmp:ast_only, and -rose:openmp:lowering as distinct actions, not just as internal implementation details. That means the compiler already thinks in stages.

The tests should do the same thing.

If you skip that discipline, you get the worst possible debugging loop:

run a large benchmark,
observe that “something is wrong,”
search manually through parsing, AST construction, analysis, lowering, runtime glue, and runtime behavior all at once.

That is not a serious way to maintain a compiler.

A layered test system gives you something much better:

parser failures are caught where parser failures belong,
AST-shape regressions are caught before lowering starts,
lowering invariants are checked without needing a whole application,
semantic equivalence is tested without GPU noise,
and full benchmarks stay focused on what only they can reveal: real execution correctness and performance.

A mapping from OpenMP pipeline stages to test layers: ompparser tests, compile/analyze corpus, lowering structural tests, CPU equivalence, and benchmark validation. — Figure 2. Each pipeline stage has a corresponding test layer. The point is not redundancy for its own sake. The point is to stop failures near the boundary where they originate.

Layer 1: Parser Tests For `ompparser`

At the bottom of the stack is the standalone OpenMP parser under src/frontend/SageIII/ompparser/tests.

This layer asks the narrowest possible question:

did the directive language parse into a correct OpenMPIR object at all?

That sounds small, but it is essential, because every later layer assumes the answer is yes.

What The Parser Suite Actually Covers

The parser tests are not just one binary and a couple of text files. The CMake setup builds multiple test executables:

tester
omp_roundtrip
test_locations

Those names matter because they reveal three different contracts.

Basic parse success
The parser has to accept real directives and build OpenMPIR.
Round-trip stability
A directive parsed into OpenMPIR should be able to generate its pragma spelling again in a stable way.
Location fidelity
Line and column tracking on directives and clauses must remain correct.

This is already better than the common “did the parser crash?” test philosophy.

Why Round-Trip Tests Matter

omp_roundtrip.cpp preprocesses input files into individual pragmas, calls parseOpenMP(...), then emits each resulting directive with generatePragmaString().

The shell harness test_single_pragma.sh uses that behavior as an invariant:

extract original directives from the input file,
parse and round-trip them,
normalize formatting where needed,
compare original and round-tripped results.

That is exactly the right contract for a standalone directive parser. If the round-trip drifts unexpectedly, the parser may still “parse,” but the structure it built is probably no longer faithful.

Why Location Tests Matter

test_locations.cpp does something that many parser suites neglect: it checks line and column positions on directives and clauses, and it also checks details such as the first expression in a map(...) clause or the number of dist_data policy items.

This is a strong signal about the intended quality bar. REX is not treating OpenMP parsing as token recognition only. It cares about source fidelity, clause order, and payload structure because later stages depend on those details.

Large Corpora, Not Only Tiny Handwritten Cases

The parser CMake also registers corpora extracted from:

built-in test files,
OpenMP Validation and Verification material,
OpenMP examples.

That matters because tiny handwritten parser inputs are excellent for localizing bugs, but they are not enough to keep a parser honest as the directive space grows.

The parser layer therefore gives REX both:

targeted unit-like checks,
and broad corpus coverage.

Layer 2: Frontend Compile And Analyze Tests

The next layer lives under tests/nonsmoke/functional/CompileTests/OpenMP_tests.

Historically, the top-level README in that directory still carries the old story about the early parser work. The current reality of the suite is more interesting and much larger. The real structure is in the CMake and sub-suite layout:

a broad C and C++ OpenMP corpus,
Fortran coverage,
OpenMP + OpenACC combined cases,
AST-output reference checks,
focused analyzer checks,
and dedicated subdirectories for lowering-specific tests.

This layer asks a broader question than the parser layer:

can the frontend parse, construct, and analyze real OpenMP programs without losing structural correctness?

The Broad Corpus Is Intentional

The CMake enumerates a large list of test programs:

loop constructs,
reductions,
tasks,
teams and target constructs,
metadirectives,
requires,
declare_mapper,
combined OpenMP/OpenACC cases,
and more.

That breadth matters because frontend failures often show up not in exotic corner cases, but in interactions:

a directive plus a particular clause,
a construct inside an if,
a Fortran variant with line continuation,
a combined construct that exercises both directive parsing and AST attachment.

This layer is where that combinatorial reality gets exercised.

AST-Only Output Diffs Are Still Useful Here

The compile tests are run with -rose:openmp:ast_only so the compiler stops after building SgOmp* nodes. Then helper tests grep the resulting rose_*.c files for OpenMP directives and diff them against reference outputs.

That is a very pragmatic choice for frontend regression coverage:

it is later than pure parser tests,
but earlier than lowering,
and it lets the suite verify that the compiler still emits recognizable directive structure after AST construction.

This is also where cases like axpy_ompacc_parseonly.c are useful. They validate earlier checkpoints of the OpenMP path without requiring the whole lowering/runtime story to be involved.

Focused Analyzer Regressions

The same CMake also wires focused analyzer tests such as:

default schedule handling,
dynamic schedule handling,
implicit target-map behavior.

That is a good sign of maturity. Analyzer behavior is neither pure parsing nor full lowering. It deserves its own checks instead of being treated as “whatever the lowering pass happens to notice later.”

Layer 3: Invariant-Based Lowering Structural Tests

The most instructive modern layer in the suite is tests/nonsmoke/functional/CompileTests/OpenMP_tests/lowering_rodinia.

Its README states the design very clearly:

validate lowering-specific behavior,
use reduced Rodinia-like inputs,
avoid dependence on fragile legacy output reference files,
and check invariants rather than unstable identifiers or formatting.

This is exactly the right way to test a lowerer.

Lowering structural tests checking invariant properties such as kernel count, runtime include placement, launch API calls, helper ordering, and comment relocation. — Figure 3. Lowering tests should not depend on every generated identifier or formatting detail. They should check the stable facts that define correct lowering behavior.

Why Golden Files Are Not Enough

Lowered source changes a lot for reasons that are not semantic regressions:

symbol hashes can change,
helper names can move,
formatting can change,
declarations can be reordered harmlessly.

If a lowering suite depends on full-file exact matches, developers quickly stop trusting the failures.

REX’s lowering_rodinia suite instead runs the translator and then checks meaningful facts in the generated output tree. The harness is explicit:

1
2
"${parse_omp}" --rex-omp-lowering -w -rose:verbose 0 -c "${input_file}"
bash verify_outputs.sh "${case_name}" "${workdir}"

The second script is where the real value is. It checks things like:

how many kernels were emitted,
whether the host file includes rex_kmp.h exactly once,
whether rex_offload_init() appears exactly once,
whether rex_offload_fini() is absent from host output,
whether __tgt_target_kernel(...) is used,
whether the right number of offload entries and kernel IDs exist,
whether repeated host calls to the same lowered helper remain present,
whether comment relocation and inactive conditional bodies are preserved.

That is a much healthier lowering contract than “the file still looks character-for-character the same.”

Why Rodinia-Like Cases Work So Well

The inputs are reduced, benchmark-shaped kernels:

rodinia_axpy_multi_like
rodinia_nn_like
rodinia_srad_comments_like
rodinia_btree_kernel_like
and others

This is a sweet spot for compiler testing:

small enough to inspect,
realistic enough to exercise multi-kernel and offloading patterns,
stable enough that the test intent stays obvious.

That design directly reflects real regressions observed during the LLVM 22 migration work. The suite is synthetic not for its own sake, but to isolate real failure modes cleanly.

Layer 4: CPU Equivalence For Lowering Semantics

The lowering_cpu suite is one of the smartest layers in the whole stack because it asks a very practical question:

if REX lowers OpenMP to explicit code, does that lowered code still behave like the original OpenMP program when both run against the same CPU OpenMP runtime?

This is a better semantic test than jumping straight to GPU runs.

The README states the model clearly:

compile and execute the original OpenMP source with LLVM’s CPU runtime,
lower the source with REX,
compile and execute the lowered source with the same runtime,
compare behavior.

Why This Layer Is So Valuable

GPU offloading adds a lot of noise:

data-mapping behavior,
runtime-plugin behavior,
launch configuration,
numerical differences,
performance artifacts.

If the lowered source is semantically wrong on CPU already, none of that GPU detail is helpful.

So lowering_cpu strips the question back down to semantics.

The harness is careful:

it stages only omp.h from LLVM into a local include directory so the active compiler still resolves its normal standard headers,
it uses Clang with -fopenmp=libiomp5,
it runs both binaries repeatedly,
it tests multiple OMP_NUM_THREADS settings,
and it supports both exact and sort comparison modes.

That last detail is especially good. Some outputs are expected to differ only in harmless line interleaving, so the suite preserves the first line and sorts the rest when needed. That means the suite is strict where it should be strict and tolerant where exact textual order is not actually the semantic contract.

Repetition Matters

The harness runs each binary multiple times for each thread count. That is not overkill. It reduces the chance that a schedule-sensitive bug or a lucky interleaving slips through as a false negative.

This is the kind of careful engineering that separates “we have a test” from “the test is worth trusting.”

Layer 5: Full Benchmark And Performance Validation

The top layer is the one people are often most tempted to start with: real applications, real offloading, real outputs, real performance.

This layer is necessary. It is just not sufficient on its own.

In practice, this means benchmark suites outside the core compiler tree validate things the internal suites cannot:

GPU output correctness,
output matching between REX and native LLVM offloading,
runtime/helper integration in realistic applications,
and end-to-end performance.

This is where questions like these get answered:

does the benchmark still run at all?
does REX’s GPU result match the native LLVM result?
did a helper or launch-policy change make a benchmark substantially slower?

These are real compiler questions. They just do not belong at the bottom of the test stack.

Why The Benchmark Layer Must Be Last

Benchmarks are slower, noisier, and harder to debug than the layers below them.

That is precisely why they should be last:

parser tests localize directive grammar problems fast,
frontend tests localize AST or analysis problems,
lowering structural tests localize code-shape regressions,
CPU equivalence localizes semantic drift,
then benchmark runs answer the remaining real-world questions.

This ordering saves engineering time because it stops failures near the boundary that produced them.

The Real Shape Of The Test Pyramid

If you compress the whole strategy down, REX’s OpenMP test system looks like this:

Parser tests
Can the directive language parse into correct OpenMPIR, round-trip, and preserve locations?
Frontend compile/analyze tests
Can real OpenMP programs survive parsing, AST construction, and targeted analyzer checks?
Lowering structural tests
Does the lowerer emit the right stable shapes and helper/runtime invariants?
CPU equivalence tests
Does lowered code still behave like the original OpenMP program on the same CPU runtime?
Benchmark and performance validation
Does the full system still run correctly and competitively in real applications?

That is not duplication. It is a division of responsibility.

Why This Matters For Anyone Touching The Compiler

The biggest benefit of a layered suite is psychological as much as technical: it makes the compiler safe to change.

When someone edits:

the parser,
Stage 1 directive collection,
Stage 2 AST construction,
analyzer logic,
lowering,
helper generation,

they should not have to wait for a giant benchmark campaign to learn whether they broke a basic invariant.

The test stack gives developers a faster and much more intelligible path:

fail early,
fail near the boundary that changed,
and make the benchmark layer confirm the final result instead of serving as the first detector.

That is how a research-oriented source-to-source compiler stays maintainable as it grows into a more production-grade toolchain.

The Design In One Sentence

The REX OpenMP test system works because it mirrors the compiler architecture.

That layering is what keeps the OpenMP pipeline understandable enough to evolve without fear:

Parser contracts are tested at the parser.
AST and analyzer contracts are tested in the frontend corpus.
Lowering contracts are tested with invariant-based structural checks.
Semantics are tested with CPU equivalence.
Real applications are used last for end-to-end correctness and performance.