How OpenMP Works in REX: From Pragmas to GPU Kernels

Posted on (Updated on )
REX does not hand OpenMP parsing to Clang. It keeps directives alive inside the ROSE/REX pipeline, parses them with ompparser, converts them into SgOmp* nodes, lowers them into ordinary source files plus helper runtime glue, and validates the whole stack with parser, lowering, equivalence, and benchmark tests.

This post starts a series about the REX OpenMP offloading work. The goal of the series is not to dump a changelog. The goal is to make the architecture understandable enough that a new contributor can answer three practical questions quickly:

  1. Where does an OpenMP directive first become a structured object in REX?
  2. Who owns the transformation from source-level directive to runnable GPU code?
  3. How do we know a change is still correct after we touch parsing, AST construction, lowering, or runtime glue?

The short answer is: REX owns the OpenMP pipeline end to end.

That design choice is the key to everything else in this post. REX does use Clang and LLVM in the toolchain, but it does not let Clang become the source of truth for OpenMP semantics. Instead, REX preserves directive text in its own frontend, parses OpenMP with its own parser, constructs SgOmp* nodes in its own AST, lowers those nodes in its own midend, emits ordinary source files and helper files, and only then hands the result to the downstream compiler and runtime stack.

That sounds like extra work, and it is. But it buys something important. A source-to-source compiler can only stay in control if it owns the structure it wants to transform. If the compiler only sees OpenMP after some other frontend has already swallowed it into another AST model, then source-preserving transformation, language-unified handling, custom lowering, and stable debugging all become much harder.

Overview of the REX OpenMP pipeline from source pragmas to GPU execution.

Figure 1. The REX OpenMP pipeline is staged on purpose: keep directive text alive, parse it into OpenMPIR, convert it into SgOmp*, lower it, emit regular source files, then let the normal compiler and runtime do their job.

Why REX does not let Clang create the OpenMP AST

The most important architectural decision is the one that often surprises people first: in REX, OpenMP is not modeled as “whatever Clang says it is.”

That is not because Clang’s OpenMP support is bad. It is because REX is solving a different problem.

REX is a source-to-source compiler built on top of the ROSE/Sage AST world. Its job is not only to compile OpenMP programs. Its job is to analyze, rewrite, lower, and unparse programs while preserving enough source-level structure that transformations remain inspectable and debuggable. In that environment, OpenMP has to become a first-class citizen of the same AST universe as the rest of the program.

That requirement leads directly to the REX route:

  • the generic frontend still builds the regular Sage AST for the language itself;
  • OpenMP directives remain visible as pragma-like syntax attached to that AST;
  • ompparser parses only the OpenMP language and builds a dedicated intermediate representation, OpenMPIR;
  • ompAstConstruction.cpp converts that OpenMPIR into SgOmp* nodes inside the Sage AST;
  • the midend lowering works entirely on REX-owned nodes rather than on a foreign frontend’s OpenMP model.

There are several concrete benefits to this design.

First, it gives REX one OpenMP pipeline for both C/C++ and Fortran. The standalone ompparser under src/frontend/SageIII/ompparser explicitly supports both language families and exposes a direct API:

1
2
3
extern OpenMPDirective *parseOpenMP(const char *,
                                    OpenMPExprParseCallback,
                                    void *);

That matters because REX is not only a C/C++ translator. The same frontend pipeline needs to cope with Fortran OpenMP comments, C/C++ pragmas, combined directives, paired begin/end directives, and later source-to-source rewriting.

Second, it keeps directive ownership inside REX. The frontend can preserve original spelling, source locations, and directive text even when preprocessing changes whitespace or macro-expanded forms. In processOpenMP(), REX deliberately prefers raw source directive text when it can recover it, instead of blindly trusting the preprocessed pragma string.

Third, it keeps OpenMP clause expressions connected to the surrounding Sage context. ompparser parses the directive language, but REX still parses or reconstructs the embedded expressions with knowledge of the current AST and scope. That is the right split of responsibilities for a source-to-source compiler: the directive grammar is handled by a dedicated parser, but name lookup and AST insertion still happen in the host compiler’s own world.

Fourth, it makes debugging simpler. REX can stop after parse-only mode, stop after AST-only mode, or continue into lowering. Those checkpoints are extremely useful when the failure is “the directive parsed wrong,” versus “the AST conversion attached it to the wrong statement,” versus “the lowerer generated the wrong runtime call.”

Comparison of a Clang-centric route and the REX-owned OpenMP route.

Figure 2. Clang still exists in the toolchain, but REX does not outsource OpenMP ownership to it. The OpenMP model lives inside REX so transformation and unparsing stay under REX’s control.

Stage 1: carrying pragmas until REX is ready to parse them

The entry point for the frontend side is OmpSupport::processOpenMP() in src/frontend/SageIII/ompAstConstruction.cpp.

Its structure says a lot about the design:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
void processOpenMP(SgSourceFile *sageFilePtr) {
  ...
  if (wantsOpenMP) {
    setNormalizeClauses(false);
  }
  ...
  // Stage 1: parse OpenMP directives using ompparser
  ...
  OpenMPIRToSageAST(sageFilePtr);
  ...
}

Two details are worth calling out immediately.

The first is setNormalizeClauses(false). REX does not want the parser to “simplify away” the original directive shape too early. That is a recurring theme in source-to-source compilers: normalization is useful, but it can also destroy distinctions that the lowerer, unparser, or debugger still cares about. Here the parser is asked to preserve original directive and clause structure rather than merging everything into a canonical form too early.

The second is that processOpenMP() explicitly describes a staged pipeline in comments and code. For C/C++, it finds SgPragmaDeclaration nodes and parses only those that start with omp. For Fortran, it handles directive comments, optional paired begin/end forms, and a temporary conversion to pragma-like nodes so the later AST construction path can reuse the same machinery.

This is where the “carry the pragma” part of the architecture matters. At this point, REX has a normal Sage AST for the program and a collection of OpenMP directive spellings that are still explicit objects. It has not yet committed to the final SgOmp* form. That delay is intentional.

The parser used at this stage is the standalone ompparser. It lives under src/frontend/SageIII/ompparser, uses Flex/Bison grammar files (omplexer.ll and ompparser.yy), and builds OpenMPIR nodes such as OpenMPDirective and OpenMPClause. It also supports round-trip unparsing and DOT output, which is a strong hint that it is treated as a proper compiler component, not just a temporary or ad-hoc parsing solution.

The parser tests reflect that design. The ompparser test driver omp_roundtrip.cpp preprocesses source files, parses directives into OpenMPDirective*, then prints them back with generatePragmaString(). The shell harness test_single_pragma.sh treats “parse and round-trip the same directive spelling” as a regression contract. That is exactly the kind of contract you want for a standalone directive parser.

Stage 2: converting OpenMPIR into SgOmp*

Once OpenMPIR exists, REX still has one more crucial translation step before lowering can begin: OpenMPIRToSageAST().

This function is the bridge between the standalone OpenMP parser and the main compiler IR. It takes the collected pairs of:

  • SgPragmaDeclaration *
  • OpenMPDirective *

and converts them into the corresponding SgOmp* statements and clauses.

The implementation does the conversion bottom-up:

1
2
3
4
5
6
7
8
9
//! Convert omp_pragma_list to SgOmpxxx nodes
void OpenMPIRToSageAST(SgSourceFile *sageFilePtr) {
  ...
  for (iter = omp_pragma_list.rbegin(); iter != omp_pragma_list.rend(); iter++) {
    ...
    convertDirective(std::make_pair(decl, omp_it->second));
    ...
  }
}

That reverse walk is not an accident. Nested directives are easier to attach correctly when inner structures are converted before outer ones try to claim bodies and clauses.

The conversion code in ompAstConstruction.h and ompAstConstruction.cpp is deliberately fine-grained. There is not just one giant “build OpenMP node” function. There are families of helpers:

  • convertDirective(...)
  • convertBodyDirective(...)
  • convertCombinedBodyDirective(...)
  • clause builders such as convertClause(...), convertDependClause(...), convertExpressionClause(...)
  • expression helpers such as parseOmpExpression(...) and parseOmpArraySection(...)

This is another place where the “REX owns OpenMP” decision pays off. REX is not merely importing a finished AST from some external frontend. It is constructing Sage nodes in a way that matches Sage scopes, Sage expressions, Sage symbol tables, and Sage source positions.

For Fortran the story is even more interesting. REX uses the same general AST construction path, but it first normalizes Fortran directive comments into fake pragma declarations so the C/C++-style conversion path can be reused. The comments in convert_Fortran_OMP_Comments_to_Pragmas() say this openly: the point is to temporarily introduce a C/C++-like AST shape because it is much easier to work with than floating directive comments.

That is a pragmatic compiler-engineering move. Instead of writing two unrelated lowering pipelines, REX builds one internal representation that can serve both.

There is also a subtle but important source-preservation aspect here. The backend unparser already knows how to emit OpenMP AST nodes. Once the frontend has converted pragma text into SgOmp*, the compiler can analyze or rewrite those nodes and still unparse a readable directive-based program later. That is exactly what you want in a research and transformation compiler.

The parse-only and AST-only checkpoints fit naturally here:

  • parse-only: test whether directives are recognized and converted into OpenMPIR;
  • AST-only: test whether SgOmp* construction is correct before any lowering or runtime-specific transformation begins.

That separation is one reason the frontend pipeline remains debuggable even as the lowering gets more aggressive.

Stage 3: lowering from OpenMP AST to runnable offload code

Once the OpenMP AST exists, the midend takes over. The center of gravity here is src/midend/programTransformation/ompLowering/omp_lowering.cpp.

This is where the source-level OpenMP world stops being a declarative directive tree and becomes executable host/device code.

For GPU offloading, the modern lowering path does a few high-value things:

  1. it outlines the target region or loop body into a callable kernel body;
  2. it computes mapping information for target arguments;
  3. it builds the OpenMP offloading runtime argument arrays;
  4. it generates the runtime launch call;
  5. it emits helper code alongside the transformed host source so the result can be compiled with ordinary tools.

The code that builds the kernel launch is explicit. In the current lowering, REX creates a __tgt_kernel_arguments object and then emits a call to __tgt_target_kernel:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SgVariableDeclaration *kernel_args_decl = buildTargetKernelArgsDeclaration(
    g_scope, p_scope, arg_number_decl, args_base_decl, args_decl, arg_sizes,
    arg_types, num_blocks_decl, threads_per_block_decl,
    tripcount_decl != NULL ? buildVarRefExp(tripcount_decl) : NULL);

...

string func_offloading_name = "__tgt_target_kernel";
SgExprStatement *func_offloading_stmt = buildFunctionCallStmt(
    func_offloading_name, buildIntType(), parameters, p_scope);

That is an important point in the architecture. REX is not trying to invent an entirely separate execution stack. It lowers into code that speaks the LLVM offloading runtime’s language. The ownership boundary is: REX owns the transformation, but it interoperates with the existing offload runtime ABI.

The lowering also carries source-derived launch information into the generated code. One example is the tripcount-aware thread capping logic that creates __rex_tripcount, adjusts threads_per_block, and avoids obviously oversized launches when the source does not explicitly force a bad configuration. That logic lives in the lowerer because this is the stage where source structure and runtime launch policy finally meet.

A simplified schematic of the host-side shape looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
rex_offload_init();

int64_t rex_device_id = -1;
int rex_threads_per_block = thread_limit_from_directive_or_default;
int rex_num_blocks = num_teams_from_directive_or_default;
int64_t rex_tripcount = n;

void *rex_args_base[] = { ... };
void *rex_args[] = { ... };
int64_t rex_arg_sizes[] = { ... };
int64_t rex_arg_types[] = { ... };
struct __tgt_kernel_arguments rex_kernel_args = { ... };

__tgt_target_kernel(rex_device_id,
                    rex_num_blocks,
                    rex_threads_per_block,
                    rex_host_ptr,
                    &rex_kernel_args);

The device side is also compiler-generated source, typically in a sibling rex_lib_*.cu file. Conceptually, the kernel body is the flattened form of the original OpenMP loop nest or target region, not a hand-written CUDA library. That matters because it keeps REX generic: the helper runtime files are shared, but the actual kernel logic is still generated from the user’s program.

In other words, REX does not “call a GPU library that happens to implement OpenMP.” It generates GPU-facing code for the user’s loop body and then supplies the runtime glue needed to launch it correctly.

That source-to-source lowering model is why the helper files are part of the design rather than an afterthought. Under src/midend/programTransformation/ompLowering/ you can see the pieces REX emits or copies into lowered applications:

  • register_cubin.cpp
  • rex_kmp.h
  • rex_nvidia.cu
  • rex_nvidia.h
  • xomp_cuda_lib.cu
  • xomp_cuda_lib_inlined.cu
  • libxomp.h

These files are the contract boundary between REX-generated host code, REX-generated device code, and the external offloading runtime.

Stage 4: build and runtime flow after lowering

Because REX is a source-to-source compiler, the result of lowering is not “machine code dropped directly into the final executable.” The result is a set of regular source files that go through an ordinary build.

That is why benchmark directories produced by the workflow contain files with names like:

  • rose_*.c
  • rex_lib_*.cu
  • register_cubin.cpp

The build step then compiles the device file to NVPTX/CUBIN, links the host program, and uses register_cubin.cpp to register the device image with libomptarget before the first launch.

This is one of the cleanest parts of the design. The compiler proper stays focused on transformation. The downstream compiler and runtime still do normal compiler and runtime jobs:

  • Clang compiles generated host and device sources;
  • the offload runtime handles image registration and target launch;
  • the executable remains a normal binary rather than a special-purpose research runtime environment.

That separation also makes debugging more practical. If something goes wrong, you can inspect the generated rose_*.c and rex_lib_*.cu files directly. You can compile them manually. You can swap in a new LLVM runtime. You can diff outputs across toolchain versions. That is exactly the kind of workflow this architecture supports, and it is especially valuable during LLVM toolchain migrations, where generated files, runtime behavior, and launch glue all need to be inspected separately.

For a tiny example, the reduced Rodinia-style lowering tests use the same public translator path the real benchmarks use:

1
parseOmp --rex-omp-lowering -w -rose:verbose 0 -c input.c

The test harness does not call private compiler internals. It runs the translator, checks the generated files, and treats the result like a real source-to-source tool output.

Stage 5: the test system that keeps the whole pipeline honest

A pipeline like this only stays healthy if the tests are layered the same way the architecture is layered.

REX’s OpenMP test strategy is strongest when viewed as a stack.

Layered test stack for the REX OpenMP pipeline.

Figure 3. The test strategy mirrors the architecture. Parser bugs, AST-construction bugs, lowering-shape bugs, semantic bugs, and full benchmark regressions are caught at different layers rather than by one giant end-to-end test.

1. Parser-level tests

The standalone ompparser has its own tests for:

  • single-pragma parsing;
  • round-trip parse/unparse;
  • location handling;
  • larger corpora such as OpenMP examples and OpenMP Validation and Verification inputs.

This layer answers a narrow question: did the directive language parse into the expected OpenMPIR at all?

2. Frontend compile tests

Under tests/nonsmoke/functional/CompileTests/OpenMP_tests, there is a broad corpus of OpenMP specimens. This layer ensures that the frontend and AST construction path can ingest real directives, combined constructs, clauses, and language variants without failing.

This is where cases like axpy_ompacc_parseonly.c are useful: they let REX validate earlier checkpoints of the OpenMP pipeline without needing to reach full code generation.

3. Lowering structural tests

The newer lowering_rodinia suite is especially important for offloading work. Its README states the intent clearly: these tests validate lowering-specific behavior with reduced Rodinia-like inputs and invariant checks, not brittle golden files.

That is the right design for a lowerer. Symbol hashes, helper ordering, and formatting may change, but invariants such as:

  • how many kernels are created,
  • whether repeated helper calls still exist,
  • whether rex_offload_init() appears before declarations used by timing instrumentation,
  • whether the right launch API is emitted,
  • whether include/preamble structure stays valid,

are stable and meaningful.

The harness literally runs the public translator and then verifies the generated output tree:

1
2
"${parse_omp}" --rex-omp-lowering -w -rose:verbose 0 -c "${input_file}"
bash verify_outputs.sh "${case_name}" "${workdir}"

4. CPU equivalence tests

The lowering_cpu suite compares:

  1. the original OpenMP source compiled with LLVM’s OpenMP runtime,
  2. the REX-lowered source compiled with the same runtime.

That is an excellent sanity layer because it removes GPU noise from the question. If the lowered source is semantically incorrect, the mismatch is caught here before offloading details can complicate the issue. The harness runs both binaries repeatedly with multiple OMP_NUM_THREADS settings and compares stdout/stderr in either exact or sorted mode depending on whether interleaving is expected.

5. Full benchmark and performance validation

Finally, outside the core compiler tree, the full benchmark suites validate what the earlier layers cannot: real end-to-end behavior, offloading correctness, output matching, and performance against native LLVM offloading.

This layer is slower and noisier, which is exactly why it should not be the only test layer. But it is still essential, because parser and AST tests cannot tell you whether a new runtime helper or launch-policy change just made a benchmark 3x slower.

The design in one sentence

If I had to summarize the REX OpenMP architecture in one sentence, it would be this:

REX treats OpenMP as compiler-owned structure, not as frontend residue.

That decision explains why pragmas are preserved long enough to be parsed deliberately, why ompparser exists as a standalone component, why OpenMPIRToSageAST() is a distinct bridge, why lowering emits ordinary host/device source plus helpers, and why the test system is layered around those exact boundaries.

It also explains why the system is debuggable. You can inspect the directive text, the OpenMPIR, the SgOmp* nodes, the lowered host code, the generated device code, and the benchmark output as separate artifacts. In a source-to-source compiler, that transparency is not a luxury. It is the difference between a pipeline you can evolve and a pipeline you are afraid to touch.

The next posts in this series will zoom in on the interesting parts of that pipeline: the GPU offloading lowerer, the helper runtime boundary, LLVM version migrations, and the performance work that brought the full benchmark suite to a state of correctness and competitive performance.