How `ompparser` Turns OpenMP Pragmas Into `OpenMPIR` In REX
ompparser is the middle frontend stage that turns preserved OpenMP directive text into a structured, location-aware intermediate representation called OpenMPIR. It is a standalone Flex/Bison parser with its own lexer, grammar, and IR classes. The parser owns OpenMP directive syntax, combined-construct spelling, clause structure, and source locations, but it deliberately does not try to become the host-language expression parser. Instead, clause payloads are stored as normalized strings and may also be passed through a callback that returns host-compiler-specific expression nodes. That split is what lets REX keep one reusable OpenMP parser while still feeding OpenMPIRToSageAST() with the right information for later AST construction and lowering.The previous posts in this series covered two frontend boundaries around this stage:
- how REX preserves OpenMP pragmas instead of letting Clang swallow them,
- and how Stage 2 converts
OpenMPIRintoSgOmp*nodes in the Sage AST.
What was still missing was the parser itself.
That missing step matters because it is where directive text stops being “a pragma string we carried forward” and becomes a compiler data structure that later stages can reason about.
In REX, that stage is handled by src/frontend/SageIII/ompparser.
The design is very deliberate:
ompparseris a standalone OpenMP parser,- it uses Flex and Bison,
- it supports both C/C++ and Fortran directive syntax,
- it builds a reusable intermediate representation called
OpenMPIR, - and it can be used either inside REX or on its own.
This post is about that parser boundary: what it parses, what it builds, what it refuses to do, and how the tests prove it is working before AST construction begins.
Figure 1. ompparser is the frontend middle layer. It does not build Sage AST directly. It turns directive text into OpenMPIR, which the next stage then translates into SgOmp* nodes.
Why REX Uses A Standalone OpenMP Parser
At first glance, writing a standalone OpenMP parser looks redundant. Clang already parses OpenMP. So why does REX insist on owning this stage?
The answer is the same reason REX owns the rest of the OpenMP pipeline: it is a source-to-source compiler, not merely a frontend wrapper around Clang.
REX needs:
- one OpenMP representation it controls,
- one parser path that works across C/C++ and Fortran,
- a parse-only checkpoint before AST construction,
- round-trip and source-location validation on directives as directives,
- and a way to keep directive grammar separate from host-language expression parsing.
That last point is the most important one.
An OpenMP directive contains two very different kinds of structure:
OpenMP grammar
- directive names,
- combined constructs,
- clause kinds,
- clause modifiers,
- begin/end forms,
- and directive-specific syntax rules.
embedded host-language payloads
if(cond),num_threads(expr),map(tofrom: a[0:n]),depend(inout: a[i]),- and so on.
ompparser is designed to own the first category completely while only partially owning the second. That is a very strong architectural choice, because it keeps the OpenMP grammar unified and reusable without forcing the parser to become a full C/C++/Fortran expression parser.
The result is a stage boundary that is narrow, testable, and reusable.
The Parser As A Real Library, Not A Hidden Internal
The parser is packaged as a proper library, not a private implementation detail buried inside the REX frontend.
Its CMake setup is explicit:
| |
That should shape how you think about it.
This is not “a few parsing routines inside the compiler.” It is a distinct parser subsystem with:
- its own public headers,
- its own IR classes,
- its own tests,
- and even a WebAssembly build for standalone use.
The README reflects that same intent. It describes ompparser as “a standalone and unified OpenMP parser” that builds an IR supporting normalization and round-trip unparsing.
That standalone quality matters inside REX too. It means Stage 1 can be reasoned about independently:
- feed directive text into
parseOpenMP(...), - get
OpenMPIRback, - then hand that IR to the Stage 2 Sage bridge.
That is a much cleaner boundary than “hope the whole frontend still works.”
The Entry Point: parseOpenMP(...)
The public entry point is small:
| |
That signature tells you almost everything that matters about the design.
It takes:
- raw directive text,
- an optional expression parse callback,
- and opaque user data passed back to that callback.
It returns a single OpenMPDirective *, which is the root of the parsed OpenMPIR tree for that directive.
The implementation in ompparser.yy makes the parser’s behavior clear:
| |
There are four important points here.
1. The parser owns the whole directive, not just a clause fragment
The result is an OpenMPDirective, not a loose bag of tokens. That means the parser always builds a construct-centered IR with directive kind, clauses, location data, and any directive-specific payloads.
2. Language matters, but only at the directive-syntax level
The parser works with OpenMPBaseLang values such as Lang_C, Lang_Cplusplus, and Lang_Fortran. It can auto-detect Fortran-style input from prefixes like !$omp, and it validates explicit language settings against the input shape.
That is exactly what a standalone OpenMP parser should do:
- understand OpenMP syntax in the context of the base language,
- but stay narrowly focused on directives instead of replacing the host-language compiler.
3. The expression callback is optional and external
The callback is installed before parsing starts, not hard-wired into the parser implementation. That is how ompparser stays reusable. REX can pass a callback that knows how to build Sage expression nodes; some other embedding could pass a different callback or none at all.
4. The parse result carries base-language identity
After parsing, the directive gets its base language attached. That becomes important for unparsing, especially for Fortran sentinels such as !$omp and !$ompx.
Lexer, Grammar, And Location Tracking
The parser is split the way you would expect from a serious Flex/Bison design:
src/omplexer.llfor lexical analysis,src/ompparser.yyfor grammar and semantic actions,src/OpenMPIR.handsrc/OpenMPIR.cppfor the constructed IR.
That sounds ordinary, but two details are worth calling out.
The lexer has a lot of clause-specific states
The lexer is not a one-state token machine. It has many explicit states:
%x AFFINITY_EXPR_STATE
%x MAP_STATE
%x NUM_THREADS_STATE
%x SCHEDULE_STATE
%x REDUCTION_STATE
%x IF_STATE
...
This is one of the reasons the parser can handle OpenMP well without pretending that every clause payload is parsed with one universal rule. Different directive and clause contexts need different tokenization behavior, especially around nested parentheses, array-section syntax, mapper syntax, and implementation-defined payloads.
The parser uses Bison locations deliberately
The grammar enables %locations, and the semantic helpers explicitly stamp line and column information onto directives and clauses:
| |
and similarly for clauses:
| |
This is not cosmetic. Location-aware IR is one of the core contracts of this stage. Later AST construction and debugging need to know where the directive and clause actually came from.
The lexer itself does nontrivial line and column bookkeeping, including a small position history to keep unput() handling consistent. That is exactly the kind of detail that disappears if you think of the parser as “just a grammar file.” In reality, source fidelity is an active design goal here.
Figure 2. ompparser owns OpenMP syntax, but it does not force itself to own the full host-language expression problem. Clause payloads can stay as normalized strings and can also be handed through an embedding callback.
What OpenMPIR Actually Stores
The parser does not build Sage nodes directly. It builds OpenMPIR.
The core types live in OpenMPIR.h:
| |
That already says something important: both directives and clauses inherit source-location behavior.
OpenMPDirective stores at least these key concepts:
- directive kind (
OpenMPDirectiveKind), - base language,
- original-order clause list,
- clause map grouped by kind,
- normalization behavior,
- directive-specific flags such as Fortran sentinel style,
- and directive-specific subclass state where needed.
The “original order” point matters:
| |
That means the parser is not only interested in semantic grouping by clause kind. It also preserves the input ordering for round-trip output. A directive like:
| |
should round-trip in the same order rather than being reparsed and then emitted in some arbitrary canonicalized order unless normalization explicitly chooses otherwise.
That is why OpenMPDirective exposes both:
- grouped clause lookup by kind,
- and
getClausesInOriginalOrder().
This dual view is one of the strongest design choices in the IR. It serves both:
- compiler consumers that want semantic lookup by clause kind,
- and unparser/test consumers that care about source-faithful output.
Clause payloads are strings first, optional nodes second
OpenMPClause stores expression payloads as strings, separators, optional callback-produced nodes, and per-expression locations:
| |
That combination is exactly the point of this stage.
The parser keeps enough source-level information to:
- reconstruct the directive text,
- preserve clause spelling and ordering,
- and later hand richer semantic work to the host compiler.
But it does not force ompparser itself to become the place where all host-language expression semantics must live.
The Callback Boundary: OpenMP Grammar Here, Host Expressions There
This is the most important subtlety in the parser design.
OpenMPIR.h defines the callback type:
| |
and the parser-side trampoline:
| |
Then OpenMPClause::addLangExpr(...) uses it:
| |
This is the critical architectural split:
ompparserparses OpenMP directive syntax,- normalizes and stores clause payload strings,
- tags them with a parse mode such as expression, variable list, or array section,
- and optionally asks an embedding callback to interpret that payload in a host-compiler-specific way.
The parse-mode enum shows how intentional this is:
| |
That is a clean boundary.
It lets the standalone parser stay generic while still providing exactly the information the embedding compiler needs to do the next stage correctly.
In REX, this is what makes the Stage 1 to Stage 2 handoff sane. ompparser does not need to know how to build Sage expressions. It only needs to tell Stage 2 what kind of payload it saw and preserve the text and locations faithfully enough that the Sage-aware code can rebuild the right AST nodes.
Clause Normalization: Enabled By Default, Explicitly Disabled For Round-Trip Tests
The parser supports clause normalization, and it does so on purpose.
The public API exposes a global toggle:
| |
and each OpenMPDirective captures the current policy when it is created:
| |
The README even calls this out as a feature: multiple compatible clauses may be merged into a normalized representation.
That is useful for compiler consumers, because it reduces meaningless multiplicity in the IR.
But it is not always what the tests want.
The round-trip harness deliberately disables normalization:
| |
That is the right testing discipline. If the goal is to validate that the parser can recover and re-emit the directive spelling accurately, normalization would blur the signal. So the harness explicitly asks the parser to preserve the original clause shape for those tests.
That tells you something healthy about the design: normalization is treated as a policy choice, not as an unavoidable side effect of parsing.
Combined Constructs And Directive Kinds
One reason ompparser is valuable is that it handles the large OpenMP directive vocabulary as a parser problem instead of scattering that logic throughout the compiler.
The grammar repeatedly creates directives with line and column provenance:
| |
and it does that for the broad range of OpenMP constructs, including combined constructs such as:
parallel for,target teams,target teams distribute parallel for,- and many newer forms.
That is exactly where this logic belongs.
If combined-construct recognition leaked into later AST conversion or lowering code, the whole frontend would be harder to reason about. Instead, Stage 1 returns a directive kind that already knows what construct was parsed. Stage 2 can then focus on AST construction rather than rediscovering directive spelling.
Round-Trip And Location Tests
The parser tests are unusually informative because they reveal what the project considers the real contracts of this stage.
The CMake test registration builds three main executables:
| |
That is already a strong clue. The parser is not tested only on “did it crash?”
It is tested on:
- parse coverage,
- round-trip behavior,
- and location fidelity.
omp_roundtrip: parse, then unparse
omp_roundtrip.cpp preprocesses input files into extracted directive strings, parses each one with parseOpenMP(...), and prints the result with:
| |
The shell harness test_single_pragma.sh then compares the original pragmas against the round-tripped output. It normalizes whitespace carefully, and it distinguishes C/C++ from Fortran formatting rules.
That is exactly the right contract for a standalone directive parser:
- if it parses,
- and if the parse is faithful,
- then it should be able to reconstruct a stable directive spelling.
test_locations: line and column fields are part of the API
test_locations.cpp is even more revealing.
It checks:
- directive kind,
- directive line and column,
- clause line and column,
- invalid-parse behavior,
- recovery after invalid input,
- and some clause-specific content expectations such as
map(...)expression handling.
That tells you location is not a debugging afterthought. It is a first-class parser contract.
Corpus-based tests, not only microcases
The test CMake also registers extracted pragma files from:
- built-in
.txtsuites, - OpenMP Validation and Verification inputs,
- and OpenMP examples.
That matters because a directive parser can look good on a handful of handcrafted cases while still drifting on real corpora. The parser suite avoids that trap.
Figure 3. The parser is tested on the contracts that matter for this stage: parse success, faithful round-trip spelling, correct location tracking, and robustness on real corpora rather than only microcases.
Why This Stage Feeds Stage 2 So Cleanly
At this point, the relationship to the next post in the pipeline should be clearer.
ompparser gives Stage 2 exactly what it needs:
- a directive kind that already reflects OpenMP grammar,
- clause objects grouped semantically and preserved in original order,
- normalized or non-normalized clause payload text,
- optional expression-node hooks,
- and location data on both directives and clauses.
That is why OpenMPIRToSageAST() can stay focused on Sage AST construction instead of trying to parse OpenMP syntax itself.
If you think of the OpenMP frontend as one monolithic parser, this separation can feel like extra complexity.
If you think of it as a compiler pipeline, it is the opposite. ompparser removes complexity from the later stages by solving the directive-language problem once and solving it well.
The Real Value Of ompparser
The deepest value of ompparser is not only that it parses a lot of OpenMP directives.
It is that it defines a disciplined contract for the REX frontend:
- preserve the directive text,
- parse OpenMP syntax into a reusable IR,
- keep source locations and clause order intact,
- allow normalization as a policy rather than forcing it,
- and stop short of taking over the full host-language expression problem.
That last point is what makes the design hold together.
If ompparser tried to become the full host-language compiler, it would lose the modularity that makes it reusable.
If it stored everything as raw strings with no structure, Stage 2 and later stages would have to rediscover too much.
Instead, it chooses the right middle ground:
structured OpenMP IR with optional host-expression hooks.
That is why it fits REX so well, and it is why this stage deserves to be understood separately from both pragma preservation before it and AST construction after it.