How REX Carries OpenMP Pragmas Before AST Construction

Posted on Mar 21, 2026 (Updated on Mar 24, 2026)

Stage 1 in REX’s OpenMP pipeline (processOpenMP()) preserves raw directive text and avoids early normalization, producing a clean (pragma site, OpenMPIR) stream for AST construction.

The previous post in this series covered Stage 2, where REX converts OpenMPIR into SgOmp*. That stage only works because an earlier stage has already done something subtle and important: it has preserved the directive long enough that REX can still decide how to parse it.

This post is about that earlier stage.

In the end-to-end OpenMP pipeline, Stage 1 is not yet about building the final AST. It is about keeping the directive recoverable:

keep its spelling close to the original source,
keep its attachment to the surrounding Sage AST,
keep enough language context to parse clause payloads later,
and stop cleanly at parse-only if the user wants that checkpoint.

That is the work done by OmpSupport::processOpenMP() in src/frontend/SageIII/ompAstConstruction.cpp.

A Stage 1 control-flow diagram for processOpenMP showing normalization control, C/C++ and Fortran collection paths, OpenMPIR creation, parse-only exit, and handoff to Stage 2. — Figure 1. Stage 1 is a frontend traffic controller. Its job is not to lower anything. Its job is to preserve directive shape, select the right language-specific collection path, produce `OpenMPIR`, and hand a clean stream of directive sites into Stage 2.

Why Stage 1 Exists

The easy answer is “because ompparser needs input.” The real answer is more specific.

If you let the generic frontend normalize or erase too much information too early, three things become harder immediately:

Source-preserving debugging is hindered.
When users report “the directive parsed wrong,” you want to inspect the directive text that REX actually saw, not a heavily normalized approximation that has already dropped details.
Language-unified handling becomes more complex.
C/C++ and Fortran spell OpenMP differently. Stage 1 is where those surface forms are collected and normalized just enough that they can enter the same parser and later the same AST-construction path.
Downstream transformations become dependent on front-end implementation details.
If later stages only see whatever the generic pragma representation happened to keep, then subtle preprocessing or parser changes can break OpenMP behavior in ways that are difficult to explain.

So Stage 1 is a boundary of responsibility. The generic frontend constructs the ordinary Sage AST. Stage 1 then says: “for OpenMP, do not trust the default representation blindly; preserve the directive deliberately, then parse it deliberately.”

`processOpenMP()` Is The Gatekeeper

The control point is processOpenMP(). Its structure makes the staged design explicit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
void processOpenMP(SgSourceFile *sageFilePtr) {
  ...
  const bool wantsOpenMP = sageFilePtr->get_openmp();
  const bool wantsOpenACC = sageFilePtr->get_openacc();

  if (wantsOpenMP) {
    setNormalizeClauses(false);
  }

  if (isFortran) {
    ...
  } else {
    ...
  }

  if (sageFilePtr->get_openmp_parse_only()) {
    return;
  }

  OpenMPIRToSageAST(sageFilePtr);
  ...
}

That small amount of control flow hides several deliberate choices:

OpenMP processing is opt-in per source file.
Clause normalization is disabled before parsing begins.
C/C++ pragmas and Fortran directives take different collection paths.
Parse-only is treated as a real checkpoint, not as a partial failure mode.
AST construction is a later step, not something mixed into initial directive collection.

This matters because the function is not only doing work; it is defining when certain kinds of work are allowed to happen.

Why `setNormalizeClauses(false)` Matters

One of the most important lines in the whole frontend is also one of the shortest:

1
setNormalizeClauses(false);

The comment above it in processOpenMP() is blunt: preserve original directive and clause structure instead of normalizing and merging clauses in the parser.

That is exactly the right instinct for a source-to-source compiler.

Normalization is attractive because it simplifies later logic. But if it happens too early, it destroys distinctions that REX still cares about:

the original clause order,
whether the user wrote a particular combined spelling,
source details that are only visible in the raw directive text,
extension payloads that may be semantically important even if the generic pragma representation does not model them fully.

Stage 1 deliberately pays the cost of preserving this structure. That makes Stage 2 and Stage 3 slightly more explicit, but it prevents a much worse outcome: having to reverse-engineer lost intent from an already-collapsed pragma representation.

Comparison of a raw C/C++ OpenMP directive in source and a normalized or preprocessed pragma string, showing why REX prefers the raw spelling when possible. — Figure 2. The normalized pragma text stored in the AST is not always the most faithful copy of what the user wrote. Stage 1 therefore tries to recover raw directive text from source lines and prefers it when possible.

The C/C++ Path: Start From `SgPragmaDeclaration`, But Don’t Stop There

For C and C++, Stage 1 begins with a straightforward query:

walk the file’s SgPragmaDeclaration nodes,
filter those whose first token is omp,
parse them with ompparser.

If that were the entire story, Stage 1 would be a very small pass. But the actual implementation is more careful than that.

The First Copy Of The Directive May Already Be Degraded

Each pragma declaration has a string in the AST:

1
2
const std::string preprocessedPragmaString =
    pragmaDeclaration->get_pragma()->get_pragma();

That is useful, but REX does not automatically trust it as the best available source text.

Why not? Because the AST-normalized or preprocessed pragma string can legitimately drop details. The comments in the implementation call this out directly: extension-only fragments can disappear, and strict signature matching against normalized text can therefore miss the user’s real directive.

So Stage 1 calls getRawOpenMPCppDirectiveText(...) and tries to recover the directive from the actual source lines near the pragma site.

The logic is intentionally layered:

try the exact source location first;
if that fails, search within a bounded line radius nearby;
if needed, fall back to signature-based matching over nearby lines;
if raw recovery still fails, fall back to the AST/preprocessed pragma string.

That is a pragmatic design. It does not pretend raw source recovery is always perfect, but it strongly prefers the highest-fidelity directive text when it is available.

The Raw Text Is Not Only For Parsing

When raw recovery succeeds, Stage 1 does two important things with it.

First, it uses the recovered text as the preferred input to parseOpenMP(...).

Second, it stores the preferred source spelling in a side map:

1
2
g_omp_directive_source_text_by_pragma[pragmaDeclaration] =
    preferredPragmaString;

That second step is easy to miss, but it matters for the rest of the pipeline. Stage 2 and clause-expression helpers can later consult the original directive text when they need to reconstruct source spelling or recover details that a normalized pragma node no longer exposes cleanly.

In other words, Stage 1 does not merely “parse and forget.” It preserves a directive-to-source-text association that later stages can reuse.

Parsing Into `OpenMPIR`

Once Stage 1 has the best directive string it can get, it invokes ompparser:

1
ompparser_OpenMPIR = parseOpenMP(pragmaString.c_str(), nullptr, nullptr);

There is also a fallback in the other direction: if parsing the recovered raw text fails and that text differs from the preprocessed string, REX retries with the preprocessed pragma text. This is another pragmatic choice that prefers fidelity while avoiding turning recoverable parsing issues into hard failures if the normalized form still works.

If parsing succeeds, Stage 1 validates the resulting OpenMPIR, filters end-marker-only directives that are not supposed to become standalone AST nodes, and records the successful parse in two core data structures:

omp_pragma_list
OpenMPIR_list

Conceptually, those structures are building the stream that Stage 2 will consume:

a Sage anchor site (SgPragmaDeclaration*)
paired with its parsed directive (OpenMPDirective*)

That is the real output of Stage 1.

Stage 1 Also Prepares Clause-Structure Side Data

Another easy detail to overlook is that Stage 1 does some clause-level preparation before AST construction begins.

For directives that are not just end markers, the pass builds a parsed clause-node view through parseClauseNodesForDirective(...). That means Stage 2 does not start from a completely cold directive representation; it already has structured clause information associated with the parsed directive.

This is a good example of how the stages in REX are separated, but not artificially isolated. Stage 1 is still “the parsing stage,” but it is allowed to prepare auxiliary information that makes Stage 2 more deterministic.

The Fortran Path: Different Surface Syntax, Same Goal

Fortran is where the Stage 1 design becomes obviously necessary.

OpenMP in Fortran does not start life as C/C++-style pragmas. It usually appears as comment directives such as:

1
2
3
4
5
6
!$omp parallel do &
!$omp& private(i)
do i = 1, n
  call work(a(i))
end do
!$omp end parallel do

There may also be:

continuation lines,
begin/end directive pairs,
directives attached through preprocessing info rather than through ordinary pragma nodes.

REX therefore gives Fortran its own collection path inside Stage 1.

Fortran Stage 1 flow showing OpenMP comment directives, continuation stitching, pairing of begin and end directives, and conversion into a pragma-anchored stream for later AST construction. — Figure 3. Fortran needs a different front door. Stage 1 collects OpenMP comments, stitches continuation lines, pairs explicit `END` directives when needed, and converts the result into the same kind of parsed directive stream that later stages expect.

`parseOpenMPFortranPragmas()` First, Then Fallback

The main Stage 1 function tries a dedicated Fortran pragma parser path first:

1
2
3
4
parsed_fortran_pragmas = parseOpenMPFortranPragmas(sageFilePtr);
if (!parsed_fortran_pragmas) {
  parseOpenMPFortran(sageFilePtr);
}

That already tells you something about the architecture: REX expects Fortran directive collection to be messy enough that a fallback path is worthwhile.

Inside parseOpenMPFortranPragmas(), the logic is careful about real source behavior:

gather candidate pragma nodes in the current source file,
extract only those that are genuine Fortran OpenMP directives,
stitch together continuation lines,
preserve the directive source text,
parse the accumulated directive with parseOpenMP(...),
and match END directives back to their begin directives when explicit pairing is required.

The pairing logic is especially important. Some directives need explicit END handling; some do not. Stage 1 therefore maintains a pairing list and merges end-clause information back into the begin directive when it finds the correct match.

That work belongs in Stage 1 because it is still fundamentally about recovering the directive structure from source spelling. By the time Stage 2 runs, the directive stream should already reflect the correct logical construct.

Why Fortran Still Converts Comments Into Pragma-Like Anchors

Later in processOpenMP(), if the dedicated Fortran pragma path did not already leave the AST in the needed shape, REX can still call convert_Fortran_OMP_Comments_to_Pragmas(...).

This is not a contradiction. It is a design compromise that keeps later stages simple.

Floating comment directives are awkward transformation anchors. A pragma-like node attached to a statement is much easier to pair with a parsed directive and much easier to feed into Stage 2. So REX temporarily creates a more C/C++-like AST shape for Fortran OpenMP, precisely so that the same later machinery can be reused.

That is a recurring REX pattern:

preserve language-specific syntax early,
normalize it into a common internal form as soon as it is safe,
then keep the rest of the pipeline single-path.

Parse-Only Is A First-Class Checkpoint

One of the reasons REX’s OpenMP pipeline stays debuggable is that Stage 1 has a clean exit:

1
2
3
4
5
if (sageFilePtr->get_openmp_parse_only()) {
  releaseOpenMPParseStateForSourceFile(sageFilePtr);
  mark_processed(sageFilePtr);
  return;
}

That checkpoint matters more than it first appears to.

Without it, every “the parser got this directive wrong” investigation would be entangled with AST construction or lowering. With it, REX can stop immediately after Stage 1 and let developers inspect:

which directives were collected,
what source spelling was preserved,
which OpenMPIR nodes were created,
whether end markers and continuation lines were handled correctly.

That is a much better debugging story than forcing every failure to travel downstream before it becomes visible.

What Stage 1 Hands To Stage 2

By the time processOpenMP() calls OpenMPIRToSageAST(sageFilePtr), the Stage 1 contract is:

OpenMP directives in the current file have been found.
Their best available source spelling has been preserved when possible.
C/C++ and Fortran surface syntax has been collected through the right path.
OpenMPIR objects have been created and validated.
End-marker-only directives have been filtered out from standalone AST conversion.
A stable (pragma site, parsed directive) stream has been recorded.

That handoff contract is what lets Stage 2 focus on AST construction instead of re-solving frontend collection problems.

This is the architectural point worth emphasizing: Stage 1 is not “just scanning pragmas.” It is constructing a reliable boundary object between the generic frontend and the OpenMP-specific AST builder.

If You Change Stage 1

The easiest way to create fragile OpenMP behavior is to let Stage 1 silently lose source structure.

When modifying this stage, the safe design rules are:

Preserve directive spelling whenever practical;
Do not normalize clause structure earlier than necessary;
Keep the parse-only checkpoint working;
Hand Stage 2 structured pairs, not partially-recovered text blobs;
Treat Fortran as a first-class path, not as an afterthought bolted onto the C/C++ flow.

If those rules hold, the next stages remain understandable. If they do not, every later phase inherits ambiguity it cannot fix cleanly.

The Stage 1 Philosophy In One Sentence

Stage 1 exists to make sure OpenMP is still recoverable before it becomes transformable.

That is why REX keeps directive collection explicit, prefers raw source text when it can recover it, avoids premature normalization, and only then hands parsed directives into the OpenMP AST-construction pass.