How REX Fixed the Cxx_Grammar Timeout by Owning the Frontend Traversal Boundary
Cxx_Grammar.C timeout was not solved by raising CTest limits or adding unparser special cases. The root cause was that the Clang frontend was materializing too much non-main-file/header AST for ordinary compile tests, especially namespace and template-heavy declarations from the C++ library and REX headers. The fix was to make frontend traversal policy explicit: normal compile tests translate the source-backed/application surface they need, while token/source-position workflows retain the extra header surface they actually inspect. The frozen CTest loop kept the fix honest by proving the timeout disappeared without bouncing previously fixed failures.Some compiler failures are loud in the right way. An assertion points at a broken parent pointer. A generated file fails to compile and the diagnostic tells you which declaration pair disagrees. A symbol lookup fails in a scope that should have owned it.
Timeouts are worse.
They say only that the compiler disappeared into work it could not finish before the test harness gave up. For a source-to-source compiler, that can mean many different things:
- a real infinite recursion,
- a quadratic traversal,
- repeated reparsing of the same structure,
- accidental materialization of a whole header universe,
- or a correctness bug that prevents later pruning from happening.
During the REX Clang frontend migration, rose_example_src_frontend_SageIII_Cxx_Grammar_C became one of those failures. In the full suite it timed out at 1500 seconds. The specimen was not a small user program. It was REX’s own generated Cxx_Grammar.C, the implementation file produced by ROSETTA for the SAGE IR.
That made the failure important. If the new frontend could not translate the compiler’s own generated IR implementation, it was not just slow. It was failing on one of the densest stress tests for declaration ownership, template type construction, file information, and frontend/backend agreement.
This post documents what was actually wrong and how the fix avoided the tempting wrong answers: no timeout bump, no test suppression, and no unparser string workaround.
Figure 1. The timeout was not caused by one exotic statement in Cxx_Grammar.C. The file forced the frontend to cross a boundary where ordinary source translation accidentally became eager header-universe translation.
Why This Test Is Special
Cxx_Grammar.C is generated code, but it is not disposable generated code. It defines a huge part of the ROSE/REX IR implementation:
- generated accessors,
- memory-pool helpers,
- symbol and declaration utilities,
- traversal support,
- node replacement helpers,
- type and expression classes,
- and many of the declarations every other REX subsystem touches.
The CTest case runs the normal translator over that file with a large include path and then checks that the generated output still compiles. In practice, the test asks:
Can the Clang frontend build a coherent SAGE AST for REX’s own generated C++ implementation, and can the backend emit code that agrees with the original declarations?
That is a very different test from a small parser specimen. It exercises the frontend at compiler scale.
It also includes enough REX headers and C++ standard library surface to expose traversal-policy mistakes. If the frontend decides that “reachable from the AST” means “must eagerly translate everything under this namespace and template declaration,” this test becomes enormous very quickly.
That is what happened.
The First Symptom Was A Timeout, Not A Nice Assertion
The baseline full run showed:
| |
There were nearby assertion failures in other tests, and the logs around this family showed familiar structural problems:
- declarations whose parent pointers were missing,
- declaration symbols not found in the expected scope,
- frontend-generated nodes with inconsistent file information,
- template-heavy constructs being materialized from headers that the test did not need to unparse.
But the Cxx_Grammar.C test itself did not begin as a clean “fix this line” failure. It simply consumed the full timeout budget.
That shaped the debugging strategy. A timeout should not be treated as a reason to relax the harness. It is usually evidence that the compiler is doing work it should not be doing, or doing necessary work in a pathologically expensive way.
For this case, the key question was:
Why is an ordinary compile-only translation of one source file pulling in enough non-main-file AST to run for more than 1500 seconds?
The Bad Mental Model: Translate Everything Clang Knows
Clang’s AST contains far more than the user’s source file. Once headers enter the translation unit, Clang can expose a massive declaration graph:
- namespaces from the standard library,
- class templates and their member declarations,
- typedefs and aliases,
- implicit declarations,
- instantiated declarations,
- declarations whose canonical home is outside the main file,
- declarations that exist only because a header implementation needed them.
That is expected. Clang is a compiler frontend; it needs a semantically complete view of the translation unit.
REX, however, is a source-to-source compiler. It needs to build a SAGE representation that is correct for the current workflow. That does not mean every Clang declaration reachable through headers should be eagerly materialized as a full SAGE declaration tree.
The old traversal policy blurred that boundary. In practice, it kept too much non-main-file surface alive, including namespaces and template hierarchies that ordinary compile tests did not need. That created two problems at once.
First, it was expensive. The frontend spent time translating suppressed or non-output header declarations whose bodies and nested declarations were irrelevant to the generated output.
Second, it increased correctness pressure. Every extra declaration had to have coherent parent pointers, scope ownership, defining/nondefining declaration links, symbol insertion, file info, and template metadata. If the compiler imports a declaration universe it does not actually need, it also imports every invariant that universe requires.
The timeout was therefore a performance symptom of a design bug.
The Boundary REX Needed
The fix started by separating frontend workflows.
Not every C/C++ translation in REX has the same preservation requirements.
Some workflows need token or source-position preservation. They may later inspect original header locations, token streams, or preprocessing attachment. For those workflows, keeping more written header surface can be justified.
Ordinary compile tests are different. They need a correct AST for the source being translated, declarations needed for semantic references, and application-header declarations that are part of the user-visible program surface. They do not need eager traversal through every namespace and class-template hierarchy from system headers.
That distinction became an explicit frontend rule:
| |
That check appears in the non-main-file eager traversal path. Its meaning is intentionally narrow:
do not pay for eager non-main-file/header leaf declaration traversal unless this source file is in a workflow that preserves and later relies on original token/source surface.
For the preserving workflows, the frontend still keeps a stable surface for inexpensive declarations. But even there, it avoids recursively descending through heavyweight namespace and template hierarchies. The policy keeps leaf declarations visible when useful, without making a namespace declaration an invitation to translate the whole header world.
Figure 2. The important fix was a traversal boundary. Ordinary compile translation keeps the semantic surface it needs; token/source-position workflows are allowed to preserve more written header surface, but still avoid recursive namespace/template expansion.
Why This Is Not A Test Hack
There are several easy ways to make a timeout disappear:
- increase the CTest timeout,
- remove the test from the frozen failure set,
- mark the test as expected failure,
- skip generated files,
- special-case
Cxx_Grammar.C, - hide the generated output problem in the unparser.
None of those fix a compiler.
The actual change did not mention Cxx_Grammar.C as a special case. It changed the frontend’s ownership of a general question:
when should a non-main-file declaration be eagerly translated?
That is a real compiler policy question. The answer depends on source ownership and workflow requirements, not on a test name.
It also respects the source-to-source compiler boundary. REX should not pretend that Clang’s full internal declaration graph is the exact output surface REX needs to reproduce. REX should translate the program surface and the semantic dependencies required to keep that surface correct.
The fix made that policy explicit.
The Supporting Performance Fixes
The traversal boundary was the main fix, but large translation units also exposed a few supporting inefficiencies.
One was lookup structure choice. The frontend translation maps for Clang declarations, statements, and types were hot paths during large AST construction. Using ordered maps there meant paying tree lookup costs repeatedly in a workload that mostly needs pointer-key identity lookup.
Those maps were moved to hash maps:
| |
That is not as conceptually important as the traversal boundary, but it matters at this scale. Once the frontend stops importing unnecessary header subtrees, the remaining legitimate work should not pay avoidable lookup overhead.
Another supporting fix was in preprocessing-record handling. The preprocessor recorder owns a sorted list of directives to attach into the SAGE tree. The old path repeatedly considered already-consumed front entries. The repair added a cursor so consumption advances through the list instead of continually treating the front of the list as live work.
The important shape is that sorting is a preparation step, and cursor access stays cheap:
| |
That fix also reflects the same principle: once a source artifact is consumed, the frontend should not keep rediscovering it as if it were new.
The Correctness Problems Did Not Disappear
It is important to be honest about what this fix did and did not solve.
After the timeout disappeared, Cxx_Grammar.C became an ordinary long-running test. It passed repeatedly in frozen runs, usually around 1000 to 1060 seconds. That is still expensive, but it is no longer an unbounded failure.
Later changes exposed a different Cxx_Grammar.C failure: generated output where nested template and typedef-heavy types disagreed with the original header. For example, a type that should preserve nested map/set/vector-pair structure could collapse into a simpler std::string-shaped template argument in the generated implementation.
That is not the timeout coming back. It is a separate CFE type construction bug. In fact, making the timeout go away helped reveal it. Once the test finishes, it can report the next concrete failure:
| |
That is progress. A timeout gives you almost no localized information. A compile error in rose_Cxx_Grammar.C gives you a root-cause trail into template argument and typedef AST construction.
This is why performance fixes in a compiler test suite must be treated carefully. The goal is not to make the red line disappear. The goal is to move the failure from “the compiler got lost” to “this invariant is wrong.”
The Anti-Bounce Rule
The most important process detail was the frozen failure set.
At the start of the campaign, the full CTest failure list was frozen into a file. After fixes began, the workflow did not rely on the moving --rerun-failed set. Every accepted patch had to survive two kinds of checks:
- the focused test or family that motivated the patch,
- the original frozen failure set and the core non-regression gate.
That matters because large frontend fixes can easily trade one failure for another. A patch that makes Cxx_Grammar.C pass but reintroduces OpenMP, Fortran, token-stream, or earlier CFE failures is not an accepted fix.
For this case, the evidence was simple:
- the original full run had
rose_example_src_frontend_SageIII_Cxx_Grammar_Cas a timeout; - after the traversal-boundary work, frozen reruns showed the test completing and passing;
- later authoritative frozen runs did not list the test as failing;
- the frozen set still tracked all other unresolved failures instead of letting the suite’s moving failure file hide bounces.
Figure 3. The fix was accepted only because the original frozen failure set stayed stable. The test no longer timed out, and the campaign could still detect if some other already-fixed case bounced back.
Why The Fix Belongs In The Frontend
It is tempting to think of this as an unparser or backend issue because the visible test is a compile of generated rose_Cxx_Grammar.C.
That is the wrong layer.
The unparser should not be asked to guess which declarations were accidentally imported, stringify around malformed template structure, or hide missing scope ownership. Once the SAGE AST has an invalid declaration graph, the backend can only make the failure harder to see.
The frontend owns the decision to materialize Clang declarations into SAGE declarations. It also owns the invariant that a materialized declaration has a valid scope, symbol, parent, type, source location, and declaration pairing.
So the timeout fix belongs in the frontend for two reasons:
- the frontend was doing too much work by crossing the wrong traversal boundary;
- reducing that work also reduced the number of irrelevant AST invariants later passes had to maintain.
That is not a workaround. It is the right ownership line.
What This Changed In Practice
Before the fix, translating a large file like Cxx_Grammar.C could turn into:
- parse the main file,
- see a large header graph,
- eagerly recurse into non-main-file namespaces and templates,
- build SAGE declarations for a huge amount of suppressed header surface,
- run postprocessing and backend checks over that enlarged AST,
- run out the CTest clock.
After the fix, the same class of translation looks more like:
- parse the main file,
- materialize the source-backed declarations and semantic dependencies needed for correctness,
- preserve application-header surface when it is part of the program boundary,
- preserve extra written surface only for token/source-position workflows,
- avoid recursive system namespace/template expansion for ordinary compile tests,
- finish in bounded time and expose any remaining structural bugs as ordinary failures.
That last item is the payoff. A bounded failure is debuggable. An unbounded traversal is not.
Lessons For Compiler Test Campaigns
This bug reinforced a few rules that are now central to the REX cleanup work.
A Timeout Is Usually A Design Smell
Sometimes a test is just too large for the default timeout. But a compiler timeout during a migration is more often a sign that a boundary is missing:
- a traversal does not know when to stop,
- a cache key is too weak,
- a pass revisits work it already completed,
- or a frontend imports an AST surface larger than the compiler contract requires.
Treat the timeout as a profiling clue, not as a testing inconvenience.
Do Not Fix Generated Output By Lying In The Unparser
If generated C++ has the wrong declaration order, wrong type, wrong scope, or wrong qualification, the backend may be where the symptom appears. That does not mean the backend is where the fix belongs.
For Cxx_Grammar.C, the important fixes were in the Clang frontend and AST invariants. The backend should emit the AST it is given. If the AST is wrong, fix the AST.
Preserve More Source Only When A Workflow Needs It
Source-to-source compilers have multiple modes of correctness. Token preservation is real. Source-position validation is real. Ordinary compile-only translation is also real.
A single “always preserve every reachable header declaration” policy is too blunt. It makes small tests expensive and large tests pathological.
The better rule is to make preservation requirements explicit.
Keep The Frozen Set Frozen
The anti-bounce discipline was as important as the code change. Without a frozen failure set, it is too easy to fix a timeout and accidentally stop noticing an earlier failure moving back into red.
The frozen set turns progress into an accounting problem:
- which original failures are resolved,
- which remain,
- which newly failing tests are outside the original set,
- and whether a patch reduced the unresolved count without reintroducing old failures.
That is tedious, but it is the only way to make a thousand-failure cleanup campaign honest.
The State After The Fix
After the traversal-policy repair, the Cxx_Grammar.C test stopped being a timeout. It remained expensive, but it completed under the original CTest limit and passed in repeated frozen runs.
That changed the nature of the campaign. Instead of waiting 1500 seconds for one opaque timeout, the suite could continue through the remaining failures and classify them:
- ELSA-derived frontend specimens,
- move-declaration transformations,
- token stream and source-position tests,
- C/C++ compile-diff tests,
- callgraph and midend analysis failures,
- and later concrete
Cxx_Grammar.Ctype-construction regressions.
That is exactly what a good compiler fix should do. It should not hide the rest of the work. It should make the rest of the work visible.
Closing
The Cxx_Grammar.C timeout was a useful failure because it forced REX to answer a frontend design question directly:
What part of Clang’s AST does a REX translation actually own?
The answer is not “everything Clang can see.” The answer is the source-backed and semantic surface required by the current workflow, with explicit preservation paths for token and source-position-sensitive modes.
Once that boundary became explicit, the timeout disappeared without weakening the tests. More importantly, the compiler became easier to debug. The next failures were no longer long silences. They were ordinary, localized invariants.
That is the kind of progress that matters in a frontend migration.