How REX Cleaned Up A Thousand Historical Test Failures Without Bounce

Posted on
REX did not clear roughly one thousand historical CTest failures simply because an agent guessed at tests quickly. The important part was the no-bounce workflow: freeze the original failure set, fix one small frontier at a time, build, run the core gate, rerun the complete original failure set, and reject any patch that brought an old failure back. Codex with GPT-5.5 supplied persistence and implementation speed, but the test loop supplied authority. The campaign ran from the afternoon of May 12 to the afternoon of May 24, and the merged PR finished with 955/955 frozen failures fixed, 6420/6420 core-gate tests passing, and a final full local CTest result of 31165/31165.

The REX Clang frontend cleanup did not look like one heroic fix.

It looked like a long sequence of small decisions that could easily have gone wrong. The original full-suite run had roughly one thousand failures. Many were C and C++ frontend failures, but the failure surface was not limited to parsing. Once a source-to-source compiler builds an inconsistent AST, every later layer becomes a possible reporter:

  • generated source fails to compile,
  • token streams stop matching nodes,
  • source-position checks report drift,
  • name qualification chooses the wrong spelling,
  • dataflow and callgraph passes assert on unexpected AST shapes,
  • OpenMP and Fortran gates become collateral damage if a broad change moves shared infrastructure.

That is why the campaign was not organized around “make the next red test green.” It was organized around a stricter rule:

1
No accepted patch may make an already-fixed original failure fail again.

That rule sounds simple. In practice, it mattered more than raw coding speed.

A timeline showing the REX cleanup moving from roughly one thousand original failures to the frozen set passing and then a full CTest pass.

Figure 1. The important movement was not just the failure count going down. The important part was keeping earlier fixes fixed while the frontier moved.

The Starting Point

The work started from an uncomfortable but useful baseline.

The REX fork had completed the move to the LLVM 22 Clang frontend for C and C++, but the test suite still carried a large historical failure set from the transition. The frontend could parse and translate many programs, but it was not yet mature enough to survive the whole ROSE/REX regression surface.

The early failure count was large enough that individual test names were not a good plan. A list of one thousand tests is not a diagnosis. It is a symptom inventory.

The failures fell into broad families:

  • Clang-to-Sage declaration construction,
  • defining and nondefining declaration pairing,
  • scope and symbol insertion,
  • tag, typedef, enum, lambda, and template representation,
  • source locations and token mapping,
  • generated-source compile failures,
  • unparser ordering and name qualification,
  • midend assumptions about AST shape,
  • long-running tests and stale harness behavior,
  • reference outputs that had drifted from now-correct compiler output.

The old EDG frontend history gave useful design clues, but it could not be copied. The current frontend was Clang-based, the project was pinned to LLVM 22, and the objective was to make the current architecture correct, not recreate an old one.

The campaign therefore started with a frozen baseline.

Freezing The Failure Set

The first rule was to stop chasing the moving --rerun-failed file.

That file is useful for a developer who wants to rerun the tests that failed in the last CTest invocation. It is dangerous for a long repair campaign. Once a fix lands, a new run changes the file. If the next patch reintroduces an earlier failure, the moving file may no longer contain the test that would have caught it.

So the original failed-test names were frozen into a stable list. After that, every patch was judged against the same original failure set, not the latest failure residue.

The acceptance loop became:

1
2
3
4
5
6
make a focused root-cause change
build REX
run the relevant frontier tests
run the core non-regression gate
run the complete frozen original failure set
accept only if resolved tests stay resolved

That is a slow loop compared with running one test. It is also the reason the cleanup could compound instead of oscillating.

A loop showing focused patch, build, core gate, frozen failure set, ledger comparison, and either accept or rework.

Figure 2. The no-bounce loop made each patch prove that it solved a local problem without reintroducing an older one.

Why Compiler Test Suites Bounce

Compiler tests bounce because the layers are coupled.

A frontend declaration fix may make a generated source file compile, but it can change which declarations the unparser sees. A token-stream fix may repair a source-position test while changing how comments attach to nearby statements. A name-qualification change may fix one template test and expose that another type was built with the wrong underlying declaration. A midend traversal fix may quiet an assertion but hide a missing symbol until a later pass asks for it.

That coupling is why naive repair is so risky. A local pass is not enough evidence.

In REX, the main danger was especially clear: a string workaround in the unparser could hide a malformed AST. That might turn one generated file green, but it would make the next bug harder to diagnose. The rule was therefore strict:

1
2
fix the AST, enum, type, declaration, scope, or symbol invariant;
do not make the unparser lie about malformed state.

The same principle applied to tests. Stale references could be updated only when the output was semantically equivalent and the change was not masking a compiler bug. Comment placement needed special care. A comment that belonged before a statement could not silently move after it. A standalone comment could not be normalized into a trailing comment just because that made a diff pass.

This made the work slower. It also made the final result meaningful.

Why One Giant Patch Would Have Failed

The tempting way to attack a thousand failures is to make broad changes until a large group disappears. That can work for a mechanical rename. It is a poor fit for frontend stabilization.

A broad patch hides causality. If fifty tests start passing, ten tests start failing, and three new assertions appear, it is difficult to know which part of the patch changed the invariant. Worse, some of the new failures may not be new at all. They may be old failures that were temporarily hidden behind an earlier abort.

The cleanup therefore favored small frontiers:

1
2
3
4
5
one failure signature,
one ownership invariant,
one generated-source family,
one token/source-position behavior,
one midend assumption.

That did not mean each commit fixed only one test. Some root causes had high fan-out. A declaration-pair fix could clear dozens of generated-source failures. A traversal-boundary fix could turn a timeout family into normal long-running tests. A target-option fix could unblock many x86 or ABI-sensitive specimens.

The difference was that each patch still had a coherent reason. If the reason could not be stated, the patch was not ready.

This is important for agent-assisted work. Agents can produce large diffs quickly. A large diff is not progress unless the project can explain why it is correct. The no-bounce loop rewarded patches that were small enough to reason about but deep enough to remove a real root cause.

The Twelve-Day Shape

The focused run began on the afternoon of May 12 and finished on the afternoon of May 24. The PR was merged on May 25.

The work did not proceed in a perfect linear order, but the shape was clear.

The early phase was about making the Clang frontend preserve enough target and option state to run the right compiler path, then repairing obvious frontend aborts. After that, most progress came from AST invariants: declarations needed correct parents, scopes, symbols, defining/nondefining links, typedef and tag relationships, template parameter mappings, and lambda/class ownership.

The next phase stabilized generated C and C++ output. That did not mean adding string hacks. It meant making declaration ordering, type construction, name qualification, and AST state coherent enough that normal unparsing produced compilable code.

Then the token, source-position, header, and comment tests became useful. They were no longer reporting only structural frontend collapse. They could point at narrower source-preservation issues.

Midend failures came later. Dataflow, callgraph, CFG, outlining, inlining, move-declaration, and normalization tests were not unrelated. They were consumers of the AST. Once the frontend generated more Clang-built Sage nodes, those consumers needed to stop assuming only the older frontend shapes existed.

The timeout work fit into the same loop. The Cxx_Grammar.C timeout, already documented in the previous case study, was fixed by owning the frontend traversal boundary rather than raising the timeout or special-casing the file.

By the end, the original frozen failure set was clean:

1
2
3
955/955 frozen failures passed
6420/6420 core-gate tests passed
31165/31165 full CTest tests passed

The final local full run reported:

1
100% tests passed, 0 tests failed out of 31165

Why The Final Full Run Still Mattered

Even after the frozen set and core gate were green, the final full run mattered.

The frozen set proves that the original known failures were addressed. The core gate proves that the most sensitive REX, OpenMP, Fortran, and representative C++ areas did not regress. A full run asks a broader question:

1
did the accumulated cleanup disturb anything outside the original failure map?

That question is important because compiler changes can affect tests that were never red during the campaign. A source-location fix might affect a passing token test. A declaration-ordering change might affect a passing generated-source specimen. A hook or test-output-path change might affect a passing infrastructure test.

The final 31165/31165 result is therefore not redundant. It is the step that turns “the known red list is gone” into “the local suite is now a clean baseline again.”

What Codex Changed

Codex with GPT-5.5 mattered, but not in the way a shallow story would frame it.

The value was not that the agent could write a patch quickly. Fast local patches are easy. The value was that it could stay inside the workflow for a long time:

  • read failing logs,
  • infer root causes,
  • make scoped changes,
  • avoid unrelated reversions,
  • rerun the required gates,
  • compare against the frozen set,
  • respond to review comments,
  • keep going after context compactions,
  • and avoid “green by hiding the issue” shortcuts.

Earlier models could be persistent for a while, but long compiler repair tends to punish shallow persistence. If the model keeps pushing local fixes after the global invariant has shifted, it cycles. If it forgets the no-bounce rule after a context transition, it loses the campaign.

GPT-5.5 crossed a practical threshold in this run. It still needed hard gates. It still needed review. It still needed exact logs. But its reasoning and persistence were good enough that the loop compounded instead of constantly resetting.

A diagram showing Codex producing candidate fixes, while frozen tests, full CTest, review comments, and hook checks decide whether the fixes are trusted.

Figure 3. The agent was not the oracle. The workflow made the agent useful by letting tests, review, and frozen evidence decide acceptance.

Concrete Acceptance Examples

Several moments show why the workflow mattered.

The Cxx_Grammar.C timeout could have been handled by raising a timeout. That would have made the dashboard less red, but it would not have answered why ordinary translation was walking too much header and template surface. The accepted fix changed traversal policy and then proved that the test completed without reintroducing frozen failures.

Reference-output updates could have been applied mechanically. Instead, they were treated as compiler evidence. Stable whitespace changes were acceptable. A questionable comment movement was not automatically accepted, because a source-to-source compiler must preserve user-facing source structure with care.

The live/dead lattice review comment near the end is another useful example. The implementation was doing a deep copy, but the raw pointer interface made ownership ambiguous. The fix was not a comment explaining the convention after the fact. The API changed to a non-owned reference, so the signature and behavior agreed.

Those examples are small compared with the whole PR, but they capture the operating rule:

1
2
when a patch looks locally good but weakens future reasoning,
make the invariant explicit instead.

The Agent Was Not The Oracle

It is important to state this plainly.

The agent did not prove REX correct. The test suite did not prove REX correct either. A compiler can pass a large suite and still contain bugs.

What the campaign proved is narrower and still significant:

1
2
3
4
5
the original frozen failure set no longer fails;
the core regression gate remains green;
the final local full CTest run is green;
the accepted patches survived review and hook checks;
the fixes did not rely on test masking or unparser string hacks.

That is the right kind of claim for a compiler cleanup. It is evidence, not mythology.

Why No-Bounce Mattered More Than Speed

The no-bounce rule changed the economics of the work.

Without it, the fastest path would be to chase whatever failure is visible now. That creates an illusion of progress. A patch fixes ten tests, breaks five older ones, and the visible failure list changes shape. The campaign looks active, but the project may not be getting more stable.

With the frozen set, every accepted patch had to preserve accumulated progress. That made the work monotonic in the only way that mattered: the original failures could disappear, but they could not be allowed to return.

The difference is especially important for automated repair. An agent can generate a lot of plausible code. A compiler project does not need more plausible code. It needs changes that survive the whole system.

The no-bounce loop turned agent output into candidate patches and gave the project a way to reject candidates that were locally attractive but globally wrong.

What This Means For REX

The merged PR does not mean the Clang frontend is finished forever. It means REX crossed a stability threshold.

The project now has a fully green local suite on the LLVM 22 Clang frontend path. That changes what future work feels like. New failures can be treated as regressions, not as noise in a large historical pile. Reviewers can ask sharper questions. Future frontend work has a clean baseline. Long tests like Cxx_Grammar.C are now expensive tests, not unresolved timeouts.

The most useful lesson is procedural:

1
large compiler cleanups should optimize for monotonic trust, not local speed.

Codex and GPT-5.5 made the twelve-day run possible in practice. The frozen failure set made it trustworthy.