What A Twelve-Day Codex Session Revealed About Persistent Engineering Agents

Posted on May 28, 2026

The REX frozen CTest cleanup was a practical test of persistent engineering agents. Earlier Codex models could work for a while, but long compiler repair often ended in local cycling or brittle fixes. GPT-5.5 crossed a threshold where persistence, global judgment, and verification discipline compounded across many context transitions. The important lesson is not that an agent can write code quickly. It is that a strong enough agent can stay inside a strict workflow long enough for small verified repairs to accumulate into a large compiler cleanup.

The REX frozen-failure cleanup was a compiler project, but it was also an agent project.

For twelve days, Codex worked through a large historical CTest failure set in a real codebase. The task was not a toy benchmark. It involved a Clang frontend, a source-to-source AST, an unparser, token and source-position preservation, OpenMP and Fortran guardrails, midend analyses, generated files, reference outputs, review comments, hooks, and full-suite CTest runs.

That is the kind of task where many agent demos stop being useful.

It is easy for an agent to make one failing test pass. It is harder for an agent to keep a thousand-test repair campaign moving without losing earlier progress, inventing brittle shortcuts, or forgetting the workflow after context compaction.

This run made one thing clear:

1
persistence by itself is not enough.

Persistent engineering agents need stamina, but they also need enough global judgment to avoid converging on local fixes that poison the system.

A diagram showing earlier agents making local progress but cycling, while a stronger agent crosses a threshold where verified repairs compound. — Figure 1. The threshold was not raw speed. It was the point where persistence and judgment became strong enough for verified repairs to accumulate.

Earlier Persistence Was Not Enough

Earlier Codex models already felt different from one-shot chat.

They could inspect a codebase, run commands, patch files, build, test, and keep a task alive across multiple tool calls. That was already useful. But long compiler work exposed a ceiling.

The pattern was familiar:

the model would make good progress for a while,
it would fix local failures,
then it would reach a failure that required a broader invariant,
and instead of stepping back, it would keep pushing local patches.

That is dangerous in a compiler.

If the frontend creates a malformed declaration, an unparser workaround may look attractive. If a test times out, raising the timeout may look attractive. If a reference diff changes, updating the reference may look attractive. Each local step can be made to sound plausible. The project gets worse if the agent cannot tell when a plausible local fix violates the global contract.

The difference with GPT-5.5 was not that it never made mistakes. It was that it more often recovered into the right workflow:

1
2
3
4
5
find the invariant,
make a scoped fix,
run the required gates,
reject bounces,
continue.

That behavior is what allowed the long run to compound.

Why Compiler Work Is A Hard Agent Test

Compiler work is a hard agent test because failures are indirect.

The failing test is often not the broken layer. A generated-source compile error might be caused by frontend type construction. A token-stream failure might be caused by source-location ownership. A callgraph failure might be caused by a valid AST shape that an older traversal never handled. A timeout might be a traversal-policy bug, not a need for a bigger timeout.

An agent that treats the last error message as the whole problem will write brittle fixes.

REX made this especially clear because the project is a source-to-source compiler. The output code is readable, but that is a trap. If generated code looks wrong, the fix is usually not to print a different string. The fix is to ask why the AST state led to that output.

That requires a global model:

Clang AST in,
Sage AST construction,
declaration and type invariants,
source-preservation metadata,
unparse behavior,
downstream analyses,
CTest layers,
review constraints.

The agent needed to keep those layers in mind while still editing concrete C++ files.

What Changed With GPT-5.5

The practical change was that the agent could stay aligned with the repair protocol for longer.

It could carry forward constraints like:

1
2
3
4
5
6
7
no unparser hacks;
do not mask tests;
do not soften assertions casually;
all dirty changes are intentional work on the failure set;
run the frozen set after accepted patches;
fix full CTest failures before pushing;
respond to review comments with root-cause changes.

Those constraints are not glamorous. They are the difference between a helpful agent and a dangerous one.

The model also handled context compaction better. Long sessions inevitably compress. Important context can disappear from the active prompt. A weaker agent may restart the task mentally, repeat old investigations, or forget the acceptance criteria. In this run, the important workflow survived compaction often enough that progress continued.

A diagram showing context compaction bridged by frozen failure files, logs, PR review comments, branch state, and CTest evidence. — Figure 2. Long agent work needs durable external state. The frozen set, logs, branch history, and PR comments helped the workflow survive context transitions.

What Would Have Broken The Session

Several failure modes could have ended the run even if the model kept working.

The first is silent scope drift. If the task had gradually changed from “fix the frozen CTest failures” into “make whatever test is currently red pass,” the no-bounce guarantee would have disappeared. Long agent sessions need a stable objective that survives many turns.

The second is tool-output amnesia. Compiler work generates logs, failure lists, patches, review comments, and branch state. If those artifacts are not treated as durable evidence, the agent can repeat old investigations or accept a patch that was already rejected in another form.

The third is permission to hack around symptoms. An agent that is allowed to edit tests, hide assertions, or add unparser string fallbacks will often find a short path to green output. That path is especially dangerous because it can look productive in the moment.

The fourth is weak human feedback. The strongest interventions in this run were not vague. They were direct constraints: no unparser hacks, use enum and AST state, keep test artifacts in the build tree, verify comment placement, fix full CTest failures. Those constraints gave the agent less room to rationalize a shortcut.

The fifth is insufficient model judgment. A workflow can reject bad patches, but if the model produces mostly bad local patches, the work will stall. GPT-5.5 was useful because it produced enough globally plausible fixes that the verification loop could keep accepting progress.

The Agent Was Not The Source Of Truth

The important lesson is not “trust the agent.”

The lesson is:

1
make the agent work inside a system that can reject it.

During the REX cleanup, the source of truth was not the model’s confidence. It was the combination of:

compiler assertions,
focused tests,
the core gate,
the frozen original failure set,
full CTest,
review comments,
hook logs,
clean git state.

The agent proposed code. The repository judged it.

That distinction matters because agents are good at producing plausible explanations. Plausibility is useful for exploration. It is not sufficient for accepting a compiler change.

The no-bounce rule made this concrete. A patch was not accepted because it fixed the test that motivated it. It was accepted only if it preserved previously fixed original failures.

That turned the agent from an oracle into a worker inside a verification loop.

Raw Speed Was Not The Key Metric

Raw coding speed is easy to overvalue.

If the task were simply to write many lines, an agent’s advantage would be obvious but shallow. The REX cleanup was not line-count work. It was decision work under constraints.

The hard questions were:

Is this failure frontend, unparser, midend, test harness, or environment?
Is this reference update semantically safe?
Is this timeout hiding a traversal bug?
Is this pointer API transferring ownership or copying?
Is this x86-specific failure a platform test issue or a target propagation bug?
Is this EDG-era behavior a useful design clue or an unsuitable legacy assumption?
Does this patch reduce the frozen failure set without bounce?

These are not solved by typing faster.

GPT-5.5 was useful because it could repeatedly connect local evidence to those broader questions. It still needed correction. It still needed review. But it crossed the threshold where global judgment was good enough to keep the campaign moving.

The Surprising Part Was Recovery

The most surprising part of the session was not that the agent could make progress. It was that it could recover.

Long engineering work always has wrong turns. A diagnosis can be incomplete. A test can expose a second bug after the first one is fixed. A review comment can reveal that an implementation is correct in behavior but unclear in ownership. A full CTest run can fail late after hours of apparent success.

In those moments, a weaker agent often becomes defensive or repetitive. It explains why the current patch should have worked, reruns the same command, or narrows the problem until the global constraint disappears.

The useful behavior is different:

1
2
3
4
5
accept the new evidence,
locate the exact failure,
revise the root-cause model,
make the smaller correct change,
and rerun the gate that matters.

That recovery pattern appeared repeatedly in the REX cleanup. Review comments were not treated as obstacles to answer with prose. They became evidence for tightening APIs, clarifying ownership, or removing ambiguity. Full CTest failures were not treated as noise. They became blockers to investigate before push.

That is a more important capability than first-try correctness. In a large codebase, first-try correctness is rare. Durable recovery is what lets a session continue.

The Role Of Human Direction

The human role did not disappear.

The user enforced important constraints:

no unparser hacks,
no fragile string-based attributes,
fix enum and AST roots,
do not treat EDG as a source to copy,
keep test artifacts out of the source tree,
verify reference-output updates carefully,
fix full CTest failures before push,
address review comments by root cause.

Those interventions mattered. They sharpened the workflow and prevented the agent from choosing easy exits.

This is probably the right shape for serious agent-assisted engineering today. The agent handles a large amount of investigation and implementation. The human sets boundaries, catches suspicious patterns, and raises the quality bar when the agent might drift toward a brittle local solution.

What This Does Not Prove

This session does not prove that agents can replace maintainers.

It does not prove that all large compiler work can be automated. It does not prove that GPT-5.5 will succeed on every long-running task. It does not remove the need for review, tests, local expertise, or hard project constraints.

The result is narrower:

1
2
3
4
5
6
given a mature codebase,
a clear objective,
a strong enough model,
durable external task state,
and strict accept/reject gates,
an engineering agent can carry a large repair campaign to completion.

That is still a meaningful threshold.

The distinction matters because the wrong lesson would be to make agent workflows more permissive. The right lesson is the opposite. The more capable the agent becomes, the more valuable strict boundaries become, because they turn capability into reliable progress instead of high-speed drift.

What Persistent Agents Need

This run suggests a practical checklist for persistent engineering agents.

They need durable task state outside the model context. In this case, the frozen failure set, CTest logs, branch history, review comments, and hook logs were essential.

They need a clear accept/reject loop. “Try to fix tests” is too vague. “Fix a small frontier, build, run the core gate, run the frozen set, reject bounces” is actionable.

They need enough global reasoning to identify the layer that owns a bug. A source-to-source compiler should not fix AST bugs in the unparser. A timeout should not be treated as permission to hide work behind a bigger timeout. A test reference should not be updated until the semantic change is understood.

They need humility in the face of evidence. When a full CTest failure appears, the agent should not guess. It should find the exact failure, inspect logs, reproduce if needed, and fix the actual root cause.

A diagram contrasting a brittle local-fix loop with a global judgment loop that checks root cause, gates, frozen failures, and review feedback. — Figure 3. The best agent behavior was not constant forward motion. It was repeatedly choosing the slower path when that was the only path that preserved trust.

The Bar Should Get Higher

The right response to a stronger agent is not to lower the review bar. It is to raise the kind of work the agent is allowed to attempt while keeping the acceptance bar strict.

That means future sessions should ask for better failure ledgers, clearer patch boundaries, better final reports, and more explicit evidence trails. If an agent can handle more context and longer tasks, the project should use that capacity to demand more traceability, not less.

The REX run worked because the workflow stayed strict while the agent became more capable. That is the combination worth preserving.

What This Changed For Me

The session changed my expectation of what an engineering agent can do.

Before this, I expected agents to be useful for scoped tasks and occasionally impressive on larger ones, but I also expected them to stall at the hard boundary where local fixes stopped working. This run crossed that boundary.

That does not mean future compiler work can be delegated blindly. It means a strong enough model, placed inside a strict enough workflow, can carry a large repair campaign further than I previously expected.

The important phrase is “inside a strict workflow.”

Without the frozen failure set, the full CTest runs, the review comments, and the no-hack constraints, the same speed could have created a mess. With those constraints, the speed became useful.

Closing

The REX cleanup is a good case study because it is concrete. It ended in a merged PR, a green full suite, and a cleaner compiler baseline. It also produced a clearer lesson about agents:

1
persistent agents become valuable when their work can compound under verification.

GPT-5.5 made that compounding possible in this session. The no-bounce workflow made it trustworthy.