Why Codex + GPT-5.2 Felt Like a Different Kind of Agent

Posted on Mar 27, 2026

Codex + GPT-5.2 crossed a threshold from “sometimes convincing” into “actually persistent” because a stronger model finally became capable enough to fully exploit a runtime designed to preserve momentum. Projects like Crush and Zeroclaw have parts of the puzzle but need more durable turn-state management and recovery semantics to close that gap.

The first time I used Codex with GPT-5.2, the change did not feel incremental.

It felt like the category changed.

Before that, most agent experiences had the same general pattern. They looked impressive for a while, sometimes for an hour, sometimes for two. Then they started to soften. They stopped pushing into the hard parts. They began narrating instead of doing. They asked for meaningless confirmation. They proposed options that were all obviously just more delay. They did not explicitly refuse, but they stopped behaving like something you could actually hand a difficult task to and leave alone.

Codex with GPT-5.2 was the first combination where that pattern clearly broke for me.

It was not perfect. It still got stuck, still needed correction, still made wrong turns, still benefited from supervision. But the important thing was different: it kept going. It could stay on task through failure, context pressure, long tool runs, partial progress, and messy intermediate states in a way I had not seen before.

That is why the surprise was bigger than what I later felt from GPT-5.3 or GPT-5.4.

Those later models are better models. That part is obvious. But the first real shock came from realizing that Codex + GPT-5.2 had crossed a threshold from “sometimes convincing” into “actually persistent.” Once that threshold had been crossed, later improvements felt more like gains on top of an already new baseline.

This post is about what I think actually happened there.

This write-up is based on reading the parts of Codex that own task execution, history compaction, rollout recording, and rollout reconstruction. I then compared those with Crush’s session and agent flow and Zeroclaw’s agent loop, reliability layer, channel runtime, and SQLite-backed session state. I am deliberately focusing on runtime behavior rather than undocumented model-training differences.

My conclusion after reading the Codex source and comparing it with other agent tools is simple:

the step change was not just raw model quality, and it was not just prompt quality either. It was the interaction between a stronger model and a runtime that was built to preserve momentum.

That distinction matters because it explains three things at once:

why older models often felt fragile,
why other tools using GPT-5.2 still did not feel like Codex,
and what projects like Crush or Zeroclaw would need to change if they want the same kind of long-run persistence.

What “Persistence” Actually Means

People often talk about agent persistence as if it were only a personality trait.

It is not.

What users call persistence is really a bundle of capabilities:

the model keeps trying after tool failures,
the runtime keeps the turn alive instead of aborting too early,
context pressure does not instantly kill the task,
new input can be incorporated without throwing away progress,
retries happen outside the model instead of forcing the model to rediscover the same next step,
a crash or disconnect does not necessarily erase the active task,
and when things do fail, the failure is turned into useful state rather than a dead end.

If even one or two of those are weak, the whole experience starts feeling flaky.

That is why so many agents appear competent at the beginning of a task and then degrade later. The easy part of an agentic run is getting started. The hard part is surviving the middle.

The middle is where real work lives:

partial file changes,
bad tool outputs,
auth issues,
context overflows,
repeated false starts,
a need to retry with a different route,
or a new user instruction arriving while the model is still in motion.

The tools that feel persistent are the ones that keep task continuity through the middle.

Why Older Models So Often Felt Like They Gave Up

Before GPT-5.2, there were already strong models. Some could reason well, some could code well, some could use tools fairly well, some could write plausible multi-step plans. But long unattended work often broke down in the same recognizable ways.

They were weaker at self-correction under noisy state

A hard agent run is not a clean benchmark prompt. It is a polluted state machine:

some previous tool output was truncated,
the last shell command partly succeeded,
the repo is dirty,
the model’s own earlier idea was wrong,
a tool was denied,
or an intermediate summary is lossy.

Weaker models can still look good on the first few turns because the problem is still clean enough. But once the state becomes noisy, they are more likely to collapse into generic planning language, repetitive retries, or over-cautious handoff back to the user.

They were more likely to misinterpret host feedback

If a runtime tells the model “tool failed,” “context was compacted,” “that call was denied,” or “this duplicate call was skipped,” the model has to actually use that information.

That sounds trivial, but it is not. The model has to infer:

what just happened,
whether the failure is terminal or recoverable,
whether it should try a different tool,
whether it should summarize progress,
and whether the same general strategy still makes sense.

That is exactly the sort of control-loop behavior that looks easy in theory and is brittle in practice.

They had less reserve for long-run execution

A model that is merely good enough for the first two hours may still not be good enough for the ninth failure or the third compaction event.

Long agent runs are cumulative. Small weaknesses do not stay small. They compound.

That is why “it seemed smart for a while” is not the right standard. For long tasks, the question is whether the model can remain coherent after repeated partial failures, not whether it looks clever during the clean opening.

Why Codex Did Not Feel Like Just “A Better Model”

If Codex had only changed the model, I would expect other tools using GPT-5.2 to feel much closer to it than they did.

They did not.

That strongly suggests the runtime matters, and after reading the code I think that reading is correct.

Codex has a set of host-side behaviors that are easy to underestimate if you only look at the model name.

1. Codex treats a turn like an active process, not a single request

This is one of the biggest differences.

In Codex, a normal task is not just “send one request to the model and wait.” The runtime treats a turn as an active thing with its own lifecycle. New user input can be inspected, queued, blocked, accepted, or steered into the live turn. The system does not have to pretend that a turn is a sealed box.

That matters a lot for persistence because it avoids one of the classic agent failures: the runtime gets new information, but the only thing it knows how to do is kill the current request and start a fresh one from a worse state.

Codex can do better than that because the host owns turn state.

2. Tool failures become same-turn state, not immediate dead ends

In persistent systems, the model should not have to guess that a tool failed because the whole run stopped. It should see the failure and keep working inside the same turn.

Codex is good at this pattern. The runtime converts errors, interruptions, and tool outputs into model-visible history, so the model gets another chance to adapt without losing the work it has already done.

This is more important than it sounds. Many agents fail not because the model is too weak to solve the task, but because the runtime turns every medium-sized failure into a turn-ending event.

Codex is much more aggressive about continuation semantics.

3. Retries and transport recovery are host-owned

Persistent systems should not make the model rediscover obvious infrastructure reactions.

If the issue is:

a transient stream error,
a reconnect,
a retryable provider failure,
a background task delay,
or a stale transport state,

then the host should own that recovery loop as much as possible.

Codex does.

That reduces the amount of “fake cognitive work” the model has to spend on operational issues, leaving the model to focus on the actual problem rather than repeatedly deciding to do the same retry the runtime could have performed automatically.

4. Compaction is not just summarization, it is checkpointing

This is one of the deepest differences.

A lot of agent systems say they “handle long context” because they summarize old history.

That is not enough.

Summarization helps, but it is lossy. Once you have replaced a long chain of actual turn history with a paragraph of prose, you have weakened the model’s future ability to reconstruct exactly what happened. Sometimes that is fine. Sometimes it is fatal.

Codex goes further. Its compaction path is tied to replayable history and replacement-history checkpoints. That means compaction is not just “make the transcript shorter.” It is “make the transcript shorter in a way that still preserves enough structural meaning for later reconstruction and resume.”

That is a much stronger notion of continuity.

5. Sessions are not just chat logs, they have rollout state

This is the part that, in my view, most directly explains the difference in feel.

Codex persists rollouts. It records events. It reconstructs history from rollouts. It can reason about what the current thread is and how it got here. That gives it a much stronger foundation for replay, resume, and inspection than a plain message transcript ever can.

A plain transcript tells you what messages existed.

A rollout tells you what happened.

That difference becomes decisive once you care about long autonomous execution.

Why GPT-5.2 Was the Inflection Point

If Codex already had runtime machinery that helped persistence, why did the big subjective change show up so strongly with GPT-5.2?

Because model quality and runtime quality multiply. They do not substitute cleanly for one another.

The runtime was not enough by itself

A strong runtime can prevent many avoidable failures:

pointless restarts,
dropped intermediate state,
retries that should have been automatic,
context overflow ending the whole task,
or a crash wiping out the thread.

But the runtime still depends on the model for the hard part:

choosing a better strategy,
interpreting feedback correctly,
knowing when to abandon a failing route,
continuing through messy intermediate state,
and remaining coherent after many tool-use iterations.

Older models could benefit from the runtime, but not enough to cross the qualitative threshold.

GPT-5.2 was strong enough to cash in the runtime advantage

This is the core of the whole story.

The Codex runtime created a better environment for persistence:

better continuation semantics,
better turn ownership,
better compaction behavior,
better recovery behavior,
better tool loop structure.

GPT-5.2 was, in practice, the first model in that environment that seemed strong enough to fully exploit it.

In other words, Codex likely had some of the right runtime ideas already, but GPT-5.2 was the first model that turned those ideas into a dramatic user-visible effect.

That also explains why GPT-5.3 and GPT-5.4 did not surprise me as much. Once the underlying runtime-model combination had crossed the threshold, later model improvements felt more like “better inside the same category” than “the category itself changed.”

What Crush Does Well

Crush is not a toy wrapper. It is much more serious than that.

It has several qualities that already move it beyond the simplest agent designs.

Crush is session-based and stateful

Crush persists sessions and messages in SQLite. It keeps track of usage, titles, files, session metadata, and message history. It is not stateless terminal sugar around a single API call.

That matters because it gives it a real memory of the conversation and the project session.

Crush has tool integration, permissions, and a queue

Crush supports tools, permission checks, MCP integration, LSP assistance, file tracking, and prompt queueing when a session is busy. It is aware of concurrency at the session level, and it tries to preserve order rather than allowing overlapping chaos.

That is already much better than the “run whatever the model says immediately” class of agents.

Crush has summarization to keep sessions alive

Crush also has automatic summarization once context pressure reaches a threshold. This is important because it keeps long sessions from simply blowing up when they become too large.

But this is exactly where the gap with Codex becomes visible.

Where Crush Still Falls Short of Codex-Style Persistence

Crush’s persistence is real, but it is mostly session persistence, not active-turn persistence.

That distinction is the heart of the problem.

1. Crush queues prompts, but it does not really steer a live turn

If a session is busy, Crush queues the next prompt. That is sensible, and it is better than dropping it.

But queueing is not the same as Codex-style active-turn steering.

Queueing says:

“wait until the current thing finishes, then start the next thing.”

Steering says:

“I have a live turn, I know it is live, and I can decide how new input should interact with it.”

That is a much stronger control model.

2. Crush persists messages, not a replayable execution rollout

Crush stores messages and session state. That is useful, but it is not equivalent to a durable event log of the running turn.

If you want twelve-hour autonomy, the relevant object is not just the conversation transcript. It is the execution process:

what tool calls were requested,
which were already completed,
what partial output was emitted,
what failures occurred,
what retry state exists,
and what compacted state can be reconstructed exactly.

Crush does not appear to persist that as a first-class root-turn rollout in the way Codex does.

3. Summarization is continuity-preserving, but not reconstruction-grade

Crush’s auto-summarization helps session longevity. But it is still fundamentally summarization. It is not the same thing as checkpointed replacement history plus later rollout reconstruction.

That means Crush can remain usable across long sessions while still being weaker at precise task continuity across difficult interruptions, crashes, or long-running in-progress work.

4. The core execution still leans on an external agent abstraction

Crush builds a session agent and calls into the fantasy agent abstraction to stream the turn. That is a perfectly reasonable architecture, but it usually means the host has less direct ownership over the full internal turn-state machine than Codex does.

That matters because the more the host owns the state machine, the more precisely it can enforce persistence semantics.

Another way to say it is that Crush is currently much better at preserving conversational continuity than execution continuity. Codex feels more persistent because it treats long-running work as something the host must actively preserve, not just something the transcript should describe afterward.

What Crush Would Need to Catch Up

If the goal is not “be better” in general, but specifically “behave more like Codex on hard, long unattended tasks,” then I think Crush would need changes in four major areas.

1. A durable turn-state machine

Crush needs a first-class concept of an active root turn whose state is persisted independently of chat messages.

That means:

active turn id,
current execution phase,
pending tool calls,
completed tool calls,
retry state,
compaction checkpoints,
partial assistant output,
and finalization state.

2. Rollout recording and reconstruction

Crush needs something much closer to Codex’s rollout recorder model.

Not because event logs are fashionable, but because that is what lets a system resume meaningfully after disruption. Without it, the system mostly has to re-enter from a transcript, which is a weaker foundation.

3. Better same-turn handling of new input

Prompt queueing is good, but it is not enough for the strongest persistence story.

Crush would need explicit policies for:

steer into current turn,
queue for later,
block,
or convert into interruption.

That decision should be runtime-owned, not improvised every time through user-visible friction.

4. Compaction as a checkpoint, not only a summary

This is easy to miss and hard to replace later.

If Crush wants Codex-like long-task behavior, it needs compaction that preserves reconstructible state rather than just reducing token load. Otherwise long autonomy will always get weaker as the session ages.

Why Zeroclaw Is Closer to Codex Than Crush Is

Of the two alternatives, Zeroclaw is clearly closer to Codex on the runtime side.

That is because Zeroclaw already owns much more of the agent loop itself.

Zeroclaw owns its own tool loop

This is a big deal.

Zeroclaw does not just hand everything to a generic external agent SDK and hope for the best. It has its own provider abstraction, its own tool loop, its own tool execution helpers, its own context compressor, and its own loop detector.

That already puts it in a different class from more wrapper-like designs.

Zeroclaw already has host-side reliability features

It already includes:

provider/model retry and fallback,
context overflow recovery,
tool result truncation and history trimming,
loop detection,
per-session channel history,
background delegate tasks,
and nontrivial channel orchestration with cancellation and debouncing.

That is real infrastructure. It is not superficial.

Zeroclaw already persists more than a bare transcript

It has SQLite-backed memory, SQLite-backed channel session storage, per-session metadata, and session state markers such as running, idle, error, turn_id, and turn_started_at.

That means the project already has some of the pieces you would expect to need for real long-run autonomy.

But not yet the one that matters most.

Where Zeroclaw Still Stops Short

Zeroclaw is closer to Codex than Crush is, but it still is not Codex-like in the specific “send a tricky Telegram task and let it keep working for twelve hours” sense.

1. It persists session history, not a durable root-task execution record

This is the key gap.

Zeroclaw stores chat history and session metadata, but on restart it hydrates the conversation and closes orphaned turns rather than reconstructing and resuming the interrupted active turn.

That is a useful safety behavior, but it is not execution persistence.

2. The channel runtime is still fundamentally an inline request path

Telegram messages go into the channel runtime, which builds the system prompt, recalls memory, runs the tool loop, and emits a bounded reply.

That is strong for normal chatbot behavior.

It is not the same thing as a durable background job model for root tasks.

If the process dies in the middle, there is no strong evidence that Zeroclaw can restart, inspect the active root task, rebuild its exact execution state, and continue.

3. Its operational limits are far below a real 12-hour autonomy target

By default, Zeroclaw is configured more like a practical safe agent than a marathon worker:

max_tool_iterations defaults to 10,
channel message timeout defaults to 300 seconds,
timeout scaling is capped,
and non-interactive autonomy defaults remain conservative.

Those defaults are sensible.

They are just not twelve-hour-agent defaults.

4. Non-interactive approval behavior is still a blocker

In channel mode, tools requiring approval are generally auto-denied rather than routed through a richer unattended policy engine.

That makes sense for safety, but it means many long autonomous tasks will hit policy friction before they hit model limits.

5. Session state is observed, not resumed

Zeroclaw can tell you that something is running or stuck. That is useful operationally. But I did not find the mechanism that turns that information into a resumed root task.

That is the difference between visibility and continuity.

What Zeroclaw Would Need for the Telegram Use Case

If the goal is:

send a hard task from Telegram and let Zeroclaw work for up to twelve hours nonstop, even with a weaker model

then the path is much clearer for Zeroclaw than for Crush, because more of the necessary runtime is already there.

But it still needs major additions.

1. A durable root-task manager

Telegram messages should create persisted root jobs, not just inline channel requests.

That job should have:

a task id,
conversation identity,
current state,
an event log,
progress metadata,
and a resumable lifecycle independent of the original Telegram update.

2. Replayable rollout logging

Zeroclaw needs to record root-turn execution in a reconstructible form:

prompt/input events,
tool requests,
tool outputs,
provider retries,
compaction events,
cancellations,
partial assistant output,
and final status.

Without that, recovery after crash or restart will always be weaker than Codex.

3. Startup recovery and resume

At daemon startup, Zeroclaw should scan running root tasks and decide:

resume,
retry,
mark failed,
or ask for help.

Right now it has enough metadata to notice that something was running. It needs the next step: actual reconstruction.

4. Stronger compaction semantics

Its current context compression is helpful but mostly summary-based. For long-lived background jobs, it should preserve checkpointed state that later turns can resume from with minimal ambiguity.

5. A real per-conversation execution model for channels

For long Telegram work, the runtime should behave more like:

one active root task per conversation,
explicit queue of follow-up inputs,
explicit /stop or /interrupt,
and explicit policies for whether new messages steer, queue, or cancel.

That is more robust than treating each inbound message mainly as a fresh bounded processing request.

6. An unattended autonomy profile

If the user truly wants a 12-hour Telegram worker, Zeroclaw needs an autonomy mode designed for that:

broader permitted operations,
higher action and cost ceilings,
a clear risk policy,
and logging strong enough that the user can trust the system afterward.

In other words, the runtime needs to stop assuming that every important action will be supervised live.

Can a Weaker Model Become “Codex-Like” Just by Porting the Runtime?

Not fully.

This point matters because it is easy to overcorrect after seeing how important the runtime is.

The runtime can do a lot:

preserve state,
improve failure recovery,
reduce pointless resets,
keep tasks alive through infrastructure problems,
and stop the model from throwing away progress.

That can make a weaker model dramatically more usable.

But it cannot manufacture deep planning ability, strategic flexibility, or long-horizon self-correction out of thin air.

So the honest answer is:

yes a strong runtime can make weaker models much more persistent than they would otherwise be,
but no it cannot make a clearly weaker model truly equivalent to a stronger one on hard agentic work.

The best way to think about it is multiplicative:

Persistence = model capability × runtime continuity × tool semantics × policy design.

If any one factor is too weak, the product is weak.

Codex + GPT-5.2 was surprising because all of those factors finally lined up strongly enough at once.

The Main Lesson

The big lesson for me is that agent persistence is not just a model property.

It is a systems property.

The model matters enormously. GPT-5.2 was clearly strong enough to change the outcome. But the model alone does not explain why Codex felt so different from other GPT-5.2 tools.

The missing piece is that Codex is designed to preserve momentum:

it owns the turn lifecycle,
it keeps failures inside the working state instead of turning them into immediate dead ends,
it treats long-context handling as a continuity problem, not just a summarization problem,
and it persists enough of execution to make replay and reconstruction real concepts rather than vague aspirations.

That is what made the experience feel qualitatively different.

And that is also why the right question for Crush or Zeroclaw is not merely:

“How do we get a better model?”

It is:

“What do we need to preserve task continuity when the model, the tools, the network, and the process all stop being clean?”

Codex answered that question more seriously than most tools did.

GPT-5.2 was the first time I felt the answer become user-visible.

That is why it felt like day and night.