Agentic Reflexion: Teach Your AI to Doubt Its Own Plan

Your AI wrote a great plan. It's probably wrong. Here's how to catch that before you build.

        ┌──────────────────────────────────────────────┐
        │                                              │
   You  │                                         refine
    │   │                                              │
    ▼   ▼                                              │
┌─────────────────────┐                               │
│      Plan Mode      │  ← Model A (e.g., Sonnet)     │
│  (Read · Reason ·   │                               │
│    Propose)         │                               │
└─────────────────────┘                               │
           │                                          │
           │  Draft Plan                              │
           ▼                                          │
┌─────────────────────┐                               │
│    Plan Reviewer    │  ← Model B (e.g., Opus)       │
│  (Critique · Flag · │    Fresh context              │
│    Rank by Risk)    │    Adversarial prompt         │
└─────────────────────┘                               │
           │                                          │
    ┌──────┴──────┐                                   │
    │  Blockers?  │ ──── yes ─────────────────────────┘
    └──────┬──────┘
           │ no
           ▼
┌─────────────────────┐
│      Implement      │
└─────────────────────┘

There's a failure mode buried inside plan mode that nobody talks about.

The model that wrote the plan is also the most likely to think the plan is good. It built the reasoning. It owns the assumptions. When you ask the same model to review its own output — same context, same priors — it's not reviewing. It's confirming.

This is a well-known problem in AI research. The paper that formalised it — Reflexion (Shinn et al., 2023) — showed that giving agents a verbal feedback loop dramatically improved task performance. The core insight: critique before committing to an action is structurally better than acting and recovering. The catch: the critique has to be genuine. A self-reinforcing loop is worse than no critique at all, because it gives you false confidence.

The fix is surprisingly simple. You use a second agent — different model, fresh context, adversarial instructions — whose only job is to find what's wrong with the plan before you build it.

In this post you'll learn:

  • What the Reflexion pattern is and why it matters for agentic coding
  • How plan mode fits into the loop — and where it falls short on its own
  • How to build a plan reviewer sub-agent that actually finds problems
  • Why the model family you pick for the reviewer changes what it catches
  • How GitHub Copilot CLI validated this exact pattern with their Rubber Duck feature — and the numbers behind it

💡 This is Part 7 in my Context Engineering series. Part 1 covered the four pillars. Part 2 introduced agent skills. Part 3 argued for lean over Scrum. Part 4 put skills to work with git worktrees. Part 5 showed why Markdown is the native format of agentic context. Part 6 explained why spec-driven development is waterfall in a trench coat.


🧠 The Reflexion Pattern

The original Reflexion paper introduced a deceptively simple idea: rather than asking an agent to act directly on a task, give it a loop.

  1. Generate — the agent produces an output (a plan, a solution, an action)
  2. Evaluate — a critic assesses that output against some standard
  3. Refine — the agent incorporates the critique and produces a better output

Repeat until good enough or resource limits are hit.

What made the results striking wasn't the loop itself — iteration is obvious. It was the quality of the critic. When the evaluator had a fresh context and adversarial instructions, agents caught mistakes they repeatedly missed on direct attempts. The same model could find what it wouldn't have generated.

The lesson for agentic coding isn't that you need a research paper's setup — it's that a second perspective, properly framed, is structurally different from self-review. The critique has to come from somewhere the original reasoning didn't.


📋 Plan Mode: The Generator Half

If you've used Claude Code's plan mode, you've seen the first half of the Reflexion loop already.

You describe the task. Claude Code reads your codebase — your CLAUDE.md, the relevant files, the existing patterns — and produces a structured plan before touching anything. It identifies which files change, what functions are affected, what the implementation sequence looks like, and flags anything it's uncertain about.

This is genuinely valuable. It surfaces the reasoning before you're committed to it. You can redirect, push back, or add constraints before a single line of code is written. That's the right gate.

But plan mode has a blind spot. The model that wrote the plan has been primed to think it's correct. It read your codebase through the lens of solving your task. Its context window is full of reasoning that supports the approach it just proposed. Asking that same context to critique the plan is like asking someone to proofread their own essay five minutes after writing it — the eye slides over what the brain already decided was fine.

What you need is a second reader who wasn't in the room when the plan was written.


🔴 Enter the Skeptic: The Plan Reviewer Sub-Agent

The plan reviewer is a sub-agent whose only job is to find problems.

Not to validate the approach. Not to suggest alternatives. Not to summarise what the plan does well. To find what will break it.

Here's what the output looks like in practice:

## Plan Review

### Blockers
- Step 3 calls `migrate_schema()` but this function does not exist in the codebase.
  Grep confirms no match. Plan assumes it was already implemented in a prior ticket.

### Significant Risks  
- The migration has no rollback step. If the schema change fails mid-run on the
  50M-row table, there is no documented recovery path.

### Minor Concerns
- Step 7 says "update accordingly" — underspecified. Which callers are affected
  and in what order?

That's the output you want before implementation, not after. The blocker in that example — calling a function that doesn't exist — would have caused an immediate runtime failure. The plan reviewer caught it in thirty seconds by grepping the codebase. The planning model didn't, because it assumed the function was there.

In practice, a single review pass catches the high-severity issues. If the plan changes substantially after refinement, run the reviewer again — a revised plan can introduce new assumptions just as easily as the original.


🎯 Why the Model Family Matters

The single most important decision in setting up a plan reviewer is which model it uses.

This isn't about one model being "better" than another in some general sense. It's about cognitive diversity. Different model families have different training, different blind spots, different tendencies toward optimism or caution. A plan generated by Claude Sonnet reviewed by Claude Sonnet in the same session is likely to surface the same risk profile the plan itself assumed.

A plan generated by Sonnet, then passed to Opus in a fresh context window with adversarial instructions, is different. Opus has different priors about what constitutes a risk. It hasn't been warmed up by the same planning conversation. It's reading the plan cold, the same way a senior engineer would if you dropped it in their inbox.

The principle generalises beyond Claude. If your planning agent uses GPT-4o, review with Claude. If it uses Claude Sonnet, review with Opus. The point isn't that one is smarter — it's that the review is structurally independent. You're not asking the same chain of reasoning to audit itself.

💡 Model diversity in the review step is the difference between a rubber stamp and a genuine gate.


🦆 The Industry Agrees: Copilot's Rubber Duck

This isn't a niche pattern. GitHub just shipped exactly this in GitHub Copilot CLI under the name Rubber Duck — and the numbers make the case better than theory does.

When you select Claude as the orchestrator in Copilot CLI, Rubber Duck automatically brings in GPT-5.4 as the reviewer. Cross-model by design. The system fires at three checkpoints:

  1. After planning — catches suboptimal early decisions before they compound
  2. After complex implementation — identifies edge cases in intricate code
  3. After test writing — flags coverage gaps before execution

The performance result is striking. Testing on SWE-Bench Pro found that Claude Sonnet 4.6 paired with Rubber Duck running GPT-5.4 closed 74.7% of the performance gap between Sonnet and Opus — approaching Opus-level resolution rates at Sonnet cost. On the hardest multi-file problems, the pairing scored 4.8% higher than the baseline.

The bugs Rubber Duck caught in real-world testing are instructive too: a scheduler that immediately exited due to an architectural flaw, silent data loss from overwritten dictionary keys, cross-file conflicts breaking Redis references. These aren't hypothetical risks — they're the exact category of problem that compiles, passes linting, and only breaks in production.

💡 You don't get near-Opus performance by using a better model. You get it by adding a cross-model reviewer to your loop.

The name is a deliberate nod to the classic rubber duck debugging technique — explaining your code to an inanimate object to surface problems you couldn't see while writing. The insight is the same: a second perspective that wasn't involved in producing the output finds what you normalised away.

GitHub built it into the tool. You can build it into your workflow today with a single agent definition.


⚙️ Building the Plan Reviewer

You don't need a complex setup to implement this. In Claude Code, a plan reviewer is a standard agent definition — a markdown file in your .agents/agents/ directory with a system prompt and a model assignment.

The key elements:

1. Make it adversarial by default. The reviewer's job is to find problems, not to acknowledge what the plan does right. Any praise is wasted context.

2. Give it verification tools. The most common plan failure is an assumption about the codebase that turns out to be false. The reviewer needs Read, Grep, and Glob access to check those assumptions against actual code — not just against what the plan claims.

3. Order findings by severity. A list of concerns is useful. A list ordered by blast radius is actionable. Blockers (will cause failure) first, then significant risks (likely to cause production issues), then minor concerns (underspecified steps, optimistic assumptions).

4. Make it trigger proactively. The description field determines when Claude Code invokes it automatically. Phrase it to trigger after any plan is produced, not just when explicitly asked — because the cases where you forget to ask are exactly the cases that go wrong.

5. Set color: red. It's a small touch, but it visually marks the reviewer as the adversarial voice in your agent list — a useful reminder that its job is to find fault, not to help.

Here's the full definition from my langchain-anthropic-pdf-support project:

---
name: plan-reviewer  
description: |  
  Use this agent when a plan has been written and needs critical review before
  implementation begins. Trigger proactively after producing an implementation plan,
  migration plan, refactor proposal, or architectural design. Also trigger when
  the user asks to "review the plan", "critique this approach", or "find problems
  with this".
model: opus  
color: red  
tools: ["Read", "Grep", "Glob"]  
---

You are a principal engineer reviewing a plan. Your job is to find problems,  
not validate assumptions.

Systematically identify:  
1. Edge cases not accounted for  
2. Performance implications (N+1, blocking, scale issues)  
3. Backwards compatibility risks  
4. Missing rollback steps  
5. Anything underspecified or optimistic

Verify plan assumptions against actual code using Read/Grep/Glob.  
Order findings by severity: blockers first, then significant risks,  
then minor concerns.  
Do not praise or summarise what the plan does well.  

Note on model: opus: The repo uses model: inherit by default, which picks up whatever model the parent session is running. For the reviewer to provide genuine cognitive diversity, set it explicitly — opus if you're running Sonnet day-to-day, or swap model families entirely if you're on a non-Claude tool.

If you're not on Claude Code, the same principle applies. In Cursor, add a rule in .cursorrules that instructs a second agent pass to critique the plan before acting. In GitHub Copilot CLI, Rubber Duck does this automatically. The pattern is tool-agnostic — what matters is the structural separation: a different model, a fresh context, and an adversarial framing.


The Bottom Line

Plan mode is valuable. But a plan reviewed only by the model that wrote it is a plan that's checked its own homework.

The Reflexion pattern gives you a genuine quality gate: a second model, fresh context, explicitly tasked with finding what the plan got wrong before you build it. The reviewer reads cold. It verifies assumptions against real code. It ranks what it finds by severity so you know what to fix before you act.

  • Use plan mode to generate a structured plan before any implementation
  • Spawn a plan reviewer agent with a different model family and adversarial instructions
  • Give the reviewer code access so it can verify assumptions, not just critique the document
  • Trigger it proactively — the cases where you forget to ask are exactly the cases that go wrong
  • Re-run after significant refinement — a revised plan can introduce new assumptions just as easily as the original
  • Don't use the same model and context for both planning and review — you're not getting a second opinion, you're getting a confirmation

The plan reviewer doesn't replace judgment. You still decide which findings to act on. But it gives you something plan mode can't: a perspective that wasn't in the room when the plan was written.

That's the difference between a plan you reviewed and a plan that's been stress-tested.