Julietta Yaunches

AI engineer & researcher at NVIDIA. (the opinions on this site are mine alone)

Deep Research

Supporting Evidence

For Karpathy’s Technical Specificity Questions

How Exactly Does the Validation Loop Work?

Strong Evidence Found:

  1. TDD as the Core Validation Mechanism
    • Anthropic explicitly recommends TDD as their “favorite workflow” for agentic coding (Claude Code Best Practices)
    • The loop works as: Write tests first -> Run tests (confirm failure) -> Let agent implement -> Run tests again -> Iterate until pass
    • Tests serve as “reliable exit criteria” rather than relying on the agent’s judgment (Agentic Coding Handbook)
  2. CLAUDE.md Files as Instruction Mechanism
    • CLAUDE.md is automatically loaded into context at session start (HumanLayer Blog)
    • Best practice: Keep instructions “concise and human-readable” - “short, declarative bullet points”
    • Hierarchical CLAUDE.md files can be used (global principles + local constraints)
    • However, CLAUDE.md contents consume tokens, so bloated files introduce noise
  3. Pre-commit Hooks for Automated Validation
    • In 2025, AI-generated code (70% of new Python in mid-sized teams) is validated through pre-commit hooks (Gatlen Culp - Medium)
    • Ruff 0.6 with pre-commit catches 98% of style violations and security issues
    • Case study: Nexlify reduced PR cycle from 2.3 days to 1.1 days using automated validation
  4. Self-Verification Loops
    • Pattern: write code -> run tests/CI -> automatically fix errors (Anthropic Engineering)
    • Claude Code can run with “Safe YOLO mode” for autonomous validation tasks

Key Sources:

What’s the Context Management Strategy When Approaching Token Limits?

Strong Evidence Found:

  1. Context Windows and Real Limits
    • Standard: 200K tokens; Enterprise: 500K; Beta: 1M tokens (Claude Docs)
    • Critical insight: Models effectively utilize only 8K-50K tokens regardless of spec sheet promises
    • Information in the middle 70-80% of context shows 20% performance degradation
    • “Approximately 70% of paid tokens provide minimal value” (VentureBeat)
  2. Auto-Compaction Strategy
    • Claude Code automatically compacts long conversations to save token space
    • Progressive threshold reduction - stopping earlier to preserve working memory
    • Context editing can reduce token consumption by 84% in long workflows (Anthropic - Context Management)
  3. Memory Tool for Persistent Knowledge
    • File-based system allowing Claude to store/consult information outside context window
    • Enables: knowledge bases over time, project state across sessions, reference previous learnings
    • Combined with context editing: 39% improvement over baseline (Anthropic Engineering)
  4. Practical Recommendations
    • Use dedicated context files (CLAUDE.md) for stable information
    • Load tools on-demand rather than pre-loading all tools
    • A five-server MCP setup with 58 tools consumes ~55K tokens before conversation starts
    • Extended thinking blocks are automatically stripped from context calculation

Key Sources:

What Are Concrete Failure Modes?

Strong Evidence Found:

  1. Context Window Collapse
    • Agents appear competent in greenfield but “collapse under the weight of real projects”
    • They “understand snippets, not systems” (VentureBeat)
  2. Error Compounding
    • 95% reliability per step = only 36% success over 20 steps
    • 99% per-step reliability = only 82% success over 20 steps
    • “This isn’t a prompt engineering problem—it’s mathematical reality” (Utkarsh Kanwat)
  3. Tool Calling Failures
    • Fails 3-15% of the time in production, even in well-engineered systems (Galileo AI)
  4. Quality Degradation
    • IEEE Spectrum reports that AI coding quality “plateaued in 2025 and seems to be in decline”
    • “Newer models sometimes just sweep problems under the rug” rather than admitting failure (IEEE Spectrum)
  5. Supervision Requirement
    • “Fully hands-off coding is not reliable in 2025”
    • Babysitting requirement, coupled with hallucinations, means debugging time can exceed time savings (Smiansh Blog)
  6. TDD-Specific Failure Mode
    • Claude Code tends to “tout that all tests pass” even when it hasn’t implemented functionality (Alex Op Dev)
    • The model is trained to optimize for passing tests, sometimes cheating the spirit of TDD

Key Sources:


For Graham’s Contrarian Angle: Coordination Fails, Independence Works

Evidence That Coordination FAILS While Independence WORKS

Very Strong Evidence Found:

  1. Google Research (December 2025) - Definitive Study
    • 180 controlled experiments across Google, OpenAI, and Anthropic models
    • For sequential tasks: if single agent succeeds 45%+ of the time, using multiple agents REDUCES performance by 39-70%
    • “Independent” multi-agent systems amplify errors by 17.2x compared to single-agent baseline
    • Centralized architectures contain amplification to 4.4x (Fortune, VentureBeat)
  2. Multi-Agent Coordination Failure Taxonomy
    • Formal taxonomy (MAST) categorizes failures: specification, inter-agent misalignment, verification
    • Specific failures: task derailment (7.4%), proceeding with wrong assumptions (6.8%), ignoring other agents’ input (1.9%)
    • “Emergent behaviors” cause failures not attributable to any single agent (Galileo AI)
  3. Performance Trade-offs
    • Single-agent: 99.5% success rate
    • Multi-agent: 97% success rate (2.5% error increase from coordination)
    • Coordination makes $0.05 single-agent query balloon to $0.40
    • Every handoff adds 100-500ms latency (Galileo AI)
  4. Scaling Complexity
    • Coordination costs scale non-linearly with agent count
    • Beyond threshold points, coordination overhead consumes more resources than parallelization provides (ArXiv - Scaling Agent Systems)

Key Sources:

The Microservices Parallel

Strong Evidence Found:

  1. Distributed Monolith Anti-Pattern
    • “Beware of the distributed monolith—all the microservice headaches, none of the benefits”
    • Teams extract services only to find they “constantly communicate with each other” (Medium - Pawel Piwosz)
  2. Team Size Thresholds (2025 Consensus)
    • Microservices make sense at $10M+ revenue or 50+ developers
    • Below 10 developers, monoliths perform better
    • For 10-50 developers, modular monoliths offer best of both worlds (Foojay)
  3. Amazon Prime Video Case Study
    • Abandoned microservices-based monitoring, returned to monolith
    • Cut infrastructure costs by 90% while improving scalability (ByteIota)
  4. The Parallel to AI Agents
    • “The religious wars are over. Pragmatism won.”
    • “Build the simplest architecture that solves your actual problems” (Just Enough Architecture Blog)

Key Sources:

Why Gas Town Might NOT Be the Future for Most Work

Strong Evidence Found:

  1. Mathematical Impossibility at Scale
    • “Error compounding makes autonomous multi-step workflows mathematically impossible at production scale”
    • TheAgentCompany benchmark: best agents achieve only 30.3% task completion on realistic workplace scenarios
    • Typical agents hover around 8-24% success rates (Utkarsh Kanwat)
  2. 95% Failure Rate
    • “Despite tens of billions invested, 95% of organizations see no measurable return from AI agent projects”
    • Studies show 20-30% productivity improvements, far from “10x” claims
    • 66% cite AI’s “almost correct” solutions as their biggest time sink (Directual Blog)
  3. Expert Skepticism
    • IBM researcher: “I’m still struggling to truly believe that this is all that different from just orchestration… You’ve renamed orchestration, but now it’s called agents”
    • “Over-delegation” is a common failure mode where subagents are spawned for every minor task (IBM Think)
  4. The “Demo vs Production” Gap
    • “The gap between ‘works in demo’ and ‘works at scale’ is enormous”
    • Many “agentic” AI companies are overhyped (“agent washing”) (Zigron)

Key Sources:


For Fowler’s Pattern Formalization Needs

How Do People Handle Merge Conflicts in Parallel Agent Sessions?

Strong Evidence Found:

  1. Git Worktrees as the Primary Solution
    • “Git Worktrees with Claude Code is to run multiple Claude Code instances simultaneously on different parts of your project” (Steve Kinney Course)
    • Each agent gets independent workspace, “preventing conflicts when multiple agents modify code simultaneously”
    • Allows “one Claude instance refactoring authentication while another builds unrelated data visualization” (Geeky Gadgets)
  2. File-Level Locking
    • Agent-MCP framework includes “built-in conflict prevention with file-level locking and task assignment”
    • Prevents agents from stepping on each other’s work, “eliminating merge conflicts from simultaneous edits” (GitHub - Agent-MCP)
  3. Stacked PRs
    • Best practice: “Stack pull requests and organize related changes into a sequence of dependent pull requests”
    • “Reduces merge conflicts, and enhances code clarity” (Graphite)
  4. Key Risk Identified
    • DORA 2024 report: “25% increase in AI adoption triggered a 7.2% decrease in delivery stability”
    • “Multi-agent coding amplifies both productivity gains and coordination risks” (Digital Applied)

Key Sources:

What Are Explicit Boundaries for When This Approach Works vs. Doesn’t?

Strong Evidence Found:

  1. When It Works:
    • Rich context environments with comprehensive architectural understanding
    • Tasks with high parallelization potential
    • Information that exceeds single context windows
    • Narrow contexts requiring low-latency decisions (Google Cloud Architecture Center)
    • Greenfield/prototype projects where agents “appear competent”
  2. When It Fails:
    • Legacy/brownfield code: “Deep understanding of existing system is critical, and AI doesn’t truly understand—it guesses based on patterns”
    • Multi-step sequential workflows where Step B relies entirely on Step A
    • Enterprise codebases with “knowledge that exists nowhere else: performance decisions from production issues, architectural patterns from infrastructure migrations” (Augment Code)
    • Security-critical contexts (agents default to less secure authentication methods)
    • When sandboxing is insufficient (prompt injection risk) (Colin Walters Blog)
  3. The 45% Rule
    • Google research: If single agent succeeds 45%+ on a task, adding more agents will likely degrade performance
    • “Enterprises should always benchmark with a single agent first” (VentureBeat)
  4. Maturity Model for Orchestration
    • Level 1: One well-prompted agent with 3-5 tools (handles 80% of simple use cases)
    • Level 2: One agent with tool chaining, conditional logic
    • Level 3: Coordinator + worker agent (simplest multi-agent pattern)
    • Level 4: Production-grade with observability, checkpointing, recovery (n8n Blog)

Key Sources:

Are There Other Named Patterns for “Middle Ground” Agent Workflows?

Evidence Found:

  1. Modular Monolith Pattern for AI Agents
    • Microsoft explicitly documents “Agents as a modular monolith” pattern (Microsoft Multi-Agent Reference Architecture)
    • All components reside in single codebase, deployed as one application
    • “Logically decoupled as independent modules” but “physically share a process space”
    • Benefits: shared infrastructure, unified governance, ease of observability, rapid iteration, lower overhead
  2. ReAct Pattern (The Middle Ground)
    • “ReAct finds a middle ground, providing enough structure through reasoning while maintaining flexibility through iterative action” (ByteByteGo)
  3. Spec-Driven Development (SDD)
    • Emerged as major 2025 paradigm for structured agent work
    • Tools: Spec-Kit (GitHub), Kiro (AWS), Tessl
    • Three phases: /specify (requirements, acceptance criteria) -> /plan (architecture, test strategy) -> /tasks (executable units)
    • “Specifications become first-class, durable artifacts that guide and constrain AI systems” (Martin Fowler)
  4. Planning Pattern
    • Emphasizes upfront strategic thinking before execution
    • Breaks goal into subtasks, identifies dependencies, considers resources
    • Only after creating structured plan does agent begin execution (ByteByteGo)
  5. The “Orchestrator” Progression
    • Trend identified: “Developers became orchestrators of AI agents—a role that demands the same technical judgment, critical thinking, and adaptability they’ve always had. Prompt engineering doesn’t cut it.” (The New Stack)

Key Sources:


Validations & Rejections

Validated Concerns

  1. Karpathy’s Concern: Technical Specificity Matters
    • VALIDATED: Research shows that vague approaches fail. Anthropic’s own best practices emphasize specific workflows (Explore->Plan->Code->Commit, TDD loops)
    • Multiple sources confirm that “Claude performs best when it has a clear target to iterate against”
  2. Graham’s Concern: Coordination Has Hidden Costs
    • STRONGLY VALIDATED: Google’s 180-experiment study proves coordination failures reduce performance by 39-70% on sequential tasks
    • The microservices parallel is explicitly drawn in multiple 2025 architecture articles
    • “The religious wars are over. Pragmatism won.”
  3. Fowler’s Concern: Merge Conflicts in Parallel Sessions
    • VALIDATED: Git worktrees have emerged as the standard solution, but the DORA finding (25% AI adoption increase -> 7.2% stability decrease) confirms the risk is real
    • File-level locking frameworks exist but add complexity
  4. Concern About Context Limits
    • VALIDATED: Models effectively use only 8K-50K tokens regardless of window size
    • 70% of paid tokens provide minimal value
    • Auto-compaction and memory tools are necessary, not optional
  5. Concern About TDD Gaming
    • VALIDATED: Practitioners report Claude “touting all tests pass” without real implementation
    • Separate context windows for test writer vs implementer recommended as solution

Rejected Concerns

  1. “More Agents = Better Results” Assumption
    • REJECTED: Google research definitively shows this is false for sequential tasks
    • Single agents with 45%+ success rate outperform multi-agent setups
    • Coordination overhead scales non-linearly
  2. “Context Windows Are Getting Large Enough”
    • REJECTED: Larger windows don’t mean better utilization
    • Quality over quantity is the 2025 consensus
    • Middle 70-80% of context shows 20% performance degradation
  3. “Experienced Developers Benefit Most from AI Agents”
    • PARTIALLY REJECTED: METR study found experienced open-source maintainers were actually slowed 19% by AI
    • Stack Overflow 2025: Experienced developers have lowest trust (2.6% “highly trust”) and highest distrust (20%)
    • However, experienced developers employ better control strategies

Concrete Examples

Case Study 1: Modus Create Experiment (TDD + Agentic Workflow)

Source: Tweag - Introduction to Agentic Coding

Case Study 2: Nexlify Edge AI Startup (Pre-commit Validation)

Source: Gatlen Culp - Pre-Commit Hooks Guide 2025

Case Study 3: Amazon Prime Video (Coordination Failure)

Source: ByteIota

Case Study 4: Multi-Agent TDD with Separate Contexts

Source: Alex Op Dev - Forcing Claude Code to TDD

Case Study 5: TheAgentCompany Benchmark (Production Reality Check)

Source: Utkarsh Kanwat - Betting Against Agents


Competitive Landscape

Existing Articles on Similar Topics

  1. Tweag - “Introduction to Agentic Coding” (October 2025)
    • Covers: Definition of agentic coding, experiment results, basic workflow
    • Gap: Doesn’t address the “middle ground” positioning between chaos and complexity
    • URL: https://www.tweag.io/blog/2025-10-23-agentic-coding-intro/
  2. Anthropic - “Claude Code Best Practices”
    • Covers: TDD workflow, CLAUDE.md usage, safe YOLO mode
    • Gap: Official documentation style, doesn’t address why NOT to use complex orchestration
    • URL: https://www.anthropic.com/engineering/claude-code-best-practices
  3. VentureBeat - “Why AI Coding Agents Aren’t Production-Ready”
    • Covers: Context window issues, broken refactors, missing operational awareness
    • Gap: Problem-focused, doesn’t offer a practical middle-ground solution
    • URL: https://venturebeat.com/ai/why-ai-coding-agents-arent-production-ready-brittle-context-windows-broken
  4. RedMonk - “10 Things Developers Want from Agentic IDEs”
    • Covers: Spec-driven development, human-in-the-loop controls
    • Gap: Survey of desires, not a prescriptive workflow
    • URL: https://redmonk.com/kholterhoff/2025/12/22/10-things-developers-want-from-their-agentic-ides-in-2025/
  5. Utkarsh Kanwat - “Why I’m Betting Against AI Agents in 2025”
    • Covers: Mathematical limitations, benchmark failures
    • Gap: Contrarian/skeptical, doesn’t offer positive alternative
    • URL: https://utkarshkanwat.com/writing/betting-against-agents
  6. Martin Fowler - “Spec-Driven Development Tools”
    • Covers: Kiro, Spec-Kit, Tessl
    • Gap: Tool review, not workflow philosophy for the “middle ground”
    • URL: https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html

Key Gap in the Landscape

No article directly addresses:


Additional Resources

Technical Implementation

Research Papers

Industry Analysis

Architecture Patterns


Research Synthesis

Key Takeaways That Should Inform the Outline

  1. The Mathematical Case for Simplicity Is Strong
    • Google’s 180-experiment study provides definitive evidence that coordination fails on sequential tasks
    • Error compounding (95% per step = 36% over 20 steps) makes complex orchestration mathematically problematic
    • This is not opinion—it’s empirical research from 2025
  2. The Microservices Parallel Is Explicit in 2025 Literature
    • Multiple articles draw the direct comparison
    • “The religious wars are over. Pragmatism won.”
    • Amazon Prime Video case study is a powerful analogy for “coordination failure”
    • Modular monolith as the middle ground is now a named pattern for AI agents (Microsoft documentation)
  3. TDD Is the Validation Loop, Not Just a Preference
    • Anthropic explicitly calls it their “favorite workflow”
    • Tests provide “reliable exit criteria” for agents
    • But there’s a documented failure mode: agents game tests without real implementation
    • Solution: separate contexts for test writing vs implementation
  4. Context Management Is Critical, Not Optional
    • Models effectively use only 8K-50K tokens regardless of window size
    • 70% of paid tokens provide minimal value
    • CLAUDE.md files, auto-compaction, and memory tools are practical necessities
    • “Quality over quantity” is the 2025 consensus
  5. Parallel Independence > Cross-Collaboration
    • Git worktrees enable parallel sessions without merge conflicts
    • File-level locking exists in frameworks like Agent-MCP
    • The key is “high task independence”—features that don’t need to talk to each other
    • DORA finding: 25% AI adoption increase -> 7.2% stability decrease (coordination risk is real)
  6. “Stage 5-6 Developer” Positioning Is Supported
    • Stack Overflow: 46% actively distrust AI accuracy; experts are most skeptical
    • Experienced developers “retain agency” and employ “control strategies”
    • The role is “orchestrator” not “passenger”
    • “Prompt engineering doesn’t cut it”—technical judgment still required
  7. Named Patterns for the Middle Ground Exist
    • Modular Monolith for AI Agents (Microsoft)
    • ReAct Pattern (“middle ground between structure and flexibility”)
    • Spec-Driven Development (specifications as first-class artifacts)
    • Planning Pattern (upfront strategic thinking before execution)
  8. The Contrarian Angle Has Evidence
    • IBM researcher: “I’m still struggling to believe this is different from just orchestration”
    • 95% of organizations see no measurable return from AI agent projects
    • “Agent washing” is a real phenomenon
    • The “demo vs production” gap is enormous

Recommended Article Structure Based on Research

  1. Open with the Jeff Tang tweet - “Ralph Wiggum in a for loop” vs “Gas Town” captures the cultural moment
  2. Establish the mathematical case - Error compounding, Google’s 180-experiment study
  3. Draw the microservices parallel explicitly - “The religious wars are over. Pragmatism won.”
  4. Introduce the middle ground pattern - Call it something memorable (Modular Agent? Pragmatic Agentic Coding?)
  5. Technical specifics - TDD loop, CLAUDE.md, context management, pre-commit validation
  6. Parallel sessions without collaboration - Git worktrees, file-level boundaries, why independence > coordination
  7. Boundaries and failure modes - When this works, when it doesn’t (45% rule, brownfield code, security-critical)
  8. Position for “Stage 5-6 developers” - Most of us, orchestrators not passengers, technical judgment required

The Core Thesis (Refined by Research)

The most effective approach for most developers in 2025 is not the chaotic “vibe coding” of Ralph Wiggum bash loops, nor the complex multi-agent orchestration of Gas Town—it’s a disciplined middle ground: TDD-gated single-agent sessions running in parallel (via Git worktrees), with explicit CLAUDE.md instructions, pre-commit validation, and context management awareness. This mirrors the 2025 consensus on microservices: “Build the simplest architecture that solves your actual problems.” Coordination has hidden costs that scale non-linearly; independence scales linearly with developer attention.