Verification-Driven Agentic Workflows

Article

This article introduces the concept of verification-driven agentic workflows. The approach: define (and continually refine) verification criteria across the entire lifecycle of an agentic workflow, letting the agent iterate until those criteria are met. Checking in as the human at the points that you deem reasonable.

The insight underneath is simple: verifiable tasks are automatable tasks. This builds on Andrej Karpathy’s observation that verifiability is the key predictor for AI automation. If a task has a resettable environment, allows many attempts, and provides an automated reward signal, AI can practice until the task is mastered. This principle also inherits from RLVR (Reinforcement Learning from Verifiable Rewards)—the approach that’s driven the recent wave of reasoning models. Tests are resettable, efficient, and provide non-gameable reward signals. Give the model a verification target and let it practice.

From Principle to Practice

The rest of this article walks through an example of how to put this into practice when using an agentic coding harness to write software features. I’ll show this by speaking to a set of workflows that I have designed and refined over recent months.

The approach uses two workflows: one that transforms a plan into a structured, verifiable spec; and one that reviews that spec against your standards, then implements it phase-by-phase using TDD. Both workflows use verification loops to ensure the best practices I find important are applied throughout the work that is done.

I’ve packaged these as a Claude Code plugin with two main commands: /spec creates the specification with phases, acceptance criteria, and a validation strategy; /execute-wf runs the review and implementation loops. The plugin, the spec file structure, and the verification criteria are all tunable—what follows is how I’ve configured mine.

The sections that follow break down each piece:

The spec file structure that encodes verification criteria
The review workflow that verifies the plan
The implementation workflow that iterates over the verification steps defined in planning. (Mine also uses TDD.)

These are tuned to my engineering practices—simplicity, architectural consistency, test coverage. Yours might emphasize different things. See the appendix for more on what I verify and how a different team might configure theirs.

The Spec File

The spec file is where your definition of “right” becomes structure. It’s the artifact that bridges planning and execution—objectives, phases, acceptance criteria, validation strategy. When the context window refreshes or the agent restarts, the agent picks up where it left off.

But the spec file isn’t just input to verification—it’s also subject to verification. The workflows I’ll present include a review phase that verifies the plan itself: Does it align with codebase patterns? Is it over-engineered? Does the test coverage match your standards? The agent iterates on the spec until these criteria are met, before implementation begins.

Here’s how mine are structured:

block-beta
    columns 1

    A["OVERVIEW & OBJECTIVES<br/>— Problem statement and goals —"]
    B["CURRENT STATE ANALYSIS<br/>— What exists vs. what's needed —"]
    C["ARCHITECTURE DESIGN<br/>— Conceptual description, diagrams (no code) —"]
    D["IMPLEMENTATION PHASES<br/>— Phase 1, 2, ... N with acceptance criteria —"]
    E["FINAL PHASE: CLEAN THE HOUSE<br/>— Dead code removal, docs update —"]
    F["VALIDATION PHASE<br/>— Tools & methods to confirm feature works E2E —"]
    style A fill:#0C5DF2,color:#ffffff,stroke:#0945b5
    style B fill:#0C5DF2,color:#ffffff,stroke:#0945b5
    style C fill:#0C5DF2,color:#ffffff,stroke:#0945b5
    style D fill:#0C5DF2,color:#ffffff,stroke:#0945b5
    style E fill:#0C5DF2,color:#ffffff,stroke:#0945b5
    style F fill:#0C5DF2,color:#ffffff,stroke:#0945b5

The Workflows

Two workflows encode the verification principle: one for planning, one for implementation.

Spec Planning — Takes a plan and produces a structured spec file with phases, acceptance criteria, and a validation strategy. The output is verifiable structure.
Review and Implementation — Verifies the spec against your standards (simplicity, patterns, test coverage), then executes it phase-by-phase using TDD. Each phase has its own verification gate.

Let’s walk through each.

Workflow 1: Spec Planning

Three steps:

Plan — Develop the plan conversationally with your agent. The goal is a coherent plan in the context window before formalizing it.
Structure — Transform the plan into a spec file with discrete phases, each with acceptance criteria. This is where your definition of “right” becomes verifiable structure.
Design Validation — Define how you’ll verify the feature works end-to-end. The agent examines available tools (CLIs, MCPs, Playwright, etc.) and designs a validation phase that programmatically confirms the job is done.

The output is a draft spec—verifiable structure ready for review.

Workflow 1 Flowchart

flowchart LR
    Start(["/spec"]) --> Analyze["Analyze Codebase"]

    Analyze --> SpecGen

    subgraph SpecGen["Generate Spec File"]
        S["Overview & Objectives<br/>Current State<br/>Architecture Design<br/>Implementation Phases"]
    end

    SpecGen --> ValDesign

    subgraph ValDesign["Design Validation"]
        V["Research available tools<br/>Plan E2E verification<br/>Add validation phase"]
    end

    ValDesign --> Done(["Draft Spec"])

    style SpecGen fill:#F5F7FA,color:#2A2F36,stroke:#0C5DF2
    style ValDesign fill:#0C5DF2,color:#ffffff,stroke:#0945b5

Workflow 2: Review and Implementation

This workflow has two sub-workflows.

This phase verifies the spec itself. The agent iterates on the plan, checking against the criteria I’ve defined as important to my workflow:

Over-engineering
Inconsistency with codebase patterns
Vague or incomplete test coverage
Unnecessary backward compatibility
Deviation from project practices

Each check is a verification loop—the agent reviews, suggests changes, and refines until criteria are met. A PATTERNS.md document encodes your standards, giving the agent explicit rules to verify against.

A Review Loop Over the Spec

flowchart LR
    A["Simplify<br/>Remove over-engineering"] --> B["Generate Tests<br/>Create test spec file"] --> C["Review Design<br/>Align to patterns"] --> D["Review Implementation<br/>Fill gaps, clarify"] --> Done(["Final Spec"])

    style A fill:#e8ecf1,color:#2A2F36,stroke:#6C7A89
    style B fill:#F5F7FA,color:#2A2F36,stroke:#dddddd
    style C fill:#e8ecf1,color:#2A2F36,stroke:#6C7A89
    style D fill:#e8ecf1,color:#2A2F36,stroke:#6C7A89

PATTERNS.md guides: Simplify, Review Design, Review Implementation

The output is your final spec, verified against your standards before implementation begins.

Sub-workflow 2b: Implementation

Implementation is a loop over phases, using TDD as the verification signal.

For each phase:

Write tests (the verification criteria)
Implement until tests pass
Commit and record the git hash in the spec

The git hash serves as a checkpoint—if context resets, the agent knows where to resume.

The final validation phase is another verification loop. The agent uses whatever tools match the task—Playwright for UI, CLI calls for APIs, job monitors for CI—and iterates until validation criteria are met.

flowchart LR
    Start(["Final Spec"]) --> Loop

    subgraph Loop["For Each Phase"]
        direction LR
        T1["Write Tests"] --> T2["Implement"] --> T3["Commit"]
    end

    T3 -->|"Mark complete<br/>+ git hash"| Check{"More<br/>phases?"}
    Check -->|Yes| Loop
    Check -->|No| Val

    subgraph Val["Validation"]
        V["Run check-work<br/>Verify acceptance criteria"]
    end

    Val --> Done(["Feature Complete"])

    style Loop fill:#F5F7FA,color:#2A2F36,stroke:#0C5DF2
    style Val fill:#0C5DF2,color:#ffffff,stroke:#0945b5
    style T1 fill:#e8ecf1,color:#2A2F36,stroke:#6C7A89
    style T2 fill:#e8ecf1,color:#2A2F36,stroke:#6C7A89
    style T3 fill:#e8ecf1,color:#2A2F36,stroke:#6C7A89

One Application

I’ve used this workflow to build a few personal applications of ranging complexity (Future posts coming on these perhaps!). These codebases have extremely high unit test coverage on human-confirmed test cases, plus suites of validation tests derived from the specs that created them. Those validation tests run via Playwright and can be executed at any time to verify respective feature sets still work.

The Take-away

Bringing it back to the original concept: this is RLVR (Reinforcement Learning from Verifiable Rewards) for your agentic workflow. The workflows I’ve presented define the reward signal—acceptance criteria, passing tests, validation checks—and let the agent iterate until it meets them.

The underlying insight remains: verifiable tasks are automatable tasks. The more precisely you can define “done,” the more confidently you can hand the loop to an agent.

Appendix

On TDD

In the agentic context, TDD serves a crucial additional purpose: it provides an objective, programmatic signal for whether a phase is complete. The agent writes tests first, confirms they fail, implements until they pass, then moves on. No ambiguity, no drift—the tests are the contract. This grounds the loop in something verifiable rather than relying on the agent’s self-assessment.

Tuning Your Verification Criteria

The workflows in this article verify against criteria I’ve found valuable over 20 years of engineering practice:

What I verify:

Simplicity — No over-engineering, no speculative abstractions, YAGNI enforced
Architectural consistency — New code follows existing patterns, no parallel systems
Test coverage — Unit tests for each phase, validation tests for the feature
No backward compatibility hacks — Direct integration, no shims or feature flags unless explicitly needed
Clean documentation — README and CLAUDE.md stay current with changes

What a different team might emphasize:

Security-first — Verify authentication patterns, input validation, secrets handling at every phase
Performance — Verify latency budgets, query efficiency, caching strategies
Accessibility — Verify WCAG compliance, screen reader compatibility, keyboard navigation
Compliance — Verify audit logging, data retention policies, regulatory requirements
API stability — Verify backward compatibility (the opposite of my preference), versioning, deprecation patterns

The point isn’t that my criteria are right—it’s that you define yours explicitly, encode them into your verification loop (in this example, that’s the PATTERNS.md), and let the agent verify against them. The workflow is the same; the verification targets are yours to choose.

Software 1.0 vs. 2.0: The Specifiability to Verifiability Shift

Karpathy draws a parallel between two eras of automation:

Software 1.0 (Traditional Programming):

If you could specify a task as an explicit algorithm (step-by-step rules), you could write code to automate it
Examples: typing, bookkeeping, calculations, data entry
These were “rote, easy to specify” transformations of information
The key question was: Can you write down the exact rules?

Software 2.0 (AI/Neural Nets):

If you can verify whether an attempt succeeded, AI can learn to do it via reinforcement learning
The key question is: Can the AI practice and get feedback?

The Insight

In the 1980s, if you wanted to predict which jobs computers would automate, you’d ask: “Is this task specifiable?” Jobs that were mechanical rule-following (calculators, typists, bookkeepers) got automated. Jobs requiring judgment, creativity, or fuzzy pattern-matching didn’t.

Now, with AI, the predictive question shifts to: “Is this task verifiable?” If you can programmatically check whether the output is correct, AI can iterate until it gets it right.

AI engineer & researcher at NVIDIA. (the opinions on this site are mine alone)