Spec-driven development took over the conversation in under a year. Spec Kit from GitHub, Kiro from AWS, Tessl, BMAD, Agent OS: a dozen tools arrived to make you write down what you want, in detail, up front, so the agent stops guessing. I have been skeptical of it for a while, but skepticism is cheap. What changed my mind about the shape of the fix is a two-part argument by Kapil Viren Ahuja, who has been running the alternative on his own team and writing down where it breaks. This is my reading of it: the parts I think are right, and the place it ends that the saying-so never did, on the client's invoice.

The argument in five lines

  • A spec is a prediction, written at your least-informed moment. The agent fills the holes you left.
  • Proof: OpenAI's best spec (Symphony, 2,169 lines) was written last, distilled from software that already ran.
  • The fix is ICE: Intent (goal + constraints + failure conditions), Context (harness-owned), Expectations (the definition of done).
  • Separate the invariants from the unknowns. Pin only the part that won't change; let the how be discovered.
  • Tests don't disappear, they relocate. Outcome checks and runtime guardrails are where the rigor lives now.

SDD exists because vibe coding wasn't working. You described a goal, you got a block of code back, it looked right, and it didn't quite work. GitHub said the quiet part themselves when they shipped Spec Kit in September 2025: models are "exceptional at pattern completion, but not at mind reading." The reach was for the spec. Write it all down, and the agent stops inventing.

One correction, because careful readers will check: spec-driven development is not new. Larman and Basili documented in IEEE Computer back in 2003 that iterative development goes to the 1950s, and the single-pass, document-driven ideal was doubted from the start, even by Royce, the man usually blamed for waterfall. What's new is handing the spec to a goal-seeking machine and expecting the spec to hold it.

The spec leaves holes, and the agent fills them

The whole problem with SDD is that people write the spec however they want. It holds the what. It holds the how. It holds whatever was in the engineer's head that morning. Then the agent, Claude Code or Codex or whatever you run, reads it and makes decisions, because there are gaping holes in it.

The holes aren't a defect in any one spec. They are two human gaps:

It is worse than "the spec leaves holes." Open Spec Kit's own files. Its manifesto describes the templates' purpose plainly: they "transform the LLM from a creative writer into a disciplined specification engineer," under a section literally titled "How Structure Constrains LLMs for Better Outcomes."

Then the spec command, when the description is incomplete, tells the model to "make informed guesses," use "common patterns" to "fill gaps," and caps it at a "Maximum 3 [NEEDS CLARIFICATION] markers." Three. The tool does not merely leave holes for the agent to fill. It instructs the agent to fill them, and limits how often it has to admit it's guessing.

The same toolkit argues with itself. The manifesto calls test-first non-negotiable. A task template shipped in the same repo the same week says tests are optional. The file that runs the work says follow TDD. Three orders for one decision, shipped together. Hand a goal-seeking model three contradictory rules and it can't reason toward an outcome. It picks a dogma, obeys it, and improvises the rest: the exact behavior the rigid method was adopted to prevent.

What OpenAI's Symphony spec actually proves

Look at what OpenAI published with Symphony in April 2026: an open spec for orchestrating agents, whose core is a single file of 2,169 lines, eighteen sections, formal must-and-should language. The depth shows exactly what we are asking people to do, and exactly what they cannot.

If anyone could write all of that into one SPEC.md up front, then yes, it works. That is not sarcasm; it is the literal truth. A complete, unambiguous, two-thousand-line spec produces good software.

But look at the order OpenAI did it in. They spent about six months building an internal tool under one rule: no human-written code, every line generated by Codex. They got it working. Only after it worked did they distill the spec out of the running system, then had Codex reimplement it in five more languages to shake the ambiguities loose.

The deep, RFC-grade spec is retrospective documentation of software that already ran. The one organization that produced a spec that good produced it last, by reverse-engineering software that was already alive. The industry is selling the output of that process as if it were the method.

That is the trap nobody names: you cannot write the spec up front, because the spec that works is the one you can only write after the thing exists.

ICE: intent, context, expectations

Here is the frame that replaces it. ICE, for Intent, Context, Expectations: three artifacts, each owned by a human, none of them a single bloated document.

The fix is owning each one as its own craft. The moment the definition of done drifts away from the person who wanted the outcome, the agent starts deciding "done" for them.

Spec-driven (SDD)Intent-driven (ICE)
ArtifactOne SPEC.md: what + howIntent, context, expectations, split apart
Written up frontEverything, in detailOnly the part that won't change
The "how"Pinned in the specDiscovered by the agent
AmbiguityFilled silently by the agentClosed by the human, or caught by evals
Owns "done"Whoever held the keyboardThe person who wanted the outcome
the human owns two things and never leaves them INTENT goal + constraints + failures EXPECTATIONS what "done" means agentic loop (harness) pull context code it validate vs expectations not met? again merge the harness owns the loop and is never asked to invent what the human wanted
The human gives intent and expectations. The harness pulls context, codes, and validates against the expectations; if they aren't met it goes again, and keeps going until they are. SDD breaks because it asks one document, written one way, to be all four things at once (intent, definition of done, workflow, and context) with the gaps left to the agent's discretion.

The anatomy of intent

Intent is the root, and if the root is a spec in disguise, nothing built on it can be anything else. It is three things: a goal, the constraints, and the failure conditions.

The goal

The goal is the outcome you want, in one sentence. There's a clean test: can two completely different implementations both satisfy this? If yes, you wrote a goal. If only one implementation could possibly satisfy it, you wrote a spec and called it a goal.

And no "and." When you need a conjunction you almost certainly have two goals pretending to be one, and the fix is to split, never to add detail. The method scales by adding more intents, not by making one heavier.

"Build a microservice that handles the user-facing product catalog" is a goal. A dozen engineers build it a dozen ways and every one could be right. "Build a Go microservice using gRPC, with PostgreSQL for catalog storage and Redis for cart state, behind an Envoy sidecar" is a spec wearing a goal's clothes. The design choices are already made, in a hurry, possibly wrong, and the agent is demoted from decision-maker to typist. If you wanted a typist you needed a keyboard and an afternoon, not an agent.

The constraints

Constraints are the qualities the outcome must carry: the non-functional requirements. Each is directional (it points where the outcome should land) and unconditional (miss it and the output fails, however clever the implementation). They sit in business language, not implementation patterns.

"99.99% uptime against the SLO" is a constraint. "Use the existing Kubernetes setup with three replicas and pod disruption budgets" is not. It picks a pattern, so it belongs in Context. The discipline is five to seven lines, not fifty. The waterfall trap waits exactly here: the urge to write every requirement you know, in the name of completeness, is how 1,300-line specs are born.

The failure conditions

Failure conditions are the binary checks the validator runs against the output. Each is observable (a script or eval decides true/false without a human opinion) and post-output (it catches what's wrong after the work is done, rather than shaping the builder as it goes). The decision rule between a constraint and a failure condition is one sentence:

Would knowing this change how the builder writes code? Yes → it's a constraint. No → it's a failure condition the validator catches after.

ConstraintFailure condition
The testShapes how the builder codesChecked after, true or false
WhenPre-build, guides designPost-output, binary eval
Who sees itThe builderThe validator, hidden from the builder
Example"No new runtime dependency""Coverage stays above 90%"

Goal

Build a microservice that handles the user-facing product catalog.

Constraints: qualities the outcome must carry

  • Support 1,000 concurrent monthly active users at peak.
  • p99 latency under 200 ms on read endpoints.
  • 99.99% uptime against the published SLO.
  • WCAG 2.1 AA on any user-facing surface.
  • OWASP ASVS Level 2 on every endpoint.

Failure conditions: binary checks the validator runs as evals

  • Build fails.
  • Unit-test coverage drops below 90%.
  • Lint reports any errors.
  • Sonar quality gate fails on coverage, duplication, or security hotspots.
  • A secret appears in the source.
  • API contract changes without a version bump.

That's the whole intent. No file paths, no class names, no library choices, no "use the Go service template." The Go template lives in Context, where the harness assembles it from the org's standards and passes it to the builder alongside the intent. The intent carries only what is irreducible to this work.

There's a second, sharper reason the constraint/failure split matters: compartmentation. The builder receives the goal and the constraints. The validator receives the failure conditions, compiled into encrypted evals. The builder cannot teach a test it cannot see. LLMs reward-hack; if you hand the builder the same scenarios the validator will check, the builder games them. Keeping the two on opposite sides of a wall is the only structural defense, and it only works if every requirement is on the right side of the line.

The same anatomy, outside code

The shape isn't unique to software. A consumer outcome carries the same three slots, in plain language a shopper would recognize.

Goal

The user wants to buy a red shoe for under thirty dollars.

Constraints: qualities the experience must carry

  • The price shown is the total they pay, with shipping and tax included.
  • The inventory shown is real, and shippable to the user.
  • Checkout is three clicks or fewer.
  • The seller never sees card details.
  • It works for screen-reader and keyboard-only users.
  • The return policy is visible before purchase, not after.

Failure conditions: binary checks on what the user sees

  • A shoe priced over thirty-five dollars appears in the results.
  • A non-red shoe appears in the results.
  • A search returns nothing and shows no empty-state message.
  • Adding to the cart fails silently.
  • The total at checkout differs from the total on the listing.

Same anatomy, different domain. The implementation (the recommendation engine, the storefront, the payment provider) is nowhere in the intent. If one model handles a piece of software and a piece of commerce with the same three slots and nothing left over, it's the right model.

Harness is not method. IDSD is.

This is the distinction the download buttons obscure. Spec Kit, BMAD, the prompt-and-workflow crowd: these are harnesses, and useful ones. But a harness is the part with a download button, and the default failure is adopting the harness without the method. ICE is the method that decides what the work is before any harness handles it.

Kapil calls the practice IDSD, intent-driven software development, where declaring outcomes and letting the machine determine implementation is the normal way of working, not the aspiration.

Some will say this is still spec by another name, and they're right about the files and wrong about what changed. It's all still markdown; the format was never the problem. What changed is what each file is and who owns it. There is no spec craft anymore. One file written however the keyboard-holder felt that morning is the thing that broke. Intent plus an explicit set of expectations, handed to a harness that owns the build, is a different thing entirely.

When SDD is still the right tool

I don't want to oversell this, and neither does Kapil. There is a real case for constraining the model. If your team is new to agents, if the codebase is fragile, if the stakes are regulatory and a wrong build costs you a license, then a leash is rational. You want the model boring, predictable, the same twice. SDD was built for that case and it does the job. Use it every time you're starting out; it's the second step, immediately after vibe coding. Do not skip it.

What doesn't survive is taking it out of that case and running it everywhere. A spec with holes hands the model the very freedom the leash was meant to remove, now without the model's instincts to balance it. You get the constraint without the protection. That is the case that has spread, and it's the one worth replacing.

Who actually pays

In the end this is an economics story, and the people at the bottom of it never read a spec in their lives. There are two ways to build:

Hand-written codeAgent-built
AuthoringSlowFast
Cost to runCheap; computing is commoditizedExpensive
The catchGaps get filled by tokens, many of them confidently wrong

Tokens didn't get expensive; the per-token price keeps falling. The cost climbs because an agent left to fill the gaps burns far more of them per finished outcome. When METR ran a controlled trial in 2025, experienced developers were measurably slower with AI and walked away certain it had made them faster. Being wrong while feeling fast is the whole failure in one sentence, and the inflation it produces isn't absorbed by the vendor or the influencer. It moves down the line, to the client who never read a spec and the people that client serves.

The question is not whether your agents can do the work. They can. The question is whether you have the discipline to own the intent and the expectations yourself, and the nerve to stay in the loop while the machine runs, on the day it would be so much easier to step out and bless the diff. That day is the whole game.

What I'm taking from it

What persuades me here is not the framework diagram. It's the order-of-operations argument. The Symphony spec being written last is the tell. The honest version of "spec-driven development" is retrospective: the unambiguous document is something you extract from running software, not something you author into existence before it. Everything that goes wrong with SDD-in-the-large is the result of pretending otherwise.

ICE doesn't escape that. Taken together it's still a spec. But it's a spec split along the lines of who can actually own each piece, and it lets the model do what it was built to do instead of fighting it. Intent is the part only you can write. The harness earns the discipline you spend writing it.

If you do one thing this week, take one real outcome you're about to ship (not the system, one outcome) and write the three things for it: the goal in one sentence with no "and," three to five constraints in business language, three to five binary failure conditions. Then hand it to someone who wasn't in your head and ask where the agent would still have to guess. Every place they point is a hole you were about to let it fill.

What the framework underplays

Let me push the argument one step further than either essay does. The deep reason a spec fails is not that it is a spec. It is that a spec is a prediction, and the things worth building live where prediction doesn't hold: the requirements are discovered through the act of building, not knowable before it. The spec written before that learning is a record of your least-informed moment, and keeping close to it anchors the finished thing to the understanding you had when you knew the least.

A spec is a good rear-view mirror and a bad crystal ball. The Symphony story is that sentence with a company attached: the two thousand good lines exist because the exploration ran first.

But the blanket version overreaches, because the opposite mistake costs just as much. It depends on the domain, and the error runs both ways:

The mistakeWhat it looks likeWhat it costs
Novel treated as knownSpecifying something you can't yet understandAnchored to your worst guess
Known treated as novel"Exploring" a fixed API contract or the hundredth CRUD formRediscovering, expensively, what a one-page spec nailed

Most real systems are a mix, which is the deeper reason one document fails: it applies the same ceremony to both.

The move was never spec-or-no-spec. It is separating the invariants from the unknowns: write down only the part that won't change, and let the volatile part be found.

That kernel is what intent is quietly doing. What you want (the outcome) and the boundaries you can't cross (the constraints, the failure conditions) are usually stable, so they are cheap and safe to commit to early. The how is what evolves, and the how is exactly what a spec should never pin.

Two things the framing skips. First, specs survive their own bad reviews because many aren't engineering artifacts at all: the spec is load-bearing for the invoice, for fixed-bid contracts, sign-off, and blame. You can't argue an organization out of a commercial instrument with an engineering argument.

Second, AI only helps half of this. Generation got cheap, so explore-and-discard is cheaper than ever, but cheap generation without equally cheap invalidation just means you reach the wrong answer faster (felt fast, was slow). Discovery only beats specification if you can kill bad branches quickly, which is why the thing that must stay rigorous is not the spec but the test of whether what you built is right.

Where the tests actually go

This is the question that decides whether the method is engineering or just vibe coding with a better vocabulary, so it is worth being plain about it. Tests are not the enemy of discovery. They are what lets you afford it. The whole case for exploring instead of specifying depends on being able to throw a wrong implementation away fast, and the thing that tells you fast that it is wrong is the test suite. Pull the tests out and you do not get more freedom. You get the METR result again: wrong at speed, sure you were quick.

The trouble is that one word, "tests," is doing three jobs, and the method only holds if you keep them apart.

JobWhat it actually isThe rule
Define "done"Acceptance / behavioral checks: the expectations and failure conditions in executable form. Human-owned, hidden from the builder.Non-negotiable
Pin the implementationUnit tests asserting a function's internal shape. A spec by another door.Optional, and harmful if written before the code is found
Guard at runtimeValidation, types, rate limits, assertions, circuit breakers, flags, canary-with-rollback.Matters more as agents write more

That table settles the Spec Kit contradiction I flagged earlier. "Tests are non-negotiable" and "tests are optional" are both right, about different rows. The guardrail row is the one that grows in an agent world: when the diff runs to ten thousand lines that look right and nobody can honestly read all of it, you stop trusting that someone read the code and start trusting that the code cannot cross a line even when it is wrong. That is Nyra's worry answered with machinery instead of willpower.

Two things follow, and both make the engineering harder, not softer. The first is that tests the builder writes for itself are scaffolding, not proof. A model left to check its own work writes the checks that pass. It mocks away the thing under test. It asserts that true is true. The tests that actually validate have to live somewhere the builder cannot see or reach. That tilts the whole pyramid. Integration, contract, and end-to-end tests carry more weight than unit tests now, because they face the outcome and cross boundaries the builder never sees all at once, which is exactly what makes them hard to fake.

The second is that guardrails are never finished, and I will not pretend otherwise. You cannot foresee every way a system breaks, so you cannot write every guardrail up front any more than you can write the spec up front. You write the ones you can see coming, and the incident hands you the next one. It is the same loop as monitoring is not understanding: the guardrail you are missing is usually sitting in the postmortem of the thing that already broke once.

So proper engineering does not walk out when the agent starts writing the code. It moves. The work stops being the typing and turns into the deciding: which invariants have to hold always, and what machinery proves they hold on every run. The agent can write the implementation. It cannot decide for you what must always be true, and that decision is the job now. It is a harder one than the typing ever was.


A reading of two essays by Kapil Viren Ahuja: "The Method That Replaces Spec-Driven Development (IDSD)" and "The Anatomy of Intent (ICE in IDSD)." The ICE framing, the IDSD name, the Symphony and Spec Kit observations, and the red-shoe and microservice examples are his.