Spec-driven development took over the conversation in under a year. Spec Kit from GitHub, Kiro from AWS, Tessl, BMAD, Agent OS: a dozen tools arrived to make you write down what you want, in detail, up front, so the agent stops guessing. I have been skeptical of it for a while, but skepticism is cheap. What changed my mind about the shape of the fix is a two-part argument by Kapil Viren Ahuja, who has been running the alternative on his own team and writing down where it breaks. This is my reading of it: the parts I think are right, and the place it ends that the saying-so never did, on the client's invoice.
The argument in five lines
- A spec is a prediction, written at your least-informed moment. The agent fills the holes you left.
- Proof: OpenAI's best spec (Symphony, 2,169 lines) was written last, distilled from software that already ran.
- The fix is ICE: Intent (goal + constraints + failure conditions), Context (harness-owned), Expectations (the definition of done).
- Separate the invariants from the unknowns. Pin only the part that won't change; let the how be discovered.
- Tests don't disappear, they relocate. Outcome checks and runtime guardrails are where the rigor lives now.
SDD exists because vibe coding wasn't working. You described a goal, you got a block of code back, it looked right, and it didn't quite work. GitHub said the quiet part themselves when they shipped Spec Kit in September 2025: models are "exceptional at pattern completion, but not at mind reading." The reach was for the spec. Write it all down, and the agent stops inventing.
One correction, because careful readers will check: spec-driven development is not new. Larman and Basili documented in IEEE Computer back in 2003 that iterative development goes to the 1950s, and the single-pass, document-driven ideal was doubted from the start, even by Royce, the man usually blamed for waterfall. What's new is handing the spec to a goal-seeking machine and expecting the spec to hold it.
The spec leaves holes, and the agent fills them
The whole problem with SDD is that people write the spec however they want. It holds the what. It holds the how. It holds whatever was in the engineer's head that morning. Then the agent, Claude Code or Codex or whatever you run, reads it and makes decisions, because there are gaping holes in it.
The holes aren't a defect in any one spec. They are two human gaps:
- Engineers don't take the time to learn what a spec actually is. We assume we know, because we've written specs for twenty years. We have not written this kind.
- The methods that went viral were built for a forty-minute YouTube demo on a greenfield app, not for a real enterprise codebase with twelve teams, a decade of baked-in decisions, and a compliance reviewer who needs to sleep at night. The marketing layer is selling demos as methods. They are not the same thing, and the gap between them is paid for downstream.
It is worse than "the spec leaves holes." Open Spec Kit's own files. Its manifesto describes the templates' purpose plainly: they "transform the LLM from a creative writer into a disciplined specification engineer," under a section literally titled "How Structure Constrains LLMs for Better Outcomes."
Then the spec command, when the description is incomplete, tells the model to "make informed guesses," use "common patterns" to "fill gaps," and caps it at a "Maximum 3 [NEEDS CLARIFICATION] markers." Three. The tool does not merely leave holes for the agent to fill. It instructs the agent to fill them, and limits how often it has to admit it's guessing.
The same toolkit argues with itself. The manifesto calls test-first non-negotiable. A task template shipped in the same repo the same week says tests are optional. The file that runs the work says follow TDD. Three orders for one decision, shipped together. Hand a goal-seeking model three contradictory rules and it can't reason toward an outcome. It picks a dogma, obeys it, and improvises the rest: the exact behavior the rigid method was adopted to prevent.
What OpenAI's Symphony spec actually proves
Look at what OpenAI published with Symphony in April 2026: an open spec for orchestrating agents, whose core is a single file of 2,169 lines, eighteen sections, formal must-and-should language. The depth shows exactly what we are asking people to do, and exactly what they cannot.
If anyone could write all of that into one SPEC.md up front, then yes, it works. That is not sarcasm; it is the literal truth. A complete, unambiguous, two-thousand-line spec produces good software.
But look at the order OpenAI did it in. They spent about six months building an internal tool under one rule: no human-written code, every line generated by Codex. They got it working. Only after it worked did they distill the spec out of the running system, then had Codex reimplement it in five more languages to shake the ambiguities loose.
The deep, RFC-grade spec is retrospective documentation of software that already ran. The one organization that produced a spec that good produced it last, by reverse-engineering software that was already alive. The industry is selling the output of that process as if it were the method.
That is the trap nobody names: you cannot write the spec up front, because the spec that works is the one you can only write after the thing exists.
ICE: intent, context, expectations
Here is the frame that replaces it. ICE, for Intent, Context, Expectations: three artifacts, each owned by a human, none of them a single bloated document.
- Intent is what you want: the goal, the constraints that bound it, and the failure conditions that guard it.
- Context is the how: the stack, the existing system, the constraints of the codebase the thing gets built in. It comes from your harness and is fed progressively as needed, not dumped into one wall at the start.
- Expectations are what counts as done: the success scenarios and the limits the result must stay inside, written in terms the user would recognize, generated from the intent and the context together and vetted at a checkpoint.
The fix is owning each one as its own craft. The moment the definition of done drifts away from the person who wanted the outcome, the agent starts deciding "done" for them.
| Spec-driven (SDD) | Intent-driven (ICE) | |
|---|---|---|
| Artifact | One SPEC.md: what + how | Intent, context, expectations, split apart |
| Written up front | Everything, in detail | Only the part that won't change |
| The "how" | Pinned in the spec | Discovered by the agent |
| Ambiguity | Filled silently by the agent | Closed by the human, or caught by evals |
| Owns "done" | Whoever held the keyboard | The person who wanted the outcome |
The anatomy of intent
Intent is the root, and if the root is a spec in disguise, nothing built on it can be anything else. It is three things: a goal, the constraints, and the failure conditions.
The goal
The goal is the outcome you want, in one sentence. There's a clean test: can two completely different implementations both satisfy this? If yes, you wrote a goal. If only one implementation could possibly satisfy it, you wrote a spec and called it a goal.
And no "and." When you need a conjunction you almost certainly have two goals pretending to be one, and the fix is to split, never to add detail. The method scales by adding more intents, not by making one heavier.
"Build a microservice that handles the user-facing product catalog" is a goal. A dozen engineers build it a dozen ways and every one could be right. "Build a Go microservice using gRPC, with PostgreSQL for catalog storage and Redis for cart state, behind an Envoy sidecar" is a spec wearing a goal's clothes. The design choices are already made, in a hurry, possibly wrong, and the agent is demoted from decision-maker to typist. If you wanted a typist you needed a keyboard and an afternoon, not an agent.
The constraints
Constraints are the qualities the outcome must carry: the non-functional requirements. Each is directional (it points where the outcome should land) and unconditional (miss it and the output fails, however clever the implementation). They sit in business language, not implementation patterns.
"99.99% uptime against the SLO" is a constraint. "Use the existing Kubernetes setup with three replicas and pod disruption budgets" is not. It picks a pattern, so it belongs in Context. The discipline is five to seven lines, not fifty. The waterfall trap waits exactly here: the urge to write every requirement you know, in the name of completeness, is how 1,300-line specs are born.
The failure conditions
Failure conditions are the binary checks the validator runs against the output. Each is observable (a script or eval decides true/false without a human opinion) and post-output (it catches what's wrong after the work is done, rather than shaping the builder as it goes). The decision rule between a constraint and a failure condition is one sentence:
Would knowing this change how the builder writes code? Yes → it's a constraint. No → it's a failure condition the validator catches after.
| Constraint | Failure condition | |
|---|---|---|
| The test | Shapes how the builder codes | Checked after, true or false |
| When | Pre-build, guides design | Post-output, binary eval |
| Who sees it | The builder | The validator, hidden from the builder |
| Example | "No new runtime dependency" | "Coverage stays above 90%" |
Goal
Build a microservice that handles the user-facing product catalog.
Constraints: qualities the outcome must carry
- Support 1,000 concurrent monthly active users at peak.
- p99 latency under 200 ms on read endpoints.
- 99.99% uptime against the published SLO.
- WCAG 2.1 AA on any user-facing surface.
- OWASP ASVS Level 2 on every endpoint.
Failure conditions: binary checks the validator runs as evals
- Build fails.
- Unit-test coverage drops below 90%.
- Lint reports any errors.
- Sonar quality gate fails on coverage, duplication, or security hotspots.
- A secret appears in the source.
- API contract changes without a version bump.
That's the whole intent. No file paths, no class names, no library choices, no "use the Go service template." The Go template lives in Context, where the harness assembles it from the org's standards and passes it to the builder alongside the intent. The intent carries only what is irreducible to this work.
There's a second, sharper reason the constraint/failure split matters: compartmentation. The builder receives the goal and the constraints. The validator receives the failure conditions, compiled into encrypted evals. The builder cannot teach a test it cannot see. LLMs reward-hack; if you hand the builder the same scenarios the validator will check, the builder games them. Keeping the two on opposite sides of a wall is the only structural defense, and it only works if every requirement is on the right side of the line.
The same anatomy, outside code
The shape isn't unique to software. A consumer outcome carries the same three slots, in plain language a shopper would recognize.
Goal
The user wants to buy a red shoe for under thirty dollars.
Constraints: qualities the experience must carry
- The price shown is the total they pay, with shipping and tax included.
- The inventory shown is real, and shippable to the user.
- Checkout is three clicks or fewer.
- The seller never sees card details.
- It works for screen-reader and keyboard-only users.
- The return policy is visible before purchase, not after.
Failure conditions: binary checks on what the user sees
- A shoe priced over thirty-five dollars appears in the results.
- A non-red shoe appears in the results.
- A search returns nothing and shows no empty-state message.
- Adding to the cart fails silently.
- The total at checkout differs from the total on the listing.
Same anatomy, different domain. The implementation (the recommendation engine, the storefront, the payment provider) is nowhere in the intent. If one model handles a piece of software and a piece of commerce with the same three slots and nothing left over, it's the right model.
Harness is not method. IDSD is.
This is the distinction the download buttons obscure. Spec Kit, BMAD, the prompt-and-workflow crowd: these are harnesses, and useful ones. But a harness is the part with a download button, and the default failure is adopting the harness without the method. ICE is the method that decides what the work is before any harness handles it.
Kapil calls the practice IDSD, intent-driven software development, where declaring outcomes and letting the machine determine implementation is the normal way of working, not the aspiration.
Some will say this is still spec by another name, and they're right about the files and wrong about what changed. It's all still markdown; the format was never the problem. What changed is what each file is and who owns it. There is no spec craft anymore. One file written however the keyboard-holder felt that morning is the thing that broke. Intent plus an explicit set of expectations, handed to a harness that owns the build, is a different thing entirely.
When SDD is still the right tool
I don't want to oversell this, and neither does Kapil. There is a real case for constraining the model. If your team is new to agents, if the codebase is fragile, if the stakes are regulatory and a wrong build costs you a license, then a leash is rational. You want the model boring, predictable, the same twice. SDD was built for that case and it does the job. Use it every time you're starting out; it's the second step, immediately after vibe coding. Do not skip it.
What doesn't survive is taking it out of that case and running it everywhere. A spec with holes hands the model the very freedom the leash was meant to remove, now without the model's instincts to balance it. You get the constraint without the protection. That is the case that has spread, and it's the one worth replacing.
Who actually pays
In the end this is an economics story, and the people at the bottom of it never read a spec in their lives. There are two ways to build:
| Hand-written code | Agent-built | |
|---|---|---|
| Authoring | Slow | Fast |
| Cost to run | Cheap; computing is commoditized | Expensive |
| The catch | Gaps get filled by tokens, many of them confidently wrong |
Tokens didn't get expensive; the per-token price keeps falling. The cost climbs because an agent left to fill the gaps burns far more of them per finished outcome. When METR ran a controlled trial in 2025, experienced developers were measurably slower with AI and walked away certain it had made them faster. Being wrong while feeling fast is the whole failure in one sentence, and the inflation it produces isn't absorbed by the vendor or the influencer. It moves down the line, to the client who never read a spec and the people that client serves.
The question is not whether your agents can do the work. They can. The question is whether you have the discipline to own the intent and the expectations yourself, and the nerve to stay in the loop while the machine runs, on the day it would be so much easier to step out and bless the diff. That day is the whole game.
What I'm taking from it
What persuades me here is not the framework diagram. It's the order-of-operations argument. The Symphony spec being written last is the tell. The honest version of "spec-driven development" is retrospective: the unambiguous document is something you extract from running software, not something you author into existence before it. Everything that goes wrong with SDD-in-the-large is the result of pretending otherwise.
ICE doesn't escape that. Taken together it's still a spec. But it's a spec split along the lines of who can actually own each piece, and it lets the model do what it was built to do instead of fighting it. Intent is the part only you can write. The harness earns the discipline you spend writing it.
If you do one thing this week, take one real outcome you're about to ship (not the system, one outcome) and write the three things for it: the goal in one sentence with no "and," three to five constraints in business language, three to five binary failure conditions. Then hand it to someone who wasn't in your head and ask where the agent would still have to guess. Every place they point is a hole you were about to let it fill.
What the framework underplays
Let me push the argument one step further than either essay does. The deep reason a spec fails is not that it is a spec. It is that a spec is a prediction, and the things worth building live where prediction doesn't hold: the requirements are discovered through the act of building, not knowable before it. The spec written before that learning is a record of your least-informed moment, and keeping close to it anchors the finished thing to the understanding you had when you knew the least.
A spec is a good rear-view mirror and a bad crystal ball. The Symphony story is that sentence with a company attached: the two thousand good lines exist because the exploration ran first.
But the blanket version overreaches, because the opposite mistake costs just as much. It depends on the domain, and the error runs both ways:
| The mistake | What it looks like | What it costs |
|---|---|---|
| Novel treated as known | Specifying something you can't yet understand | Anchored to your worst guess |
| Known treated as novel | "Exploring" a fixed API contract or the hundredth CRUD form | Rediscovering, expensively, what a one-page spec nailed |
Most real systems are a mix, which is the deeper reason one document fails: it applies the same ceremony to both.
The move was never spec-or-no-spec. It is separating the invariants from the unknowns: write down only the part that won't change, and let the volatile part be found.
That kernel is what intent is quietly doing. What you want (the outcome) and the boundaries you can't cross (the constraints, the failure conditions) are usually stable, so they are cheap and safe to commit to early. The how is what evolves, and the how is exactly what a spec should never pin.
Two things the framing skips. First, specs survive their own bad reviews because many aren't engineering artifacts at all: the spec is load-bearing for the invoice, for fixed-bid contracts, sign-off, and blame. You can't argue an organization out of a commercial instrument with an engineering argument.
Second, AI only helps half of this. Generation got cheap, so explore-and-discard is cheaper than ever, but cheap generation without equally cheap invalidation just means you reach the wrong answer faster (felt fast, was slow). Discovery only beats specification if you can kill bad branches quickly, which is why the thing that must stay rigorous is not the spec but the test of whether what you built is right.
Where the tests actually go
This is the question that decides whether the method is engineering or just vibe coding with a better vocabulary, so it is worth being plain about it. Tests are not the enemy of discovery. They are what lets you afford it. The whole case for exploring instead of specifying depends on being able to throw a wrong implementation away fast, and the thing that tells you fast that it is wrong is the test suite. Pull the tests out and you do not get more freedom. You get the METR result again: wrong at speed, sure you were quick.
The trouble is that one word, "tests," is doing three jobs, and the method only holds if you keep them apart.
| Job | What it actually is | The rule |
|---|---|---|
| Define "done" | Acceptance / behavioral checks: the expectations and failure conditions in executable form. Human-owned, hidden from the builder. | Non-negotiable |
| Pin the implementation | Unit tests asserting a function's internal shape. A spec by another door. | Optional, and harmful if written before the code is found |
| Guard at runtime | Validation, types, rate limits, assertions, circuit breakers, flags, canary-with-rollback. | Matters more as agents write more |
That table settles the Spec Kit contradiction I flagged earlier. "Tests are non-negotiable" and "tests are optional" are both right, about different rows. The guardrail row is the one that grows in an agent world: when the diff runs to ten thousand lines that look right and nobody can honestly read all of it, you stop trusting that someone read the code and start trusting that the code cannot cross a line even when it is wrong. That is Nyra's worry answered with machinery instead of willpower.
Two things follow, and both make the engineering harder, not softer. The first is that tests the builder writes for itself are scaffolding, not proof. A model left to check its own work writes the checks that pass. It mocks away the thing under test. It asserts that true is true. The tests that actually validate have to live somewhere the builder cannot see or reach. That tilts the whole pyramid. Integration, contract, and end-to-end tests carry more weight than unit tests now, because they face the outcome and cross boundaries the builder never sees all at once, which is exactly what makes them hard to fake.
The second is that guardrails are never finished, and I will not pretend otherwise. You cannot foresee every way a system breaks, so you cannot write every guardrail up front any more than you can write the spec up front. You write the ones you can see coming, and the incident hands you the next one. It is the same loop as monitoring is not understanding: the guardrail you are missing is usually sitting in the postmortem of the thing that already broke once.
So proper engineering does not walk out when the agent starts writing the code. It moves. The work stops being the typing and turns into the deciding: which invariants have to hold always, and what machinery proves they hold on every run. The agent can write the implementation. It cannot decide for you what must always be true, and that decision is the job now. It is a harder one than the typing ever was.
A reading of two essays by Kapil Viren Ahuja: "The Method That Replaces Spec-Driven Development (IDSD)" and "The Anatomy of Intent (ICE in IDSD)." The ICE framing, the IDSD name, the Symphony and Spec Kit observations, and the red-shoe and microservice examples are his.