ARC // Why Spec-Driven

Why we built ARC
around the spec.

Spec-driven development is how serious teams will ship software in the GenAI era. Here is the argument, from the top.

ARCitect Beta — Open

~1,800 words

9 min read

The Thesis

Code without coordination
is meaningless.

Specs are not paperwork you do before the real work. They are the real work.

A spec is the only place in a software project where humans, agents, and systems can agree on what is being built, how it behaves, and how it fits with everything else. Every other artifact — handlers, tests, infrastructure, documentation, client SDKs — is a downstream consequence of that agreement. When the agreement is clear, the consequences are cheap to produce and easy to change. When the agreement is vague, the consequences are where all the work ends up, and the work is almost always the wrong work.

For most of software history, teams got away with treating the spec as a planning document. You wrote it before you started coding. You referred to it when you needed to remember what you decided. You abandoned it when it drifted from reality — which it always did, because keeping two sources of truth in sync is a tax nobody wants to pay. The code became the spec by default, and the spec became a misleading artifact that new engineers learned not to trust.

That arrangement was survivable when humans wrote all the code. It is not survivable now.

In a world where any agent can generate plausible code in seconds, the bottleneck is no longer typing. The bottleneck is making sure the code that gets written is the code you actually wanted — that it integrates with the rest of the system, and that it doesn't quietly break the thing you shipped last week. An AI coding agent produces code that plausibly satisfies a prompt; a spec is what tells you whether the code satisfies the contract.

Plausible and correct are not the same thing.

◆

LLMs may commoditize code generation. Code coordination and contract governance will be the durable layer.

◆

The teams that ship the most valuable software in the next decade will treat the spec as the source of truth and generate all other artifacts from it.

The Category

A spec is an input to development,
not an output of it.

Spec-driven development inverts the relationship between specification and implementation. In the traditional model, you build the system and the spec describes what you built — the spec is an artifact of the code, created after the fact and maintained (or not) as the code evolves. In a spec-driven model, you write the spec and the implementation is derived from it — the code is an artifact of the spec, produced in response to it and regenerated when the spec changes.

Important distinction

"API-first" and "design-first" are spec-driven in name only. Both describe workflows where the spec comes first in time but still loses authority to the code over the life of the project. Spec-driven development, properly understood, is about where authority lives — not about scheduling. The spec has authority if and only if changes to the system happen by editing the spec. Any other arrangement degrades to documentation.

OpenAPI is the right starting point for spec-driven development on HTTP APIs because it is the only widely adopted specification format that captures the things you actually need to generate code from: request and response shapes, validation rules, authentication, error contracts, operation semantics. It has a mature tooling ecosystem, a real standards body, and enough expressiveness to describe most of what backend APIs actually do.

It is imperfect in specific ways that matter for generation — which is a problem the tooling layer can solve — but it is the format everyone already knows, and that is worth more than any theoretically cleaner alternative. The remaining question is how far you take it. That is what the levels are for.

The Maturity Model

Where is your team today?

Three levels, each closing a specific gap the previous level left open. Each level has a distinct limitation that motivates the next one.

The compounding bet

The teams that get to Level 3 first will have a compounding advantage. Every month of Level 3 operation is a month of institutional trust in the pipeline, a month of accumulated spec coverage, and a month in which their competitors are still reviewing ten thousand lines of generated code by hand. That is why now.

The spec is the only file a human edits. Handlers, validators, infrastructure configuration, tests, documentation, client SDKs — all derived from the spec by a generation pipeline. The code is a build output. The workspace is the contract.

What you get

Drift becomes architecturally impossible, not just detectable. There is no path by which a handler can disagree with the spec, because the handler is the spec, materialized.

The human design surface shrinks to the contract layer. Teams spend time on API design, business logic, and data modeling — not on boilerplate, validation wiring, or documentation maintenance.

An entire class of production bug disappears.

The tradeoff

Trust in the generation pipeline. If the pipeline has a bug, the bug propagates to every generated artifact simultaneously.

Brownfield adoption is a migration, not a flip of a switch.

Our position

We think the tradeoff is worth it. We think it is obviously worth it once a team feels it work. That is what ARC is for.

The Moment

AI coding agents make
drift worse, not better.

Source CodeRabbit — analysis of 470 PRs, AI-authored vs. human-authored

AI-generated PRs have more critical & major issues — and 75% more logic errors.

A study conducted by AI-powered code review company CodeRabbit shows that AI accelerates output, but it also amplifies certain categories of mistakes.

1.4×

more critical issues

341 AI

240 Human

1.7×

more major issues

447 AI

257 Human

1.7×

more issues per PR overall

10.83 issues per PR for AI vs. 6.45 for human-only. High-issue outliers were significantly more common, creating unpredictable review workloads.

75%

more logic & correctness issues

Business logic mistakes, incorrect dependencies, flawed control flow, misconfigurations — the most expensive class of bug to find and fix.

Bottom line

AI coding tools are powerful accelerators, but acceleration without guardrails increases risk. The analysis shows that AI-generated code is consistently more variable, more error-prone, and more likely to introduce high-severity issues without the right protections in place.

Utilizing AI code output successfully requires mechanisms to ensure that the code isn't just plausible, it's spec-compliant. Without a contract to measure against, you have no verifiable criterion for correctness.

The Problem

We have a coordination problem, not a speed problem.

The rise of AI-integrated IDEs and agentic coding tools has made spec-driven development go from "a good idea most teams should adopt" to the only sustainable way to ship at AI-assisted velocity. A model can produce a plausible handler in seconds. A team deciding whether that handler matches the contract takes hours. The faster the generation, the more important the contract.

Code generation has become cheap, but code coordination has not. Without a single source of truth for the entire system, you have an unknown number of agents and engineers making diverging decisions — decisions that don't become deployable code, but merge conflicts, reworks, and outages.

The Solutions

Adopting AI effectively

CodeRabbit identified seven practices that measurably reduce AI-introduced defects. These aren't process suggestions — they're structural interventions that close specific failure modes.

01

Give AI the context it needs

AI makes more mistakes when it lacks business rules, configuration patterns, or architectural constraints. Provide prompt snippets, repo-specific instruction capsules, and configuration schemas to reduce misconfigurations and logic drift.

ARCitect provides the context at the source — your OpenAPI spec carries the business rules, schema constraints, and architectural decisions that agents need.
ARC Code Reactor uses deterministic code generation to eliminate the possibility of invalid code output.
02

Use policy-as-code to enforce style

Readability and formatting are among the biggest AI-driven gaps. CI-enforced formatters, linters, and style guides eliminate entire categories of issues before they reach review.

ARC Code Reactor generates systems with standard formatting configured with Git hooks enabled by default.
03

Add correctness safety rails

Given the rise in logic and error-handling issues: require tests for non-trivial control flow, mandate nullability and type assertions, standardize exception-handling rules, and explicitly prompt for guardrails where needed.

ARC Code Reactor goes beyond generation-time validation — it generates comprehensive automated contract testing suites.
04

Strengthen security defaults

Mitigate elevated vulnerability rates by centralizing credential handling, blocking ad-hoc password usage, and running SAST and security linters automatically on every commit.

ARC Code Reactor systems follow the AWS Well-Architected Framework, with automated security posture analysis on the roadmap.
05

Nudge the model toward efficient patterns

Offer guidelines for batching I/O, choosing appropriate data structures, and embedding performance hints in prompts before generation begins.

ARC Code Reactor excels at this — the entire deterministic system is comprised of consistent, efficient patterns. No need for nudges when you have guaranteed code output.
06

Adopt AI-aware PR checklists

Reviewers should explicitly ask: Are error paths covered? Are concurrency primitives correct? Are configuration values validated? Are credentials handled via the approved helper? These questions target the areas where AI is most error-prone.

We are actively developing the next level of ARC Code Reactor test suite generation, which will provide automatic comprehensive system testing.
07

Get help reviewing and testing AI code

Code review pipelines weren't built to handle the volume teams are now shipping with AI. Reviewer fatigue leads to more missed bugs. An AI code review tool like CodeRabbit standardizes quality across different AI tools, acts as a third-party source of truth, and reduces the cognitive load of review — letting developers concentrate on the complex parts.

ARC Code Reactor provides deterministic, validated code along with a comprehensive test suite for additional validation — but we recommend leveraging any additional tools that help the team collaborate with AI most effectively.

The Next Project

What Level 3 looks like
in practice.

Two versions of a feature request. Same team, same project, very different outcomes.

Project Timeline

PM files a feature request.

Engineer checks the docs and Slack, then starts working based on outdated info.

Engineer writes the handler based on a previous API version.

Test suite fails: it expects current API version input for the database operation.

Engineer disables the failing test and pings Slack.

Engineer implements remaining business logic, forgetting about the disabled test.

Engineer updates the thread in Slack, but not the spec or the team's Postman collection. There are now at least four different sources of "truth."

Ships. Message from the consumer team on Wednesday about a field that changed shape.

Project Timeline — Level 3

PM files a feature request.

Engineer opens the spec, adds two fields to an existing response shape and a new endpoint.

Spec validator runs locally. Change is internally consistent. Branch pushed.

Generation pipeline runs. Handlers, validators, contract tests, CDK diff — all regenerated.

PR opens automatically: spec diff, generated-code diff, change summary.

Team reviews the spec diff. Nobody reviews ten thousand lines of generated code — because nobody needs to.

Ships. Users love the new stable features.

The Level 3 project is not an impossible utopia. It is what happens when the tool is right and the discipline is architectural instead of cultural. The code is a build output, and trusting it is the same kind of trust you already extend to your TypeScript compiler.

Level 3 means you can compile systems from specs.

Ready to start?

ARCitect is the first piece of spec-as-source we're putting in your hands. The editor where the spec lives, and where everything downstream gets derived from.

Join the ARCitect Beta → See how ARC works →

Why we built ARCaround the spec.

AI-generated PRs have more critical & major issues — and 75% more logic errors.

We have a coordination problem, not a speed problem.

Adopting AI effectively

Give AI the context it needs

Use policy-as-code to enforce style

Add correctness safety rails

Strengthen security defaults

Nudge the model toward efficient patterns

Adopt AI-aware PR checklists

Get help reviewing and testing AI code

Ready to start?

Why we built ARC
around the spec.