From Spec to Production
in Days, Not Months
ARC Pilot Project #1 is a multi-tenant, TCPA-compliant SMS loan notification platform — and a proof of what ARC Code Reactor delivers at production scale.
The Short Version
This case study analyzes ARC Pilot Project #1, an application we built for a design partner. ARC Pilot Project #1 is a production-ready multi-tenant SaaS platform that automates TCPA-compliant SMS payment reminders for lenders — handling everything from borrower consent and timezone-aware scheduling to exactly-once delivery guarantees.
Using Code Reactor v0 in late 2025, 85% of the code for the finished system was generated in a single shot on the first day of the project.
We recently released Code Reactor v1, which improved code and test generation across the board. v1 specifically improved coverage of unit and contract testing to 100% across all services.
Five purpose-built micro-services form a cohesive pipeline from tenant onboarding through to compliant, time-aware borrower communication. ARC Code Reactor generated the overwhelming majority of that codebase — source code, test suites, and infrastructure-as-code — from a validated OpenAPI spec.
The ARC workflow used to build ARC Pilot Project #1 can accelerate your next complex software project from months to days — without sacrificing architecture quality or compliance rigor.
Two Things We Set Out to Prove
This case study measures ARC Code Reactor against two distinct questions:
- How much of a production system can be generated from a spec?
- How well does the generated code hold up under automated testing?
Both matter — generation rate without coverage is untested code at scale, and coverage without generation rate is just a better test harness for code you still had to write by hand.
The percentage of the final deployed codebase that was produced by ARC Code Reactor versus written by a developer. Measured per service as LOC Generated ÷ Total LOC Deployed.
The percentage of the deployed codebase exercised by ARC-generated test suites. Measured across unit, contract, and functional test layers using Vitest branch coverage as the primary signal.
Code Generation Rate
During the pilot project, ARC produced more than 85% of all deployed code, in a single, deterministic pass. ARC produced more than 3x cited average levels of AI-assisted code in deployed systems.
Source: Laura Tacho (CTO, DX) — “Measuring Developer Productivity & AI Impact” (2026)
Test Code Coverage
ARC didn't just generate most of the system code. It generated the test suites that automatically validate the system. ARC generated a test suite that provided 88% code coverage, providing an extremely high degree of confidence in the system.
Source: Kochhar, Thung, Lo & Lawall, APSEC 2014 — “An Empirical Study on the Adequacy of Testing in Open Source Projects”
Five Services, One Cohesive Platform
Each service owns a distinct domain. Together they handle the full lifecycle — from tenant provisioning to compliant borrower communication.
| Service | Admin | Auth | Batch | Core | Messaging |
|---|---|---|---|---|---|
| Endpoints | 3 | 1 | 1 | 22 | 1 |
| AWS Resources | 87 | 39 | 32 | 451 | 54 |
| Lines of Code | 2,846 | 749 | 661 | 21,245 | 1,226 |
Service
Handles provisioning of lender accounts, campaign configuration, and system-wide settings. Ensures each tenant operates in a fully isolated context within the multi-tenant architecture.
Service
Integrates with Auth0 to handle authentication and authorization across all services. Enforces role-based access control and tenant-scoped request isolation via JWT validation.
Service
The operational hub for lenders. Manages loans, borrowers, and campaign lifecycle via a RESTful API, with business logic covering campaign stages, due date tracking, and contact records.
Service
Owns all SMS communication via Twilio — outbound notifications, inbound webhook handling, conversation history, and opt-out consent management.
Service
The notification engine. Runs on a daily EventBridge schedule, evaluating each loan against campaign stage, due date, consent status, and notification history to trigger compliant outreach.
Metric 1: Code Generation Rate
ARC Code Reactor produces three categories of output from a validated OpenAPI spec: source code, comprehensive test suites, and infrastructure-as-code. Together, these outputs represent the complete deployable artifact — not a scaffold to iterate from, but a production-ready baseline.
The table below shows the breakdown of generated versus manually-written lines of code across all five services. The delta column reflects custom business logic, environment-specific wiring, and developer-authored overrides beyond the spec-defined surface area.
| Metric | Admin | Auth | Batch | Core | Messaging |
|---|---|---|---|---|---|
| LOC Generated (ARC) | 2,846 | 749 | 661 | 21,245 | 1,226 |
| LOC Delta (Manual) | 291 | 31 | 3,379 | 4,079 | 77 |
| % Generated | 90.7% | 96.0% | 16.4% | 83.9% | 94.1% |
Note on Batch Service: The Batch service's lower generation rate reflects its role as the custom notification engine — containing a high concentration of bespoke business logic for TCPA compliance, timezone-aware scheduling, and campaign-stage evaluation that falls outside standard CRUD patterns. Code Reactor's generation rate improves as more of this logic is expressed through the spec.
Estimating Dev Velocity Improvements
We leveraged ARCitect's project estimation workflow to provide agent-based estimates for projects. Compared to estimated effort, this pilot was completed 3.5x faster than manual development and 50% faster than AI-assisted development without ARC.
ARC Velocity: Building a production-ready system with “exemplary” test coverage within a single sprint is impressive and would allow any team to ship faster without sacrificing quality. The only thing better is being able to do it whenever the spec changes.
What Does Good Coverage Actually Look Like?
Before reviewing the numbers ARC generated, it helps to have an industry baseline. Code coverage is not a perfect measure of test quality — a codebase can have high coverage and still ship bugs if tests assert the wrong things. But a low coverage number does make one thing certain.
“A low code coverage number does guarantee that large areas of the product are going completely untested by automation on every single deployment.”
That guaranteed blind spot increases the risk of pushing bad code to production on every release. Coverage gives teams an objective, actionable signal — not a perfect one, but a meaningful floor. Google's engineering team offers the most widely-cited industry benchmark. Their guidelines, drawn from experience at scale across thousands of services:
A functional floor. Large areas of code still go untested on every deployment.
A solid baseline. Most critical paths are exercised, edge cases still present risk.
The standard ARC targets. High confidence that automation catches regressions before production.
Source: Google Testing Blog — “Code Coverage Best Practices” (2020). Although there is no ideal coverage number, Google offers 60% as acceptable, 75% as commendable, and 90% as exemplary across thousands of production services.
Code coverage provides significant benefits to the developer workflow. It is not a perfect measure of test quality, but it does offer a reasonable, objective, industry-standard metric with actionable data. The sections below show where ARC-generated test suites land against these benchmarks — on day one, before a developer writes a single manual test.
Code Coverage Metrics Overview
Test coverage is measured across multiple dimensions — and the numbers can tell very different stories depending on which metric you look at. The table below presents a summary of key coverage types per service. What matters most is the direction: high coverage across all three categories means the generated code is being exercised broadly, from individual functions through to complete request paths.
Primer How to read coverage metrics
Vitest reports four distinct coverage metrics for every test run. Each measures a different aspect of how thoroughly the test suite exercises the codebase. They are related but not equivalent — a service can have high line coverage and low branch coverage, which would mean conditional logic is largely untested.
The percentage of individual executable statements that were run at least once. Similar to line coverage but more granular — multiple statements on one line are counted separately.
The percentage of decision branches exercised — every if, else, switch, and ternary path. This
is the strictest signal: a line can be “covered” while one of its
branches remains completely untested.
The percentage of declared functions that were called at least once. A useful coarse-grained signal for identifying dead code or untested handlers, but it says nothing about whether the logic inside those functions was fully exercised.
The percentage of source lines executed during the test run. The most intuitive metric and the one closest to what most engineers picture when they hear “code coverage.” Used as the headline figure in the table below.
Why branches? Branch coverage is the stricter and more meaningful signal — complete coverage numbers are included in the appendix for readers who want the full picture.
Test Pyramid Overview
The Testing Pyramid is a strategic framework that prioritizes a large base of fast, automated checks to catch errors early, ensuring software remains reliable and cost-effective as it grows.
Functional Testing: Checks the software against specific business requirements to confirm that the system actually does what the user expects it to do.
Contract Testing: Verifies that two different systems (like a frontend and a backend) agree on how to communicate, ensuring they can exchange data without breaking.
Unit Testing: Focuses on the smallest parts of the application, like individual functions or components, to ensure each piece of logic works correctly in isolation.
Primer ARC Testing Pyramid
Functional Testing
ARC understands your code at the structural level — going far beyond a simple Markdown context file. This understanding enables functional tests that exercise each code path created by business logic, verifying that exact business rules and constraints were implemented correctly.
Contract Testing
ARC automatically generates contract test suites that verify API contract compliance — confirming schema conformance, endpoint existence, and status code correctness against the authoritative OpenAPI spec. These tests make spec drift immediately visible in CI.
Unit Testing
ARC generates unit tests for low-level functions related to data models. Code Reactor v1 extends this to include test coverage for utility functions defined within ARCitect — reducing the surface area that requires manual test authorship.
Metric 2: Test Code Coverage
ARC doesn't just generate application code. It generates comprehensive test suites across the conventional testing pyramid — from low-level unit tests to API contract verification. All generated tests are passing.
| Coverage Type | Admin | Auth | Batch | Core | Messaging |
|---|---|---|---|---|---|
| Unit Test | 91.3% | 100% | 100% | 95.6% | 100% |
| Contract Test | 80% | 100% | 78.8% | 98.4% | 75% |
| Functional Test | 82.4% | 66.7% | 75.5% | 78.4% | 60% |
| Total | 83.6% | 80% | 76.3% | 89.8% | 63.15% |
Unit Test Coverage
Coverage of generated unit tests per service, comparing v0 output to the v1 rerun. All tests passing.
| Version | Admin | Auth | Batch | Core | Messaging |
|---|---|---|---|---|---|
| v0.0 | 91.3% | 100% | 100% | 95.6% | 100% |
| v1.0 | 100% | 100% | 100% | 100% | 100% |
Contract Test Coverage
ARC generates contract tests that verify API shape conformance — schema correctness, endpoint existence, and expected status codes — against the authoritative OpenAPI spec. These run in CI and make spec drift immediately visible before it reaches production.
| Version | Admin | Auth | Batch | Core | Messaging |
|---|---|---|---|---|---|
| v0.0 | 80.0% | 100% | 78.8% | 98.4% | 75% |
| v1.0 | 100% | 100% | 100% | 100% | 100% |
Functional Test Coverage
ARC understands your code at the structural level — going far beyond a simple Markdown context file. This understanding enables functional tests that exercise each code path created by business logic, verifying that exact business rules and constraints were implemented correctly.
Using this in-development functional test generation capability, we achieved comprehensive coverage across the ARC Pilot Project #1 system in one sprint of developer effort. The next Code Reactor release will generate 90+% coverage functional tests automatically.
| Version | Admin | Auth | Batch | Core | Messaging |
|---|---|---|---|---|---|
| v0.0 | 83.6% | 66.7% | 75.8% | 89.8% | 63.15% |
Every Resource, Generated
ARC Code Reactor doesn't stop at application code. For every service in the ARC Pilot Project #1 platform, ARC generated the complete AWS infrastructure definition — API Gateway configurations, Lambda functions, DynamoDB tables, SQS queues, EventBridge rules, and IAM roles — as production-ready Infrastructure-as-Code.
Across the five ARC Pilot Project #1 services, ARC generated definitions for 663 AWS resources. These are not generic templates — they are generated directly from the validated OpenAPI spec and reflect the exact shape of each service's API surface and operational requirements.
| Service | Admin | Auth | Batch | Core | Messaging | Total |
|---|---|---|---|---|---|---|
| AWS Resources Generated | 87 | 39 | 32 | 451 | 54 | 663 |