For the last two months I've been studying how the teams who actually do this (Anthropic, OpenAI, and a handful of others) get to 100% AI-written code. Code that ships, written end-to-end by agents. I've also built my first working versions.
The first thing I learned: it's not about a better model. Opus 4.7 vs. Opus 4.6 vs. GPT-5.5 doesn't matter.
What matters is the system you build around the model. The teams getting to 100% are building something more like a software factory than a coding tool. The model is one part in the system. Most use more than one.
If you're trying to figure out how your team gets here, this post is a high-level map of what others are doing.
TL;DR: Getting to 100% AI-written code requires three things.
- Context engineering so agents can reason about your codebase without getting overwhelmed.
- Specialized agents so each agent gets one job and does it well.
- An outer harness so the whole system runs without you babysitting it.
A quick grounding for anyone newer to this.
An agent is an AI model that can use tools, hold memory, and run in a loop until it finishes a task. You give it a goal. It figures out how to get there. That's it.
Two things follow that matter for everything below:
Get those two ideas and the rest of this post follows.
Why can't you just point a single agent at a large codebase and ask it to ship a feature? Three reasons. Each one shapes one of the solutions below.
Context-window degradation. Models advertise 1M token windows, but effective performance collapses somewhere around 80–120K tokens — and has been stuck there for two years, per Blitzy CTO Sid Pardeshi on the Cognitive Revolution podcast. Past that, the model gets confused, loses thread, and quality nosedives. The advertised window is a marketing number. As of May 2026, the working window is smaller.
Intent drift. Left alone, agents optimize locally. They make the test pass, but they comment out the assertion. They fix the symptom but break the contract. Without explicit gates and reviews, an agent will quietly drift away from what you actually wanted.
Compound reliability decay. This is the math one. Suppose each step in your process is 95% reliable (generous, even for humans). Stack 10 of them and your end-to-end reliability is 59.9%. Multi-agent systems collapse fast unless you actively counter this.
These three are the reason a single agent can't ship production code. Each pillar below addresses at least one.
The teams getting to 100% AI code solve these failure modes with the same three pillars: context engineering, specialized agents, and an outer harness. The pillars don't eliminate the failure modes, they bound them.
Skip any one of them and the others won't carry the weight.
A single agent cannot absorb and reason over most real-world codebases. There is too much information for the working window, and most of the code isn't relevant to any given task.
The solution is context engineering: deliberately controlling what an agent sees in its working window. For codebases, the emerging pattern is a documentation layer inside the repo, designed for agents.
Here's how I'm doing it right now. I add a new folder next to my code to contain specialized documentation just for agents.
_dev/
├── current-state/
│ ├── architecture.md # structural invariants and boundaries
│ ├── project-context.md # product intent and domain vocabulary
│ ├── data/ # schemas, tables, procedures (DB-backed only)
│ ├── how-to/ # how we build it, team standards by stack layer
│ ├── requirements/ # what the system does, one file per capability
│ └── workflows/ # user, data, and system flows worth documenting
└── changesets/ # in-flight units of work (archived after merge)
├── 001-frontend-asset-normalization/
├── 002-add-expert-design-polish/
└── ...
src/
current-state/ files and folders are a living description of how the system works right now, written for agents to read, not humans.
changesets/ holds in-flight units of work. Once a changeset merges, it's archived out of the repo so it doesn't bloat context or confuse agents reading the codebase. The git history takes over from there.
The key move is slices: small, self-contained pieces of context that get auto-injected into an agent's window based on what it's working on. Touching the auth code pulls in the auth slice. Working on ecommerce reporting pulls in the ecommerce data and reporting slices.
A real example: I built a tool to generate the agent documentation set for any codebase. One codebase I'm working on is 328K lines across 1,423 files. The tool ran for two hours and produced 216 slices. Now any agent working in that codebase starts with a few relevant slices instead of trying to find what matters across 1,400+ files.
This is what fixes context-window degradation. Instead of forcing agents to scan and load large parts of a codebase into the context window, it pulls in just what's relevant for its immediate work.
Most teams are using files. Some are moving to databases and other tools. The tooling is evolving, but the idea is the same.
Skip this layer and every other pillar is built on sand. Agents that can't discover and reason about your codebase well will produce confidently wrong code.
A single agent can't be great at writing specs, writing code, reviewing code, AND running tests. So you don't ask one to. You build a small team of specialists.
This shows up at two scales: 1) across the pipeline, and 2) within a single step.
A changeset moves through a sequence of specialist agents, with gates between them.
flowchart TB
CB["change-brief.md<br/><i>Developer's input</i>"]:::box-2
S1["<b>Step 1<br />/changeset-writing</b>"]:::box-1
PG["<b>(optional)<br />/changeset-review</b>"]:::box-3
S2["<b>Step 2<br />/story-writing</b><br />code + docs"]:::box-1
S3["<b>Step 3<br />/story-review</b>"]:::box-1
S4["<b>Step 4<br />/dev-exec</b><br />code + docs"]:::box-1
S5["<b>Step 5<br />/dev-review</b>"]:::box-1
S6["<b>(optional)<br />Step 6<br />/test-exec</b>"]:::box-3
S7["<b>Step 7<br />/changeset-report</b>"]:::box-1
S8["<b>Step 8<br />Human regression</b>"]:::box-1
SC["<b>Step 9<br />/changeset-close</b>"]:::box-1
SR["<b>/changeset-rework</b>"]:::box-1
DONE["merged code + updated docs"]:::box-4
CB --> S1
S1 -.optional.-> PG
PG -.rework.-> S1
S1 --> S2 --> S3 --> S4 --> S5 --> S7 --> S8
S3 -.rework.-> S2
S5 -.rework.-> S4
S5 -.optional.-> S6
S6 -.optional.-> S7
S8 --> SC
S8 -.issues.-> SR
SR -.rework cycle.-> S2
SC --> DONE
Each step has one job. The story-writer turns a changeset into stories. The dev-exec implements them. The reviewers check. Failed reviews route back to the previous step. Code and current-state docs are first-class outputs, updated in the same flow.
This pipeline is what fixes compound reliability decay. With explicit reviews and rework loops between steps, you bound the failures. Each gate catches errors before they compound.
Sometimes a single step is itself too big for one agent: too much context, too many things to do, too many parallel items. That's where sub-agents come in.
%% wide
flowchart TB
SKILL["<b>Skill orchestrator</b><br/><i>holds the plan,<br/>drives the gates</i>"]:::box-1
PLAN["<b>Planner</b><br/><i>ordered work list</i>"]:::box-3
W["<b>Worker</b><br/><i>does the work<br/>(parallel, one per item)</i>"]:::box-3
R["<b>Per-Item Reviewer</b><br/><i>local correctness</i>"]:::box-3
HR["<b>Holistic Reviewer</b><br/><i>cross-item coherence</i>"]:::box-3
CONS["<b>Consolidator</b><br/><i>collapse findings,<br/>route by severity</i>"]:::box-3
OUT["Artifacts + findings"]:::box-4
SKILL -->|1 plan| PLAN
SKILL -->|2 dispatch workers| W
SKILL -->|3 dispatch reviewers| R
SKILL -->|4 dispatch holistic| HR
SKILL -->|5 dispatch consolidator| CONS
CONS --> OUT
The pattern: a skill orchestrator holds the plan and drives the gates. A planner produces an ordered work list. Workers do the work in parallel, one per item. A per-item reviewer checks each output for local correctness. A holistic reviewer checks for cross-item coherence. A consolidator collapses findings and routes by severity.
You don't need this everywhere. You need it where a step would otherwise overwhelm a single agent's window or attention.
The principle: each agent is a specialist with a tight job. Quality compounds across specialists in a way it doesn't inside a single generalist agent.
A model on its own is a text generator. You call an API, you get a completion back. That's the whole interface. To make a model useful in a real workflow (running in a loop, using tools, recovering from errors, managing what's in its context), you need software wrapped around it. That software is called an agent harness.
You've probably already used several. Claude Code, Cursor's agent, Codex, OpenAI's reasoning agents are all harnesses wrapping models. Anthropic itself frames Claude Code as the harness around Claude. The harness is what turns "I can call the API" into "I can hand the agent a task and walk away."
So when you run an custom agent like /story-writing, you're already inside a harness. Call this the inner harness: it manages the tool loop, error handling, and the back-and-forth with the model for a single agent.
The outer harness is the equivalent layer for your system. It's the code that wraps your specialized agents and orchestrates them across a pipeline.
Without an outer harness, I needed to stay in the loop to:
With a harness, I work with a conversational agent to create the changeset, then hand it off. The harness does the rest. I get to walk away and check the PR when it's done.
Same pipeline from Pillar 2. The harness now wraps the autonomous middle:
flowchart TB
CB["change-brief.md<br/><i>Developer's input</i>"]:::box-2
S1["<b>Step 1<br />/changeset-writing</b>"]:::box-1
PG["<b>(optional)<br />/changeset-review</b>"]:::box-3
subgraph HARNESS["Outer Harness"]
direction TB
S2["<b>Step 2<br />/story-writing</b><br />code + docs"]:::box-1
S3["<b>Step 3<br />/story-review</b>"]:::box-1
S4["<b>Step 4<br />/dev-exec</b><br />code + docs"]:::box-1
S5["<b>Step 5<br />/dev-review</b>"]:::box-1
S6["<b>(optional)<br />Step 6<br />/test-exec</b>"]:::box-3
S7["<b>Step 7<br />/changeset-report</b>"]:::box-1
end
S8["<b>Step 8<br />Human PR Review</b>"]:::box-1
SC["<b>Step 9<br />/changeset-close</b>"]:::box-1
SR["<b>/changeset-rework</b>"]:::box-1
DONE["merged code + updated docs"]:::box-4
CB --> S1
S1 -.optional.-> PG
PG -.rework.-> S1
S1 --> HARNESS
S2 --> S3 --> S4 --> S5 --> S7 --> S8
S3 -.rework.-> S2
S5 -.rework.-> S4
S5 -.optional.-> S6
S6 -.optional.-> S7
S8 --> SC
S8 -.issues.-> SR
SR -.rework cycle.-> HARNESS
SC --> DONE
Your harness can be built in anything that can call a CLI or API. PowerShell, C#, bash, Node, Go. The choice doesn't matter much. What matters is that the harness exists and that it owns all the deterministic glue.
This is the pillar that turns "I can run an agent" into "the system runs itself." Without an outer harness, you're driving the pipeline by hand: kicking off agents, copying outputs around, watching for failures. That's fine for experimentation. It doesn't get you to 100%.
The three pillars feel right and fit my research into how other teams are doing this. None of them are finished for me.
I'm at about 95% autonomous development across complex tasks, not 100%. There are still situations where it's easier to tweak the code than explain what I want clearly enough for the agents. My pipelines take longer to run and cost more than I'd like. This might be the realistic ceiling for me and most teams. Getting from 95% to 100% feels like tuning the pillars, not adding new ones.
Here's where each one still needs work, plus a cross-cutting concern I'm just starting.
Context Engineering. My current tooling does the job, but it leans on agents for parts that traditional scripts written in python, bash, or PowerShell could handle faster and cheaper. Code Intelligence tools like GitNexus expose codebase search and discovery as MCP tools any agent can call and could speed up my documentation process.
Specialized Agents. Two open questions. First, when to split a single agent into sub-agents — the threshold isn't obvious yet. Second, which agents can move off frontier models onto faster, cheaper alternatives without losing quality.
Outer Harness. Others are playing here: LangGraph, Cursor's TypeScript SDK, Anthropic's Managed Agents, GitHub's Copilot SDK. None of them yet let me declare a pipeline that mixes CLI calls, API calls, and deterministic gates the way I need. This is likely to become standard infrastructure in the next 12 months. I'm watching closely as I build my custom tools in parallel.
Evals. I don't yet have a clean way to measure whether changes to the pipeline (new prompts, different gates, swapped models) actually improve outcomes. Right now every adjustment is vibes-based. Building real evals (quality monitoring over time and a way to A/B test system changes) is the next big piece I'm exploring.
If you're trying to figure out how your team gets to 100% AI-written code, I'd love to compare notes. Reach out.