The Journey

How does AI adoption
actually work?

Not everyone is at the same level. Not every level delivers what it promises. This is the complete journey — from a developer writing every line by hand to a factory that builds itself. Twelve levels. Three eras. One question: where does productivity actually shift?

Tools and data collected as of 18 March 2026

Era I: Individual AI

This is where everyone starts. Every line written by a human, every test run by hand. The baseline.

Era I: Individual AI
L1

No AI

The developer is the sole author of every line of code.

Traditional development. Code is written by hand, reviewed by humans, tested manually or through human-authored CI pipelines.
Still present in air-gapped environments, classified government work, and teams that have not yet adopted AI tooling.

Cost

$0 AI costs Human labour only.

The first taste of AI. It finishes your sentences. It feels like magic. But look closer — you're still the one thinking, deciding, structuring. The AI is accelerating your fingers, not your mind.

Developers code 52 minutes a day. Autocomplete makes those 52 minutes faster. It does nothing about the other 7 hours.

Era I: Individual AI
L2

Autocomplete

AI accelerates typing. The developer's mental model, decisions, and workflow remain unchanged.

AI predicts the next tokens or completes a line of code inline. The developer accepts or rejects suggestions without leaving their editing flow. No dialogue. No multi-file awareness. The AI operates at the token and line level only.

Autocomplete tools

Autocomplete tools
ProductCompanyDetail
Copilot (inline) GitHub / Microsoft ~15M developers. $10/month individual, $19/month business [1].
Tabnine Tabnine On-premises deployment. SOC 2 Type II, GDPR, HIPAA. Trains on customer code only. $9/month [1].
Supermaven Anysphere (now part of Cursor) Fastest autocomplete measured. 300K-token context window. $10/month. Acquired by Cursor parent company [34].
Cody (completions) Sourcegraph Strong codebase search and indexing, particularly for large legacy systems [94].
AI Assistant JetBrains Native to IntelliJ family IDEs.
Q Developer (inline) Amazon AWS ecosystem. Formerly CodeWhisperer. Free tier with limits [130].
Gemini Code Assist (inline) Google Google Cloud ecosystem. Free tier available [131].
Double.bot Double (YC W23) VS Code extension focused on completion quality: bracket handling, auto-imports, multi-cursor, mid-line completions. ~28K installs [83].

Productivity Evidence

GitHub internal [1] [2]
Individual: +55% on simple function generation Typing phase only. Devs code 52 min/day.
GitHub/Accenture RCT [84] n=800+
Individual: +8.7% PRs. +84% successful builds. ~30% suggestion acceptance rate.
Copilot longitudinal [146] n=225
Individual: +55.8% on simple tasks One year of data.

Cost

$0.30-0.65 ($10-20/mo) Copilot $10/mo, Tabnine $9/mo, Supermaven $10/mo.

Now you can ask questions. Describe a problem in plain language, get code back. But every conversation starts from zero — no memory, no project context. You are the integration layer. You copy. You paste.

Era I: Individual AI
L3

AI Chat Assistant

An always-available consultant, but every interaction starts from zero project context. No awareness of the codebase, no memory between sessions.

The developer opens a separate chat interface, describes a problem in natural language, receives code blocks, and pastes them into the codebase. The AI has no project context. Context must be manually copied into the prompt window. The integration layer is copy-paste.
No controlled studies isolate L3 from L4–L6.

AI Chat Assistants

AI Chat Assistants
ProductCompanyDetail
Claude.ai / Claude Pro Anthropic Strong reasoning. 200K context window. $20/month Pro.
ChatGPT / GPT-4o OpenAI Largest general user base.
Gemini Google Long context, multimodal.
Grok xAI Integrated with X/Twitter. Free tier available.

Cost

$0.65-1.30 ($20-40/mo) Claude Pro $20/mo, ChatGPT Plus $20/mo.

A significant leap. The AI moves inside your editor, sees your files, understands your imports. It proposes multi-file changes. But here's the catch: you review every diff, line by line. You are still the bottleneck.

This is the most studied level — and the most contested. One RCT found developers 19% slower. Another found them 26% faster. The difference? Experience, task complexity, and whether the study measured typing speed or thinking speed.

Era I: Individual AI
L4

AI in IDE, Supervised

The AI gains real project context. But the human approval loop on every change limits speed. The developer is still the bottleneck.

The AI is inside the editor with access to the project's files. It proposes multi-file edits, generates code from natural language, and understands project structure — file tree, imports, dependencies. Every proposed change requires explicit developer approval before it is applied. The developer reviews diffs line by line.
The most intensely studied level. Results diverge sharply depending on methodology.

Commercial IDE tools

Commercial IDE tools
ProductCompanyDetail
Cursor (standard mode) Anysphere VS Code fork. 1M+ users, 360K+ paying. ~$500M+ ARR. $20/month hobby, $40/month pro [34].
Windsurf (standard mode) Cognition / Codeium Cascade agent. $15/month. $82M ARR. 350+ enterprise customers [42].
Copilot Workspace GitHub / Microsoft Works from issues and PRs. Feb 2026 added Claude and Codex model access alongside native models.
Augment Code Augment Code Enterprise-focused. Context Engine indexes 400K+ files, full dependency graph. $50/dev/month. $252M Series B [34].
Trae ByteDance AI-native IDE (VS Code fork). 6M registered users, 1.6M MAUs across ~200 countries. Free Claude 3.7 Sonnet + GPT-4o initially; paid from ~$7.50/month. Builder Mode (SOLO) for full-stack generation: 44% penetration among international users. Formerly MarsCode. Privacy concerns re telemetry to ByteDance servers [77][78].
Kiro AWS Spec-driven IDE on Code OSS, using Claude Sonnet. Enforces Requirements (EARS syntax) → Design → Task list methodology. Automated "hooks" trigger background tasks on file saves. Autonomous agent (announced re:Invent Dec 2025) works independently for days with persistent memory. Free during preview [79].
Kilo Code Kilo (GitLab co-founder) Fork of Cline + Roo Code. 1.5M+ users, 25T+ tokens processed. Orchestrator mode routes subtasks to specialised sub-agents (Architect, Coder, Debugger). 500+ models at exact provider rates, zero markup. VS Code, JetBrains, CLI. $8M seed Dec 2025 [80].
Zed AI Zed Industries Rust-built, GPU-rendered editor (120 FPS). From Atom's creator. $32M from Sequoia. Co-developed Agent Client Protocol (ACP) with JetBrains (Jan 2026) as open standard for editor-agnostic AI agents. Open-source Zeta model for fast completion [81].
Sweep AI Sweep JetBrains-focused AI assistant with 40K+ installs, filling a gap since most competitors prioritise VS Code [86].
Blackbox AI Blackbox Claims 30M+ developers. 300+ models. "Chairman LLM" dispatches to multiple specialist agents. From ~$3/month [87].
Pieces Pieces for Developers "Long-term memory" tool capturing developer activity across IDEs, browsers, terminals for 9+ months. Works alongside other tools via MCP. Free tier; Pro $18.99/month [89].

Open-source IDE extensions

Open-source IDE extensions
ProductDetail
Cline VS Code extension. Autonomous agent, BYOK model. 59K GitHub stars.
Roo Code Fork of Cline with additional features and community plugins.
Continue.dev Model-agnostic, privacy-focused IDE extension.
Aider Terminal pair-programmer. 42K+ GitHub stars. Git-native. Strong on pair-programming workflows.

Productivity Evidence

METR RCT v1 [3] n=16
Individual: -19% (CI: -39% to -2%) 39-point perception gap. Experienced OSS devs.
METR RCT v2 [4] n=57
Individual: -4% (CI: -15% to +9%) 800+ tasks. 30-50% refused without AI.
MS/MIT/Accenture RCT [5] [88] n=4,867
Individual: +26% completed tasks Largest multi-company study. Accepted Management Science 2026.
Anthropic skills RCT [90] n=52
Individual: Faster completion -17% on knowledge quizzes. First evidence of skill inhibition.
NAV IT Norway [91] n=26,317 commits
Individual: No significant change 2 years longitudinal. Selection effect: adopters already more active.
Denmark study [92] n=100,000
Individual: ~3% time saved Across 11 occupations. Broadest population study.
Stack Overflow 2025 [30]
Individual: 16.3% report significant gains; 41.4% little/no effect Favourable views dropped 70% to 60%.
Anthropic 100K convos [141] n=100K convos
Individual: ~80% reduction per task Software dev = largest category. Task-level, not day-level.
Anthropic 132 engineers [142] n=132
Individual: +67% merged PRs/day Internal study. AI used in 59% of work.
Faros/DORA telemetry [19] n=10,000+
Individual: +21% tasks. +98% PRs. +91% review time. +47% context switching. +154% PR size.
Bain 2025 [20]
Individual: Coding = 25-35% of idea-to-launch Speed-up does little for time to market.
DORA 2024 [31] n=39,000+
Largest DevOps study.
DORA 2025 [32]
Individual: Higher individual effectiveness Year-over-year improvement.
CodeRabbit analysis [104]
AI-generated PRs: ~1.7x more issues flagged.

Cost

$0.65-3.30 ($20-100/mo) Cursor $20-40/mo + overages. Windsurf $15/mo. Trae from $7.50/mo. Augment $50/mo. Kilo Code: zero markup on tokens.

Something shifts. You stop reviewing every change and start reviewing results. "Add authentication to this endpoint" — and you check if authentication works, not whether the code looks like you would have written it. Trust replaces control.

Era I: Individual AI
L5

AI Co-pilot, Autonomous in IDE

The relationship shifts from "show me every change" to "show me the result." The developer operates at a higher level of abstraction.

The developer stops reviewing every diff. Trust is established through repeated successful interactions. The developer steers at the intent level — "add authentication to this endpoint" — rather than approving individual line changes.

Autonomous IDE agents

Autonomous IDE agents
ProductCompanyDetail
Cursor (Agent / Composer mode) Anysphere Feb 2026: up to 8 parallel agents via git worktrees [34].
Windsurf (Cascade mode) Cognition / Codeium Flows model retains session context across interactions. Wave 13 update: parallel Cascade + Cascade Hooks [42].
Copilot (Agent mode) GitHub / Microsoft Expanding toward autonomous multi-file operation within VS Code.
Google Antigravity Google Google's AI IDE, launched Nov 2025 with Gemini 3. Built after $2.4B acqui-hire of Windsurf team (July 2025). Manager View orchestrates multiple AI agents simultaneously; "Artifacts" system generates structured deliverables for verification. Currently free preview; paid tiers expected ~$20–60/month [82].

Productivity Evidence

Stack Overflow 2025 [30]
Individual: 16.3% report significant gains; 41.4% little/no effect Favourable views dropped 70% to 60%.
Anthropic 100K convos [141] n=100K convos
Individual: ~80% reduction per task Software dev = largest category. Task-level, not day-level.
Anthropic 132 engineers [142] n=132
Individual: +67% merged PRs/day Internal study. AI used in 59% of work.
Faros/DORA telemetry [19] n=10,000+
Individual: +21% tasks. +98% PRs. +91% review time. +47% context switching. +154% PR size.
Bain 2025 [20]
Individual: Coding = 25-35% of idea-to-launch Speed-up does little for time to market.
DORA 2024 [31] n=39,000+
Largest DevOps study.
DORA 2025 [32]
Individual: Higher individual effectiveness Year-over-year improvement.
CodeRabbit analysis [104]
AI-generated PRs: ~1.7x more issues flagged.

Cost

$0.65-3.30 ($20-100/mo) Cursor $20-40/mo + overages. Windsurf $15/mo. Trae from $7.50/mo. Augment $50/mo. Kilo Code: zero markup on tokens.

The inversion is complete. The agent writes. You direct. The IDE becomes a reading surface, not a writing surface. For the first time, the developer's primary skill is judgement, not implementation.

This is the most crowded and fastest-moving space in all of AI. Claude Code, Devin, Codex, Replit — billions of dollars chasing the same bet: that the developer's value is in knowing what to build, not in building it.

Era I: Individual AI
L6

Agent-First Development

The agent is the primary author. The developer specifies intent and evaluates output. The editor is for reading, not writing.

The developer's primary interface is no longer the editor. It is a terminal session or conversation with an autonomous agent. The developer describes architectural intent — "build an OAuth flow with PKCE, store tokens in Redis, add refresh logic" — and the agent plans, implements, tests, debugs, and delivers working code. The IDE becomes a review surface for inspecting output, not a writing surface.
This is the most crowded and fastest-moving level. It divides into four sub-categories: developer-facing autonomous agents, foundation models with native agentic capabilities, full-stack app generators, and emerging infrastructure players. Full-stack app generators serve a different audience than the developer tools. Their primary users are product managers, designers, and entrepreneurs — people who want working software without writing code. They represent AI-first alternatives to the no-code/low-code movement. A structural shift in foundation models: models trained from the ground up for agentic tool use, multi-step reasoning within tool environments, or model-native multi-agent orchestration. The agentic behaviour is in the model weights, not bolted on through external harnesses. This moves coordination from the application layer to the model layer.

Developer tools (terminal and autonomous agents)

Developer tools (terminal and autonomous agents)
ProductCompanyDetail
Claude Code Anthropic Terminal-native agent. 1M-token context window. Agentic tool use. Estimated ~$2.5B ARR as of early 2026 [6].
Devin Cognition Fully sandboxed environment: shell, editor, browser. Asynchronous — takes tickets, works independently, submits PRs. $20/month + ACU (agent compute units). Goldman Sachs pilot alongside 12,000 developers. $4B valuation [7][26].
Codex OpenAI Cloud-based asynchronous coding agent. 1.6M weekly active users. Runs in sandboxed environments. Works from natural language task descriptions [6].
Google Jules Google Asynchronous agent. Clones repos into Cloud VMs, works independently, submits PRs via GitHub. 140,000+ code improvements shared during beta. CLI companion (Oct 2025), public API, proactive "Suggested Tasks" that scan repos for improvements. Free (15 tasks/day) to $124.99/month (300 tasks/day) [93].
Amp Sourcegraph Replaced Cody for individual users (Nov 2025). No artificial token constraints — models run to completion. Team-first: shared threads, leaderboards, reusable workflows. Leverages Sourcegraph's deep code graph (code intelligence over billions of lines). Credit-based pricing, $10 free at signup [94].
Grok Build / grok-code-fast-1 xAI Local-first CLI agent. Up to 8 parallel agents with Arena Mode (algorithmic ranking selects best output across agents). 314B parameter MoE model. 256K context window. $0.20/1M input tokens. Waitlist as of March 2026 [8][9].
Grok CLI xAI / open source Open-source terminal agent. MCP support. Morph Fast Apply engine for high-speed multi-file edits [10].
Warp Warp Rust-built terminal evolved into "Agentic Development Environment." Warp 2.0 (June 2025) added AI-native workflows. Runs hundreds of agents in parallel. 3.2 billion lines of code edited in 2025. 120,000+ codebases indexed. TIME Best Invention 2025. Free tier; Build $20/month; Business $50/month [95].
Goose Block (open source) 29,400+ GitHub stars. Open-source AI agent created at Block (Square). ~5,000 Block employees use weekly. Contributed to Linux Foundation's Agentic AI Foundation alongside MCP and AGENTS.md. Stripe forked Goose for their Minions agent harness [96][97].
Cosine / Genie Cosine (YC) Proprietary Genie 2 model. 50+ languages including legacy (COBOL, Fortran, Ada). Enterprise compliance: FINRA, HIPAA, ITAR certifiable. Multi-agent mode decomposes backlog items across autonomous specialist agents. 72% on SWE-Lancer benchmark [98].
OpenCode Open source 122K GitHub stars. 75+ LLM providers. LSP integration for language-aware editing. Privacy-first: runs locally [11].
Pi (by badlogicgames) Open source Minimal terminal coding harness. 4 core tools. Powers the OpenClaw ecosystem. Created by libGDX founder Mario Zechner [12][45].
OpenHands Open source 69K stars. Formerly OpenDevin. CLI and web entrypoints. Strong on autonomous task completion [11].
Gemini CLI Google (open source) 98K GitHub stars. Terminal agent for repository work and research. Direct Gemini model access [11].
Codex CLI OpenAI (open source) 65K GitHub stars. Interactive TUI with tool execution. Local-first [11].

Foundation models with native agentic capabilities

Foundation models with native agentic capabilities
ModelCompanyDetail
Moonshot AI 1T parameter MoE (32B activated). Open-source. Agent Swarm: orchestrates up to 100 sub-agents with 1,500+ tool calls per session. Trained via PARL (Parallel Agentic Reinforcement Learning) to prevent "serial collapse" where sequential agents degrade quality. 4.5x speedup over serial execution. $0.60/1M input tokens, $0.10/1M cached. Released January 2026 [48][49][50].
DeepSeek Open-source. Introduces "thinking in tool-use" — models reason while executing tool-calling flows rather than separating reasoning and execution. Agentic synthesis pipeline: 1,800+ curated environments, 85,000+ synthetic instructions. 128K context. RL budget exceeds 10% of pre-training cost. Free web access [51][52][53].
MiniMax 230B/10B MoE architecture. Open-source. Parallel tool calling. Trained via Forge (scaffold-agnostic agent RL framework that works across any orchestration layer). $1/hour at 100 tokens/second. Self-reported production data: 80% of committed code at MiniMax is M2.5-generated; 30% of company-wide tasks handled autonomously. Not independently verified [54][55][56].
MiniMax Open-source (Apache 2.0). 1M token context. Hybrid attention mechanism. Full RL training cost: $534,700 on 512 H800 GPUs in 3 weeks [57][58].
Poolside Founded by Jason Warner (ex-GitHub CTO) and Eiso Kant. ~$626M through Series B; up to $1B from Nvidia (Oct 2025) at $12B valuation. Models Malibu and Point use RLCEF (RL from Code Execution Feedback) — trained by executing code and learning from results across 350M+ repos (50T+ tokens). Deploys entirely within customer infrastructure. Project Horizon: planned 2GW AI campus [99][100].
Magic $515M raised across multiple rounds. Building 100M-token context models with ~23 engineers and 8,000 H100 GPUs. Custom CUDA training stack (no PyTorch autograd). LTM-2-Mini claims 1,000x efficiency over traditional attention mechanisms. Pre-product as of March 2026 — no shipped product, all research [101][102].

Full-stack app generators (prompt to deployed application)

Full-stack app generators (prompt to deployed application)
ProductCompanyDetail
Replit (Agent 4) Replit Cloud-native platform. Agent writes, tests, deploys, iterates in a managed environment. Built-in auth, database (Postgres), hosting, monitoring. Agent 4 (March 2026): canvas interface, kanban project management, parallel agents. $400M raise at $9B valuation. Targeting $1B ARR. Fastest growth from non-technical users [13][14].
Lovable Lovable Natural language to React/TypeScript + Supabase backend. Clean, exportable code — not locked in. $25/month pro [15].
Bolt.new StackBlitz In-browser full-stack IDE. Diffs-based updates allow iterating on generated code. ~$20/month [15].
Base44 Base44 (acquired by Wix) Prompt to full-stack app with managed backend. Zero-setup database, auth, hosting. Closed ecosystem — harder to export [16].
v0 Vercel UI generation from prompts. React/Next.js focused. Vercel deployment integration. Strong on frontend components.

Productivity Evidence

Faros/DORA telemetry [19] n=10,000+
Individual: +21% tasks. +98% PRs. +91% review time. +47% context switching. +154% PR size.
Bain 2025 [20]
Individual: Coding = 25-35% of idea-to-launch Speed-up does little for time to market.
DORA 2024 [31] n=39,000+
Largest DevOps study.
DORA 2025 [32]
Individual: Higher individual effectiveness Year-over-year improvement.
CodeRabbit analysis [104]
AI-generated PRs: ~1.7x more issues flagged.
Goldman Sachs/Devin [26] n=12K devs
Individual: Targeting 20% efficiency gains Weeks of config, dedicated staff required.
Nubank/Devin [33]
Individual: 20x cost savings (migration), 12x (ETL) Not org-wide.

Cost

$3-10 ($100-300/mo) Claude Code ~$100-200/mo. Devin $20/mo + ACU. Grok Build $0.20/1M input. Jules free-$124.99/mo.
$0.65-5 ($20-150/mo) Lovable $25/mo. Bolt ~$20/mo. Base44 ~$19/mo. Replit effort-based.
Variable Kimi K2.5: $0.60/1M input, $0.10/1M cached. DeepSeek V3.2: free web. MiniMax M2.5: $1/hr at 100 tok/s.

Some developers go further. They build personal infrastructure — Slack bots, Telegram triggers, custom scripts wrapping foundation models. They've automated their own workflow. But it's personal. The developer next to them has a completely different setup. There is no shared context, no organisational standard.

Era I: Individual AI
L7

Custom Trigger Interfaces

The developer has started building personal infrastructure around AI agents. But there is no shared context, no organisational standards, no coordination with other developers' agents.

The developer builds lightweight automation around AI: a Slack bot, a Telegram bot, or a custom script that makes API calls to a foundation model, passing the full codebase as context. No persistent memory. No institutional knowledge. Every interaction starts from a fresh codebase rescan.
Coordinated individual agent swarms: The Agentic Coding Flywheel methodology (Jeffrey Emanuel) represents the most documented approach to running multiple agents as a single developer. The system uses three primitives: Beads (self-contained work units with dependency graphs, each including context, acceptance criteria, and test requirements), Agent Mail (advisory file reservation system with TTL expiry — prevents two agents from editing the same file simultaneously), and Graph-theory routing (PageRank and betweenness centrality to sequence beads optimally across agents). Documented example: 5,500-line architectural plan → 347 beads → 11,000 lines of working code in ~5 hours with 25 agents running in parallel. Full-scale infrastructure: 22 Claude Max subscriptions + 22 GPT Pro subscriptions + 7 Gemini Ultra accounts running simultaneously [18].

The open-source personal agent ecosystem (OpenClaw)

The open-source personal agent ecosystem (OpenClaw)
ProductStarsDetail
OpenClaw 310K Personal AI assistant. Multi-channel: WhatsApp, Slack, Discord, Telegram, iMessage, Signal. Uses Pi as SDK. Created by Peter Steinberger. 60K+ stars in 72 hours at launch. OpenAI acqui-hired Steinberger Dec 2025; project now managed by vendor-neutral foundation [17][103].
Nanobot 33K Ultra-lightweight Python rewrite of OpenClaw. ~4,000 lines of code [11].
ZeroClaw 27K Autonomous agent runtime in Rust. <5MB RAM footprint [11].
PicoClaw 25K Personal assistant in Go. Runs on $10 hardware, <10MB RAM [11].
NanoClaw 22K Security-first variant. Apple container and Docker sandboxed execution [11].
IronClaw 9.9K Rust rewrite by NEAR AI. WASM sandbox with capability-based permissions [11].
NullClaw 6K Smallest OpenClaw-compatible implementation. Zig. 678KB binary, ~1MB RAM, <2ms startup [11].

Productivity Evidence

Faros/DORA telemetry [19] n=10,000+
Individual: +21% tasks. +98% PRs. +91% review time. +47% context switching. +154% PR size.
Bain 2025 [20]
Individual: Coding = 25-35% of idea-to-launch Speed-up does little for time to market.
DORA 2024 [31] n=39,000+
Largest DevOps study.
DORA 2025 [32]
Individual: Higher individual effectiveness Year-over-year improvement.
CodeRabbit analysis [104]
AI-generated PRs: ~1.7x more issues flagged.

Cost

$5-30 ($150-900/mo) Flywheel: ~20M input + ~3.5M output tokens per intensive session. Full scale: 22 Claude Max + 22 GPT Pro + 7 Gemini Ultra accounts.

And then the trap.

Multiple agents running in parallel. Cursor in five windows. Devin on three branches. Output metrics explode: +98% PRs merged. But look at what happens to the organisation: +91% review time. +47% context switching. +154% average PR size. Delivery metrics? Flat. The agents produce more code. The humans can't review it fast enough. The pipeline clogs. This is where most advanced teams are stuck right now.

Era I: Individual AI
L8

Agent Multiplexing (Uncoordinated)

Individual output metrics explode. Organisational coherence collapses. This is the trap level.

Multiple agents running in parallel without the coordination infrastructure described at L7. The developer spawns several agent sessions — Cursor in multiple windows, Devin on separate branches, parallel Claude Code sessions — but there is no shared understanding of what each agent is doing.
The L8 → L9 transition is the critical organisational challenge. Most teams that adopt multiple AI agents land here. Escaping requires institutional investment in shared infrastructure (L9), not more or faster agents.

Productivity Evidence

Faros/DORA telemetry [19] n=10,000+
Individual: +21% tasks. +98% PRs. +91% review time. +47% context switching. +154% PR size.
Bain 2025 [20]
Individual: Coding = 25-35% of idea-to-launch Speed-up does little for time to market.
DORA 2024 [31] n=39,000+
Largest DevOps study.
DORA 2025 [32]
Individual: Higher individual effectiveness Year-over-year improvement.
CodeRabbit analysis [104]
AI-generated PRs: ~1.7x more issues flagged.

Cost

$15-100+ ($450-3K+/mo) Worst cost-to-org-output ratio. Costs multiply; delivery stays flat.
Era II: Transitional

Escaping the trap requires a fundamentally different investment. Not better agents — shared infrastructure. The organisation builds what individuals cannot: RAG over internal docs, custom MCP servers, org-specific prompt libraries, coding standards enforced by AI. The tooling becomes institutional. The workflow is still individual. This is the bridge between two worlds.

Era II: Transitional
L9

Organisation-Tuned Tooling

The tooling is institutional. The workflow is still individual. This is the last level where a developer personally drives every work item.

The organisation invests in shared AI infrastructure: RAG over internal documentation, custom MCP servers, org-specific prompt libraries, coding standards enforced by AI, shared context databases, curated tool registries, and CI/CD integration.
Enterprise security and runtime (NemoClaw): NVIDIA NemoClaw, announced at GTC March 16, 2026, deserves separate analysis because of its scope and positioning. NemoClaw is an enterprise security and runtime layer for OpenClaw. It consists of two components: Nemotron models (local inference for privacy-sensitive workloads) and OpenShell (a sandboxed runtime with YAML-configurable, hot-swappable policy-based guardrails). Single-command install on any OpenClaw instance. Partners: Adobe, Salesforce, SAP, CrowdStrike, ServiceNow, Atlassian, Palantir, Dell (first to ship GB300 hardware with NemoClaw preinstalled). Nemotron Coalition: Mistral, Perplexity, Cursor, LangChain. Framework placement: L9, not L11–L12. NemoClaw is OpenClaw with enterprise guardrails — personal agent infrastructure with enterprise-grade security. It does not provide institutional intake (no shared ticket queue), multi-agent coordination (agents don't know about each other), or system-initiated triggers (no cron, no monitoring-driven spawning). Futurum Group assessment: "addresses the deployment end of the trust chain but is not a complete governance solution" [106][107]. NVIDIA is positioning as infrastructure beneath agents, not as the factory itself [103][106][107][150].

AI-native development platforms

AI-native development platforms
ProductCompanyDetail
Droids (enterprise) Factory AI Unifies context from GitHub, Notion, Linear, Slack, Sentry. Enterprise memory across tools. Model-agnostic, IDE-agnostic. $50M Series B (NEA, Sequoia, Nvidia, J.P. Morgan). EY deploying to 5,000+ engineers [21][22].
Codegen (in ClickUp) ClickUp / Codegen Agents with full business context: task descriptions, specs, acceptance criteria from the project management layer. Acquired by ClickUp Dec 2025 [23].
Open SWE LangChain Open-source framework capturing architectural patterns from Stripe, Ramp, Coinbase agent deployments. Built on Deep Agents and LangGraph. Sandboxes (Modal, Daytona, Runloop, LangSmith), curated toolsets, AGENTS.md context engineering, subagent orchestration. MIT-licensed [63].
Fabro Qlty Software (Bryan Helmkamp) Open-source "dark software factory." Deterministic workflow graphs in Graphviz DOT: branching, loops, parallelism, human approval gates. CSS-like model stylesheets route each node to different models/providers with fallback chains. Cloud sandboxes (Daytona, Sprites, exe.dev) with SSH/VNC. Git checkpointing after every stage. Automatic retrospectives with cost/duration/narrative. REST API + SSE for 24/7 server mode. Single Rust binary, zero dependencies. MIT-licensed. 37 stars, v0.174.0. Generalises Stripe's Blueprints pattern as a reusable product [155][156].

AI code review and quality

AI code review and quality
ProductCompanyDetail
Qodo Merge + IDE Qodo PR review with org-specific coding standards. Integrates Jira, Linear, Monday context. RAG across repos [24].
CodeRabbit CodeRabbit AI PR reviewer that learns from team feedback over time. 2M+ reviews/month. $24–30/user/month [25][104].
Graphite Graphite AI code review with <3% unhelpful comment rate. Developers change code 55% of the time when flagged (vs 49% for human reviewers). Median PR merge time: 24 hours → 90 minutes. $40/user/month [108].
BugBot Cursor Launched July 2025. Reviews 2M+ PRs monthly with 8 parallel review passes per PR. Free for Cursor users [109].
Greptile Greptile API for building custom AI tools with full codebase understanding. Powers internal tools.

AI security and testing

AI security and testing
ProductCompanyDetail
Snyk AI Snyk Suite: Snyk Studio (IDE plugin), Agent Fix (80% auto-remediation of known vulnerabilities), Evo (agentic security orchestration). 2026 report: 48% of AI-generated code contains vulnerabilities. Gartner Magic Quadrant Leader for Application Security Testing [110].
Diffblue Cover Diffblue Autonomous JUnit test generation using reinforcement learning (not LLMs). Up to 250x faster than manual test writing. Deterministic, reproducible output [116].
Trunk.io Trunk CI reliability: merge queues batching up to 100 PRs, AI-powered flaky test management. Critical infrastructure when AI-generated code volume overwhelms existing CI pipelines [115].

AI DevOps and modernisation

AI DevOps and modernisation
ProductCompanyDetail
Harness AI Harness DevOps agent addressing the "AI velocity paradox" — code generates faster but CI/CD hasn't kept up. Uses Claude Opus 4.5. Customers: United Airlines, Citibank. Claims 75% faster releases [113].
Moderne Moderne Enterprise code modernisation built on OpenRewrite (5,000+ deterministic refactoring recipes). Expanded from Java to JavaScript, TypeScript, Python (late 2025). Positioned as deterministic infrastructure that AI agents depend on for reliable refactoring [114].
Rovo Dev Atlassian GA late 2025. Takes Jira issues, generates technical plans, creates code, submits PRs. 75% time savings on simpler tasks. Statsig: 45% reduction in PR cycle times. $20/user/month [111].
GitLab Duo GitLab Agent Platform GA January 2026. Specialised agents: Planner, Security Analyst, Data Analyst. 64.5% on SWE-bench Verified. Gartner Magic Quadrant Leader [112].

Cost

$5-50 ($150-1,500/mo) Factory AI: free BYOK, $20/mo Pro. Qodo/CodeRabbit: $24-30/user/mo. Rovo Dev: $20/user/mo. Graphite: $40/user/mo.
Era III: Institutional AI

Now the line is crossed. A human sends a Slack message, creates a ticket, assigns a task. An agent picks it up — and everything after that is agent-driven. It plans, implements, tests, creates the PR, routes it for review. Not in a sandbox. Inside the organisation's actual infrastructure: the repos, the CI/CD, the monitoring, the feature flags, the compliance systems.

Stripe ships 1,300 PRs per week this way. Zero human-written code. Ramp's Inspect produces 50%+ of all merged pull requests. Uber's system handles 70% of committed code. All three built custom. All three required years of prior engineering investment. The prerequisite for L10 is not better AI. It is mature engineering practice.

Era III: Institutional AI
L10

Human-Triggered Institutional Agents

The trigger is human. Everything after the trigger is agent-driven within institutional bounds. The developer becomes a reviewer and strategic director rather than an implementer.

A human sends a message on Slack, creates a ticket, or assigns a task. An agent picks it up, plans, implements, tests, creates the merge request, and routes it for review — operating within institutional infrastructure: the organisation's repos, CI/CD, monitoring, feature flags, and compliance systems.
Convergent architectural patterns at L10: LangChain's Open SWE project analysed the agent architectures of Stripe, Ramp, and Coinbase and identified six patterns that all three independently converged on [63]: 1. Isolated execution environments — dedicated cloud sandboxes with full permissions inside strict boundaries. 2. Curated toolsets — maintained and pruned rather than accumulated; Stripe reports ~500 tools. 3. Slack-first invocation — meeting developers in existing messaging workflows. 4. Rich context at startup — full issue/thread/PR/codebase context loaded before agent begins work. 5. Subagent orchestration — complex tasks decomposed to specialised child agents. 6. Human review as final gate — all three maintain mandatory human review on every merge. All three built custom rather than adopting commercial tools. The reasons are consistent: codebase scale (hundreds of millions of lines), proprietary libraries and frameworks, compliance requirements, and the need for deep integration with internal systems [63][64][65][66].

Other L10 entities

Other L10 entities
EntitySystemDetail
Takes tickets to PRs via Slack. Goldman Sachs pilot alongside 12,000 developers. Weeks of configuration required [7][26].
Auto-triggers from issue assignment. Ticket-to-code traceability through entire pipeline [21].
"Select everything in backlog, assign to Codegen" — bulk assignment to agents [23].

Productivity Evidence

Stripe Minions [64] [67]
Individual: 1,300+ PRs/week, zero human code 2-round CI limit. Human review mandatory.
Ramp Inspect [68]
Individual: 50%+ of merged PRs 80% of Inspect self-written.
Coinbase [66]
Individual: 25+ hours/week saved per automation New agent build: 12+ wks to <1 wk.
Uber [117] [118]
Individual: ~70% machine-generated code AI agent: <1% to ~8% of changes in months.
MiniMax (self-reported) [54]
Individual: 80% of committed code M2.5-generated Unverified.
Case Study

Stripe Minions

  • Produces 1,300+ pull requests per week with zero human-written code [64][67].
  • Agent harness: Modified fork of Block's Goose, optimised for unattended one-shot operation where there is no human in the loop during execution [64][96].
  • Blueprints: Structured workflows alternating deterministic nodes (linting, formatting, test running) with agent loops (code generation, debugging). This ensures critical operations are reliable while flexible work is agent-driven [65].
  • Toolset: ~500 curated MCP tools, maintained rather than accumulated. Tools are versioned, documented, and gated by the same review process as production code [63].
  • Execution: Sandboxed devboxes identical to human development environments. Full access to Stripe's hundreds of millions of lines of Ruby code [64].
  • CI constraint: Two-round CI limit — if tests don't pass after two attempts, the agent escalates to a human rather than spiralling [64].
  • Consistency: Rule files synced across Minions, Cursor, and Claude Code to ensure agents and human developers follow the same standards [64].
  • System-initiated triggers: Automated flaky-test detectors spawn Minion runs without human intervention — an L11 element within a predominantly L10 system [64].
  • Human review: Mandatory on every PR. No exceptions [67].
  • Stripe's Minions could operate because Stripe had already invested in the prerequisites: a decade of standardised devboxes, a CI suite of 3 million tests, well-documented APIs, and a culture of code review [64].
Case Study

Ramp Inspect

  • Ramp's internal coding agent produces 50%+ of all merged pull requests [68].
  • Execution: OpenCode running on Modal sandboxes. Each sandbox includes a full development environment: Postgres, Redis, Temporal, Sentry, Datadog, LaunchDarkly [68][69].
  • Verification loop: Beyond tests — the agent monitors runtime behaviour, checks feature flags, and performs visual verification via VNC + headless Chromium [68].
  • Interfaces: Slack bot (primary), web VS Code (for inspection), Chrome extension enabling non-engineers to trigger coding tasks [68].
  • Multiplayer sessions: Multiple people can observe and redirect an agent session in real time [68].
  • Self-bootstrapping: 80% of Inspect's own codebase was written by Inspect [68].
  • Adoption: Organic, without mandates. Engineers adopted because it was faster than writing code themselves [68].
Case Study

Coinbase Agent Ecosystem

  • Coinbase operates multiple AI agent systems across different functions [63][66].
  • Cloudbot: Coding agent integrated with Slack, Linear, and GitHub. Takes tickets, produces PRs [63].
  • RAPID-D: Multi-agent decision support system with four specialised roles — Analyst (data), Seeker (research), Devil's Advocate (challenges), Synthesizer (recommendations). Used for strategic and operational decisions [70].
  • qa-ai-agent: Visual browser testing agent that replaces week-long manual QA testing cycles with minutes-long automated runs [71].
  • Enterprise automation: Production agents built on LangGraph for internal workflows [66].
  • Paved roads: Standardised agent development patterns that cut new agent build time from 12+ weeks to less than 1 week [66].
  • x402 protocol: HTTP-native stablecoin micropayment protocol designed for AI-agent-to-AI-agent transactions [72].
Case Study

Uber AI Coding System

  • Uber has deployed one of the most comprehensive enterprise AI coding systems documented publicly [117][118].
  • Adoption: ~95% of engineers use AI monthly. ~70% of committed code is machine-generated [117].
  • Multi-agent architecture: Validator (IDE best-practice enforcement), AutoCover (domain-specific test generation), uReview (AI-augmented review of 90%+ of ~65,000 weekly code diffs) [118].
  • Scaling: AI coding agent scaled from <1% to ~8% of all code changes within months [117].
  • Efficiency: 21,000 developer hours saved through automated test generation alone [118].
  • Key finding: Specialised agents consistently outperform general-purpose agents across all tasks measured [117].

Cost

$30-1,000+ ($900-30K+/mo) StrongDM: $1,000+/day per engineer. Scale depends on agent volume and task complexity.

The trigger shifts from human to system. Cron jobs run code quality sweeps. Monitoring detects anomalies and spawns agents to investigate. Intake is structured and programmatic. Agents coordinate with each other. The factory is a multi-agent operating system with human governance.

Almost no one is here. StrongDM operates with 3 people, no human code, no human review. Their agents built their own test infrastructure. The cost: $1,000+ per day per engineer in API tokens. Stanford Law asks: "Built by agents, tested by agents, trusted by whom?"

Era III: Institutional AI
L11

Agent Factory

Work initiation shifts from human to system. The factory is a coordinated multi-agent operating system with human governance.

Agents operate on system-level triggers. Cron jobs run code quality sweeps. Monitoring systems detect anomalies and spawn agents to investigate and fix. Intake is structured and programmatic. Agents coordinate with each other. Deployment follows staged rollout with automated rollback.

Other L11 entities

Other L11 entities
EntityApproachDetail
"Script and parallelise Droids at massive scale for CI/CD, migrations, maintenance. Self-healing builds" [21].
Automated flaky-test detectors trigger Minion runs without human intervention — L11 element within a predominantly L10 system [64].
Claims 30% of company-wide tasks handled autonomously by M2.5, 80% of committed code AI-generated. Not independently verified [54].

Productivity Evidence

MiniMax (self-reported) [54]
Individual: 80% of committed code M2.5-generated Unverified.
StrongDM [27] n=3
Individual: No comparison — humans don't code/review $1,000+/day tokens per engineer.
Case Study

StrongDM Software Factory

  • The most radical documented implementation. A 3-person team founded July 2025, operating under a charter: "Code must not be written by humans. Code must not be reviewed by humans" [27][28].
  • Validation: Scenario-based testing with holdout sets (tests not visible to agents during generation). The factory does not test its own code — separate validation agents test with scenarios the writing agents never saw [28].
  • Digital Twin Universe: Agent-built clones of Okta, Slack, Google Workspace used for integration testing. The agents built their own test infrastructure [27].
  • Attractor repo: 6,000–7,000 lines of specification with zero code. Agents generate everything from spec [28].
  • Cost: $1,000+/day in API tokens per engineer [27].
  • Open source: Attractor repository and cxdb (contextual database) open-sourced [28].
  • Criticism: Stanford Law CodeX raised accountability questions: "Built by Agents, Tested by Agents, Trusted by Whom?" [29].
Case Study

Google DeepMind AlphaEvolve

  • An evolutionary coding agent (May 2025) using Gemini Flash + Pro with automated evaluation [119][120].
  • Discovered a new 48-multiplication algorithm for 4×4 complex matrices — the first improvement over Strassen's 1969 algorithm in 56 years [119].
  • Recovered 0.7% of Google's global compute resources through optimised data centre scheduling [120].
  • Achieved 23% speedup in FlashAttention kernels [119].
  • 1% reduction in Gemini model training time [119].
  • Represents AI improving the infrastructure that powers AI itself — a self-referential capability loop.

Cost

$30-1,000+ ($900-30K+/mo) StrongDM: $1,000+/day per engineer. Scale depends on agent volume and task complexity.

The factory identifies gaps in its own capabilities. It proposes new agent pipelines. It iterates on its own processes. It improves its own infrastructure. Humans hold strategic authority and approval gates. The system's scope of autonomous action expands over time, based on demonstrated reliability.

No company operates here. This is the north star — the logical endpoint of everything that came before. The question is not whether we get here. The question is what governance, trust, and accountability look like when we do.

Era III: Institutional AI
L12

Self-Evolving Factory

The factory builds itself. Humans hold strategic authority and approval gates. The system's scope of autonomous action expands over time based on demonstrated reliability.

The factory identifies gaps in its own capabilities, proposes new agent pipelines, iterates on its own processes, and improves its own infrastructure — subject to human approval at strategic decision points.
No company operates at L12. The closest precedents: StrongDM's Digital Twin Universe is built by agents — the testing infrastructure was not designed by humans [27]. StrongDM's Attractor repo is 6,000–7,000 lines of spec with zero code — agents generate everything [28]. AlphaEvolve demonstrates AI improving its own compute infrastructure, creating a self-referential improvement loop [119]. Live-SWE-agent (University of Illinois, Nov 2025) demonstrated a coding agent that self-evolves during runtime, creating custom tools on-the-fly while solving problems. 45.8% on SWE-bench Pro [121].
End of the Journey

The factory is not a tool.
It's an operating system.

Most organisations are somewhere between L4 and L8. The path to Institutional AI requires more than better tools — it requires rethinking the unit of productivity from the individual to the organisation.