How does AI adoption
actually work?
Not everyone is at the same level. Not every level delivers what it promises. This is the complete journey — from a developer writing every line by hand to a factory that builds itself. Twelve levels. Three eras. One question: where does productivity actually shift?
Tools and data collected as of 18 March 2026
This is where everyone starts. Every line written by a human, every test run by hand. The baseline.
No AI
The developer is the sole author of every line of code.
Cost
The first taste of AI. It finishes your sentences. It feels like magic. But look closer — you're still the one thinking, deciding, structuring. The AI is accelerating your fingers, not your mind.
Developers code 52 minutes a day. Autocomplete makes those 52 minutes faster. It does nothing about the other 7 hours.
Autocomplete
AI accelerates typing. The developer's mental model, decisions, and workflow remain unchanged.
Autocomplete tools
| Product | Company | Detail |
|---|---|---|
| Copilot (inline) | GitHub / Microsoft | ~15M developers. $10/month individual, $19/month business [1]. |
| Tabnine | Tabnine | On-premises deployment. SOC 2 Type II, GDPR, HIPAA. Trains on customer code only. $9/month [1]. |
| Supermaven | Anysphere (now part of Cursor) | Fastest autocomplete measured. 300K-token context window. $10/month. Acquired by Cursor parent company [34]. |
| Cody (completions) | Sourcegraph | Strong codebase search and indexing, particularly for large legacy systems [94]. |
| AI Assistant | JetBrains | Native to IntelliJ family IDEs. |
| Q Developer (inline) | Amazon | AWS ecosystem. Formerly CodeWhisperer. Free tier with limits [130]. |
| Gemini Code Assist (inline) | Google Cloud ecosystem. Free tier available [131]. | |
| Double.bot | Double (YC W23) | VS Code extension focused on completion quality: bracket handling, auto-imports, multi-cursor, mid-line completions. ~28K installs [83]. |
Productivity Evidence
Cost
Now you can ask questions. Describe a problem in plain language, get code back. But every conversation starts from zero — no memory, no project context. You are the integration layer. You copy. You paste.
AI Chat Assistant
An always-available consultant, but every interaction starts from zero project context. No awareness of the codebase, no memory between sessions.
AI Chat Assistants
| Product | Company | Detail |
|---|---|---|
| Claude.ai / Claude Pro | Anthropic | Strong reasoning. 200K context window. $20/month Pro. |
| ChatGPT / GPT-4o | OpenAI | Largest general user base. |
| Gemini | Long context, multimodal. | |
| Grok | xAI | Integrated with X/Twitter. Free tier available. |
Cost
A significant leap. The AI moves inside your editor, sees your files, understands your imports. It proposes multi-file changes. But here's the catch: you review every diff, line by line. You are still the bottleneck.
This is the most studied level — and the most contested. One RCT found developers 19% slower. Another found them 26% faster. The difference? Experience, task complexity, and whether the study measured typing speed or thinking speed.
AI in IDE, Supervised
The AI gains real project context. But the human approval loop on every change limits speed. The developer is still the bottleneck.
Commercial IDE tools
| Product | Company | Detail |
|---|---|---|
| Cursor (standard mode) | Anysphere | VS Code fork. 1M+ users, 360K+ paying. ~$500M+ ARR. $20/month hobby, $40/month pro [34]. |
| Windsurf (standard mode) | Cognition / Codeium | Cascade agent. $15/month. $82M ARR. 350+ enterprise customers [42]. |
| Copilot Workspace | GitHub / Microsoft | Works from issues and PRs. Feb 2026 added Claude and Codex model access alongside native models. |
| Augment Code | Augment Code | Enterprise-focused. Context Engine indexes 400K+ files, full dependency graph. $50/dev/month. $252M Series B [34]. |
| Trae | ByteDance | AI-native IDE (VS Code fork). 6M registered users, 1.6M MAUs across ~200 countries. Free Claude 3.7 Sonnet + GPT-4o initially; paid from ~$7.50/month. Builder Mode (SOLO) for full-stack generation: 44% penetration among international users. Formerly MarsCode. Privacy concerns re telemetry to ByteDance servers [77][78]. |
| Kiro | AWS | Spec-driven IDE on Code OSS, using Claude Sonnet. Enforces Requirements (EARS syntax) → Design → Task list methodology. Automated "hooks" trigger background tasks on file saves. Autonomous agent (announced re:Invent Dec 2025) works independently for days with persistent memory. Free during preview [79]. |
| Kilo Code | Kilo (GitLab co-founder) | Fork of Cline + Roo Code. 1.5M+ users, 25T+ tokens processed. Orchestrator mode routes subtasks to specialised sub-agents (Architect, Coder, Debugger). 500+ models at exact provider rates, zero markup. VS Code, JetBrains, CLI. $8M seed Dec 2025 [80]. |
| Zed AI | Zed Industries | Rust-built, GPU-rendered editor (120 FPS). From Atom's creator. $32M from Sequoia. Co-developed Agent Client Protocol (ACP) with JetBrains (Jan 2026) as open standard for editor-agnostic AI agents. Open-source Zeta model for fast completion [81]. |
| Sweep AI | Sweep | JetBrains-focused AI assistant with 40K+ installs, filling a gap since most competitors prioritise VS Code [86]. |
| Blackbox AI | Blackbox | Claims 30M+ developers. 300+ models. "Chairman LLM" dispatches to multiple specialist agents. From ~$3/month [87]. |
| Pieces | Pieces for Developers | "Long-term memory" tool capturing developer activity across IDEs, browsers, terminals for 9+ months. Works alongside other tools via MCP. Free tier; Pro $18.99/month [89]. |
Open-source IDE extensions
| Product | Detail |
|---|---|
| Cline | VS Code extension. Autonomous agent, BYOK model. 59K GitHub stars. |
| Roo Code | Fork of Cline with additional features and community plugins. |
| Continue.dev | Model-agnostic, privacy-focused IDE extension. |
| Aider | Terminal pair-programmer. 42K+ GitHub stars. Git-native. Strong on pair-programming workflows. |
Productivity Evidence
Cost
Something shifts. You stop reviewing every change and start reviewing results. "Add authentication to this endpoint" — and you check if authentication works, not whether the code looks like you would have written it. Trust replaces control.
AI Co-pilot, Autonomous in IDE
The relationship shifts from "show me every change" to "show me the result." The developer operates at a higher level of abstraction.
Autonomous IDE agents
| Product | Company | Detail |
|---|---|---|
| Cursor (Agent / Composer mode) | Anysphere | Feb 2026: up to 8 parallel agents via git worktrees [34]. |
| Windsurf (Cascade mode) | Cognition / Codeium | Flows model retains session context across interactions. Wave 13 update: parallel Cascade + Cascade Hooks [42]. |
| Copilot (Agent mode) | GitHub / Microsoft | Expanding toward autonomous multi-file operation within VS Code. |
| Google Antigravity | Google's AI IDE, launched Nov 2025 with Gemini 3. Built after $2.4B acqui-hire of Windsurf team (July 2025). Manager View orchestrates multiple AI agents simultaneously; "Artifacts" system generates structured deliverables for verification. Currently free preview; paid tiers expected ~$20–60/month [82]. |
Productivity Evidence
Cost
The inversion is complete. The agent writes. You direct. The IDE becomes a reading surface, not a writing surface. For the first time, the developer's primary skill is judgement, not implementation.
This is the most crowded and fastest-moving space in all of AI. Claude Code, Devin, Codex, Replit — billions of dollars chasing the same bet: that the developer's value is in knowing what to build, not in building it.
Agent-First Development
The agent is the primary author. The developer specifies intent and evaluates output. The editor is for reading, not writing.
Developer tools (terminal and autonomous agents)
| Product | Company | Detail |
|---|---|---|
| Claude Code | Anthropic | Terminal-native agent. 1M-token context window. Agentic tool use. Estimated ~$2.5B ARR as of early 2026 [6]. |
| Devin | Cognition | Fully sandboxed environment: shell, editor, browser. Asynchronous — takes tickets, works independently, submits PRs. $20/month + ACU (agent compute units). Goldman Sachs pilot alongside 12,000 developers. $4B valuation [7][26]. |
| Codex | OpenAI | Cloud-based asynchronous coding agent. 1.6M weekly active users. Runs in sandboxed environments. Works from natural language task descriptions [6]. |
| Google Jules | Asynchronous agent. Clones repos into Cloud VMs, works independently, submits PRs via GitHub. 140,000+ code improvements shared during beta. CLI companion (Oct 2025), public API, proactive "Suggested Tasks" that scan repos for improvements. Free (15 tasks/day) to $124.99/month (300 tasks/day) [93]. | |
| Amp | Sourcegraph | Replaced Cody for individual users (Nov 2025). No artificial token constraints — models run to completion. Team-first: shared threads, leaderboards, reusable workflows. Leverages Sourcegraph's deep code graph (code intelligence over billions of lines). Credit-based pricing, $10 free at signup [94]. |
| Grok Build / grok-code-fast-1 | xAI | Local-first CLI agent. Up to 8 parallel agents with Arena Mode (algorithmic ranking selects best output across agents). 314B parameter MoE model. 256K context window. $0.20/1M input tokens. Waitlist as of March 2026 [8][9]. |
| Grok CLI | xAI / open source | Open-source terminal agent. MCP support. Morph Fast Apply engine for high-speed multi-file edits [10]. |
| Warp | Warp | Rust-built terminal evolved into "Agentic Development Environment." Warp 2.0 (June 2025) added AI-native workflows. Runs hundreds of agents in parallel. 3.2 billion lines of code edited in 2025. 120,000+ codebases indexed. TIME Best Invention 2025. Free tier; Build $20/month; Business $50/month [95]. |
| Goose | Block (open source) | 29,400+ GitHub stars. Open-source AI agent created at Block (Square). ~5,000 Block employees use weekly. Contributed to Linux Foundation's Agentic AI Foundation alongside MCP and AGENTS.md. Stripe forked Goose for their Minions agent harness [96][97]. |
| Cosine / Genie | Cosine (YC) | Proprietary Genie 2 model. 50+ languages including legacy (COBOL, Fortran, Ada). Enterprise compliance: FINRA, HIPAA, ITAR certifiable. Multi-agent mode decomposes backlog items across autonomous specialist agents. 72% on SWE-Lancer benchmark [98]. |
| OpenCode | Open source | 122K GitHub stars. 75+ LLM providers. LSP integration for language-aware editing. Privacy-first: runs locally [11]. |
| Pi (by badlogicgames) | Open source | Minimal terminal coding harness. 4 core tools. Powers the OpenClaw ecosystem. Created by libGDX founder Mario Zechner [12][45]. |
| OpenHands | Open source | 69K stars. Formerly OpenDevin. CLI and web entrypoints. Strong on autonomous task completion [11]. |
| Gemini CLI | Google (open source) | 98K GitHub stars. Terminal agent for repository work and research. Direct Gemini model access [11]. |
| Codex CLI | OpenAI (open source) | 65K GitHub stars. Interactive TUI with tool execution. Local-first [11]. |
Foundation models with native agentic capabilities
| Model | Company | Detail |
|---|---|---|
| Moonshot AI | 1T parameter MoE (32B activated). Open-source. Agent Swarm: orchestrates up to 100 sub-agents with 1,500+ tool calls per session. Trained via PARL (Parallel Agentic Reinforcement Learning) to prevent "serial collapse" where sequential agents degrade quality. 4.5x speedup over serial execution. $0.60/1M input tokens, $0.10/1M cached. Released January 2026 [48][49][50]. | |
| DeepSeek | Open-source. Introduces "thinking in tool-use" — models reason while executing tool-calling flows rather than separating reasoning and execution. Agentic synthesis pipeline: 1,800+ curated environments, 85,000+ synthetic instructions. 128K context. RL budget exceeds 10% of pre-training cost. Free web access [51][52][53]. | |
| MiniMax | 230B/10B MoE architecture. Open-source. Parallel tool calling. Trained via Forge (scaffold-agnostic agent RL framework that works across any orchestration layer). $1/hour at 100 tokens/second. Self-reported production data: 80% of committed code at MiniMax is M2.5-generated; 30% of company-wide tasks handled autonomously. Not independently verified [54][55][56]. | |
| MiniMax | Open-source (Apache 2.0). 1M token context. Hybrid attention mechanism. Full RL training cost: $534,700 on 512 H800 GPUs in 3 weeks [57][58]. | |
| Poolside | Founded by Jason Warner (ex-GitHub CTO) and Eiso Kant. ~$626M through Series B; up to $1B from Nvidia (Oct 2025) at $12B valuation. Models Malibu and Point use RLCEF (RL from Code Execution Feedback) — trained by executing code and learning from results across 350M+ repos (50T+ tokens). Deploys entirely within customer infrastructure. Project Horizon: planned 2GW AI campus [99][100]. | |
| Magic | $515M raised across multiple rounds. Building 100M-token context models with ~23 engineers and 8,000 H100 GPUs. Custom CUDA training stack (no PyTorch autograd). LTM-2-Mini claims 1,000x efficiency over traditional attention mechanisms. Pre-product as of March 2026 — no shipped product, all research [101][102]. |
Full-stack app generators (prompt to deployed application)
| Product | Company | Detail |
|---|---|---|
| Replit (Agent 4) | Replit | Cloud-native platform. Agent writes, tests, deploys, iterates in a managed environment. Built-in auth, database (Postgres), hosting, monitoring. Agent 4 (March 2026): canvas interface, kanban project management, parallel agents. $400M raise at $9B valuation. Targeting $1B ARR. Fastest growth from non-technical users [13][14]. |
| Lovable | Lovable | Natural language to React/TypeScript + Supabase backend. Clean, exportable code — not locked in. $25/month pro [15]. |
| Bolt.new | StackBlitz | In-browser full-stack IDE. Diffs-based updates allow iterating on generated code. ~$20/month [15]. |
| Base44 | Base44 (acquired by Wix) | Prompt to full-stack app with managed backend. Zero-setup database, auth, hosting. Closed ecosystem — harder to export [16]. |
| v0 | Vercel | UI generation from prompts. React/Next.js focused. Vercel deployment integration. Strong on frontend components. |
Productivity Evidence
Cost
Some developers go further. They build personal infrastructure — Slack bots, Telegram triggers, custom scripts wrapping foundation models. They've automated their own workflow. But it's personal. The developer next to them has a completely different setup. There is no shared context, no organisational standard.
Custom Trigger Interfaces
The developer has started building personal infrastructure around AI agents. But there is no shared context, no organisational standards, no coordination with other developers' agents.
The open-source personal agent ecosystem (OpenClaw)
| Product | Stars | Detail |
|---|---|---|
| OpenClaw | 310K | Personal AI assistant. Multi-channel: WhatsApp, Slack, Discord, Telegram, iMessage, Signal. Uses Pi as SDK. Created by Peter Steinberger. 60K+ stars in 72 hours at launch. OpenAI acqui-hired Steinberger Dec 2025; project now managed by vendor-neutral foundation [17][103]. |
| Nanobot | 33K | Ultra-lightweight Python rewrite of OpenClaw. ~4,000 lines of code [11]. |
| ZeroClaw | 27K | Autonomous agent runtime in Rust. <5MB RAM footprint [11]. |
| PicoClaw | 25K | Personal assistant in Go. Runs on $10 hardware, <10MB RAM [11]. |
| NanoClaw | 22K | Security-first variant. Apple container and Docker sandboxed execution [11]. |
| IronClaw | 9.9K | Rust rewrite by NEAR AI. WASM sandbox with capability-based permissions [11]. |
| NullClaw | 6K | Smallest OpenClaw-compatible implementation. Zig. 678KB binary, ~1MB RAM, <2ms startup [11]. |
Productivity Evidence
Cost
And then the trap.
Multiple agents running in parallel. Cursor in five windows. Devin on three branches. Output metrics explode: +98% PRs merged. But look at what happens to the organisation: +91% review time. +47% context switching. +154% average PR size. Delivery metrics? Flat. The agents produce more code. The humans can't review it fast enough. The pipeline clogs. This is where most advanced teams are stuck right now.
Agent Multiplexing (Uncoordinated)
Individual output metrics explode. Organisational coherence collapses. This is the trap level.
Productivity Evidence
Cost
Escaping the trap requires a fundamentally different investment. Not better agents — shared infrastructure. The organisation builds what individuals cannot: RAG over internal docs, custom MCP servers, org-specific prompt libraries, coding standards enforced by AI. The tooling becomes institutional. The workflow is still individual. This is the bridge between two worlds.
Organisation-Tuned Tooling
The tooling is institutional. The workflow is still individual. This is the last level where a developer personally drives every work item.
AI-native development platforms
| Product | Company | Detail |
|---|---|---|
| Droids (enterprise) | Factory AI | Unifies context from GitHub, Notion, Linear, Slack, Sentry. Enterprise memory across tools. Model-agnostic, IDE-agnostic. $50M Series B (NEA, Sequoia, Nvidia, J.P. Morgan). EY deploying to 5,000+ engineers [21][22]. |
| Codegen (in ClickUp) | ClickUp / Codegen | Agents with full business context: task descriptions, specs, acceptance criteria from the project management layer. Acquired by ClickUp Dec 2025 [23]. |
| Open SWE | LangChain | Open-source framework capturing architectural patterns from Stripe, Ramp, Coinbase agent deployments. Built on Deep Agents and LangGraph. Sandboxes (Modal, Daytona, Runloop, LangSmith), curated toolsets, AGENTS.md context engineering, subagent orchestration. MIT-licensed [63]. |
| Fabro | Qlty Software (Bryan Helmkamp) | Open-source "dark software factory." Deterministic workflow graphs in Graphviz DOT: branching, loops, parallelism, human approval gates. CSS-like model stylesheets route each node to different models/providers with fallback chains. Cloud sandboxes (Daytona, Sprites, exe.dev) with SSH/VNC. Git checkpointing after every stage. Automatic retrospectives with cost/duration/narrative. REST API + SSE for 24/7 server mode. Single Rust binary, zero dependencies. MIT-licensed. 37 stars, v0.174.0. Generalises Stripe's Blueprints pattern as a reusable product [155][156]. |
AI code review and quality
| Product | Company | Detail |
|---|---|---|
| Qodo Merge + IDE | Qodo | PR review with org-specific coding standards. Integrates Jira, Linear, Monday context. RAG across repos [24]. |
| CodeRabbit | CodeRabbit | AI PR reviewer that learns from team feedback over time. 2M+ reviews/month. $24–30/user/month [25][104]. |
| Graphite | Graphite | AI code review with <3% unhelpful comment rate. Developers change code 55% of the time when flagged (vs 49% for human reviewers). Median PR merge time: 24 hours → 90 minutes. $40/user/month [108]. |
| BugBot | Cursor | Launched July 2025. Reviews 2M+ PRs monthly with 8 parallel review passes per PR. Free for Cursor users [109]. |
| Greptile | Greptile | API for building custom AI tools with full codebase understanding. Powers internal tools. |
AI security and testing
| Product | Company | Detail |
|---|---|---|
| Snyk AI | Snyk | Suite: Snyk Studio (IDE plugin), Agent Fix (80% auto-remediation of known vulnerabilities), Evo (agentic security orchestration). 2026 report: 48% of AI-generated code contains vulnerabilities. Gartner Magic Quadrant Leader for Application Security Testing [110]. |
| Diffblue Cover | Diffblue | Autonomous JUnit test generation using reinforcement learning (not LLMs). Up to 250x faster than manual test writing. Deterministic, reproducible output [116]. |
| Trunk.io | Trunk | CI reliability: merge queues batching up to 100 PRs, AI-powered flaky test management. Critical infrastructure when AI-generated code volume overwhelms existing CI pipelines [115]. |
AI DevOps and modernisation
| Product | Company | Detail |
|---|---|---|
| Harness AI | Harness | DevOps agent addressing the "AI velocity paradox" — code generates faster but CI/CD hasn't kept up. Uses Claude Opus 4.5. Customers: United Airlines, Citibank. Claims 75% faster releases [113]. |
| Moderne | Moderne | Enterprise code modernisation built on OpenRewrite (5,000+ deterministic refactoring recipes). Expanded from Java to JavaScript, TypeScript, Python (late 2025). Positioned as deterministic infrastructure that AI agents depend on for reliable refactoring [114]. |
| Rovo Dev | Atlassian | GA late 2025. Takes Jira issues, generates technical plans, creates code, submits PRs. 75% time savings on simpler tasks. Statsig: 45% reduction in PR cycle times. $20/user/month [111]. |
| GitLab Duo | GitLab | Agent Platform GA January 2026. Specialised agents: Planner, Security Analyst, Data Analyst. 64.5% on SWE-bench Verified. Gartner Magic Quadrant Leader [112]. |
Cost
Now the line is crossed. A human sends a Slack message, creates a ticket, assigns a task. An agent picks it up — and everything after that is agent-driven. It plans, implements, tests, creates the PR, routes it for review. Not in a sandbox. Inside the organisation's actual infrastructure: the repos, the CI/CD, the monitoring, the feature flags, the compliance systems.
Stripe ships 1,300 PRs per week this way. Zero human-written code. Ramp's Inspect produces 50%+ of all merged pull requests. Uber's system handles 70% of committed code. All three built custom. All three required years of prior engineering investment. The prerequisite for L10 is not better AI. It is mature engineering practice.
Human-Triggered Institutional Agents
The trigger is human. Everything after the trigger is agent-driven within institutional bounds. The developer becomes a reviewer and strategic director rather than an implementer.
Other L10 entities
| Entity | System | Detail |
|---|---|---|
| Takes tickets to PRs via Slack. Goldman Sachs pilot alongside 12,000 developers. Weeks of configuration required [7][26]. | ||
| Auto-triggers from issue assignment. Ticket-to-code traceability through entire pipeline [21]. | ||
| "Select everything in backlog, assign to Codegen" — bulk assignment to agents [23]. |
Productivity Evidence
Stripe Minions
- Produces 1,300+ pull requests per week with zero human-written code [64][67].
- Agent harness: Modified fork of Block's Goose, optimised for unattended one-shot operation where there is no human in the loop during execution [64][96].
- Blueprints: Structured workflows alternating deterministic nodes (linting, formatting, test running) with agent loops (code generation, debugging). This ensures critical operations are reliable while flexible work is agent-driven [65].
- Toolset: ~500 curated MCP tools, maintained rather than accumulated. Tools are versioned, documented, and gated by the same review process as production code [63].
- Execution: Sandboxed devboxes identical to human development environments. Full access to Stripe's hundreds of millions of lines of Ruby code [64].
- CI constraint: Two-round CI limit — if tests don't pass after two attempts, the agent escalates to a human rather than spiralling [64].
- Consistency: Rule files synced across Minions, Cursor, and Claude Code to ensure agents and human developers follow the same standards [64].
- System-initiated triggers: Automated flaky-test detectors spawn Minion runs without human intervention — an L11 element within a predominantly L10 system [64].
- Human review: Mandatory on every PR. No exceptions [67].
- Stripe's Minions could operate because Stripe had already invested in the prerequisites: a decade of standardised devboxes, a CI suite of 3 million tests, well-documented APIs, and a culture of code review [64].
Ramp Inspect
- Ramp's internal coding agent produces 50%+ of all merged pull requests [68].
- Execution: OpenCode running on Modal sandboxes. Each sandbox includes a full development environment: Postgres, Redis, Temporal, Sentry, Datadog, LaunchDarkly [68][69].
- Verification loop: Beyond tests — the agent monitors runtime behaviour, checks feature flags, and performs visual verification via VNC + headless Chromium [68].
- Interfaces: Slack bot (primary), web VS Code (for inspection), Chrome extension enabling non-engineers to trigger coding tasks [68].
- Multiplayer sessions: Multiple people can observe and redirect an agent session in real time [68].
- Self-bootstrapping: 80% of Inspect's own codebase was written by Inspect [68].
- Adoption: Organic, without mandates. Engineers adopted because it was faster than writing code themselves [68].
Coinbase Agent Ecosystem
- Coinbase operates multiple AI agent systems across different functions [63][66].
- Cloudbot: Coding agent integrated with Slack, Linear, and GitHub. Takes tickets, produces PRs [63].
- RAPID-D: Multi-agent decision support system with four specialised roles — Analyst (data), Seeker (research), Devil's Advocate (challenges), Synthesizer (recommendations). Used for strategic and operational decisions [70].
- qa-ai-agent: Visual browser testing agent that replaces week-long manual QA testing cycles with minutes-long automated runs [71].
- Enterprise automation: Production agents built on LangGraph for internal workflows [66].
- Paved roads: Standardised agent development patterns that cut new agent build time from 12+ weeks to less than 1 week [66].
- x402 protocol: HTTP-native stablecoin micropayment protocol designed for AI-agent-to-AI-agent transactions [72].
Uber AI Coding System
- Uber has deployed one of the most comprehensive enterprise AI coding systems documented publicly [117][118].
- Adoption: ~95% of engineers use AI monthly. ~70% of committed code is machine-generated [117].
- Multi-agent architecture: Validator (IDE best-practice enforcement), AutoCover (domain-specific test generation), uReview (AI-augmented review of 90%+ of ~65,000 weekly code diffs) [118].
- Scaling: AI coding agent scaled from <1% to ~8% of all code changes within months [117].
- Efficiency: 21,000 developer hours saved through automated test generation alone [118].
- Key finding: Specialised agents consistently outperform general-purpose agents across all tasks measured [117].
Cost
The trigger shifts from human to system. Cron jobs run code quality sweeps. Monitoring detects anomalies and spawns agents to investigate. Intake is structured and programmatic. Agents coordinate with each other. The factory is a multi-agent operating system with human governance.
Almost no one is here. StrongDM operates with 3 people, no human code, no human review. Their agents built their own test infrastructure. The cost: $1,000+ per day per engineer in API tokens. Stanford Law asks: "Built by agents, tested by agents, trusted by whom?"
Agent Factory
Work initiation shifts from human to system. The factory is a coordinated multi-agent operating system with human governance.
Other L11 entities
| Entity | Approach | Detail |
|---|---|---|
| "Script and parallelise Droids at massive scale for CI/CD, migrations, maintenance. Self-healing builds" [21]. | ||
| Automated flaky-test detectors trigger Minion runs without human intervention — L11 element within a predominantly L10 system [64]. | ||
| Claims 30% of company-wide tasks handled autonomously by M2.5, 80% of committed code AI-generated. Not independently verified [54]. |
Productivity Evidence
StrongDM Software Factory
- The most radical documented implementation. A 3-person team founded July 2025, operating under a charter: "Code must not be written by humans. Code must not be reviewed by humans" [27][28].
- Validation: Scenario-based testing with holdout sets (tests not visible to agents during generation). The factory does not test its own code — separate validation agents test with scenarios the writing agents never saw [28].
- Digital Twin Universe: Agent-built clones of Okta, Slack, Google Workspace used for integration testing. The agents built their own test infrastructure [27].
- Attractor repo: 6,000–7,000 lines of specification with zero code. Agents generate everything from spec [28].
- Cost: $1,000+/day in API tokens per engineer [27].
- Open source: Attractor repository and cxdb (contextual database) open-sourced [28].
- Criticism: Stanford Law CodeX raised accountability questions: "Built by Agents, Tested by Agents, Trusted by Whom?" [29].
Google DeepMind AlphaEvolve
- An evolutionary coding agent (May 2025) using Gemini Flash + Pro with automated evaluation [119][120].
- Discovered a new 48-multiplication algorithm for 4×4 complex matrices — the first improvement over Strassen's 1969 algorithm in 56 years [119].
- Recovered 0.7% of Google's global compute resources through optimised data centre scheduling [120].
- Achieved 23% speedup in FlashAttention kernels [119].
- 1% reduction in Gemini model training time [119].
- Represents AI improving the infrastructure that powers AI itself — a self-referential capability loop.
Cost
The factory identifies gaps in its own capabilities. It proposes new agent pipelines. It iterates on its own processes. It improves its own infrastructure. Humans hold strategic authority and approval gates. The system's scope of autonomous action expands over time, based on demonstrated reliability.
No company operates here. This is the north star — the logical endpoint of everything that came before. The question is not whether we get here. The question is what governance, trust, and accountability look like when we do.
Self-Evolving Factory
The factory builds itself. Humans hold strategic authority and approval gates. The system's scope of autonomous action expands over time based on demonstrated reliability.
The factory is not a tool.
It's an operating system.
Most organisations are somewhere between L4 and L8. The path to Institutional AI requires more than better tools — it requires rethinking the unit of productivity from the individual to the organisation.