Research & Benchmarks
SWE-bench Evolution
SWE-bench Verified is the primary benchmark for autonomous coding agents solving real GitHub issues. Score progression:
| Date | Model / Agent | Score | Note |
|---|---|---|---|
| Early 2024 | Best available | ~33% | Baseline year |
| Mid-2025 | Multiple | ~55-60% | Rapid improvement |
| Oct 2025 | Claude 3.5 Sonnet (new) | 63% | With open-source agent harness |
| Jan 2026 | Various | ~70% | Chinese models in top 10 |
| Feb 2026 | Claude 4.5 Opus | 76.8% | Current best |
Key Research Findings
AlphaEvolve (Google DeepMind, May 2025)
Evolutionary coding agent pairing Gemini Flash (idea generation) with Gemini Pro (evaluation). Automated evaluation pipelines grade outputs. Key results: new 48-multiplication algorithm for 4x4 complex matrices (first improvement over Strassen 1969); 0.7% of Google's global compute recovered; 23% speedup in FlashAttention kernels; 1% reduction in Gemini training time. Demonstrates AI improving the infrastructure that powers AI.
Live-SWE-agent (UIUC, Nov 2025)
First coding agent that self-evolves during runtime. Creates custom tools on-the-fly while solving problems — the agent recognises missing capabilities and builds them mid-task. 45.8% on SWE-bench Pro. Points toward L12 capability.
Self-healing code vulnerability study
Research on iterative AI refinement loops found a 37.6% increase in critical security vulnerabilities after 5 iterations. Each "fix" pass introduces new vulnerabilities while addressing old ones, creating a vulnerability treadmill.
Snyk 2026 AI Security Report
48% of AI-generated code contains security vulnerabilities. The rate is higher for infrastructure code (Terraform, Kubernetes manifests) than application code.
Anthropic skill formation RCT
52 engineers learning a new codebase. Those using AI completed tasks faster but scored 17% lower on knowledge quizzes about the codebase they worked on. First controlled evidence that AI tools may create a dependency that inhibits deep learning.
Stanford employment data
Software developer employment aged 22-25 fell nearly 20% between 2022 and 2025. Correlation, not causation — but the timing aligns with L4-L6 adoption.
SWE-bench Multilingual
Extends the benchmark beyond Python. Claude's score drops from 63% to 43% on non-Python tasks, revealing that language-specific training creates brittle generalisation.
SWE-smith (Princeton, NeurIPS 2025 Spotlight)
50,000+ task instances from 128 repositories — 10x larger than original SWE-bench. Open-source models trained on SWE-smith achieve 40.2% on SWE-bench Verified, approaching closed-model performance.
MIT CSAIL study (ICML 2025)
Mapped systematic roadblocks to autonomous software engineering. Found that while AI agents excel at well-scoped, test-driven tasks, they struggle with ambiguous requirements, cross-system dependencies, and tasks requiring domain knowledge not present in the codebase.