Research

Research & Benchmarks

SWE-bench Evolution

SWE-bench Verified is the primary benchmark for autonomous coding agents solving real GitHub issues. Score progression:

SWE-bench Verified score progression
Date	Model / Agent	Score	Note
Early 2024	Best available	~33%	Baseline year
Mid-2025	Multiple	~55-60%	Rapid improvement
Oct 2025	Claude 3.5 Sonnet (new)	63%	With open-source agent harness
Jan 2026	Various	~70%	Chinese models in top 10
Feb 2026	Claude 4.5 Opus	76.8%	Current best

Key Research Findings

AlphaEvolve (Google DeepMind, May 2025)

Evolutionary coding agent pairing Gemini Flash (idea generation) with Gemini Pro (evaluation). Automated evaluation pipelines grade outputs. Key results: new 48-multiplication algorithm for 4x4 complex matrices (first improvement over Strassen 1969); 0.7% of Google's global compute recovered; 23% speedup in FlashAttention kernels; 1% reduction in Gemini training time. Demonstrates AI improving the infrastructure that powers AI.

Live-SWE-agent (UIUC, Nov 2025)

First coding agent that self-evolves during runtime. Creates custom tools on-the-fly while solving problems — the agent recognises missing capabilities and builds them mid-task. 45.8% on SWE-bench Pro. Points toward L12 capability.

Self-healing code vulnerability study

Research on iterative AI refinement loops found a 37.6% increase in critical security vulnerabilities after 5 iterations. Each "fix" pass introduces new vulnerabilities while addressing old ones, creating a vulnerability treadmill.

Snyk 2026 AI Security Report

48% of AI-generated code contains security vulnerabilities. The rate is higher for infrastructure code (Terraform, Kubernetes manifests) than application code.

Anthropic skill formation RCT

52 engineers learning a new codebase. Those using AI completed tasks faster but scored 17% lower on knowledge quizzes about the codebase they worked on. First controlled evidence that AI tools may create a dependency that inhibits deep learning.

Stanford employment data

Software developer employment aged 22-25 fell nearly 20% between 2022 and 2025. Correlation, not causation — but the timing aligns with L4-L6 adoption.

SWE-bench Multilingual

Extends the benchmark beyond Python. Claude's score drops from 63% to 43% on non-Python tasks, revealing that language-specific training creates brittle generalisation.

SWE-smith (Princeton, NeurIPS 2025 Spotlight)

50,000+ task instances from 128 repositories — 10x larger than original SWE-bench. Open-source models trained on SWE-smith achieve 40.2% on SWE-bench Verified, approaching closed-model performance.

MIT CSAIL study (ICML 2025)

Mapped systematic roadblocks to autonomous software engineering. Found that while AI agents excel at well-scoped, test-driven tasks, they struggle with ambiguous requirements, cross-system dependencies, and tasks requiring domain knowledge not present in the codebase.