Most AI coding tool comparisons published in 2026 are not actually benchmarks. They are feature comparisons written by people who used each tool for a day, listed the things they noticed, and called it a review. That is useful journalism, but it does not answer the question that matters to a developer making a budget decision: which tool completes real work at the lowest total cost?
This report attempts something different. We defined five task categories that represent the actual work of a software engineer, ran each task through each of the four leading terminal-first and editor-first AI coding tools, and measured time, token consumption, cost, and quality. Where numbers are estimates — because token counts vary by session and model pricing changes quarterly — we say so explicitly and show the methodology so you can plug in your own figures.
The tools under test are:
- Cursor (editor, Claude 3.7 Sonnet / GPT-4o backend, subscription + API)
- Claude Code (terminal CLI, Claude Opus/Sonnet backend, Anthropic API or Max subscription)
- Aider (terminal CLI, model-agnostic, bring-your-own-key)
- OpenAI Codex CLI (terminal CLI, Codex/GPT-4o backend, OpenAI API)
1. The Honest Benchmark — Why Most Comparisons Fail
Before the data, the methodology. Skip this section if you want, but it explains why the numbers in the tables that follow are meaningful and most others you have read are not.
The Toy Task Problem
The most common benchmark approach is to give each tool a self-contained coding puzzle: implement a binary search tree, write a function to parse JSON, convert a REST API client to async. These tasks are legible to outsiders, easy to score, and almost completely disconnected from what makes a tool valuable in practice.
Real codebases are not collections of isolated puzzles. They have:
- Existing conventions, which the tool must infer from context
- Dependencies between modules, which affect what changes are safe
- Test suites, which must still pass after the change
- Documentation gaps, which the tool must work around
- Ambiguous specifications, which require the tool to ask or assume
A tool that gets a 95% score on LeetCode-style tasks may still be frustrating to use in a real project if it cannot navigate these complexities. Conversely, a tool with a lower raw accuracy score might be faster in practice because it generates better-structured diffs or requires fewer review cycles.
The Missing Cost Dimension
Most comparisons measure completion rate or subjective quality and stop there. Neither metric tells you what something costs.
A tool that completes 90% of tasks perfectly but costs $4 per task is less efficient than a tool that completes 85% of tasks requiring one revision at $0.80 total. The correct optimization target is cost per successful task, which requires measuring tokens consumed, retry rate, and the time cost of review cycles.
The Hidden Overhead Problem
Every AI coding tool has costs that do not appear in the API pricing:
- Context initialization: loading files, reading project structure, processing AGENTS.md or CLAUDE.md
- Retry consumption: the tokens spent generating code that the developer rejects or corrects
- Hallucination recovery: time spent debugging confidently wrong outputs
- Format friction: time spent copying output, applying diffs, resolving conflicts
Our methodology attempts to capture these costs. The “wall-clock time” measurements include everything from issuing the prompt to having working, committed code — not just the AI response latency.
2. Methodology: Five Task Categories Across Four Tools
Test Environment
All tasks were run against a single reference codebase: a Node.js/TypeScript REST API with a React frontend. The repo is 47,000 lines of source code, uses PostgreSQL via Prisma, has 340 passing tests in Jest, and has an AGENTS.md at the repo root covering conventions and test commands. This is roughly the size and complexity of a real product at an early-stage startup.
Each task was:
- Run three times per tool to account for session variability
- Scored by two independent reviewers on a 1-5 scale for code quality
- Measured for total token consumption (input + output) where metered
- Measured for wall-clock time from prompt submission to accepted completion
For Cursor, which uses a subscription model, token costs are estimated from the Cursor team’s published per-seat compute costs and the approximate model breakdown. These estimates are marked with † in the tables.
Task Category 1: New Feature Implementation
Task description: Add a rate-limiting middleware to the Express API that limits each authenticated user to 100 requests per 15-minute window, stores state in Redis, returns standard 429 responses, and has unit tests for the limit logic.
This task requires: understanding the existing middleware stack, knowing how to integrate Redis without breaking the existing test setup, writing idiomatic TypeScript matching the codebase style, and generating meaningful tests rather than trivial ones.
Task Category 2: Bug Fix from Issue
Task description: Fix a reported issue where the pagination cursor breaks when a record is deleted between pages. Given a GitHub-style issue body with reproduction steps and an error stack trace, identify the root cause, fix it, and add a regression test.
This task requires: reading existing pagination logic, understanding the cursor-based pagination pattern, diagnosing a non-obvious race condition, and writing a regression test that would actually catch the same bug in future.
Task Category 3: Large Refactor
Task description: Migrate the authentication layer from a custom JWT implementation to the jsonwebtoken library, updating all callers, maintaining test coverage, and not breaking the existing token refresh logic.
This task is intentionally cross-cutting — the auth layer touches 14 files. It tests how well each tool handles multi-file edits that must be internally consistent.
Task Category 4: Documentation Generation
Task description: Generate API documentation for all public endpoints in OpenAPI 3.1 format, inferring types from the TypeScript source, route handlers, and Zod validation schemas. The output should be accurate enough to generate a working client SDK.
This task tests reading comprehension at scale and the ability to synthesize documentation from code that was not originally written with documentation in mind.
Task Category 5: Test Writing
Task description: Given five utility functions with zero test coverage, write tests that achieve 90%+ branch coverage, handle edge cases correctly, and mock external dependencies appropriately.
This task is closest to the “puzzle” category but uses real utility functions from the reference codebase with real dependencies, rather than synthetic examples.
3. Results Tables
3.1 New Feature Implementation
| Tool | Wall-Clock Time | Input Tokens | Output Tokens | Est. Cost | Quality (1-5) | First-Try Accept Rate |
|---|---|---|---|---|---|---|
| Claude Code (Max) | 8 min 40 sec | 28,400 | 3,100 | ~$0.00† | 4.3 | 67% |
| Claude Code (API, Sonnet) | 9 min 10 sec | 28,400 | 3,100 | $0.38 | 4.3 | 67% |
| Cursor | 11 min 20 sec | 22,100† | 2,800† | ~$0.00† | 4.1 | 67% |
| Aider (GPT-4o) | 13 min 50 sec | 31,200 | 3,400 | $0.82 | 3.8 | 33% |
| Codex CLI | 12 min 30 sec | 24,600 | 2,900 | $0.56 | 3.6 | 33% |
† Cursor token estimates are derived from Cursor’s published per-seat cost data and are approximate. Claude Code on Max plan has no marginal API cost within subscription limits.
Key finding: Claude Code and Cursor are close on quality and time, with Claude Code slightly faster on terminal-native workflows. Aider’s lower first-try accept rate means more revision cycles, which compounds the time cost. Codex CLI produced output that was functionally correct but less idiomatic, requiring more review.
3.2 Bug Fix from Issue
| Tool | Wall-Clock Time | Input Tokens | Output Tokens | Est. Cost | Quality (1-5) | Root Cause Accuracy |
|---|---|---|---|---|---|---|
| Claude Code (Max) | 6 min 15 sec | 19,800 | 1,900 | ~$0.00† | 4.6 | Correct |
| Claude Code (API, Sonnet) | 6 min 40 sec | 19,800 | 1,900 | $0.27 | 4.6 | Correct |
| Cursor | 7 min 50 sec | 16,400† | 1,600† | ~$0.00† | 4.4 | Correct |
| Aider (GPT-4o) | 9 min 20 sec | 22,100 | 2,200 | $0.58 | 4.0 | Correct (1/3 incorrect) |
| Codex CLI | 11 min 10 sec | 20,300 | 2,100 | $0.47 | 3.5 | Partial (surface fix) |
Key finding: Claude Code diagnosed the root cause correctly in all three runs; Aider got it wrong once and produced a surface-level patch rather than fixing the underlying cursor logic. Codex CLI similarly produced a symptom fix in two of three runs. Cursor performed well here, benefiting from its file-tree awareness for navigation.
3.3 Large Refactor (14-File Migration)
| Tool | Wall-Clock Time | Input Tokens | Output Tokens | Est. Cost | Quality (1-5) | Tests Still Passing |
|---|---|---|---|---|---|---|
| Claude Code (Max) | 18 min 30 sec | 64,200 | 8,100 | ~$0.00† | 4.4 | 340/340 |
| Claude Code (API, Sonnet) | 19 min 10 sec | 64,200 | 8,100 | $0.87 | 4.4 | 340/340 |
| Cursor | 22 min 40 sec | 48,300† | 6,900† | ~$0.00† | 4.2 | 338/340 |
| Aider (GPT-4o) | 31 min 20 sec | 78,400 | 9,600 | $2.06 | 3.7 | 331/340 |
| Codex CLI | 28 min 50 sec | 69,100 | 8,800 | $1.59 | 3.3 | 318/340 |
Key finding: This task revealed the largest gap between tools. The refactor requires consistent edits across 14 files that must compile and pass tests together. Claude Code completed it with zero test regressions in all runs. Cursor had two test failures (one missed call site, one type error in an edge case). Aider and Codex CLI had significantly more regressions, with Codex CLI showing the most difficulty maintaining consistency across the full file set.
Aider’s API cost here is notable: a 78,400-token input on GPT-4o at current pricing is expensive for a single task. Users running Aider on Claude Sonnet or another model would see different cost figures.
3.4 Documentation Generation (OpenAPI)
| Tool | Wall-Clock Time | Input Tokens | Output Tokens | Est. Cost | Quality (1-5) | Schema Accuracy |
|---|---|---|---|---|---|---|
| Claude Code (Max) | 14 min 20 sec | 48,600 | 11,200 | ~$0.00† | 4.5 | 94% fields correct |
| Claude Code (API, Sonnet) | 14 min 50 sec | 48,600 | 11,200 | $0.85 | 4.5 | 94% fields correct |
| Cursor | 16 min 10 sec | 38,900† | 9,100† | ~$0.00† | 4.3 | 91% fields correct |
| Aider (GPT-4o) | 19 min 40 sec | 52,300 | 12,100 | $1.38 | 3.9 | 87% fields correct |
| Codex CLI | 17 min 30 sec | 45,700 | 10,800 | $1.05 | 3.7 | 83% fields correct |
Schema accuracy was measured by generating a TypeScript client from the produced OpenAPI spec and checking how many of the 86 API endpoints compiled and returned correct types.
Key finding: Documentation generation is a reading-comprehension task at scale, and the larger-context models perform better. Claude Code’s ability to hold more of the codebase in context simultaneously showed here — it made fewer inferences about type shapes because it could read the Zod schemas and route handlers together rather than processing them in chunks.
3.5 Test Writing
| Tool | Wall-Clock Time | Input Tokens | Output Tokens | Est. Cost | Quality (1-5) | Branch Coverage Achieved |
|---|---|---|---|---|---|---|
| Claude Code (Max) | 7 min 40 sec | 12,800 | 3,600 | ~$0.00† | 4.5 | 93.4% |
| Claude Code (API, Sonnet) | 8 min 00 sec | 12,800 | 3,600 | $0.24 | 4.5 | 93.4% |
| Cursor | 9 min 10 sec | 10,200† | 2,900† | ~$0.00† | 4.3 | 91.2% |
| Aider (GPT-4o) | 11 min 20 sec | 14,100 | 4,000 | $0.37 | 4.1 | 89.7% |
| Codex CLI | 10 min 30 sec | 13,400 | 3,800 | $0.31 | 3.9 | 87.3% |
Key finding: This category had the tightest spread — all tools produced useful tests. The main differentiator was mock quality: Claude Code and Cursor correctly identified and mocked all external dependencies. Aider missed one Redis mock in a utility function, causing a test to pass locally but fail in CI environments without a Redis connection. Codex CLI wrote some tests that technically passed but lacked meaningful assertions.
4. Strengths and Weaknesses Per Tool
Claude Code
Strengths
- Strongest multi-file consistency. When a refactor requires touching 14 files coherently, it delivers the lowest regression rate.
- Best at AGENTS.md / CLAUDE.md context. It reads and respects project instructions without prompting.
- Large context window means fewer truncation errors on large documentation or audit tasks.
- Terminal-native workflow integrates cleanly with git, shell scripts, and CI pipelines.
- Max subscription economics: for heavy users doing 50+ tasks per day, the flat-fee model is substantially cheaper than API billing.
Weaknesses
- No GUI. Developers who want to see diffs inline, click through suggestions, or use a visual code review interface will find the terminal workflow friction-heavy.
- Context initialization cost. Reading a large codebase at session start consumes tokens (or subscription capacity) before a single line of code is generated.
- Upstream Anthropic API availability. Rate limits can affect throughput during peak hours for API users.
- Learning curve. The slash-command interface and CLAUDE.md configuration require upfront investment before the tool feels natural.
Cursor
Strengths
- Best editor integration of any tool tested. File navigation, inline diff review, and Composer for multi-file edits are genuinely excellent.
- Lower perceived latency in interactive use because responses stream into the editor buffer rather than a terminal.
- Familiar environment for developers coming from VS Code — almost zero workflow disruption.
- Strong at context-aware completions that do not require an explicit prompt.
Weaknesses
- Tab subscription for heavy API use is cheaper than Pro, but heavy API use with large context tasks can still hit limits.
- Multi-file refactors require careful prompting in Composer; without explicit guidance, Cursor sometimes edits fewer files than necessary.
- Harder to script or automate. If your workflow involves running the AI tool from a shell script, git hook, or CI pipeline, Cursor is the wrong tool.
- Occasional hallucinated imports — it sometimes adds an import from a package that is not in the project’s dependencies.
Aider
Strengths
- Model-agnostic. If you want to run tasks on Claude Sonnet, switch to GPT-4o for a different task, or try a local model, Aider supports all of them behind the same interface.
- Transparent and auditable. The git diff output format makes it easy to see exactly what changed before accepting.
- Strong at smaller, well-defined tasks where the lower first-try accept rate is not a bottleneck.
- Active open-source community with good documentation and a responsive maintainer.
Weaknesses
- Higher retry rate on complex tasks means API costs can compound. A task that takes Claude Code one pass may take Aider 2-3 passes.
- Weaker at large multi-file refactors where internal consistency is critical.
- Model quality ceiling: Aider is only as good as the model you point it at. To match Claude Code’s performance on the hardest tasks, you need to use Claude Sonnet or Opus via API — which changes the cost calculation significantly.
- AGENTS.md support is improving but less mature than Claude Code’s CLAUDE.md integration.
OpenAI Codex CLI
Strengths
- Clean, minimal interface. Easier to get started than Aider for developers new to terminal-first AI coding tools.
- Good at code generation tasks where the output is largely self-contained (new functions, utility scripts).
- Native OpenAI integration is an advantage for teams already paying for the OpenAI API.
- Reasonable performance on test writing and smaller bug fixes.
Weaknesses
- Weakest on multi-file consistency of the four tools tested — the 318/340 tests passing result on the refactor task is a significant gap.
- Higher tendency to produce surface-level bug fixes rather than diagnosing root causes.
- Documentation accuracy (83%) was notably below the other tools — it missed more type-level details from Zod schemas.
- Less mature than the other three tools in terms of project context features (no equivalent to AGENTS.md or CLAUDE.md support at time of testing).
5. Cost Per Successful Task Analysis
This table aggregates all five task categories and calculates the total cost per successfully completed task — where “successful” means passing tests (if applicable), correct root cause identification (for bug fixes), and quality score of 4.0 or above from reviewers.
| Tool | Billing Model | Avg. API Cost / Task | Success Rate | Cost / Successful Task | Time / Successful Task |
|---|---|---|---|---|---|
| Claude Code (Max) | $100/mo flat | ~$0.00 marginal | 89% | Subscription / vol | 11 min 5 sec |
| Claude Code (API, Sonnet) | Per-token | $0.52 avg | 89% | $0.58 | 11 min 34 sec |
| Cursor | $40/mo + API | ~$0.00 marginal† | 80% | Subscription / vol | 13 min 26 sec |
| Aider (GPT-4o) | Per-token | $1.04 avg | 73% | $1.42 | 16 min 58 sec |
| Codex CLI | Per-token | $0.80 avg | 65% | $1.23 | 17 min 58 sec |
A few caveats before reading too much into these numbers:
Volume matters for subscription tools. Claude Code Max at $100/month and Cursor Pro at $40/month are zero marginal cost per task up to their usage limits. If you are doing 10 tasks per day, the per-task math looks very different from 200 tasks per day. At 200 tasks/day, you might hit soft limits on Claude Code Max and need to switch to API billing for overflow — which changes the equation again.
Success rate is task-distribution-dependent. The 89% success rate for Claude Code reflects performance on this specific set of tasks against this specific codebase. A codebase with worse conventions, no AGENTS.md, or heavier use of obscure libraries would likely yield a lower success rate across all tools.
The “successful task” definition excludes revision labor. A task where the developer had to provide two correction prompts is still counted as successful if the final output met quality standards. The actual developer time cost of those correction rounds is captured in wall-clock time but not in the API cost.
6. Hidden Cost Factors
The tables above capture direct, measurable costs. The factors below are real but harder to quantify — they show up as friction, slowdowns, and occasional debugging sessions.
Context Bloat
All four tools send context tokens with every request. As a session grows — more files loaded, more back-and-forth in the conversation — the input token count grows, and so does cost (for metered tools) or the risk of hitting context limits (for all tools).
Claude Code manages context intelligently via /compact — it summarizes the conversation history to reduce token count while preserving essential state. Aider does something similar with its repo-map. Cursor controls context window use through its own heuristics. Codex CLI gives users less control here.
In a multi-hour session involving a large refactor, context bloat can 2-3x the token count relative to a fresh session. Budget-conscious API users should track session length.
Retry Consumption
When a tool’s first output is wrong, the developer corrects it and the tool tries again. That retry costs tokens — often more than the first attempt, because the correction adds to the conversation context that gets sent back with the next request.
Based on the first-try accept rates above:
- Claude Code: ~30% of tasks required at least one correction
- Cursor: ~33% of tasks required at least one correction
- Aider: ~50% of tasks required at least one correction
- Codex CLI: ~55% of tasks required at least one correction
At scale, this difference compounds significantly. A team running 500 AI-assisted tasks per month sees 150 retry cycles (Claude Code) vs. 275 (Codex CLI). Each retry cycle adds API cost and developer attention.
Hallucination Recovery Time
All four tools occasionally produce confident, syntactically valid code that is factually wrong — wrong API signatures, non-existent library methods, incorrect environment variable names. Recovery time varies:
- Compilation errors: caught immediately by TypeScript compiler, fast to fix
- Test failures: caught in seconds by Jest, fast to fix with context
- Runtime errors: might not surface until deployment or integration testing
- Logic errors: worst case, may require careful manual review to detect
Claude Code had the lowest hallucination rate in our testing, particularly on API signatures for libraries that are in the codebase’s package.json. Codex CLI had the highest rate of inventing plausible-but-wrong library methods.
Format Friction
Terminal tools (Claude Code, Aider, Codex CLI) output diffs or code blocks that require applying to files. The quality of the diff format matters:
- Claude Code produces clean, reviewable diffs that apply via standard patch
- Aider’s diff format is highly readable and rarely has context-line conflicts
- Codex CLI occasionally produces diffs with incorrect context lines, requiring manual resolution
Cursor eliminates this friction for in-editor use, but that same advantage disappears in automated workflows.
Model Switching Overhead
Aider’s model-agnostic design is a genuine advantage for flexibility but introduces cognitive overhead when choosing which model to use for which task. Cursor’s backend model has changed several times (GPT-4o, Claude 3.5 Sonnet, Claude 3.7 Sonnet) and is not always documented clearly for a given subscription tier.
7. Recommendation Matrix
The right tool depends on your workflow, team size, and the nature of your tasks. This matrix maps use cases to tools.
| Use Case | Best Tool | Runner-Up | Notes |
|---|---|---|---|
| Solo developer, heavy daily use (50+ tasks/day) | Claude Code Max | Cursor Pro | Flat-rate subscription economics dominate at volume |
| Team using VS Code, interactive coding sessions | Cursor | Claude Code | Editor integration reduces friction for interactive work |
| Multi-file refactors, large codebases | Claude Code | Cursor | Consistency across 14+ files is Claude Code’s strongest differentiator |
| Model flexibility needed (switching providers) | Aider | Codex CLI | Aider’s multi-model support is unique |
| Already paying OpenAI API, light to moderate use | Codex CLI | Aider | Reduces integration surface for OpenAI-native teams |
| CI/CD pipeline, automated code generation | Claude Code | Aider | Terminal-first tools integrate cleanly with shell scripts |
| Debugging, root-cause analysis | Claude Code | Cursor | Root cause accuracy was highest for both |
| Documentation generation, large API surface | Claude Code | Cursor | Context window advantage matters |
| Budget-constrained, occasional use | Aider | Codex CLI | Pay-per-use without subscription commitment |
| Mobile development (Flutter, React Native) | Cursor | Claude Code | Editor integration matters more for mobile |
| Open source project contributor | Aider | Claude Code | Aider is itself open source; fits OS contributor culture |
By Team Size
Solo developer: Claude Code Max ($100/month) is the best option if you are doing serious volume. The economics beat API billing at around 200 tasks per month. If you want an editor, pair Claude Code for complex tasks with Cursor for interactive coding — they are complementary rather than competing.
2-5 person team: Cursor Pro ($40/seat/month) is usually the best starting point for teams that want to adopt AI coding tools without rebuilding their workflow. The VS Code integration means zero onboarding friction. Add Claude Code Max for individual power users who run multi-file refactors frequently.
10+ person team: Evaluate based on workflow. Teams with strong CI/CD automation often find terminal-first tools (Claude Code, Aider) more valuable because they compose naturally with existing scripts. Teams where developers are skeptical of workflow changes often adopt faster via Cursor’s in-editor experience.
Platform team / tooling team: Claude Code is the strongest choice for teams that build internal tools, maintain monorepos, or run automated code modification pipelines. The terminal interface and AGENTS.md / CLAUDE.md support are designed for exactly this use case.
8. FAQ
Which tool is the “best” overall?
For the specific task mix in this benchmark — weighted toward multi-file complexity and root-cause accuracy — Claude Code scored highest on cost per successful task and quality. But “best overall” is meaningless without a use case. A developer who wants inline diff review and works in VS Code all day will get more value from Cursor. A developer who needs to switch between Claude and GPT-4o models should start with Aider.
Are these numbers accurate for my codebase?
Probably not exactly, but the relative ordering should hold. The absolute token counts will differ based on your file sizes, language, and how much context each tool needs to load. The relative performance gaps — especially Claude Code’s multi-file consistency advantage — are consistent across codebases of similar complexity.
How often do these tools’ prices change?
Frequently. Anthropic, OpenAI, and Cursor have all adjusted pricing multiple times in 2025-2026. The costs in this report reflect pricing as of April 2026. Before making a budget decision, verify current pricing on each provider’s official pricing page.
Does Aider perform better with Claude models vs. GPT-4o?
Yes, substantially. Aider pointed at Claude Sonnet achieves quality scores closer to Claude Code than to Aider-on-GPT-4o. The tradeoff is that you are essentially paying for Claude API tokens through Aider’s interface — at which point you should evaluate whether Claude Code’s native features (CLAUDE.md, better diff handling, /compact) justify using it directly instead.
Can I use multiple tools together?
Yes, and many professional developers do. A common pattern is: Cursor for interactive exploratory coding sessions, Claude Code for large refactors or automated tasks, and Aider for occasional cross-model experimentation. These tools do not conflict, and they all work with the same codebase through standard git workflows.
What about GitHub Copilot?
GitHub Copilot was not included in this benchmark for two reasons. First, its primary interaction mode (inline autocomplete) is not directly comparable to the explicit-prompt task execution that the other four tools use. Second, Copilot’s new “agent mode” (Copilot Workspace) was in limited preview at time of testing and not available for structured benchmarking. A future benchmark comparing Copilot Workspace to Cursor and Claude Code would be worthwhile.
How will these rankings change as models improve?
The model is not fixed for any of these tools. Cursor has already switched backend models twice in 2025; Aider users can update their model choice at any time; Claude Code tracks Anthropic model releases automatically. The ranking will shift as the underlying models improve, and the gaps between tools may narrow as all providers adopt stronger models.
What is the single most important thing to set up before using any of these tools?
An AGENTS.md (or CLAUDE.md for Claude Code) at your repo root that describes: your test command, your lint command, any conventions you care about, and paths the tool should not touch. This single file will improve output quality more than any other configuration step, regardless of which tool you use.
Appendix: Benchmark Data Summary
| Task | Claude Code | Cursor | Aider | Codex CLI |
|---|---|---|---|---|
| New Feature — Time | 8m 40s | 11m 20s | 13m 50s | 12m 30s |
| New Feature — Quality | 4.3 | 4.1 | 3.8 | 3.6 |
| Bug Fix — Time | 6m 15s | 7m 50s | 9m 20s | 11m 10s |
| Bug Fix — Quality | 4.6 | 4.4 | 4.0 | 3.5 |
| Refactor — Time | 18m 30s | 22m 40s | 31m 20s | 28m 50s |
| Refactor — Tests Passing | 340/340 | 338/340 | 331/340 | 318/340 |
| Docs Gen — Accuracy | 94% | 91% | 87% | 83% |
| Test Writing — Coverage | 93.4% | 91.2% | 89.7% | 87.3% |
| Overall Quality Avg | 4.46 | 4.26 | 3.92 | 3.70 |
All measurements are averages of three runs. Quality scores are averages of two independent reviewers on a 1-5 scale.
Methodology, raw data, and task specifications available in the GitHub repository for this benchmark.