Composite score
Correctness-first overall ranking across the full benchmark set.
ORPT-Bench measures task completion, efficiency, time, and cost across a fixed repair-oriented suite. This view now leads with the benchmark answers: who finishes the work, who delivers value, and what you pay for it.
The first row answers the three questions that matter most: can the model finish the work, does it do so efficiently, and how does the blended ranking shake out?
Correctness-first overall ranking across the full benchmark set.
How often each model actually closes tasks, independent of efficiency.
Lower is better: fewer requests per successful task.
Completion versus total benchmark cost shows who actually clears the suite and what it costs to get there.
Who clears the suite most completely for the least total benchmark spend.
Upper-left is the best observed frontier: higher quality for less total benchmark spend.
The top-ranked cohort with the columns you actually need for a quick decision.
| # | Model | Composite | Success | ORPT | Cost | Wall time | Status |
|---|---|---|---|---|---|---|---|
| 1 | opencode/gpt-5.4-nano | 0.789 | 85% | 15.17 | $0.4215 | 27m 33s | Comparable |
| 2 | opencode/kimi-k2.5 | 0.785 | 89% | 14.25 | $0.9122 | 41m 05s | Comparable |
| 3 | opencode/claude-opus-4-6 | 0.67 | 89% | 14.88 | $21.8757 | 40m 04s | Comparable |
| 4 | opencode/glm-5 | 0.623 | 78% | 11.57 | $6.4339 | 20m 10s | Comparable |
| 5 | opencode/big-pickle | 0.615 | 67% | 15.39 | $0.0000 | 36m 28s | Comparable |
| 6 | opencode/gpt-5.4 | 0.609 | 78% | 11.00 | $8.9827 | 32m 47s | Comparable |
| 7 | opencode/claude-sonnet-4-6 | 0.593 | 78% | 16.43 | $11.8406 | 42m 31s | Comparable |
| 8 | opencode/glm-5.1 | 0.547 | 67% | 12.06 | $1.8816 | 64m 39s | Comparable |
| 9 | opencode/minimax-m2.5 | 0.481 | 56% | 18.87 | $0.6413 | 32m 15s | Comparable |
| 10 | opencode/gpt-5.4-mini | 0.425 | 48% | 9.54 | $1.0606 | 21m 48s | Comparable |
Sortable tables keep the current benchmark state inspectable without forcing users through raw markdown or buried artifacts.
Use this table for the main leaderboard view across models in the latest run.
| Model | Composite | Success | DNF | ORPT | Requests | Wall time | Cost | Avg task cost | Status |
|---|---|---|---|---|---|---|---|---|---|
| opencode/gpt-5.4-nano | 0.789 | 85% | 0 | 15.17 | 413 | 27m 33s | $0.4215 | $0.0153 | Comparable |
| opencode/kimi-k2.5 | 0.785 | 89% | 2 | 14.25 | 375 | 41m 05s | $0.9122 | $0.0372 | Comparable |
| opencode/claude-opus-4-6 | 0.67 | 89% | 0 | 14.88 | 418 | 40m 04s | $21.8757 | $0.7991 | Comparable |
| opencode/glm-5 | 0.623 | 78% | 0 | 11.57 | 280 | 20m 10s | $6.4339 | $0.2653 | Comparable |
| opencode/big-pickle | 0.615 | 67% | 1 | 15.39 | 414 | 36m 28s | $0.0000 | n/a | Comparable |
| opencode/gpt-5.4 | 0.609 | 78% | 0 | 11.00 | 280 | 32m 47s | $8.9827 | $0.3307 | Comparable |
| opencode/claude-sonnet-4-6 | 0.593 | 78% | 1 | 16.43 | 403 | 42m 31s | $11.8406 | $0.4739 | Comparable |
| opencode/glm-5.1 | 0.547 | 67% | 4 | 12.06 | 286 | 64m 39s | $1.8816 | $0.0909 | Comparable |
| opencode/minimax-m2.5 | 0.481 | 56% | 1 | 18.87 | 417 | 32m 15s | $0.6413 | $0.0302 | Comparable |
| opencode/gpt-5.4-mini | 0.425 | 48% | 0 | 9.54 | 264 | 21m 48s | $1.0606 | $0.0386 | Comparable |
| opencode/minimax-m2.5-free | 0.415 | 59% | 4 | 16.19 | 475 | 41m 34s | $0.0000 | n/a | Limited |
| opencode/gemini-3-flash | 0.415 | 59% | 4 | 21.81 | 508 | 62m 52s | $2.4307 | $0.1217 | Limited |
| opencode/gemini-3.1-pro | 0.291 | 37% | 3 | 12.70 | 307 | 51m 25s | $5.8536 | $0.2683 | Comparable |
| opencode/nemotron-3-super-free | 0.181 | 26% | 12 | 19.43 | 502 | 109m 00s | $0.0000 | n/a | Limited |
One row per task. See who won, how wide the gap was, whether the field agreed, and what the fastest successful path actually cost.
| Task | Category | Winner | Runner-up | Gap | Completion | Cheapest win | Fastest win | Best ORPT | Field read |
|---|---|---|---|---|---|---|---|---|---|
| SELinux registry volume label repair | linux-hardening | opencode/kimi-k2.5 1.0 composite |
opencode/glm-5.1 0.807 composite |
0.193 | 6/14 43% models passed |
$0.0254 | 42s | 9.00 | Clear separation |
| Kubernetes rollout repair | iac | opencode/gpt-5.4-mini 1.0 composite |
opencode/kimi-k2.5 0.823 composite |
0.177 | 6/14 43% models passed |
$0.0292 | 33s | 8.00 | Clear separation |
| RHEL k3s node preparation repair | kubernetes | opencode/gpt-5.4-nano 1.0 composite |
opencode/big-pickle 0.857 composite |
0.143 | 5/14 36% models passed |
$0.0120 | 38s | 13.00 | Competitive split |
| Docker Compose observability fix | docker-compose | opencode/gpt-5.4-nano 0.975 composite |
opencode/kimi-k2.5 0.836 composite |
0.139 | 3/14 21% models passed |
$0.0286 | 2m 08s | 19.00 | Competitive split |
| nftables router ingress repair | networking | opencode/gpt-5.4-nano 0.98 composite |
opencode/minimax-m2.5 0.849 composite |
0.131 | 6/14 43% models passed |
$0.0084 | 27s | 6.00 | Competitive split |
| Terraform static site repair | terraform | opencode/kimi-k2.5 0.978 composite |
opencode/minimax-m2.5 0.862 composite |
0.116 | 14/14 100% models passed |
$0.0152 | 28s | 5.00 | Competitive split |
| K3s registry mirror trust repair | kubernetes | opencode/big-pickle 1.0 composite |
opencode/minimax-m2.5 0.906 composite |
0.094 | 14/14 100% models passed |
$0.0089 | 17s | 5.00 | Competitive split |
| Bootstrap phase validation repair | platform-bootstrap | opencode/kimi-k2.5 0.993 composite |
opencode/minimax-m2.5 0.9 composite |
0.093 | 6/14 43% models passed |
$0.0380 | 1m 20s | 12.00 | Competitive split |
| Event status shell summary | scripting | opencode/big-pickle 1.0 composite |
opencode/gpt-5.4-mini 0.911 composite |
0.089 | 8/14 57% models passed |
$0.0089 | 17s | 6.00 | Competitive split |
| ExternalDNS RFC2136 repair | kubernetes | opencode/kimi-k2.5 0.982 composite |
opencode/big-pickle 0.893 composite |
0.089 | 7/14 50% models passed |
$0.0299 | 34s | 11.00 | Competitive split |
| RHEL NetworkManager bridge VLAN repair | networking | opencode/gpt-5.4-nano 0.951 composite |
opencode/kimi-k2.5 0.872 composite |
0.08 | 11/14 79% models passed |
$0.0090 | 18s | 8.00 | Competitive split |
| Build workspace plane convergence | gitops | opencode/gpt-5.4-nano 0.942 composite |
opencode/gpt-5.4-mini 0.863 composite |
0.079 | 13/14 93% models passed |
$0.0149 | 28s | 10.00 | Competitive split |
| Kubernetes OIDC RBAC repair | kubernetes | opencode/gpt-5.4-nano 0.95 composite |
opencode/kimi-k2.5 0.885 composite |
0.066 | 9/14 64% models passed |
$0.0172 | 56s | 12.00 | Competitive split |
| Workspace transplant bundle repair | scripting | opencode/big-pickle 0.985 composite |
opencode/gpt-5.4-nano 0.924 composite |
0.06 | 13/14 93% models passed |
$0.0127 | 25s | 8.00 | Competitive split |
| RHEL edge firewalld router repair | networking | opencode/gpt-5.4-nano 0.953 composite |
opencode/minimax-m2.5 0.904 composite |
0.049 | 7/14 50% models passed |
$0.0210 | 1m 10s | 10.00 | Competitive split |
| CNPG restore manifest repair | kubernetes | opencode/big-pickle 0.964 composite |
opencode/minimax-m2.5 0.915 composite |
0.049 | 12/14 86% models passed |
$0.0128 | 25s | 8.00 | Competitive split |
| Workspace runtime access convergence | gitops | opencode/gpt-5.4-nano 0.932 composite |
opencode/kimi-k2.5 0.884 composite |
0.047 | 8/14 57% models passed |
$0.0301 | 1m 00s | 16.00 | Competitive split |
| Ansible nginx role completion | ansible | opencode/big-pickle 0.963 composite |
opencode/kimi-k2.5 0.918 composite |
0.045 | 12/14 86% models passed |
$0.0131 | 22s | 7.00 | Competitive split |
| Log level rollup shell script | scripting | opencode/big-pickle 0.965 composite |
opencode/gpt-5.4-nano 0.921 composite |
0.044 | 11/14 79% models passed |
$0.0113 | 29s | 7.00 | Competitive split |
| Pre-ArgoCD bootstrap sequencing | platform-bootstrap | opencode/gpt-5.4-nano 0.967 composite |
opencode/big-pickle 0.932 composite |
0.035 | 7/14 50% models passed |
$0.0202 | 1m 13s | 17.00 | Competitive split |
| AppArmor dnsmasq profile repair | linux-hardening | opencode/gpt-5.4-nano 0.918 composite |
opencode/minimax-m2.5 0.897 composite |
0.021 | 10/14 71% models passed |
$0.0145 | 1m 01s | 9.00 | Competitive split |
| GitOps workspace render validation | gitops | opencode/big-pickle 0.941 composite |
opencode/minimax-m2.5 0.921 composite |
0.02 | 13/14 93% models passed |
$0.0159 | 29s | 9.00 | Competitive split |
| Log audit shell script | scripting | opencode/gpt-5.4-nano 0.935 composite |
opencode/gpt-5.4-mini 0.915 composite |
0.02 | 6/14 43% models passed |
$0.0107 | 31s | 7.00 | Competitive split |
| MCP OpenBao contract repair | gitops | opencode/big-pickle 0.954 composite |
opencode/minimax-m2.5 0.939 composite |
0.015 | 13/14 93% models passed |
$0.0156 | 33s | 9.00 | Competitive split |
| Traefik forwarded header trust repair | kubernetes | opencode/kimi-k2.5 0.913 composite |
opencode/gpt-5.4-nano 0.904 composite |
0.009 | 9/14 64% models passed |
$0.0176 | 48s | 11.00 | Competitive split |
| Wildcard TLS route coverage | gitops | opencode/kimi-k2.5 0.929 composite |
opencode/big-pickle 0.925 composite |
0.004 | 10/14 71% models passed |
$0.0159 | 37s | 9.00 | Competitive split |
| MetalLB ingress address pool repair | kubernetes | opencode/gpt-5.4-nano 0.928 composite |
opencode/big-pickle 0.927 composite |
0.001 | 8/14 57% models passed |
$0.0152 | 53s | 10.00 | Competitive split |
Each task row shows exactly how the published models compare on composite score, success, requests, cost, and time for that benchmark target.
Use this when you need to know not just who won overall, but where they paid for it and where they failed.
After the top-line results, these sections answer the next useful questions: who leads each tradeoff, where models separate, and what the benchmark composition looks like.
Task-level pairwise wins show whether the leaderboard leader actually dominates the suite or just edges ahead on aggregate.
A benchmark is only meaningful if you can see what kinds of tasks dominate the signal.
Observed token mix across the suite helps explain whether a model is spending heavily on reasoning, cached context, or generation.
| Model | Input | Output | Reasoning | Cache Read | Cache Write |
|---|---|---|---|---|---|
| opencode/gpt-5.4-nano | 565,422 | 87,985 | 69,555 | 5,576,704 | 0 |
| opencode/kimi-k2.5 | 464,664 | 85,749 | 0 | 4,702,208 | 0 |
| opencode/claude-opus-4-6 | 27 | 8,130 | 0 | 536,949 | 13,015 |
| opencode/glm-5 | 1,000,890 | 92,693 | 0 | 5,136,416 | 0 |
| opencode/big-pickle | 540,092 | 86,989 | 0 | 5,246,048 | 0 |
| opencode/gpt-5.4 | 29,395 | 4,788 | 1,290 | 419,200 | 0 |
| opencode/claude-sonnet-4-6 | 24 | 5,574 | 0 | 464,850 | 10,992 |
| opencode/glm-5.1 | 577,858 | 63,642 | 0 | 3,048,416 | 0 |
| opencode/minimax-m2.5 | 678,183 | 82,592 | 0 | 5,644,934 | 0 |
| opencode/gpt-5.4-mini | 450,597 | 63,992 | 44,364 | 3,133,952 | 0 |
| opencode/minimax-m2.5-free | 471,764 | 77,044 | 0 | 4,840,912 | 0 |
| opencode/gemini-3-flash | 2,966,158 | 47,085 | 182,101 | 5,200,440 | 0 |
| opencode/gemini-3.1-pro | 1,363,843 | 33,058 | 174,127 | 3,198,507 | 0 |
| opencode/nemotron-3-super-free | 3,000,485 | 20,944 | 17,835 | 0 | 0 |
The rest of the visual analysis is still here, just moved below the primary answers.
How often each model actually closes tasks, independent of efficiency.
Lower is better: fewer requests per successful task.
Total spend per model across the published full run.
Observed ORPT-Bench quality against catalog blended price per 1M tokens.
This compares observed benchmark quality with catalog pricing priors rather than run spend alone.
Lower is better: total elapsed model runtime across benchmark tasks.
Total request units consumed across the published run.
How much benchmark spend went to solved tasks versus failed attempts.
The faded segment is spend burned on failed tasks; the solid segment is spend attached to solved tasks.
Input, output, reasoning, and cache token mix by model across the suite.
Use this to see whether a model spends proportionally on reasoning, generation, or cached context.
Average composite score by benchmark category and model.
Rows expose category strengths and blind spots that disappear in a single top-line score.
Average success rate by task difficulty and model.
This isolates whether a model falls apart as task difficulty rises.
Per-task comparative quality. Higher is better.
This is the most detailed quality comparison on the page: every task, every model, one glance.
Per-task average cost by model. Lower is better.
Use this to find where a model is overpaying for equivalent or worse outcomes.
Per-task average duration by model. Lower is better.
Use this to spot slow tasks and whether the delay tracks with better outcomes or wasted effort.
Published benchmark results paired with catalog metadata make it easier to understand whether a model is cheap, fast, stable, or just happened to land higher this run.
Primary blended price derived automatically from OpenRouter listing openai/gpt-5.4-nano using a 3:1 input:output blend.
Primary blended price derived automatically from OpenRouter listing moonshotai/kimi-k2.5 using a 3:1 input:output blend.
OpenRouter reference blend for anthropic/claude-opus-4.6-fast is 60 USD per 1M tokens using a 3:1 input:output mix.
OpenRouter reference blend for z-ai/glm-5 is 1.115 USD per 1M tokens using a 3:1 input:output mix.
No trustworthy automatic pricing reference found yet, so cost is currently unknown.
OpenRouter reference blend for openai/gpt-5.4 is 5.625 USD per 1M tokens using a 3:1 input:output mix.
OpenRouter reference blend for anthropic/claude-opus-4.6-fast is 60 USD per 1M tokens using a 3:1 input:output mix.
Primary blended price derived automatically from OpenRouter listing z-ai/glm-5.1 using a 3:1 input:output blend.
Primary blended price derived automatically from OpenRouter listing minimax/minimax-m2.5 using a 3:1 input:output blend.
Primary blended price derived automatically from OpenRouter listing openai/gpt-5.4-mini using a 3:1 input:output blend.
Observed to complete ORPT-Bench scripting smoke runs cleanly and is the current preferred headless dev baseline.
Primary blended price derived automatically from OpenRouter listing minimax/minimax-m2.5:free using a 3:1 input:output blend. Reference price uses minimax/minimax-m2.5 at 0.336 USD per 1M tokens from the same OpenRouter family.
Observed to trigger external_directory permission prompts in headless runs.
OpenRouter reference blend for google/gemini-3-flash-preview is 1.125 USD per 1M tokens using a 3:1 input:output mix.
Observed to loop and hit the benchmark process deadline on the task-05 scripting smoke run, so it should not be in the default headless dev matrix.
OpenRouter reference blend for google/gemini-3.1-pro-preview is 4.5 USD per 1M tokens using a 3:1 input:output mix.
OpenRouter reference blend for nvidia/nemotron-3-super-120b-a12b:free is 0 USD per 1M tokens using a 3:1 input:output mix. Reference price uses nvidia/nemotron-3-super-120b-a12b at 0.2 USD per 1M tokens from the same OpenRouter family.
Observed to take a slow, tool-heavy path on the scripting smoke task.
Less frequently used detail stays available, but out of the critical path.
Average quality, speed, and cost by benchmark category.
How performance changes as the suite moves from control to expert tasks.
These runs are not mixed into the main full-run leaderboard, but they are intentionally published for transparency, including provider-side failures, timeout behavior, and raw verifier evidence.
| Model | Outcome | Control Tasks | Requests | Cost | Recommendation |
|---|---|---|---|---|---|
| opencode/minimax-m2.5-free | failed smoke | 1/3 | 24 | unknown | retry |
| opencode/qwen3.6-plus-free | provider-model-not-found smoke | 0/3 | 0 | unknown | wait for provider |
| opencode/trinity-large-preview-free | provider-model-not-found smoke | 0/3 | 0 | unknown | wait for provider |
| opencode/gpt-5.4-nano | failed smoke | 2/3 | 31 | $0.0019 | block |
| opencode/gpt-5.1-codex-mini | failed smoke | 0/3 | 19 | unknown | retry |
| opencode/gemini-3-flash | failed smoke | 0/3 | 13 | unknown | retry |
| opencode/big-pickle | passed smoke | 3/3 | 28 | $0.0000 | promote |
| opencode/kimi-k2.5 | passed smoke | 3/3 | 28 | $0.0058 | promote |
| opencode/glm-5 | failed smoke | 0/3 | 14 | unknown | retry |
| opencode/glm-5.1 | failed smoke | 1/3 | 22 | $0.0038 | retry |
| opencode/minimax-m2.5 | failed smoke | 2/3 | 32 | $0.0126 | block |
| opencode/gemini-3-flash | failed smoke | 1/3 | 20 | $0.0023 | retry |
| opencode/glm-5.1 | failed smoke | 0/1 | 0 | unknown | retry |
| opencode/minimax-m2.5 | passed smoke | 1/1 | 8 | $0.0039 | promote |
| opencode/glm-5.1 | failed smoke | 0/2 | 17 | unknown | retry |
| opencode/minimax-m2.5 | passed smoke | 1/1 | 8 | $0.0039 | promote |
| opencode/glm-5 | failed smoke | 0/3 | 4 | unknown | retry |
| opencode/glm-5.1 | failed smoke | 0/3 | 17 | unknown | retry |
| opencode/minimax-m2.5 | failed smoke | 2/3 | 32 | $0.0088 | retry |
| opencode/claude-haiku-4-5 | failed smoke | 2/3 | 62 | $0.0077 | retry |
| opencode/glm-5.1 | failed smoke | 0/1 | 3 | unknown | retry |
| opencode/glm-5.1 | failed smoke | 0/3 | 18 | unknown | retry |
| opencode/minimax-m2.5 | passed smoke | 3/3 | 27 | $0.0124 | promote |
| opencode/claude-haiku-4-5 | failed smoke | 2/3 | 48 | $0.0082 | retry |
| opencode/big-pickle | failed smoke | 1/3 | 23 | unknown | retry |
| opencode/claude-haiku-4-5 | failed smoke | 2/3 | 55 | $0.0080 | retry |
| opencode/glm-5.1 | failed smoke | 0/3 | 7 | unknown | retry |
| opencode/glm-5.1 | failed smoke | 0/3 | 10 | unknown | retry |
| opencode/claude-3-5-haiku | provider-http-error smoke | 0/3 | 3 | unknown | wait for provider |
| opencode/claude-haiku-4-5 | failed smoke | 2/3 | 44 | $0.0071 | retry |
| opencode/big-pickle | failed smoke | 2/3 | 21 | unknown | retry |
| opencode/gemini-3.1-pro | failed smoke | 0/3 | 3 | unknown | retry |
| opencode/minimax-m2.5 | failed smoke | 0/3 | 30 | $0.0043 | retry |
| opencode/qwen3.6-plus-free | provider-model-not-found smoke | 0/3 | 0 | unknown | wait for provider |
| opencode/trinity-large-preview-free | provider-model-not-found smoke | 0/3 | 0 | unknown | wait for provider |
| opencode/gemini-3.1-pro | failed smoke | 0/3 | 4 | unknown | retry |
| opencode/minimax-m2.5 | failed smoke | 2/3 | 22 | $0.0082 | retry |
| opencode/qwen3.6-plus-free | provider-model-not-found smoke | 0/3 | 0 | unknown | wait for provider |
| opencode/trinity-large-preview-free | provider-model-not-found smoke | 0/3 | 0 | unknown | wait for provider |
| opencode/claude-haiku-4-5 | failed smoke | 0/1 | 26 | unknown | retry |
| opencode/claude-3-5-haiku | provider-http-error smoke | 0/1 | 1 | unknown | wait for provider |
| opencode/trinity-large-preview-free | provider-model-not-found smoke | 0/1 | 0 | unknown | wait for provider |
| opencode/qwen3.6-plus-free | provider-model-not-found smoke | 0/1 | 0 | unknown | wait for provider |
| opencode/minimax-m2.5 | failed smoke | 0/1 | 17 | unknown | retry |
| opencode/minimax-m2.1 | provider-model-not-found smoke | 0/1 | 0 | unknown | wait for provider |
| opencode/gpt-5.4-nano | failed smoke | 0/1 | 10 | unknown | retry |
| opencode/gpt-5.1-codex-mini | failed smoke | 0/1 | 6 | unknown | retry |
| opencode/gpt-5-nano | provider-limited smoke | 0/1 | 5 | unknown | wait for provider |
| opencode/gemini-3-pro | provider-model-not-found smoke | 0/1 | 0 | unknown | wait for provider |
| opencode/gemini-3.1-pro | failed smoke | 0/1 | 0 | unknown | retry |
| opencode/glm-5 | passed smoke | 1/1 | 20 | $0.0160 | promote |
| opencode/nemotron-3-super-free | provider-limited smoke | 0/1 | 1 | unknown | wait for provider |
| opencode/nemotron-3-super-free | provider-limited smoke | 0/1 | 1 | unknown | wait for provider |
Snapshots are sorted by their top composite score by default. Use the raw JSON links for detailed offline analysis.