oc
ORPT-Bench
OpenCode benchmark publication

Results first. Context when you need it.

ORPT-Bench measures task completion, efficiency, time, and cost across a fixed repair-oriented suite. This view now leads with the benchmark answers: who finishes the work, who delivers value, and what you pay for it.

Hint: every comparison table is sortable and starts in composite-score order.
What changed
Top-line model results and the cost frontier appear immediately. Reference material stays available below in expandable sections.
How to read the scores
Completion score is raw task completion. Value score captures efficiency after solving. Composite score blends both with the published weighting.
Executive Summary

The first row answers the three questions that matter most: can the model finish the work, does it do so efficiently, and how does the blended ranking shake out?

Cost Frontier

Completion versus total benchmark cost shows who actually clears the suite and what it costs to get there.

Fast read leaderboard

Top 10 models

The top-ranked cohort with the columns you actually need for a quick decision.

# Model Composite Success ORPT Cost Wall time Status
1 opencode/gpt-5.4-nano 0.789 85% 15.17 $0.4215 27m 33s Comparable
2 opencode/kimi-k2.5 0.785 89% 14.25 $0.9122 41m 05s Comparable
3 opencode/claude-opus-4-6 0.67 89% 14.88 $21.8757 40m 04s Comparable
4 opencode/glm-5 0.623 78% 11.57 $6.4339 20m 10s Comparable
5 opencode/big-pickle 0.615 67% 15.39 $0.0000 36m 28s Comparable
6 opencode/gpt-5.4 0.609 78% 11.00 $8.9827 32m 47s Comparable
7 opencode/claude-sonnet-4-6 0.593 78% 16.43 $11.8406 42m 31s Comparable
8 opencode/glm-5.1 0.547 67% 12.06 $1.8816 64m 39s Comparable
9 opencode/minimax-m2.5 0.481 56% 18.87 $0.6413 32m 15s Comparable
10 opencode/gpt-5.4-mini 0.425 48% 9.54 $1.0606 21m 48s Comparable
Comparison Tables

Sortable tables keep the current benchmark state inspectable without forcing users through raw markdown or buried artifacts.

Hint: click any column header to re-sort. Composite score is the default.
Comparable cohort overview

Model summary

Use this table for the main leaderboard view across models in the latest run.

Model Composite Success DNF ORPT Requests Wall time Cost Avg task cost Status
opencode/gpt-5.4-nano 0.789 85% 0 15.17 413 27m 33s $0.4215 $0.0153 Comparable
opencode/kimi-k2.5 0.785 89% 2 14.25 375 41m 05s $0.9122 $0.0372 Comparable
opencode/claude-opus-4-6 0.67 89% 0 14.88 418 40m 04s $21.8757 $0.7991 Comparable
opencode/glm-5 0.623 78% 0 11.57 280 20m 10s $6.4339 $0.2653 Comparable
opencode/big-pickle 0.615 67% 1 15.39 414 36m 28s $0.0000 n/a Comparable
opencode/gpt-5.4 0.609 78% 0 11.00 280 32m 47s $8.9827 $0.3307 Comparable
opencode/claude-sonnet-4-6 0.593 78% 1 16.43 403 42m 31s $11.8406 $0.4739 Comparable
opencode/glm-5.1 0.547 67% 4 12.06 286 64m 39s $1.8816 $0.0909 Comparable
opencode/minimax-m2.5 0.481 56% 1 18.87 417 32m 15s $0.6413 $0.0302 Comparable
opencode/gpt-5.4-mini 0.425 48% 0 9.54 264 21m 48s $1.0606 $0.0386 Comparable
opencode/minimax-m2.5-free 0.415 59% 4 16.19 475 41m 34s $0.0000 n/a Limited
opencode/gemini-3-flash 0.415 59% 4 21.81 508 62m 52s $2.4307 $0.1217 Limited
opencode/gemini-3.1-pro 0.291 37% 3 12.70 307 51m 25s $5.8536 $0.2683 Comparable
opencode/nemotron-3-super-free 0.181 26% 12 19.43 502 109m 00s $0.0000 n/a Limited
Comparative task readout

Task insights

One row per task. See who won, how wide the gap was, whether the field agreed, and what the fastest successful path actually cost.

Task Category Winner Runner-up Gap Completion Cheapest win Fastest win Best ORPT Field read
SELinux registry volume label repair linux-hardening opencode/kimi-k2.5
1.0 composite
opencode/glm-5.1
0.807 composite
0.193 6/14
43% models passed
$0.0254 42s 9.00 Clear separation
Kubernetes rollout repair iac opencode/gpt-5.4-mini
1.0 composite
opencode/kimi-k2.5
0.823 composite
0.177 6/14
43% models passed
$0.0292 33s 8.00 Clear separation
RHEL k3s node preparation repair kubernetes opencode/gpt-5.4-nano
1.0 composite
opencode/big-pickle
0.857 composite
0.143 5/14
36% models passed
$0.0120 38s 13.00 Competitive split
Docker Compose observability fix docker-compose opencode/gpt-5.4-nano
0.975 composite
opencode/kimi-k2.5
0.836 composite
0.139 3/14
21% models passed
$0.0286 2m 08s 19.00 Competitive split
nftables router ingress repair networking opencode/gpt-5.4-nano
0.98 composite
opencode/minimax-m2.5
0.849 composite
0.131 6/14
43% models passed
$0.0084 27s 6.00 Competitive split
Terraform static site repair terraform opencode/kimi-k2.5
0.978 composite
opencode/minimax-m2.5
0.862 composite
0.116 14/14
100% models passed
$0.0152 28s 5.00 Competitive split
K3s registry mirror trust repair kubernetes opencode/big-pickle
1.0 composite
opencode/minimax-m2.5
0.906 composite
0.094 14/14
100% models passed
$0.0089 17s 5.00 Competitive split
Bootstrap phase validation repair platform-bootstrap opencode/kimi-k2.5
0.993 composite
opencode/minimax-m2.5
0.9 composite
0.093 6/14
43% models passed
$0.0380 1m 20s 12.00 Competitive split
Event status shell summary scripting opencode/big-pickle
1.0 composite
opencode/gpt-5.4-mini
0.911 composite
0.089 8/14
57% models passed
$0.0089 17s 6.00 Competitive split
ExternalDNS RFC2136 repair kubernetes opencode/kimi-k2.5
0.982 composite
opencode/big-pickle
0.893 composite
0.089 7/14
50% models passed
$0.0299 34s 11.00 Competitive split
RHEL NetworkManager bridge VLAN repair networking opencode/gpt-5.4-nano
0.951 composite
opencode/kimi-k2.5
0.872 composite
0.08 11/14
79% models passed
$0.0090 18s 8.00 Competitive split
Build workspace plane convergence gitops opencode/gpt-5.4-nano
0.942 composite
opencode/gpt-5.4-mini
0.863 composite
0.079 13/14
93% models passed
$0.0149 28s 10.00 Competitive split
Kubernetes OIDC RBAC repair kubernetes opencode/gpt-5.4-nano
0.95 composite
opencode/kimi-k2.5
0.885 composite
0.066 9/14
64% models passed
$0.0172 56s 12.00 Competitive split
Workspace transplant bundle repair scripting opencode/big-pickle
0.985 composite
opencode/gpt-5.4-nano
0.924 composite
0.06 13/14
93% models passed
$0.0127 25s 8.00 Competitive split
RHEL edge firewalld router repair networking opencode/gpt-5.4-nano
0.953 composite
opencode/minimax-m2.5
0.904 composite
0.049 7/14
50% models passed
$0.0210 1m 10s 10.00 Competitive split
CNPG restore manifest repair kubernetes opencode/big-pickle
0.964 composite
opencode/minimax-m2.5
0.915 composite
0.049 12/14
86% models passed
$0.0128 25s 8.00 Competitive split
Workspace runtime access convergence gitops opencode/gpt-5.4-nano
0.932 composite
opencode/kimi-k2.5
0.884 composite
0.047 8/14
57% models passed
$0.0301 1m 00s 16.00 Competitive split
Ansible nginx role completion ansible opencode/big-pickle
0.963 composite
opencode/kimi-k2.5
0.918 composite
0.045 12/14
86% models passed
$0.0131 22s 7.00 Competitive split
Log level rollup shell script scripting opencode/big-pickle
0.965 composite
opencode/gpt-5.4-nano
0.921 composite
0.044 11/14
79% models passed
$0.0113 29s 7.00 Competitive split
Pre-ArgoCD bootstrap sequencing platform-bootstrap opencode/gpt-5.4-nano
0.967 composite
opencode/big-pickle
0.932 composite
0.035 7/14
50% models passed
$0.0202 1m 13s 17.00 Competitive split
AppArmor dnsmasq profile repair linux-hardening opencode/gpt-5.4-nano
0.918 composite
opencode/minimax-m2.5
0.897 composite
0.021 10/14
71% models passed
$0.0145 1m 01s 9.00 Competitive split
GitOps workspace render validation gitops opencode/big-pickle
0.941 composite
opencode/minimax-m2.5
0.921 composite
0.02 13/14
93% models passed
$0.0159 29s 9.00 Competitive split
Log audit shell script scripting opencode/gpt-5.4-nano
0.935 composite
opencode/gpt-5.4-mini
0.915 composite
0.02 6/14
43% models passed
$0.0107 31s 7.00 Competitive split
MCP OpenBao contract repair gitops opencode/big-pickle
0.954 composite
opencode/minimax-m2.5
0.939 composite
0.015 13/14
93% models passed
$0.0156 33s 9.00 Competitive split
Traefik forwarded header trust repair kubernetes opencode/kimi-k2.5
0.913 composite
opencode/gpt-5.4-nano
0.904 composite
0.009 9/14
64% models passed
$0.0176 48s 11.00 Competitive split
Wildcard TLS route coverage gitops opencode/kimi-k2.5
0.929 composite
opencode/big-pickle
0.925 composite
0.004 10/14
71% models passed
$0.0159 37s 9.00 Competitive split
MetalLB ingress address pool repair kubernetes opencode/gpt-5.4-nano
0.928 composite
opencode/big-pickle
0.927 composite
0.001 8/14
57% models passed
$0.0152 53s 10.00 Competitive split
Task Comparison Matrix

Each task row shows exactly how the published models compare on composite score, success, requests, cost, and time for that benchmark target.

Per-task model deltas

Task-by-task model matrix

Use this when you need to know not just who won overall, but where they paid for it and where they failed.

Task
opencode/gpt-5.4-nano
0.789 composite | 85% success
opencode/kimi-k2.5
0.785 composite | 89% success
opencode/claude-opus-4-6
0.67 composite | 89% success
opencode/glm-5
0.623 composite | 78% success
opencode/big-pickle
0.615 composite | 67% success
opencode/gpt-5.4
0.609 composite | 78% success
opencode/claude-sonnet-4-6
0.593 composite | 78% success
opencode/glm-5.1
0.547 composite | 67% success
opencode/minimax-m2.5
0.481 composite | 56% success
opencode/gpt-5.4-mini
0.425 composite | 48% success
opencode/minimax-m2.5-free
0.415 composite | 59% success
opencode/gemini-3-flash
0.415 composite | 59% success
opencode/gemini-3.1-pro
0.291 composite | 37% success
opencode/nemotron-3-super-free
0.181 composite | 26% success
Kubernetes rollout repair
01-iac-kubernetes-rollout | medium
0.0
0% success
20.00 req | $0.0199
1m 24s | unscored
0.823
100% success
21.00 req | $0.0495
2m 12s | scored
0.0
0% success
9.00 req | $0.7268
56s | unscored
0.819
100% success
11.00 req | $0.2735
33s | scored
0.0
0% success
19.00 req | n/a
59s | unscored
0.791
100% success
14.00 req | $0.3143
59s | scored
0.0
0% success
11.00 req | $0.3864
42s | unscored
0.798
100% success
14.00 req | $0.1142
3m 58s | scored
0.0
0% success
17.00 req | $0.0232
1m 01s | unscored
1.0
100% success
8.00 req | $0.0292
34s | scored
0.7
100% success
17.00 req | n/a
1m 35s | scored
0.0
0% success
9.00 req | $0.0523
1m 22s | unscored
0.0
0% success
11.00 req | $0.1657
1m 35s | unscored
0.0
0% success
24.00 req | n/a
5m 00s | unscored
Terraform static site repair
02-terraform-static-site | high
0.822
100% success
18.00 req | $0.0175
1m 50s | scored
0.978
100% success
5.00 req | $0.0152
41s | scored
0.755
100% success
9.00 req | $0.6775
46s | scored
0.776
100% success
12.00 req | $0.2463
28s | scored
0.806
100% success
23.00 req | n/a
2m 45s | scored
0.775
100% success
10.00 req | $0.2631
43s | scored
0.771
100% success
9.00 req | $0.4026
34s | scored
0.816
100% success
9.00 req | $0.0482
1m 55s | scored
0.862
100% success
12.00 req | $0.0170
1m 11s | scored
0.789
100% success
14.00 req | $0.0530
2m 17s | scored
0.7
100% success
17.00 req | n/a
1m 20s | scored
0.7
100% success
9.00 req | $0.0568
1m 38s | scored
0.832
100% success
5.00 req | $0.0872
1m 20s | scored
0.7
100% success
14.00 req | n/a
3m 58s | scored
Ansible nginx role completion
03-ansible-nginx-role | high
0.889
100% success
14.00 req | $0.0131
48s | scored
0.918
100% success
7.00 req | $0.0200
52s | scored
0.747
100% success
12.00 req | $0.7302
1m 02s | scored
0.832
100% success
7.00 req | $0.1366
22s | scored
0.963
100% success
7.00 req | n/a
43s | scored
0.792
100% success
7.00 req | $0.2740
41s | scored
0.76
100% success
13.00 req | $0.3992
42s | scored
0.841
100% success
9.00 req | $0.0511
51s | scored
0.0
0% success
14.00 req | $0.0196
51s | unscored
0.918
100% success
7.00 req | $0.0285
28s | scored
0.7
100% success
9.00 req | n/a
1m 11s | scored
0.0
0% success
11.00 req | n/a
5m 00s | unscored
0.778
100% success
10.00 req | $0.1680
1m 34s | scored
0.7
100% success
25.00 req | n/a
4m 41s | scored
Docker Compose observability fix
04-docker-compose-observability | high
0.975
100% success
23.00 req | $0.0286
2m 08s | scored
0.836
100% success
31.00 req | $0.0897
5m 00s | scored
0.0
0% success
42.00 req | $1.2297
4m 16s | unscored
0.0
0% success
5.00 req | $0.0964
16s | unscored
0.0
0% success
9.00 req | n/a
45s | unscored
0.0
0% success
7.00 req | $0.2825
46s | unscored
0.0
0% success
14.00 req | $0.4126
53s | unscored
0.0
0% success
12.00 req | n/a
5m 00s | unscored
0.0
0% success
10.00 req | $0.0124
37s | unscored
0.0
0% success
9.00 req | $0.0288
33s | unscored
0.0
0% success
31.00 req | n/a
5m 00s | unscored
0.0
0% success
13.00 req | $0.0786
2m 26s | unscored
0.813
100% success
19.00 req | $0.3295
3m 57s | scored
0.0
0% success
17.00 req | n/a
3m 26s | unscored
Log audit shell script
05-log-audit-script | control
0.935
100% success
10.00 req | $0.0107
48s | scored
0.0
0% success
n/a req | n/a
1m 15s | unscored
0.763
100% success
9.00 req | $0.6065
39s | scored
0.762
100% success
15.00 req | $0.3671
31s | scored
0.0
0% success
12.00 req | n/a
1m 15s | unscored
0.768
100% success
11.00 req | $0.2761
1m 06s | scored
0.766
100% success
11.00 req | $0.3782
45s | scored
0.0
0% success
7.00 req | n/a
1m 15s | unscored
0.0
0% success
9.00 req | $0.0139
44s | unscored
0.915
100% success
7.00 req | $0.0269
33s | scored
0.0
0% success
12.00 req | n/a
1m 15s | unscored
0.0
0% success
10.00 req | n/a
1m 15s | unscored
0.0
0% success
8.00 req | n/a
1m 15s | unscored
0.0
0% success
8.00 req | n/a
1m 15s | unscored
Kubernetes OIDC RBAC repair
06-kubernetes-oidc-rbac-repair | high
0.95
100% success
16.00 req | $0.0172
1m 13s | scored
0.885
100% success
17.00 req | $0.0380
1m 12s | scored
0.765
100% success
16.00 req | $0.8019
1m 13s | scored
0.0
0% success
4.00 req | $0.1987
4m 43s | unscored
0.0
0% success
14.00 req | n/a
1m 36s | unscored
0.804
100% success
12.00 req | $0.3245
1m 05s | scored
0.769
100% success
20.00 req | $0.5007
1m 15s | scored
0.0
0% success
7.00 req | $0.0561
2m 01s | unscored
0.0
0% success
9.00 req | $0.0135
57s | unscored
0.873
100% success
15.00 req | $0.0626
56s | scored
0.7
100% success
22.00 req | n/a
2m 13s | scored
0.7
100% success
25.00 req | $0.1416
2m 42s | scored
0.803
100% success
14.00 req | $0.2299
1m 29s | scored
0.0
0% success
27.00 req | n/a
5m 00s | unscored
CNPG restore manifest repair
07-cnpg-restore-manifest-repair | high
0.915
100% success
13.00 req | $0.0128
45s | scored
0.874
100% success
11.00 req | $0.0261
54s | scored
0.736
100% success
18.00 req | $0.9118
1m 29s | scored
0.786
100% success
12.00 req | $0.2697
25s | scored
0.964
100% success
9.00 req | n/a
37s | scored
0.773
100% success
9.00 req | $0.3379
1m 12s | scored
0.0
0% success
9.00 req | $0.2644
5m 00s | unscored
0.855
100% success
8.00 req | $0.0472
1m 11s | scored
0.915
100% success
9.00 req | $0.0168
1m 03s | scored
0.833
100% success
13.00 req | $0.0542
39s | scored
0.7
100% success
15.00 req | n/a
1m 27s | scored
0.7
100% success
14.00 req | $0.0716
1m 09s | scored
0.766
100% success
12.00 req | $0.2791
1m 33s | scored
0.0
0% success
21.00 req | n/a
5m 00s | unscored
Workspace transplant bundle repair
08-workspace-transplant-bundle-repair | high
0.924
100% success
12.00 req | $0.0127
44s | scored
0.0
0% success
26.00 req | n/a
5m 00s | unscored
0.744
100% success
15.00 req | $0.7076
1m 24s | scored
0.823
100% success
8.00 req | $0.1595
27s | scored
0.985
100% success
9.00 req | n/a
25s | scored
0.747
100% success
16.00 req | $0.4056
2m 07s | scored
0.748
100% success
18.00 req | $0.4336
1m 24s | scored
0.823
100% success
8.00 req | $0.0594
2m 26s | scored
0.848
100% success
17.00 req | $0.0244
51s | scored
0.876
100% success
10.00 req | $0.0344
38s | scored
0.7
100% success
10.00 req | n/a
51s | scored
0.7
100% success
35.00 req | $0.1858
4m 10s | scored
0.799
100% success
9.00 req | $0.1576
1m 00s | scored
0.7
100% success
14.00 req | n/a
2m 40s | scored
GitOps workspace render validation
09-gitops-workspace-render-validation | high
0.895
100% success
17.00 req | $0.0159
1m 01s | scored
0.863
100% success
13.00 req | $0.0347
1m 09s | scored
0.747
100% success
15.00 req | $0.9304
1m 17s | scored
0.824
100% success
9.00 req | $0.1990
29s | scored
0.941
100% success
10.00 req | n/a
1m 09s | scored
0.78
100% success
9.00 req | $0.3886
1m 20s | scored
0.74
100% success
22.00 req | $0.5677
3m 09s | scored
0.775
100% success
18.00 req | $0.1239
2m 56s | scored
0.921
100% success
12.00 req | $0.0193
50s | scored
0.0
0% success
11.00 req | $0.0597
1m 02s | unscored
0.7
100% success
15.00 req | n/a
1m 22s | scored
0.7
100% success
17.00 req | $0.0882
1m 28s | scored
0.808
100% success
9.00 req | $0.1641
1m 21s | scored
0.7
100% success
28.00 req | n/a
4m 05s | scored
Bootstrap phase validation repair
10-bootstrap-phase-validation-repair | expert
0.0
0% success
21.00 req | $0.0231
1m 21s | unscored
0.993
100% success
12.00 req | $0.0380
1m 30s | scored
0.793
100% success
14.00 req | $0.8939
1m 20s | scored
0.0
0% success
9.00 req | $0.1869
38s | unscored
0.0
0% success
22.00 req | n/a
2m 08s | unscored
0.0
0% success
12.00 req | $0.4329
1m 27s | unscored
0.787
100% success
22.00 req | $0.5790
1m 24s | scored
0.831
100% success
19.00 req | $0.1260
3m 36s | scored
0.9
100% success
23.00 req | $0.0404
2m 06s | scored
0.0
0% success
10.00 req | $0.0498
45s | unscored
0.0
0% success
13.00 req | n/a
1m 00s | unscored
0.7
100% success
17.00 req | $0.0782
3m 18s | scored
0.0
0% success
16.00 req | $0.3025
2m 19s | unscored
0.0
0% success
25.00 req | n/a
5m 00s | unscored
MCP OpenBao contract repair
11-mcp-openbao-contract-repair | expert
0.901
100% success
17.00 req | $0.0156
58s | scored
0.874
100% success
14.00 req | $0.0328
50s | scored
0.747
100% success
18.00 req | $0.7456
1m 22s | scored
0.811
100% success
10.00 req | $0.2235
36s | scored
0.954
100% success
13.00 req | n/a
33s | scored
0.782
100% success
10.00 req | $0.3493
1m 11s | scored
0.759
100% success
17.00 req | $0.5186
56s | scored
0.818
100% success
9.00 req | $0.1051
2m 02s | scored
0.939
100% success
11.00 req | $0.0187
47s | scored
0.916
100% success
9.00 req | $0.0346
42s | scored
0.7
100% success
22.00 req | n/a
1m 50s | scored
0.7
100% success
26.00 req | $0.1430
2m 20s | scored
0.746
100% success
16.00 req | $0.4583
4m 54s | scored
0.0
0% success
26.00 req | n/a
5m 00s | unscored
Pre-ArgoCD bootstrap sequencing
12-pre-argocd-bootstrap-sequencing | expert
0.967
100% success
22.00 req | $0.0202
1m 13s | scored
0.894
100% success
17.00 req | $0.0420
2m 59s | scored
0.762
100% success
22.00 req | $0.8700
2m 28s | scored
0.0
0% success
8.00 req | $0.1649
42s | unscored
0.932
100% success
17.00 req | n/a
4m 23s | scored
0.0
0% success
9.00 req | $0.4099
1m 21s | unscored
0.792
100% success
20.00 req | $0.4631
1m 16s | scored
0.83
100% success
17.00 req | $0.1335
2m 58s | scored
0.0
0% success
9.00 req | $0.0171
57s | unscored
0.0
0% success
11.00 req | $0.0416
47s | unscored
0.7
100% success
18.00 req | n/a
1m 32s | scored
0.0
0% success
19.00 req | $0.1001
2m 03s | unscored
0.0
0% success
17.00 req | $0.3456
2m 43s | unscored
0.0
0% success
17.00 req | n/a
5m 00s | unscored
Wildcard TLS route coverage
13-wildcard-tls-route-coverage | expert
0.916
100% success
16.00 req | $0.0159
53s | scored
0.929
100% success
9.00 req | $0.0262
1m 00s | scored
0.761
100% success
12.00 req | $0.7877
1m 00s | scored
0.813
100% success
10.00 req | $0.2252
37s | scored
0.925
100% success
13.00 req | n/a
1m 08s | scored
0.796
100% success
10.00 req | $0.3123
47s | scored
0.757
100% success
19.00 req | $0.4915
1m 08s | scored
0.81
100% success
13.00 req | $0.0840
2m 13s | scored
0.761
100% success
51.00 req | $0.0719
2m 36s | scored
0.0
0% success
9.00 req | $0.0350
40s | unscored
0.0
0% success
75.00 req | n/a
5m 00s | unscored
0.7
100% success
29.00 req | $0.1207
4m 06s | scored
0.0
0% success
12.00 req | $0.1606
1m 42s | unscored
0.0
0% success
29.00 req | n/a
5m 00s | unscored
Build workspace plane convergence
14-build-workspace-plane-convergence | expert
0.942
100% success
13.00 req | $0.0149
46s | scored
0.819
100% success
18.00 req | $0.0519
1m 29s | scored
0.736
100% success
18.00 req | $1.1061
2m 42s | scored
0.801
100% success
11.00 req | $0.2917
28s | scored
0.852
100% success
22.00 req | n/a
2m 27s | scored
0.77
100% success
12.00 req | $0.4417
1m 14s | scored
0.745
100% success
21.00 req | $0.5629
2m 07s | scored
0.79
100% success
12.00 req | $0.1291
3m 01s | scored
0.809
100% success
25.00 req | $0.0418
1m 34s | scored
0.863
100% success
10.00 req | $0.0574
57s | scored
0.7
100% success
19.00 req | n/a
1m 14s | scored
0.0
0% success
40.00 req | n/a
5m 01s | unscored
0.735
100% success
25.00 req | $0.6600
3m 55s | scored
0.7
100% success
20.00 req | n/a
3m 23s | scored
Workspace runtime access convergence
15-workspace-runtime-access-convergence | expert
0.932
100% success
22.00 req | $0.0301
1m 47s | scored
0.884
100% success
21.00 req | $0.0633
1m 42s | scored
0.754
100% success
26.00 req | $1.1706
3m 01s | scored
0.819
100% success
16.00 req | $0.4184
1m 00s | scored
0.0
0% success
30.00 req | n/a
2m 26s | unscored
0.773
100% success
21.00 req | $0.5689
3m 35s | scored
0.0
0% success
6.00 req | $0.1763
5m 01s | unscored
0.819
100% success
20.00 req | $0.1503
3m 41s | scored
0.0
0% success
17.00 req | $0.0306
1m 32s | unscored
0.0
0% success
16.00 req | $0.0655
1m 15s | unscored
0.7
100% success
28.00 req | n/a
1m 20s | scored
0.7
100% success
27.00 req | $0.1889
2m 41s | scored
0.0
0% success
23.00 req | $0.5298
3m 12s | unscored
0.0
0% success
18.00 req | n/a
5m 00s | unscored
Event status shell summary
16-event-status-shell | control
0.908
100% success
10.00 req | $0.0089
34s | scored
0.86
100% success
9.00 req | $0.0205
37s | scored
0.749
100% success
9.00 req | $0.5796
38s | scored
0.812
100% success
6.00 req | $0.0954
37s | scored
1.0
100% success
6.00 req | n/a
17s | scored
0.783
100% success
6.00 req | $0.2351
33s | scored
0.761
100% success
9.00 req | $0.3380
33s | scored
0.0
0% success
5.00 req | n/a
45s | unscored
0.0
0% success
12.00 req | n/a
45s | unscored
0.911
100% success
6.00 req | $0.0196
25s | scored
0.0
0% success
7.00 req | n/a
45s | unscored
0.0
0% success
9.00 req | n/a
45s | unscored
0.0
0% success
7.00 req | n/a
45s | unscored
0.0
0% success
4.00 req | n/a
45s | unscored
Log level rollup shell script
17-log-level-rollup | control
0.921
100% success
11.00 req | $0.0113
49s | scored
0.893
100% success
10.00 req | $0.0223
36s | scored
0.76
100% success
9.00 req | $0.5953
50s | scored
0.807
100% success
9.00 req | $0.1567
29s | scored
0.965
100% success
7.00 req | n/a
54s | scored
0.792
100% success
7.00 req | $0.2457
48s | scored
0.778
100% success
9.00 req | $0.3619
32s | scored
0.836
100% success
10.00 req | $0.0450
1m 01s | scored
0.853
100% success
14.00 req | $0.0225
54s | scored
0.829
100% success
11.00 req | $0.0486
56s | scored
0.7
100% success
12.00 req | n/a
49s | scored
0.0
0% success
14.00 req | n/a
1m 00s | unscored
0.0
0% success
8.00 req | n/a
1m 00s | unscored
0.0
0% success
7.00 req | n/a
1m 00s | unscored
RHEL edge firewalld router repair
18-rhel-edge-firewalld-router-repair | medium
0.953
100% success
19.00 req | $0.0210
1m 10s | scored
0.0
0% success
7.00 req | $0.0198
43s | unscored
0.776
100% success
16.00 req | $0.7708
1m 17s | scored
0.821
100% success
13.00 req | $0.2802
1m 11s | scored
0.888
100% success
31.00 req | n/a
1m 43s | scored
0.0
0% success
6.00 req | $0.2606
41s | unscored
0.785
100% success
17.00 req | $0.4893
1m 27s | scored
0.0
0% success
6.00 req | $0.0359
54s | unscored
0.904
100% success
19.00 req | $0.0302
1m 48s | scored
0.0
0% success
9.00 req | $0.0277
33s | unscored
0.0
0% success
6.00 req | n/a
37s | unscored
0.7
100% success
10.00 req | $0.1025
1m 19s | scored
0.0
0% success
9.00 req | $0.1203
1m 09s | unscored
0.0
0% success
19.00 req | n/a
4m 00s | unscored
SELinux registry volume label repair
19-selinux-registry-volume-label-repair | high
0.0
0% success
10.00 req | $0.0125
59s | unscored
1.0
100% success
10.00 req | $0.0254
42s | scored
0.759
100% success
16.00 req | $0.7973
1m 57s | scored
0.0
0% success
6.00 req | $0.1133
32s | unscored
0.0
0% success
8.00 req | n/a
1m 07s | unscored
0.802
100% success
12.00 req | $0.3297
1m 09s | scored
0.0
0% success
11.00 req | $0.4442
54s | unscored
0.807
100% success
13.00 req | $0.1312
3m 51s | scored
0.0
0% success
7.00 req | $0.0098
33s | unscored
0.0
0% success
12.00 req | $0.0326
1m 02s | unscored
0.7
100% success
9.00 req | n/a
1m 13s | scored
0.7
100% success
15.00 req | $0.0799
1m 15s | scored
0.0
0% success
3.00 req | $0.0385
19s | unscored
0.0
0% success
20.00 req | n/a
5m 00s | unscored
AppArmor dnsmasq profile repair
20-apparmor-dnsmasq-profile-repair | high
0.918
100% success
18.00 req | $0.0145
1m 03s | scored
0.861
100% success
14.00 req | $0.0377
1m 37s | scored
0.751
100% success
18.00 req | $0.8037
1m 21s | scored
0.795
100% success
12.00 req | $0.2683
1m 01s | scored
0.885
100% success
23.00 req | n/a
1m 22s | scored
0.792
100% success
9.00 req | $0.3399
1m 29s | scored
0.755
100% success
22.00 req | $0.5205
1m 18s | scored
0.0
0% success
9.00 req | n/a
5m 00s | unscored
0.897
100% success
14.00 req | $0.0232
1m 21s | scored
0.0
0% success
10.00 req | $0.0337
42s | unscored
0.7
100% success
15.00 req | n/a
1m 21s | scored
0.7
100% success
28.00 req | $0.1621
3m 29s | scored
0.0
0% success
13.00 req | $0.2495
2m 08s | unscored
0.0
0% success
20.00 req | n/a
2m 50s | unscored
RHEL k3s node preparation repair
21-rhel-k3s-node-prep-repair | expert
1.0
100% success
13.00 req | $0.0120
38s | scored
0.798
100% success
24.00 req | $0.0612
2m 26s | scored
0.745
100% success
18.00 req | $0.9392
2m 03s | scored
0.0
0% success
5.00 req | $0.1018
21s | unscored
0.857
100% success
31.00 req | n/a
2m 19s | scored
0.0
0% success
9.00 req | $0.3012
51s | unscored
0.748
100% success
24.00 req | $0.6228
1m 31s | scored
0.0
0% success
9.00 req | $0.0527
2m 28s | unscored
0.0
0% success
9.00 req | $0.0161
1m 06s | unscored
0.0
0% success
11.00 req | $0.0441
1m 03s | unscored
0.0
0% success
13.00 req | n/a
1m 17s | unscored
0.0
0% success
7.00 req | $0.0468
49s | unscored
0.0
0% success
10.00 req | $0.1624
1m 44s | unscored
0.0
0% success
22.00 req | n/a
5m 01s | unscored
nftables router ingress repair
22-nftables-router-ingress-repair | expert
0.98
100% success
7.00 req | $0.0084
27s | scored
0.793
100% success
16.00 req | $0.0369
1m 21s | scored
0.735
100% success
14.00 req | $0.7916
1m 07s | scored
0.817
100% success
6.00 req | $0.1201
29s | scored
0.0
0% success
17.00 req | n/a
52s | unscored
0.0
0% success
6.00 req | $0.3503
1m 13s | unscored
0.742
100% success
14.00 req | $0.4760
1m 06s | scored
0.0
0% success
6.00 req | $0.0540
1m 48s | unscored
0.849
100% success
11.00 req | $0.0181
1m 00s | scored
0.0
0% success
6.00 req | $0.0310
43s | unscored
0.0
0% success
30.00 req | n/a
1m 44s | unscored
0.0
0% success
11.00 req | $0.0821
1m 41s | unscored
0.0
0% success
9.00 req | $0.2268
1m 50s | unscored
0.0
0% success
24.00 req | n/a
4m 52s | unscored
RHEL NetworkManager bridge VLAN repair
23-rhel-networkmanager-bridge-vlan-repair | high
0.951
100% success
9.00 req | $0.0090
34s | scored
0.872
100% success
8.00 req | $0.0248
50s | scored
0.741
100% success
12.00 req | $0.7664
1m 01s | scored
0.802
100% success
8.00 req | $0.1978
18s | scored
0.87
100% success
16.00 req | n/a
1m 06s | scored
0.751
100% success
15.00 req | $0.3071
1m 02s | scored
0.746
100% success
13.00 req | $0.5330
57s | scored
0.793
100% success
10.00 req | $0.0689
1m 47s | scored
0.745
100% success
41.00 req | $0.0676
2m 55s | scored
0.852
100% success
8.00 req | $0.0339
54s | scored
0.0
0% success
7.00 req | n/a
53s | unscored
0.0
0% success
16.00 req | $0.1243
2m 35s | unscored
0.0
0% success
9.00 req | $0.1941
1m 33s | unscored
0.7
100% success
22.00 req | n/a
4m 59s | scored
K3s registry mirror trust repair
24-k3s-registry-mirror-trust-repair | expert
0.886
100% success
11.00 req | $0.0089
32s | scored
0.792
100% success
15.00 req | $0.0306
1m 04s | scored
0.737
100% success
11.00 req | $0.6582
59s | scored
0.753
100% success
13.00 req | $0.2926
25s | scored
1.0
100% success
5.00 req | n/a
17s | scored
0.758
100% success
7.00 req | $0.2694
1m 16s | scored
0.766
100% success
7.00 req | $0.3419
26s | scored
0.852
100% success
5.00 req | $0.0395
39s | scored
0.906
100% success
7.00 req | $0.0118
33s | scored
0.896
100% success
6.00 req | $0.0195
25s | scored
0.7
100% success
6.00 req | n/a
45s | scored
0.7
100% success
40.00 req | $0.2351
3m 40s | scored
0.773
100% success
8.00 req | $0.1489
52s | scored
0.7
100% success
13.00 req | n/a
2m 01s | scored
MetalLB ingress address pool repair
25-metallb-ingress-address-pool-repair | expert
0.928
100% success
17.00 req | $0.0152
1m 03s | scored
0.85
100% success
16.00 req | $0.0428
1m 38s | scored
0.0
0% success
10.00 req | $0.7403
1m 04s | unscored
0.76
100% success
20.00 req | $0.5282
1m 08s | scored
0.927
100% success
18.00 req | n/a
57s | scored
0.803
100% success
10.00 req | $0.3259
53s | scored
0.759
100% success
20.00 req | $0.5138
1m 20s | scored
0.0
0% success
8.00 req | $0.0468
1m 27s | unscored
0.0
0% success
13.00 req | $0.0183
1m 04s | unscored
0.0
0% success
10.00 req | $0.0336
45s | unscored
0.7
100% success
25.00 req | n/a
1m 48s | scored
0.7
100% success
23.00 req | $0.1348
2m 14s | scored
0.0
0% success
3.00 req | $0.0619
34s | unscored
0.0
0% success
19.00 req | n/a
5m 01s | unscored
Traefik forwarded header trust repair
26-traefik-forwarded-header-trust-repair | expert
0.904
100% success
21.00 req | $0.0176
1m 17s | scored
0.913
100% success
13.00 req | $0.0331
1m 01s | scored
0.766
100% success
15.00 req | $0.7472
1m 08s | scored
0.771
100% success
20.00 req | $0.5014
48s | scored
0.0
0% success
6.00 req | n/a
21s | unscored
0.783
100% success
13.00 req | $0.3067
2m 18s | scored
0.771
100% success
18.00 req | $0.4565
1m 10s | scored
0.835
100% success
11.00 req | $0.0774
3m 13s | scored
0.876
100% success
17.00 req | $0.0286
1m 53s | scored
0.0
0% success
8.00 req | $0.0428
54s | unscored
0.0
0% success
14.00 req | n/a
1m 10s | unscored
0.7
100% success
15.00 req | $0.0754
1m 38s | scored
0.0
0% success
15.00 req | $0.4734
4m 42s | unscored
0.0
0% success
18.00 req | n/a
5m 01s | unscored
ExternalDNS RFC2136 repair
27-external-dns-rfc2136-repair | expert
0.0
0% success
13.00 req | $0.0138
49s | unscored
0.982
100% success
11.00 req | $0.0299
46s | scored
0.766
100% success
15.00 req | $0.7899
1m 44s | scored
0.814
100% success
15.00 req | $0.3206
34s | scored
0.893
100% success
17.00 req | n/a
1m 54s | scored
0.815
100% success
11.00 req | $0.3297
1m 02s | scored
0.0
0% success
7.00 req | $0.2057
5m 01s | unscored
0.837
100% success
12.00 req | $0.1022
2m 42s | scored
0.0
0% success
8.00 req | $0.0144
45s | unscored
0.0
0% success
8.00 req | $0.0321
39s | unscored
0.0
0% success
8.00 req | n/a
1m 01s | unscored
0.7
100% success
19.00 req | $0.0818
1m 48s | scored
0.0
0% success
7.00 req | $0.1398
1m 02s | unscored
0.0
0% success
1.00 req | n/a
5m 01s | unscored
Highlights

After the top-line results, these sections answer the next useful questions: who leads each tradeoff, where models separate, and what the benchmark composition looks like.

Cheapest full run

opencode/big-pickle

$0.0000
0.615 composite
Fastest full run

opencode/glm-5

20m 10s
78% success
Best ORPT

opencode/gpt-5.4-mini

9.54
requests per solved task
Most reliable

opencode/kimi-k2.5

89%
24 solved tasks
Head-to-head

Pairwise outcomes

Task-level pairwise wins show whether the leaderboard leader actually dominates the suite or just edges ahead on aggregate.

opencode/claude-opus-4-6 vs opencode/nemotron-3-super-free
27 tasks compared, 3 ties
24-0
opencode/gpt-5.4-nano vs opencode/nemotron-3-super-free
27 tasks compared, 4 ties
23-0
opencode/kimi-k2.5 vs opencode/minimax-m2.5-free
27 tasks compared, 2 ties
24-1
opencode/kimi-k2.5 vs opencode/gemini-3.1-pro
27 tasks compared, 2 ties
24-1
opencode/kimi-k2.5 vs opencode/nemotron-3-super-free
27 tasks compared, 2 ties
24-1
opencode/claude-opus-4-6 vs opencode/gemini-3-flash
27 tasks compared, 2 ties
24-1
opencode/gpt-5.4-nano vs opencode/claude-sonnet-4-6
27 tasks compared, 3 ties
23-1
opencode/kimi-k2.5 vs opencode/gpt-5.4
27 tasks compared, 1 ties
24-2
opencode/kimi-k2.5 vs opencode/gemini-3-flash
27 tasks compared, 1 ties
24-2
opencode/claude-opus-4-6 vs opencode/minimax-m2.5-free
27 tasks compared, 1 ties
24-2
opencode/gpt-5.4-nano vs opencode/glm-5
27 tasks compared, 2 ties
23-2
opencode/gpt-5.4-nano vs opencode/minimax-m2.5-free
27 tasks compared, 2 ties
23-2
opencode/gpt-5.4-nano vs opencode/gemini-3.1-pro
27 tasks compared, 4 ties
22-1
opencode/kimi-k2.5 vs opencode/claude-opus-4-6
27 tasks compared, 0 ties
24-3
opencode/kimi-k2.5 vs opencode/claude-sonnet-4-6
27 tasks compared, 0 ties
24-3
opencode/kimi-k2.5 vs opencode/glm-5.1
27 tasks compared, 2 ties
23-2
opencode/glm-5 vs opencode/nemotron-3-super-free
27 tasks compared, 6 ties
21-0
opencode/gpt-5.4 vs opencode/nemotron-3-super-free
27 tasks compared, 6 ties
21-0
opencode/claude-sonnet-4-6 vs opencode/nemotron-3-super-free
27 tasks compared, 6 ties
21-0
opencode/gpt-5.4-nano vs opencode/claude-opus-4-6
27 tasks compared, 1 ties
23-3
opencode/gpt-5.4-nano vs opencode/gpt-5.4
27 tasks compared, 1 ties
23-3
opencode/gpt-5.4-nano vs opencode/gemini-3-flash
27 tasks compared, 1 ties
23-3
opencode/gpt-5.4 vs opencode/minimax-m2.5-free
27 tasks compared, 5 ties
21-1
opencode/gpt-5.4-nano vs opencode/glm-5.1
27 tasks compared, 0 ties
23-4
opencode/kimi-k2.5 vs opencode/glm-5
27 tasks compared, 0 ties
23-4
opencode/gpt-5.4 vs opencode/gemini-3-flash
27 tasks compared, 4 ties
21-2
opencode/glm-5 vs opencode/minimax-m2.5-free
27 tasks compared, 3 ties
21-3
opencode/glm-5 vs opencode/gemini-3-flash
27 tasks compared, 3 ties
21-3
opencode/big-pickle vs opencode/nemotron-3-super-free
27 tasks compared, 9 ties
18-0
opencode/glm-5.1 vs opencode/nemotron-3-super-free
27 tasks compared, 9 ties
18-0
opencode/claude-sonnet-4-6 vs opencode/minimax-m2.5-free
27 tasks compared, 2 ties
21-4
opencode/claude-sonnet-4-6 vs opencode/gemini-3-flash
27 tasks compared, 2 ties
21-4
opencode/claude-opus-4-6 vs opencode/gpt-5.4
27 tasks compared, 1 ties
5-21
opencode/glm-5 vs opencode/gemini-3.1-pro
27 tasks compared, 4 ties
19-4
opencode/glm-5.1 vs opencode/minimax-m2.5-free
27 tasks compared, 6 ties
18-3
opencode/gpt-5.4-nano vs opencode/gpt-5.4-mini
27 tasks compared, 3 ties
19-5
opencode/claude-opus-4-6 vs opencode/glm-5
27 tasks compared, 1 ties
6-20
opencode/big-pickle vs opencode/minimax-m2.5-free
27 tasks compared, 5 ties
18-4
opencode/big-pickle vs opencode/gemini-3.1-pro
27 tasks compared, 7 ties
17-3
opencode/glm-5.1 vs opencode/gemini-3-flash
27 tasks compared, 5 ties
18-4
opencode/minimax-m2.5 vs opencode/nemotron-3-super-free
27 tasks compared, 11 ties
15-1
opencode/glm-5 vs opencode/big-pickle
27 tasks compared, 4 ties
5-18
opencode/big-pickle vs opencode/claude-sonnet-4-6
27 tasks compared, 4 ties
18-5
opencode/big-pickle vs opencode/gpt-5.4-mini
27 tasks compared, 6 ties
17-4
opencode/big-pickle vs opencode/gemini-3-flash
27 tasks compared, 4 ties
18-5
opencode/gpt-5.4-nano vs opencode/minimax-m2.5
27 tasks compared, 3 ties
18-6
opencode/kimi-k2.5 vs opencode/gpt-5.4-mini
27 tasks compared, 1 ties
19-7
opencode/big-pickle vs opencode/gpt-5.4
27 tasks compared, 3 ties
18-6
opencode/gpt-5.4 vs opencode/claude-sonnet-4-6
27 tasks compared, 1 ties
19-7
opencode/gpt-5.4 vs opencode/gemini-3.1-pro
27 tasks compared, 5 ties
17-5
opencode/glm-5.1 vs opencode/gemini-3.1-pro
27 tasks compared, 7 ties
16-4
opencode/minimax-m2.5 vs opencode/gemini-3.1-pro
27 tasks compared, 9 ties
15-3
opencode/gpt-5.4-mini vs opencode/nemotron-3-super-free
27 tasks compared, 13 ties
13-1
opencode/gpt-5.4-nano vs opencode/kimi-k2.5
27 tasks compared, 0 ties
19-8
opencode/claude-opus-4-6 vs opencode/big-pickle
27 tasks compared, 2 ties
7-18
opencode/claude-opus-4-6 vs opencode/claude-sonnet-4-6
27 tasks compared, 2 ties
7-18
opencode/claude-opus-4-6 vs opencode/glm-5.1
27 tasks compared, 2 ties
7-18
opencode/glm-5 vs opencode/claude-sonnet-4-6
27 tasks compared, 2 ties
18-7
opencode/big-pickle vs opencode/glm-5.1
27 tasks compared, 4 ties
17-6
opencode/gpt-5.4 vs opencode/glm-5.1
27 tasks compared, 4 ties
6-17
opencode/kimi-k2.5 vs opencode/minimax-m2.5
27 tasks compared, 1 ties
18-8
opencode/claude-sonnet-4-6 vs opencode/glm-5.1
27 tasks compared, 1 ties
8-18
opencode/minimax-m2.5 vs opencode/gemini-3-flash
27 tasks compared, 7 ties
15-5
opencode/claude-opus-4-6 vs opencode/gemini-3.1-pro
27 tasks compared, 2 ties
17-8
opencode/glm-5 vs opencode/gpt-5.4
27 tasks compared, 4 ties
16-7
opencode/big-pickle vs opencode/minimax-m2.5
27 tasks compared, 6 ties
15-6
opencode/gpt-5.4-mini vs opencode/gemini-3.1-pro
27 tasks compared, 12 ties
12-3
opencode/minimax-m2.5-free vs opencode/nemotron-3-super-free
27 tasks compared, 16 ties
10-1
opencode/gemini-3-flash vs opencode/nemotron-3-super-free
27 tasks compared, 12 ties
12-3
opencode/gemini-3.1-pro vs opencode/nemotron-3-super-free
27 tasks compared, 16 ties
10-1
opencode/minimax-m2.5 vs opencode/minimax-m2.5-free
27 tasks compared, 5 ties
15-7
opencode/claude-sonnet-4-6 vs opencode/gemini-3.1-pro
27 tasks compared, 4 ties
15-8
opencode/gpt-5.4-mini vs opencode/minimax-m2.5-free
27 tasks compared, 8 ties
13-6
opencode/claude-sonnet-4-6 vs opencode/minimax-m2.5
27 tasks compared, 5 ties
8-14
opencode/gpt-5.4 vs opencode/gpt-5.4-mini
27 tasks compared, 6 ties
8-13
opencode/glm-5.1 vs opencode/minimax-m2.5
27 tasks compared, 6 ties
8-13
opencode/gpt-5.4-nano vs opencode/big-pickle
27 tasks compared, 3 ties
14-10
opencode/claude-opus-4-6 vs opencode/minimax-m2.5
27 tasks compared, 3 ties
10-14
opencode/glm-5 vs opencode/minimax-m2.5
27 tasks compared, 5 ties
9-13
opencode/glm-5 vs opencode/gpt-5.4-mini
27 tasks compared, 5 ties
9-13
opencode/minimax-m2.5 vs opencode/gpt-5.4-mini
27 tasks compared, 7 ties
12-8
opencode/claude-sonnet-4-6 vs opencode/gpt-5.4-mini
27 tasks compared, 4 ties
10-13
opencode/gpt-5.4-mini vs opencode/gemini-3-flash
27 tasks compared, 4 ties
13-10
opencode/minimax-m2.5-free vs opencode/gemini-3.1-pro
27 tasks compared, 10 ties
7-10
opencode/kimi-k2.5 vs opencode/big-pickle
27 tasks compared, 1 ties
12-14
opencode/gpt-5.4 vs opencode/minimax-m2.5
27 tasks compared, 3 ties
11-13
opencode/claude-opus-4-6 vs opencode/gpt-5.4-mini
27 tasks compared, 2 ties
12-13
opencode/glm-5.1 vs opencode/gpt-5.4-mini
27 tasks compared, 6 ties
11-10
opencode/gemini-3-flash vs opencode/gemini-3.1-pro
27 tasks compared, 8 ties
9-10
opencode/glm-5 vs opencode/glm-5.1
27 tasks compared, 3 ties
12-12
opencode/minimax-m2.5-free vs opencode/gemini-3-flash
27 tasks compared, 17 ties
5-5
Benchmark composition

Suite composition

A benchmark is only meaningful if you can see what kinds of tasks dominate the signal.

Task inventory
10 categories, 4 difficulty bands
27 tasks
Observed suite cost
Combined spend across all model-task runs in the publication
$62.3343
Observed request volume
Request units recorded by the benchmark proxy
5,342
Observed wall time
Aggregate elapsed model runtime across all runs
624m 11s
Required capabilities
unattendedBenchmarkRuns
1
Execution profile

Token and execution profile

Observed token mix across the suite helps explain whether a model is spending heavily on reasoning, cached context, or generation.

Model Input Output Reasoning Cache Read Cache Write
opencode/gpt-5.4-nano 565,422 87,985 69,555 5,576,704 0
opencode/kimi-k2.5 464,664 85,749 0 4,702,208 0
opencode/claude-opus-4-6 27 8,130 0 536,949 13,015
opencode/glm-5 1,000,890 92,693 0 5,136,416 0
opencode/big-pickle 540,092 86,989 0 5,246,048 0
opencode/gpt-5.4 29,395 4,788 1,290 419,200 0
opencode/claude-sonnet-4-6 24 5,574 0 464,850 10,992
opencode/glm-5.1 577,858 63,642 0 3,048,416 0
opencode/minimax-m2.5 678,183 82,592 0 5,644,934 0
opencode/gpt-5.4-mini 450,597 63,992 44,364 3,133,952 0
opencode/minimax-m2.5-free 471,764 77,044 0 4,840,912 0
opencode/gemini-3-flash 2,966,158 47,085 182,101 5,200,440 0
opencode/gemini-3.1-pro 1,363,843 33,058 174,127 3,198,507 0
opencode/nemotron-3-super-free 3,000,485 20,944 17,835 0 0
Deep Dive Charts

The rest of the visual analysis is still here, just moved below the primary answers.

Benchmark speed

Total wall time

Lower is better: total elapsed model runtime across benchmark tasks.

Failure burn

Spend split by outcome

How much benchmark spend went to solved tasks versus failed attempts.

The faded segment is spend burned on failed tasks; the solid segment is spend attached to solved tasks.

Execution profile

Token breakdown

Input, output, reasoning, and cache token mix by model across the suite.

Use this to see whether a model spends proportionally on reasoning, generation, or cached context.

Benchmark composition

Category composite heatmap

Average composite score by benchmark category and model.

Rows expose category strengths and blind spots that disappear in a single top-line score.

Benchmark composition

Difficulty success heatmap

Average success rate by task difficulty and model.

This isolates whether a model falls apart as task difficulty rises.

Task breakdown

Task composite heatmap

Per-task comparative quality. Higher is better.

This is the most detailed quality comparison on the page: every task, every model, one glance.

Task economics

Task cost heatmap

Per-task average cost by model. Lower is better.

Use this to find where a model is overpaying for equivalent or worse outcomes.

Task speed

Task duration heatmap

Per-task average duration by model. Lower is better.

Use this to spot slow tasks and whether the delay tracks with better outcomes or wasted effort.

Benchmarked Model Cards

Published benchmark results paired with catalog metadata make it easier to understand whether a model is cheap, fast, stable, or just happened to land higher this run.

Benchmarked model

opencode/gpt-5.4-nano

openai low price tier dev-cheap standard
Composite0.789
Success85%
Requests413
Wall time27m 33s
Total cost$0.4215
Catalog speedn/a
Catalog blended price$0.4625 / 1M tok
Intelligencen/a
Agenticn/a
Benchmark supportunknown

Primary blended price derived automatically from OpenRouter listing openai/gpt-5.4-nano using a 3:1 input:output blend.

Benchmarked model

opencode/kimi-k2.5

moonshot low price tier standard standard
Composite0.785
Success89%
Requests375
Wall time41m 05s
Total cost$0.9122
Catalog speedn/a
Catalog blended price$0.7170 / 1M tok
Intelligencen/a
Agenticn/a
Benchmark supportunknown

Primary blended price derived automatically from OpenRouter listing moonshotai/kimi-k2.5 using a 3:1 input:output blend.

Benchmarked model

opencode/claude-opus-4-6

anthropic high price tier expensive balanced-general
Composite0.67
Success89%
Requests418
Wall time40m 04s
Total cost$21.8757
Catalog speed49 tok/s
Catalog blended price$10.0000 / 1M tok
Intelligence53
Agenticn/a
Benchmark supportunknown

OpenRouter reference blend for anthropic/claude-opus-4.6-fast is 60 USD per 1M tokens using a 3:1 input:output mix.

Benchmarked model

opencode/glm-5

z-ai low price tier standard balanced-general
Composite0.623
Success78%
Requests280
Wall time20m 10s
Total cost$6.4339
Catalog speed70 tok/s
Catalog blended price$1.6000 / 1M tok
Intelligence50
Agenticn/a
Benchmark supportunknown

OpenRouter reference blend for z-ai/glm-5 is 1.115 USD per 1M tokens using a 3:1 input:output mix.

Benchmarked model

opencode/big-pickle

unknown unknown price tier standard standard
Composite0.615
Success67%
Requests414
Wall time36m 28s
Total cost$0.0000
Catalog speedn/a
Catalog blended pricen/a
Intelligencen/a
Agenticn/a
Benchmark supportunknown

No trustworthy automatic pricing reference found yet, so cost is currently unknown.

Benchmarked model

opencode/gpt-5.4

openai medium price tier standard release-frontier
Composite0.609
Success78%
Requests280
Wall time32m 47s
Total cost$8.9827
Catalog speed74 tok/s
Catalog blended price$5.6000 / 1M tok
Intelligence57
Agenticn/a
Benchmark supportunknown

OpenRouter reference blend for openai/gpt-5.4 is 5.625 USD per 1M tokens using a 3:1 input:output mix.

Benchmarked model

opencode/claude-sonnet-4-6

anthropic medium price tier standard balanced-general
Composite0.593
Success78%
Requests403
Wall time42m 31s
Total cost$11.8406
Catalog speed67 tok/s
Catalog blended price$6.0000 / 1M tok
Intelligence52
Agenticn/a
Benchmark supportunknown

OpenRouter reference blend for anthropic/claude-opus-4.6-fast is 60 USD per 1M tokens using a 3:1 input:output mix.

Benchmarked model

opencode/glm-5.1

z-ai medium price tier standard standard
Composite0.547
Success67%
Requests286
Wall time64m 39s
Total cost$1.8816
Catalog speedn/a
Catalog blended price$2.1463 / 1M tok
Intelligencen/a
Agenticn/a
Benchmark supportunknown

Primary blended price derived automatically from OpenRouter listing z-ai/glm-5.1 using a 3:1 input:output blend.

Benchmarked model

opencode/minimax-m2.5

minimax low price tier dev-cheap dev-general
Composite0.481
Success56%
Requests417
Wall time32m 15s
Total cost$0.6413
Catalog speedn/a
Catalog blended price$0.3360 / 1M tok
Intelligencen/a
Agenticn/a
Benchmark supportunknown

Primary blended price derived automatically from OpenRouter listing minimax/minimax-m2.5 using a 3:1 input:output blend.

Benchmarked model

opencode/gpt-5.4-mini

openai low price tier dev-cheap dev-general headless friendly
Composite0.425
Success48%
Requests264
Wall time21m 48s
Total cost$1.0606
Catalog speedn/a
Catalog blended price$1.6875 / 1M tok
Intelligencen/a
Agenticn/a
Benchmark supportsupported

Primary blended price derived automatically from OpenRouter listing openai/gpt-5.4-mini using a 3:1 input:output blend.

Observed to complete ORPT-Bench scripting smoke runs cleanly and is the current preferred headless dev baseline.

Benchmarked model

opencode/minimax-m2.5-free

minimax free price tier dev-cheap dev-smoke headless caution
Composite0.415
Success59%
Requests475
Wall time41m 34s
Total cost$0.0000
Catalog speedn/a
Catalog blended price$0.0000 / 1M tok
Intelligencen/a
Agenticn/a
Benchmark supportunsupported

Primary blended price derived automatically from OpenRouter listing minimax/minimax-m2.5:free using a 3:1 input:output blend. Reference price uses minimax/minimax-m2.5 at 0.336 USD per 1M tokens from the same OpenRouter family.

Observed to trigger external_directory permission prompts in headless runs.

Benchmarked model

opencode/gemini-3-flash

google low price tier dev-cheap dev-general headless caution
Composite0.415
Success59%
Requests508
Wall time62m 52s
Total cost$2.4307
Catalog speed179 tok/s
Catalog blended price$1.1000 / 1M tok
Intelligence46
Agenticn/a
Benchmark supportlimited

OpenRouter reference blend for google/gemini-3-flash-preview is 1.125 USD per 1M tokens using a 3:1 input:output mix.

Observed to loop and hit the benchmark process deadline on the task-05 scripting smoke run, so it should not be in the default headless dev matrix.

Benchmarked model

opencode/gemini-3.1-pro

google medium price tier dev-cheap dev-general
Composite0.291
Success37%
Requests307
Wall time51m 25s
Total cost$5.8536
Catalog speed127 tok/s
Catalog blended price$4.5000 / 1M tok
Intelligence57
Agenticn/a
Benchmark supportunknown

OpenRouter reference blend for google/gemini-3.1-pro-preview is 4.5 USD per 1M tokens using a 3:1 input:output mix.

Benchmarked model

opencode/nemotron-3-super-free

nvidia free price tier dev-cheap dev-smoke headless caution
Composite0.181
Success26%
Requests502
Wall time109m 00s
Total cost$0.0000
Catalog speed155 tok/s
Catalog blended price$0.0000 / 1M tok
Intelligence36
Agenticn/a
Benchmark supportlimited

OpenRouter reference blend for nvidia/nemotron-3-super-120b-a12b:free is 0 USD per 1M tokens using a 3:1 input:output mix. Reference price uses nvidia/nemotron-3-super-120b-a12b at 0.2 USD per 1M tokens from the same OpenRouter family.

Observed to take a slow, tool-heavy path on the scripting smoke task.

Reference

Less frequently used detail stays available, but out of the critical path.

Benchmark context and splits
Category and difficulty summaries, plus benchmark framing for deeper reading.
Category splits

Category summary

Average quality, speed, and cost by benchmark category.

kubernetes
7 tasks
opencode/claude-opus-4-6: 0.645 comp | 86% success
opencode/claude-sonnet-4-6: 0.545 comp | 71% success
opencode/gemini-3-flash: 0.6 comp | 86% success
opencode/gemini-3.1-pro: 0.335 comp | 43% success
opencode/gpt-5.4: 0.677 comp | 86% success
opencode/gpt-5.4-mini: 0.372 comp | 43% success
opencode/gpt-5.4-nano: 0.798 comp | 86% success
opencode/kimi-k2.5: 0.87 comp | 100% success
opencode/minimax-m2.5-free: 0.4 comp | 57% success
opencode/big-pickle: 0.663 comp | 71% success
opencode/glm-5: 0.555 comp | 71% success
opencode/glm-5.1: 0.483 comp | 57% success
opencode/minimax-m2.5: 0.385 comp | 43% success
opencode/nemotron-3-super-free: 0.1 comp | 14% success
gitops
5 tasks
opencode/big-pickle: 0.734 comp | 80% success
opencode/claude-opus-4-6: 0.749 comp | 100% success
opencode/claude-sonnet-4-6: 0.6 comp | 80% success
opencode/gemini-3-flash: 0.56 comp | 80% success
opencode/gemini-3.1-pro: 0.458 comp | 60% success
opencode/glm-5: 0.814 comp | 100% success
opencode/glm-5.1: 0.802 comp | 100% success
opencode/gpt-5.4: 0.78 comp | 100% success
opencode/gpt-5.4-nano: 0.917 comp | 100% success
opencode/kimi-k2.5: 0.874 comp | 100% success
opencode/minimax-m2.5: 0.686 comp | 80% success
opencode/minimax-m2.5-free: 0.56 comp | 80% success
opencode/nemotron-3-super-free: 0.28 comp | 40% success
opencode/gpt-5.4-mini: 0.356 comp | 40% success
scripting
4 tasks
opencode/claude-opus-4-6: 0.754 comp | 100% success
opencode/claude-sonnet-4-6: 0.763 comp | 100% success
opencode/glm-5: 0.801 comp | 100% success
opencode/gpt-5.4: 0.773 comp | 100% success
opencode/gpt-5.4-mini: 0.883 comp | 100% success
opencode/gpt-5.4-nano: 0.922 comp | 100% success
opencode/big-pickle: 0.737 comp | 75% success
opencode/gemini-3-flash: 0.175 comp | 25% success
opencode/gemini-3.1-pro: 0.2 comp | 25% success
opencode/glm-5.1: 0.415 comp | 50% success
opencode/kimi-k2.5: 0.438 comp | 50% success
opencode/minimax-m2.5: 0.425 comp | 50% success
opencode/minimax-m2.5-free: 0.35 comp | 50% success
opencode/nemotron-3-super-free: 0.175 comp | 25% success
networking
3 tasks
opencode/big-pickle: 0.586 comp | 67% success
opencode/claude-opus-4-6: 0.751 comp | 100% success
opencode/claude-sonnet-4-6: 0.757 comp | 100% success
opencode/gemini-3-flash: 0.233 comp | 33% success
opencode/glm-5: 0.813 comp | 100% success
opencode/gpt-5.4-nano: 0.961 comp | 100% success
opencode/minimax-m2.5: 0.833 comp | 100% success
opencode/gemini-3.1-pro: 0.0 comp | 0% success
opencode/glm-5.1: 0.264 comp | 33% success
opencode/gpt-5.4: 0.25 comp | 33% success
opencode/gpt-5.4-mini: 0.284 comp | 33% success
opencode/kimi-k2.5: 0.555 comp | 67% success
opencode/minimax-m2.5-free: 0.0 comp | 0% success
opencode/nemotron-3-super-free: 0.233 comp | 33% success
linux-hardening
2 tasks
opencode/claude-opus-4-6: 0.755 comp | 100% success
opencode/gemini-3-flash: 0.7 comp | 100% success
opencode/glm-5.1: 0.403 comp | 50% success
opencode/gpt-5.4: 0.797 comp | 100% success
opencode/kimi-k2.5: 0.93 comp | 100% success
opencode/minimax-m2.5-free: 0.7 comp | 100% success
opencode/big-pickle: 0.443 comp | 50% success
opencode/claude-sonnet-4-6: 0.377 comp | 50% success
opencode/gemini-3.1-pro: 0.0 comp | 0% success
opencode/glm-5: 0.397 comp | 50% success
opencode/gpt-5.4-mini: 0.0 comp | 0% success
opencode/gpt-5.4-nano: 0.459 comp | 50% success
opencode/minimax-m2.5: 0.449 comp | 50% success
opencode/nemotron-3-super-free: 0.0 comp | 0% success
platform-bootstrap
2 tasks
opencode/claude-opus-4-6: 0.777 comp | 100% success
opencode/claude-sonnet-4-6: 0.79 comp | 100% success
opencode/gemini-3-flash: 0.35 comp | 50% success
opencode/glm-5.1: 0.83 comp | 100% success
opencode/kimi-k2.5: 0.944 comp | 100% success
opencode/minimax-m2.5: 0.45 comp | 50% success
opencode/big-pickle: 0.466 comp | 50% success
opencode/gemini-3.1-pro: 0.0 comp | 0% success
opencode/glm-5: 0.0 comp | 0% success
opencode/gpt-5.4: 0.0 comp | 0% success
opencode/gpt-5.4-mini: 0.0 comp | 0% success
opencode/gpt-5.4-nano: 0.484 comp | 50% success
opencode/minimax-m2.5-free: 0.35 comp | 50% success
opencode/nemotron-3-super-free: 0.0 comp | 0% success
ansible
1 tasks
opencode/big-pickle: 0.963 comp | 100% success
opencode/claude-opus-4-6: 0.747 comp | 100% success
opencode/claude-sonnet-4-6: 0.76 comp | 100% success
opencode/gemini-3.1-pro: 0.778 comp | 100% success
opencode/glm-5: 0.832 comp | 100% success
opencode/glm-5.1: 0.841 comp | 100% success
opencode/gpt-5.4: 0.792 comp | 100% success
opencode/gpt-5.4-mini: 0.918 comp | 100% success
opencode/gpt-5.4-nano: 0.889 comp | 100% success
opencode/kimi-k2.5: 0.918 comp | 100% success
opencode/minimax-m2.5-free: 0.7 comp | 100% success
opencode/nemotron-3-super-free: 0.7 comp | 100% success
opencode/gemini-3-flash: 0.0 comp | 0% success
opencode/minimax-m2.5: 0.0 comp | 0% success
docker-compose
1 tasks
opencode/gemini-3.1-pro: 0.813 comp | 100% success
opencode/gpt-5.4-nano: 0.975 comp | 100% success
opencode/kimi-k2.5: 0.836 comp | 100% success
opencode/big-pickle: 0.0 comp | 0% success
opencode/claude-opus-4-6: 0.0 comp | 0% success
opencode/claude-sonnet-4-6: 0.0 comp | 0% success
opencode/gemini-3-flash: 0.0 comp | 0% success
opencode/glm-5: 0.0 comp | 0% success
opencode/glm-5.1: 0.0 comp | 0% success
opencode/gpt-5.4: 0.0 comp | 0% success
opencode/gpt-5.4-mini: 0.0 comp | 0% success
opencode/minimax-m2.5: 0.0 comp | 0% success
opencode/minimax-m2.5-free: 0.0 comp | 0% success
opencode/nemotron-3-super-free: 0.0 comp | 0% success
iac
1 tasks
opencode/glm-5: 0.819 comp | 100% success
opencode/glm-5.1: 0.798 comp | 100% success
opencode/gpt-5.4: 0.791 comp | 100% success
opencode/gpt-5.4-mini: 1.0 comp | 100% success
opencode/kimi-k2.5: 0.823 comp | 100% success
opencode/minimax-m2.5-free: 0.7 comp | 100% success
opencode/big-pickle: 0.0 comp | 0% success
opencode/claude-opus-4-6: 0.0 comp | 0% success
opencode/claude-sonnet-4-6: 0.0 comp | 0% success
opencode/gemini-3-flash: 0.0 comp | 0% success
opencode/gemini-3.1-pro: 0.0 comp | 0% success
opencode/gpt-5.4-nano: 0.0 comp | 0% success
opencode/minimax-m2.5: 0.0 comp | 0% success
opencode/nemotron-3-super-free: 0.0 comp | 0% success
terraform
1 tasks
opencode/big-pickle: 0.806 comp | 100% success
opencode/claude-opus-4-6: 0.755 comp | 100% success
opencode/claude-sonnet-4-6: 0.771 comp | 100% success
opencode/gemini-3-flash: 0.7 comp | 100% success
opencode/gemini-3.1-pro: 0.832 comp | 100% success
opencode/glm-5: 0.776 comp | 100% success
opencode/glm-5.1: 0.816 comp | 100% success
opencode/gpt-5.4: 0.775 comp | 100% success
opencode/gpt-5.4-mini: 0.789 comp | 100% success
opencode/gpt-5.4-nano: 0.822 comp | 100% success
opencode/kimi-k2.5: 0.978 comp | 100% success
opencode/minimax-m2.5: 0.862 comp | 100% success
opencode/minimax-m2.5-free: 0.7 comp | 100% success
opencode/nemotron-3-super-free: 0.7 comp | 100% success
Difficulty splits

Difficulty summary

How performance changes as the suite moves from control to expert tasks.

control
3 tasks
opencode/claude-opus-4-6: 0.757 comp | 100% success
opencode/claude-sonnet-4-6: 0.768 comp | 100% success
opencode/glm-5: 0.794 comp | 100% success
opencode/gpt-5.4: 0.781 comp | 100% success
opencode/gpt-5.4-mini: 0.885 comp | 100% success
opencode/gpt-5.4-nano: 0.921 comp | 100% success
opencode/big-pickle: 0.655 comp | 67% success
opencode/gemini-3-flash: 0.0 comp | 0% success
opencode/gemini-3.1-pro: 0.0 comp | 0% success
opencode/glm-5.1: 0.279 comp | 33% success
opencode/kimi-k2.5: 0.584 comp | 67% success
opencode/minimax-m2.5: 0.284 comp | 33% success
opencode/minimax-m2.5-free: 0.233 comp | 33% success
opencode/nemotron-3-super-free: 0.0 comp | 0% success
expert
12 tasks
opencode/claude-opus-4-6: 0.692 comp | 92% success
opencode/claude-sonnet-4-6: 0.636 comp | 83% success
opencode/gemini-3-flash: 0.467 comp | 67% success
opencode/glm-5.1: 0.619 comp | 75% success
opencode/kimi-k2.5: 0.877 comp | 100% success
opencode/minimax-m2.5: 0.503 comp | 58% success
opencode/big-pickle: 0.612 comp | 67% success
opencode/gemini-3.1-pro: 0.188 comp | 25% success
opencode/glm-5: 0.597 comp | 75% success
opencode/gpt-5.4: 0.523 comp | 67% success
opencode/gpt-5.4-mini: 0.223 comp | 25% success
opencode/gpt-5.4-nano: 0.78 comp | 83% success
opencode/minimax-m2.5-free: 0.35 comp | 50% success
opencode/nemotron-3-super-free: 0.117 comp | 17% success
high
10 tasks
opencode/big-pickle: 0.641 comp | 70% success
opencode/claude-opus-4-6: 0.675 comp | 90% success
opencode/claude-sonnet-4-6: 0.529 comp | 70% success
opencode/gemini-3-flash: 0.49 comp | 70% success
opencode/gemini-3.1-pro: 0.56 comp | 70% success
opencode/glm-5: 0.564 comp | 70% success
opencode/glm-5.1: 0.571 comp | 70% success
opencode/gpt-5.4: 0.702 comp | 90% success
opencode/gpt-5.4-mini: 0.514 comp | 60% success
opencode/gpt-5.4-nano: 0.824 comp | 90% success
opencode/kimi-k2.5: 0.809 comp | 90% success
opencode/minimax-m2.5: 0.519 comp | 60% success
opencode/minimax-m2.5-free: 0.56 comp | 80% success
opencode/nemotron-3-super-free: 0.35 comp | 50% success
medium
2 tasks
opencode/glm-5: 0.82 comp | 100% success
opencode/glm-5.1: 0.399 comp | 50% success
opencode/gpt-5.4: 0.395 comp | 50% success
opencode/gpt-5.4-mini: 0.5 comp | 50% success
opencode/kimi-k2.5: 0.411 comp | 50% success
opencode/minimax-m2.5-free: 0.35 comp | 50% success
opencode/big-pickle: 0.444 comp | 50% success
opencode/claude-opus-4-6: 0.388 comp | 50% success
opencode/claude-sonnet-4-6: 0.392 comp | 50% success
opencode/gemini-3-flash: 0.35 comp | 50% success
opencode/gemini-3.1-pro: 0.0 comp | 0% success
opencode/gpt-5.4-nano: 0.476 comp | 50% success
opencode/minimax-m2.5: 0.452 comp | 50% success
opencode/nemotron-3-super-free: 0.0 comp | 0% success
Smoke failures and non-mainline evidence
Failed smoke runs are still published for transparency, without occupying prime dashboard space.
Published control-task evidence

Candidate smoke outcomes

These runs are not mixed into the main full-run leaderboard, but they are intentionally published for transparency, including provider-side failures, timeout behavior, and raw verifier evidence.

Decision support
Model Outcome Control Tasks Requests Cost Recommendation
opencode/minimax-m2.5-free failed smoke 1/3 24 unknown retry
opencode/qwen3.6-plus-free provider-model-not-found smoke 0/3 0 unknown wait for provider
opencode/trinity-large-preview-free provider-model-not-found smoke 0/3 0 unknown wait for provider
opencode/gpt-5.4-nano failed smoke 2/3 31 $0.0019 block
opencode/gpt-5.1-codex-mini failed smoke 0/3 19 unknown retry
opencode/gemini-3-flash failed smoke 0/3 13 unknown retry
opencode/big-pickle passed smoke 3/3 28 $0.0000 promote
opencode/kimi-k2.5 passed smoke 3/3 28 $0.0058 promote
opencode/glm-5 failed smoke 0/3 14 unknown retry
opencode/glm-5.1 failed smoke 1/3 22 $0.0038 retry
opencode/minimax-m2.5 failed smoke 2/3 32 $0.0126 block
opencode/gemini-3-flash failed smoke 1/3 20 $0.0023 retry
opencode/glm-5.1 failed smoke 0/1 0 unknown retry
opencode/minimax-m2.5 passed smoke 1/1 8 $0.0039 promote
opencode/glm-5.1 failed smoke 0/2 17 unknown retry
opencode/minimax-m2.5 passed smoke 1/1 8 $0.0039 promote
opencode/glm-5 failed smoke 0/3 4 unknown retry
opencode/glm-5.1 failed smoke 0/3 17 unknown retry
opencode/minimax-m2.5 failed smoke 2/3 32 $0.0088 retry
opencode/claude-haiku-4-5 failed smoke 2/3 62 $0.0077 retry
opencode/glm-5.1 failed smoke 0/1 3 unknown retry
opencode/glm-5.1 failed smoke 0/3 18 unknown retry
opencode/minimax-m2.5 passed smoke 3/3 27 $0.0124 promote
opencode/claude-haiku-4-5 failed smoke 2/3 48 $0.0082 retry
opencode/big-pickle failed smoke 1/3 23 unknown retry
opencode/claude-haiku-4-5 failed smoke 2/3 55 $0.0080 retry
opencode/glm-5.1 failed smoke 0/3 7 unknown retry
opencode/glm-5.1 failed smoke 0/3 10 unknown retry
opencode/claude-3-5-haiku provider-http-error smoke 0/3 3 unknown wait for provider
opencode/claude-haiku-4-5 failed smoke 2/3 44 $0.0071 retry
opencode/big-pickle failed smoke 2/3 21 unknown retry
opencode/gemini-3.1-pro failed smoke 0/3 3 unknown retry
opencode/minimax-m2.5 failed smoke 0/3 30 $0.0043 retry
opencode/qwen3.6-plus-free provider-model-not-found smoke 0/3 0 unknown wait for provider
opencode/trinity-large-preview-free provider-model-not-found smoke 0/3 0 unknown wait for provider
opencode/gemini-3.1-pro failed smoke 0/3 4 unknown retry
opencode/minimax-m2.5 failed smoke 2/3 22 $0.0082 retry
opencode/qwen3.6-plus-free provider-model-not-found smoke 0/3 0 unknown wait for provider
opencode/trinity-large-preview-free provider-model-not-found smoke 0/3 0 unknown wait for provider
opencode/claude-haiku-4-5 failed smoke 0/1 26 unknown retry
opencode/claude-3-5-haiku provider-http-error smoke 0/1 1 unknown wait for provider
opencode/trinity-large-preview-free provider-model-not-found smoke 0/1 0 unknown wait for provider
opencode/qwen3.6-plus-free provider-model-not-found smoke 0/1 0 unknown wait for provider
opencode/minimax-m2.5 failed smoke 0/1 17 unknown retry
opencode/minimax-m2.1 provider-model-not-found smoke 0/1 0 unknown wait for provider
opencode/gpt-5.4-nano failed smoke 0/1 10 unknown retry
opencode/gpt-5.1-codex-mini failed smoke 0/1 6 unknown retry
opencode/gpt-5-nano provider-limited smoke 0/1 5 unknown wait for provider
opencode/gemini-3-pro provider-model-not-found smoke 0/1 0 unknown wait for provider
opencode/gemini-3.1-pro failed smoke 0/1 0 unknown retry
opencode/glm-5 passed smoke 1/1 20 $0.0160 promote
opencode/nemotron-3-super-free provider-limited smoke 0/1 1 unknown wait for provider
opencode/nemotron-3-super-free provider-limited smoke 0/1 1 unknown wait for provider
opencode/minimax-m2.5-free
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-09T02-43-07-636Z
failed smoke
success=33% | dnf=1 | requests=24 | calls=24 | steps=27 | cost=unknown | wall=2m 42s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=7 | request_count=7 | steps=8
Timed out after 60000ms | Task exceeded 60s hard timeout
opencode/qwen3.6-plus-free
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-09T02-43-07-636Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=3 | cost=unknown | wall=1s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError
opencode/trinity-large-preview-free
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-09T02-43-07-636Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=3 | cost=unknown | wall=1s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError
opencode/gpt-5.4-nano
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-09T00-25-29-045Z
failed smoke
success=67% | dnf=0 | requests=31 | calls=31 | steps=34 | cost=$0.0019 | wall=1m 53s
provider_http=200 | verifier_exit=1 | request_units=10 | request_count=10 | steps=11
opencode/gpt-5.1-codex-mini
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-09T00-25-29-045Z
failed smoke
success=0% | dnf=3 | requests=19 | calls=19 | steps=21 | cost=unknown | wall=3m 00s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=3 | request_count=3 | steps=4
Timed out after 45000ms | Task exceeded 45s hard timeout
opencode/gemini-3-flash
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T23-10-22-104Z
failed smoke
success=0% | dnf=3 | requests=13 | calls=13 | steps=14 | cost=unknown | wall=3m 00s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=2 | request_count=2 | steps=3
Timed out after 45000ms | Task exceeded 45s hard timeout
opencode/big-pickle
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T23-10-22-104Z
passed smoke
success=100% | dnf=0 | requests=28 | calls=28 | steps=31 | cost=$0.0000 | wall=1m 23s
opencode/kimi-k2.5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T23-10-22-104Z
passed smoke
success=100% | dnf=0 | requests=28 | calls=28 | steps=31 | cost=$0.0058 | wall=55s
opencode/glm-5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T21-09-27-413Z
failed smoke
success=0% | dnf=3 | requests=14 | calls=14 | steps=15 | cost=unknown | wall=3m 00s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=1 | request_count=1 | steps=2
Timed out after 45000ms | Task exceeded 45s hard timeout
opencode/glm-5.1
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T21-09-27-413Z
failed smoke
success=33% | dnf=2 | requests=22 | calls=22 | steps=25 | cost=$0.0038 | wall=2m 58s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=5 | request_count=5 | steps=6
Timed out after 45000ms | Task exceeded 45s hard timeout
opencode/minimax-m2.5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T21-09-27-413Z
failed smoke
success=67% | dnf=0 | requests=32 | calls=32 | steps=35 | cost=$0.0126 | wall=1m 33s
provider_http=200 | verifier_exit=1 | request_units=15 | request_count=15 | steps=16
opencode/gemini-3-flash
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T21-09-27-413Z
failed smoke
success=33% | dnf=2 | requests=20 | calls=20 | steps=22 | cost=$0.0023 | wall=2m 43s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=3 | request_count=3 | steps=4
Timed out after 60000ms | Task exceeded 60s hard timeout
opencode/glm-5.1
16-event-status-shell | 2026-04-08T20-57-47-380Z
failed smoke
success=0% | dnf=1 | requests=0 | calls=0 | steps=0 | cost=unknown | wall=31s
dnf_reason=process-timeout | verifier_exit=124 | request_units=0 | request_count=0 | steps=0
Model run timed out after 31s
opencode/minimax-m2.5
16-event-status-shell | 2026-04-08T20-57-47-380Z
passed smoke
success=100% | dnf=0 | requests=8 | calls=8 | steps=9 | cost=$0.0039 | wall=25s
opencode/glm-5.1
16-event-status-shell | 2026-04-08T20-55-31-047Z
failed smoke
success=0% | dnf=2 | requests=17 | calls=17 | steps=19 | cost=unknown | wall=1m 10s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=9 | request_count=9 | steps=10
Timed out after 30000ms | Task exceeded 30s hard timeout
opencode/minimax-m2.5
16-event-status-shell | 2026-04-08T20-55-31-047Z
passed smoke
success=100% | dnf=0 | requests=8 | calls=8 | steps=9 | cost=$0.0039 | wall=29s
opencode/glm-5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T20-25-30-874Z
failed smoke
success=0% | dnf=3 | requests=4 | calls=4 | steps=5 | cost=unknown | wall=3m 23s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=2 | request_count=2 | steps=3
Timed out after 30000ms | Task exceeded 30s hard timeout
opencode/glm-5.1
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T20-25-30-874Z
failed smoke
success=0% | dnf=3 | requests=17 | calls=17 | steps=19 | cost=unknown | wall=3m 23s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=9 | request_count=9 | steps=10
Timed out after 30000ms | Task exceeded 30s hard timeout
opencode/minimax-m2.5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T20-25-30-874Z
failed smoke
success=67% | dnf=1 | requests=32 | calls=32 | steps=34 | cost=$0.0088 | wall=1m 42s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=9 | request_count=9 | steps=10
Timed out after 30000ms | Task exceeded 30s hard timeout
opencode/claude-haiku-4-5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T20-25-30-874Z
failed smoke
success=67% | dnf=1 | requests=62 | calls=62 | steps=65 | cost=$0.0077 | wall=1m 55s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=16 | request_count=16 | steps=17
Timed out after 30000ms | Task exceeded 30s hard timeout
opencode/glm-5.1
16-event-status-shell | 2026-04-08T20-22-14-719Z
failed smoke
success=0% | dnf=1 | requests=3 | calls=3 | steps=4 | cost=unknown | wall=30s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=3 | request_count=3 | steps=4
Timed out after 30000ms | Task exceeded 30s hard timeout
opencode/glm-5.1
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-53-23-979Z
failed smoke
success=0% | dnf=3 | requests=18 | calls=18 | steps=20 | cost=unknown | wall=3m 23s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=10 | request_count=10 | steps=11
Timed out after 30000ms | Task exceeded 30s hard timeout
opencode/minimax-m2.5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-53-23-979Z
passed smoke
success=100% | dnf=0 | requests=27 | calls=27 | steps=30 | cost=$0.0124 | wall=1m 22s
opencode/claude-haiku-4-5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-53-23-979Z
failed smoke
success=67% | dnf=1 | requests=48 | calls=48 | steps=51 | cost=$0.0082 | wall=1m 30s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=17 | request_count=17 | steps=18
Timed out after 30000ms | Task exceeded 30s hard timeout
opencode/big-pickle
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-44-33-994Z
failed smoke
success=33% | dnf=2 | requests=23 | calls=23 | steps=26 | cost=unknown | wall=1m 29s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=4 | request_count=4 | steps=5
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/claude-haiku-4-5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-44-33-994Z
failed smoke
success=67% | dnf=1 | requests=55 | calls=55 | steps=57 | cost=$0.0080 | wall=1m 48s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=11 | request_count=11 | steps=12
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/glm-5.1
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-44-33-994Z
failed smoke
success=0% | dnf=3 | requests=7 | calls=7 | steps=8 | cost=unknown | wall=3m 03s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=3 | request_count=3 | steps=4
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/glm-5.1
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-16-03-842Z
failed smoke
success=0% | dnf=3 | requests=10 | calls=10 | steps=11 | cost=unknown | wall=3m 03s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=4 | request_count=4 | steps=5
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/claude-3-5-haiku
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-16-03-842Z
provider-http-error smoke
success=0% | dnf=0 | requests=3 | calls=3 | steps=3 | cost=unknown | wall=1s
provider_http=400 | verifier_exit=1 | request_units=1 | request_count=1 | steps=1
AI_APICallError
opencode/claude-haiku-4-5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-16-03-842Z
failed smoke
success=67% | dnf=1 | requests=44 | calls=44 | steps=47 | cost=$0.0071 | wall=1m 32s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=9 | request_count=9 | steps=10
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/big-pickle
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-16-03-842Z
failed smoke
success=67% | dnf=1 | requests=21 | calls=21 | steps=24 | cost=unknown | wall=1m 21s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=5 | request_count=5 | steps=6
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/gemini-3.1-pro
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-03-13-728Z
failed smoke
success=0% | dnf=3 | requests=3 | calls=3 | steps=4 | cost=unknown | wall=3m 03s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=1 | request_count=1 | steps=2
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/minimax-m2.5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-03-13-728Z
failed smoke
success=0% | dnf=2 | requests=30 | calls=30 | steps=32 | cost=$0.0043 | wall=1m 35s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=7 | request_count=7 | steps=8
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/qwen3.6-plus-free
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-03-13-728Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=3 | cost=unknown | wall=0s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError
opencode/trinity-large-preview-free
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T19-03-13-728Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=3 | cost=unknown | wall=0s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError
opencode/gemini-3.1-pro
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T18-40-11-241Z
failed smoke
success=0% | dnf=3 | requests=4 | calls=4 | steps=5 | cost=unknown | wall=3m 03s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=1 | request_count=1 | steps=2
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/minimax-m2.5
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T18-40-11-241Z
failed smoke
success=67% | dnf=1 | requests=22 | calls=22 | steps=25 | cost=$0.0082 | wall=1m 16s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=5 | request_count=5 | steps=6
Timed out after 20000ms | Task exceeded 20s hard timeout
opencode/qwen3.6-plus-free
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T18-40-11-241Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=3 | cost=unknown | wall=0s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError
opencode/trinity-large-preview-free
16-event-status-shell, 17-log-level-rollup, 05-log-audit-script | 2026-04-08T18-40-11-241Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=3 | cost=unknown | wall=0s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError
opencode/claude-haiku-4-5
05-log-audit-script | 2026-04-08T07-54-37-316Z
failed smoke
success=0% | dnf=1 | requests=26 | calls=26 | steps=27 | cost=unknown | wall=1m 00s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=26 | request_count=26 | steps=27
Timed out after 59641ms | Task exceeded 60s hard timeout
opencode/claude-3-5-haiku
05-log-audit-script | 2026-04-08T07-54-34-702Z
provider-http-error smoke
success=0% | dnf=0 | requests=1 | calls=1 | steps=1 | cost=unknown | wall=1s
provider_http=400 | verifier_exit=1 | request_units=1 | request_count=1 | steps=1
AI_APICallError
opencode/trinity-large-preview-free
05-log-audit-script | 2026-04-08T07-54-32-421Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=1 | cost=unknown | wall=0s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError
opencode/qwen3.6-plus-free
05-log-audit-script | 2026-04-08T07-54-30-099Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=1 | cost=unknown | wall=0s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError
opencode/minimax-m2.5
05-log-audit-script | 2026-04-08T07-53-26-631Z
failed smoke
success=0% | dnf=1 | requests=17 | calls=17 | steps=18 | cost=unknown | wall=1m 00s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=17 | request_count=17 | steps=18
Timed out after 59640ms | Task exceeded 60s hard timeout
opencode/minimax-m2.1
05-log-audit-script | 2026-04-08T07-53-24-307Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=1 | cost=unknown | wall=0s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError
opencode/gpt-5.4-nano
05-log-audit-script | 2026-04-08T07-52-20-137Z
failed smoke
success=0% | dnf=1 | requests=10 | calls=10 | steps=11 | cost=unknown | wall=1m 00s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=10 | request_count=10 | steps=11
Timed out after 59588ms | Task exceeded 60s hard timeout
opencode/gpt-5.1-codex-mini
05-log-audit-script | 2026-04-08T07-51-04-317Z
failed smoke
success=0% | dnf=1 | requests=6 | calls=6 | steps=7 | cost=unknown | wall=1m 00s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=6 | request_count=6 | steps=7
Timed out after 59641ms | Task exceeded 60s hard timeout
opencode/gpt-5-nano
05-log-audit-script | 2026-04-08T07-50-01-063Z
provider-limited smoke
success=0% | dnf=1 | requests=5 | calls=5 | steps=1 | cost=unknown | wall=1m 00s
provider_http=200 | dnf_reason=task-timeout | verifier_exit=124 | request_units=5 | request_count=5 | steps=1
Timed out after 59690ms | Task exceeded 60s hard timeout
opencode/gemini-3-pro
05-log-audit-script | 2026-04-08T07-49-58-687Z
provider-model-not-found smoke
success=0% | dnf=0 | requests=0 | calls=0 | steps=1 | cost=unknown | wall=0s
verifier_exit=1 | request_count=0 | steps=1
ProviderModelNotFoundError | suggested_model=gemini-3.1-pro
opencode/gemini-3.1-pro
05-log-audit-script | 2026-04-08T07-47-51-207Z
failed smoke
success=0% | dnf=1 | requests=0 | calls=0 | steps=1 | cost=unknown | wall=1m 00s
dnf_reason=task-timeout | verifier_exit=124 | request_count=0 | steps=1
Timed out after 59642ms | Task exceeded 60s hard timeout
opencode/glm-5
05-log-audit-script | 2026-04-08T07-47-09-306Z
passed smoke
success=100% | dnf=0 | requests=20 | calls=20 | steps=21 | cost=$0.0160 | wall=39s
opencode/nemotron-3-super-free
05-log-audit-script | 2026-04-08T06-34-02-017Z
provider-limited smoke
success=0% | dnf=1 | requests=1 | calls=1 | steps=1 | cost=unknown | wall=1m 00s
provider_http=429 | retry_after=62757 | dnf_reason=task-timeout | verifier_exit=124 | request_units=1 | request_count=1 | steps=1
AI_APICallError | Timed out after 59539ms | Task exceeded 60s hard timeout
opencode/nemotron-3-super-free
05-log-audit-script | 2026-04-08T06-09-00-496Z
provider-limited smoke
success=0% | dnf=1 | requests=1 | calls=1 | steps=1 | cost=unknown | wall=1m 00s
provider_http=429 | retry_after=64258 | dnf_reason=task-timeout | verifier_exit=124 | request_units=1 | request_count=1 | steps=1
AI_APICallError | Timed out after 59588ms | Task exceeded 60s hard timeout
Docs and raw artifacts
Raw JSON, schemas, benchmark design, model catalog, and repository links.
Historical snapshots
Archived benchmark snapshots remain available for longitudinal comparisons.
If available, older benchmark publications appear here

Run history

Snapshots are sorted by their top composite score by default. Use the raw JSON links for detailed offline analysis.