This page uses opencode/gemini-3-flash as the comparison baseline. Every chart and table below is intended to answer the same question: where this model leads, where it lags, and what it costs in quality, time, and request pressure.
These charts use opencode/gemini-3-flash as zero. Positive bars mean other models are above the baseline on that metric; negative bars mean they trail it.
Use this to decide whether another model beats opencode/gemini-3-flash enough to justify the change.
| Model | Composite | Delta vs baseline | Success | Success delta | ORPT | ORPT delta | Cost | Cost delta | Wall time |
|---|---|---|---|---|---|---|---|---|---|
| opencode/gpt-5.4-nano | 0.789 | +0.375 | 85% | +26% | 15.17 | -6.64 | $0.4215 | -$2.0091 | 27m 33s |
| opencode/kimi-k2.5 | 0.785 | +0.37 | 89% | +30% | 14.25 | -7.56 | $0.9122 | -$1.5184 | 41m 05s |
| opencode/claude-opus-4-6 | 0.67 | +0.255 | 89% | +30% | 14.88 | -6.94 | $21.8757 | +$19.4450 | 40m 04s |
| opencode/glm-5 | 0.623 | +0.208 | 78% | +19% | 11.57 | -10.24 | $6.4339 | +$4.0033 | 20m 10s |
| opencode/big-pickle | 0.615 | +0.2 | 67% | +7% | 15.39 | -6.42 | $0.0000 | -$2.4307 | 36m 28s |
| opencode/gpt-5.4 | 0.609 | +0.194 | 78% | +19% | 11.00 | -10.81 | $8.9827 | +$6.5520 | 32m 47s |
| opencode/claude-sonnet-4-6 | 0.593 | +0.178 | 78% | +19% | 16.43 | -5.38 | $11.8406 | +$9.4099 | 42m 31s |
| opencode/glm-5.1 | 0.547 | +0.132 | 67% | +7% | 12.06 | -9.76 | $1.8816 | -$0.5490 | 64m 39s |
| opencode/minimax-m2.5 | 0.481 | +0.066 | 56% | -4% | 18.87 | -2.95 | $0.6413 | -$1.7894 | 32m 15s |
| opencode/gpt-5.4-mini | 0.425 | +0.01 | 48% | -11% | 9.54 | -12.27 | $1.0606 | -$1.3701 | 21m 48s |
| opencode/minimax-m2.5-free | 0.415 | +0.0 | 59% | +0% | 16.19 | -5.63 | $0.0000 | -$2.4307 | 41m 34s |
| opencode/gemini-3-flash Baseline | 0.415 | +0.0 | 59% | +0% | 21.81 | +0.00 | $2.4307 | +$0.0000 | 62m 52s |
| opencode/gemini-3.1-pro | 0.291 | -0.124 | 37% | -22% | 12.70 | -9.11 | $5.8536 | +$3.4229 | 51m 25s |
| opencode/nemotron-3-super-free | 0.181 | -0.233 | 26% | -33% | 19.43 | -2.38 | $0.0000 | -$2.4307 | 109m 00s |
This table puts the most revealing tasks first: unsolved tasks, single-solver tasks, and tasks where the baseline trails the winner by a meaningful margin.
| Task | Field read | Baseline result | Winner | Gap to winner | Baseline cost | Baseline time |
|---|---|---|---|---|---|---|
| RHEL k3s node preparation repair | Competitive split | failed | opencode/gpt-5.4-nano 1.0 |
1.0 | $0.0468 | 49s |
| Event status shell summary | Competitive split | dnf | opencode/big-pickle 1.0 |
1.0 | n/a | 45s |
| Kubernetes rollout repair | Clear separation | failed | opencode/gpt-5.4-mini 1.0 |
1.0 | $0.0523 | 1m 22s |
| nftables router ingress repair | Competitive split | failed | opencode/gpt-5.4-nano 0.98 |
0.98 | $0.0821 | 1m 41s |
| Docker Compose observability fix | Competitive split | failed | opencode/gpt-5.4-nano 0.975 |
0.975 | $0.0786 | 2m 26s |
| Pre-ArgoCD bootstrap sequencing | Competitive split | failed | opencode/gpt-5.4-nano 0.967 |
0.967 | $0.1001 | 2m 03s |
| Log level rollup shell script | Competitive split | dnf | opencode/big-pickle 0.965 |
0.965 | n/a | 1m 00s |
| Ansible nginx role completion | Competitive split | dnf | opencode/big-pickle 0.963 |
0.963 | n/a | 5m 00s |
| RHEL NetworkManager bridge VLAN repair | Competitive split | failed | opencode/gpt-5.4-nano 0.951 |
0.951 | $0.1243 | 2m 35s |
| Build workspace plane convergence | Competitive split | failed | opencode/gpt-5.4-nano 0.942 |
0.942 | n/a | 5m 01s |
| Log audit shell script | Competitive split | dnf | opencode/gpt-5.4-nano 0.935 |
0.935 | n/a | 1m 15s |
| SELinux registry volume label repair | Clear separation | passed | opencode/kimi-k2.5 1.0 |
0.3 | $0.0799 | 1m 15s |
| K3s registry mirror trust repair | Competitive split | passed | opencode/big-pickle 1.0 |
0.3 | $0.2351 | 3m 40s |
| Bootstrap phase validation repair | Competitive split | passed | opencode/kimi-k2.5 0.993 |
0.293 | $0.0782 | 3m 18s |
| Workspace transplant bundle repair | Competitive split | passed | opencode/big-pickle 0.985 |
0.285 | $0.1858 | 4m 10s |
| ExternalDNS RFC2136 repair | Competitive split | passed | opencode/kimi-k2.5 0.982 |
0.282 | $0.0818 | 1m 48s |
| Terraform static site repair | Competitive split | passed | opencode/kimi-k2.5 0.978 |
0.278 | $0.0568 | 1m 38s |
| CNPG restore manifest repair | Competitive split | passed | opencode/big-pickle 0.964 |
0.264 | $0.0716 | 1m 09s |
| MCP OpenBao contract repair | Competitive split | passed | opencode/big-pickle 0.954 |
0.254 | $0.1430 | 2m 20s |
| RHEL edge firewalld router repair | Competitive split | passed | opencode/gpt-5.4-nano 0.953 |
0.253 | $0.1025 | 1m 19s |
| Kubernetes OIDC RBAC repair | Competitive split | passed | opencode/gpt-5.4-nano 0.95 |
0.25 | $0.1416 | 2m 42s |
| GitOps workspace render validation | Competitive split | passed | opencode/big-pickle 0.941 |
0.241 | $0.0882 | 1m 28s |
| Workspace runtime access convergence | Competitive split | passed | opencode/gpt-5.4-nano 0.932 |
0.232 | $0.1889 | 2m 41s |
| Wildcard TLS route coverage | Competitive split | passed | opencode/kimi-k2.5 0.929 |
0.229 | $0.1207 | 4m 06s |
| MetalLB ingress address pool repair | Competitive split | passed | opencode/gpt-5.4-nano 0.928 |
0.228 | $0.1348 | 2m 14s |
| AppArmor dnsmasq profile repair | Competitive split | passed | opencode/gpt-5.4-nano 0.918 |
0.218 | $0.1621 | 3m 29s |
| Traefik forwarded header trust repair | Competitive split | passed | opencode/kimi-k2.5 0.913 |
0.213 | $0.0754 | 1m 38s |
Pairwise task wins and top-line deltas show whether a challenger truly beats the baseline or just looks cheaper or faster in isolation.
| Challenger | Task record | Composite edge | Success edge | Cost edge | Time edge | ORPT edge |
|---|---|---|---|---|---|---|
| opencode/gpt-5.4-nano | 3-23 1 ties |
-0.375 | -26% | +$2.0091 | +35m 19s | +6.64 |
| opencode/kimi-k2.5 | 2-24 1 ties |
-0.37 | -30% | +$1.5184 | +21m 47s | +7.56 |
| opencode/claude-opus-4-6 | 1-24 2 ties |
-0.255 | -30% | -$19.4450 | +22m 47s | +6.94 |
| opencode/nemotron-3-super-free | 12-3 12 ties |
+0.233 | +33% | +$2.4307 | -46m 09s | +2.38 |
| opencode/glm-5 | 3-21 3 ties |
-0.208 | -19% | -$4.0033 | +42m 41s | +10.24 |
| opencode/big-pickle | 5-18 4 ties |
-0.2 | -7% | +$2.4307 | +26m 23s | +6.42 |
| opencode/gpt-5.4 | 2-21 4 ties |
-0.194 | -19% | -$6.5520 | +30m 04s | +10.81 |
| opencode/claude-sonnet-4-6 | 4-21 2 ties |
-0.178 | -19% | -$9.4099 | +20m 21s | +5.38 |
| opencode/glm-5.1 | 4-18 5 ties |
-0.132 | -7% | +$0.5490 | -1m 48s | +9.76 |
| opencode/gemini-3.1-pro | 9-10 8 ties |
+0.124 | +22% | -$3.4229 | +11m 27s | +9.11 |
| opencode/minimax-m2.5 | 5-15 7 ties |
-0.066 | +4% | +$1.7894 | +30m 37s | +2.95 |
| opencode/gpt-5.4-mini | 10-13 4 ties |
-0.01 | +11% | +$1.3701 | +41m 03s | +12.27 |
| opencode/minimax-m2.5-free | 5-5 17 ties |
+0.0 | +0% | +$2.4307 | +21m 18s | +5.63 |
The benchmark result only matters in context: this section pairs the observed benchmark outcome with the catalog metadata and operating characteristics behind it.
OpenRouter reference blend for google/gemini-3-flash-preview is 1.125 USD per 1M tokens using a 3:1 input:output mix.
Observed to loop and hit the benchmark process deadline on the task-05 scripting smoke run, so it should not be in the default headless dev matrix.