ORPT-Bench model detail
Model benchmark profile

opencode/gpt-5.4-mini

This page uses opencode/gpt-5.4-mini as the comparison baseline. Every chart and table below is intended to answer the same question: where this model leads, where it lags, and what it costs in quality, time, and request pressure.

openai low price tier dev-cheap dev-general
Composite
0.425
Correctness-weighted overall standing
Success
48%
Tasks completed successfully
ORPT
9.54
Requests per solved task
Total cost
$1.0606
Observed benchmark spend
Baseline comparison

How the field moves relative to opencode/gpt-5.4-mini

These charts use opencode/gpt-5.4-mini as zero. Positive bars mean other models are above the baseline on that metric; negative bars mean they trail it.

Composite delta vs baseline

Success delta vs baseline

Cost delta vs baseline

Wall time delta vs baseline

Decision table

Field comparison against the baseline

Use this to decide whether another model beats opencode/gpt-5.4-mini enough to justify the change.

Model Composite Delta vs baseline Success Success delta ORPT ORPT delta Cost Cost delta Wall time
opencode/gpt-5.4-nano 0.789 +0.365 85% +37% 15.17 +5.64 $0.4215 -$0.6391 27m 33s
opencode/kimi-k2.5 0.785 +0.36 89% +41% 14.25 +4.71 $0.9122 -$0.1484 41m 05s
opencode/claude-opus-4-6 0.67 +0.245 89% +41% 14.88 +5.34 $21.8757 +$20.8151 40m 04s
opencode/glm-5 0.623 +0.198 78% +30% 11.57 +2.03 $6.4339 +$5.3733 20m 10s
opencode/big-pickle 0.615 +0.19 67% +19% 15.39 +5.85 $0.0000 -$1.0606 36m 28s
opencode/gpt-5.4 0.609 +0.184 78% +30% 11.00 +1.46 $8.9827 +$7.9221 32m 47s
opencode/claude-sonnet-4-6 0.593 +0.168 78% +30% 16.43 +6.89 $11.8406 +$10.7800 42m 31s
opencode/glm-5.1 0.547 +0.122 67% +19% 12.06 +2.52 $1.8816 +$0.8210 64m 39s
opencode/minimax-m2.5 0.481 +0.056 56% +7% 18.87 +9.33 $0.6413 -$0.4193 32m 15s
opencode/gpt-5.4-mini Baseline 0.425 +0.0 48% +0% 9.54 +0.00 $1.0606 +$0.0000 21m 48s
opencode/minimax-m2.5-free 0.415 -0.01 59% +11% 16.19 +6.65 $0.0000 -$1.0606 41m 34s
opencode/gemini-3-flash 0.415 -0.01 59% +11% 21.81 +12.27 $2.4307 +$1.3701 62m 52s
opencode/gemini-3.1-pro 0.291 -0.134 37% -11% 12.70 +3.16 $5.8536 +$4.7930 51m 25s
opencode/nemotron-3-super-free 0.181 -0.243 26% -22% 19.43 +9.89 $0.0000 -$1.0606 109m 00s
Task story

Where opencode/gpt-5.4-mini separates

This table puts the most revealing tasks first: unsolved tasks, single-solver tasks, and tasks where the baseline trails the winner by a meaningful margin.

Task Field read Baseline result Winner Gap to winner Baseline cost Baseline time
SELinux registry volume label repair Clear separation failed opencode/kimi-k2.5
1.0
1.0 $0.0326 1m 02s
RHEL k3s node preparation repair Competitive split failed opencode/gpt-5.4-nano
1.0
1.0 $0.0441 1m 03s
Bootstrap phase validation repair Competitive split failed opencode/kimi-k2.5
0.993
0.993 $0.0498 45s
ExternalDNS RFC2136 repair Competitive split failed opencode/kimi-k2.5
0.982
0.982 $0.0321 39s
nftables router ingress repair Competitive split failed opencode/gpt-5.4-nano
0.98
0.98 $0.0310 43s
Docker Compose observability fix Competitive split failed opencode/gpt-5.4-nano
0.975
0.975 $0.0288 33s
Pre-ArgoCD bootstrap sequencing Competitive split failed opencode/gpt-5.4-nano
0.967
0.967 $0.0416 47s
RHEL edge firewalld router repair Competitive split failed opencode/gpt-5.4-nano
0.953
0.953 $0.0277 33s
GitOps workspace render validation Competitive split failed opencode/big-pickle
0.941
0.941 $0.0597 1m 02s
Workspace runtime access convergence Competitive split failed opencode/gpt-5.4-nano
0.932
0.932 $0.0655 1m 15s
Wildcard TLS route coverage Competitive split failed opencode/kimi-k2.5
0.929
0.929 $0.0350 40s
MetalLB ingress address pool repair Competitive split failed opencode/gpt-5.4-nano
0.928
0.928 $0.0336 45s
AppArmor dnsmasq profile repair Competitive split failed opencode/gpt-5.4-nano
0.918
0.918 $0.0337 42s
Traefik forwarded header trust repair Competitive split failed opencode/kimi-k2.5
0.913
0.913 $0.0428 54s
Terraform static site repair Competitive split passed opencode/kimi-k2.5
0.978
0.189 $0.0530 2m 17s
Log level rollup shell script Competitive split passed opencode/big-pickle
0.965
0.136 $0.0486 56s
CNPG restore manifest repair Competitive split passed opencode/big-pickle
0.964
0.131 $0.0542 39s
Workspace transplant bundle repair Competitive split passed opencode/big-pickle
0.985
0.108 $0.0344 38s
K3s registry mirror trust repair Competitive split passed opencode/big-pickle
1.0
0.104 $0.0195 25s
RHEL NetworkManager bridge VLAN repair Competitive split passed opencode/gpt-5.4-nano
0.951
0.1 $0.0339 54s
Event status shell summary Competitive split passed opencode/big-pickle
1.0
0.089 $0.0196 25s
Build workspace plane convergence Competitive split passed opencode/gpt-5.4-nano
0.942
0.079 $0.0574 57s
Kubernetes OIDC RBAC repair Competitive split passed opencode/gpt-5.4-nano
0.95
0.078 $0.0626 56s
Ansible nginx role completion Competitive split passed opencode/big-pickle
0.963
0.045 $0.0285 28s
MCP OpenBao contract repair Competitive split passed opencode/big-pickle
0.954
0.038 $0.0346 42s
Log audit shell script Competitive split passed opencode/gpt-5.4-nano
0.935
0.02 $0.0269 33s
Kubernetes rollout repair Clear separation passed opencode/gpt-5.4-mini
1.0
0.0 $0.0292 34s
Head to head

Direct matchups

Pairwise task wins and top-line deltas show whether a challenger truly beats the baseline or just looks cheaper or faster in isolation.

Challenger Task record Composite edge Success edge Cost edge Time edge ORPT edge
opencode/gpt-5.4-nano 5-19
3 ties
-0.365 -37% +$0.6391 -5m 44s -5.64
opencode/kimi-k2.5 7-19
1 ties
-0.36 -41% +$0.1484 -19m 16s -4.71
opencode/claude-opus-4-6 13-12
2 ties
-0.245 -41% -$20.8151 -18m 16s -5.34
opencode/nemotron-3-super-free 13-1
13 ties
+0.243 +22% +$1.0606 -87m 12s -9.89
opencode/glm-5 13-9
5 ties
-0.198 -30% -$5.3733 +1m 38s -2.03
opencode/big-pickle 4-17
6 ties
-0.19 -19% +$1.0606 -14m 40s -5.85
opencode/gpt-5.4 13-8
6 ties
-0.184 -30% -$7.9221 -10m 59s -1.46
opencode/claude-sonnet-4-6 13-10
4 ties
-0.168 -30% -$10.7800 -20m 42s -6.89
opencode/gemini-3.1-pro 12-3
12 ties
+0.134 +11% -$4.7930 -29m 37s -3.16
opencode/glm-5.1 10-11
6 ties
-0.122 -19% -$0.8210 -42m 51s -2.52
opencode/minimax-m2.5 8-12
7 ties
-0.056 -7% +$0.4193 -10m 27s -9.33
opencode/minimax-m2.5-free 13-6
8 ties
+0.01 -11% +$1.0606 -19m 45s -6.65
opencode/gemini-3-flash 13-10
4 ties
+0.01 -11% -$1.3701 -41m 03s -12.27
Model context

Benchmark and catalog detail

The benchmark result only matters in context: this section pairs the observed benchmark outcome with the catalog metadata and operating characteristics behind it.

Requests264
Wall time21m 48s
Average task cost$0.0386
Benchmark supportsupported
Catalog blended price$1.6875 / 1M tok
Catalog speedn/a
Intelligencen/a
Agenticn/a

Primary blended price derived automatically from OpenRouter listing openai/gpt-5.4-mini using a 3:1 input:output blend.

Observed to complete ORPT-Bench scripting smoke runs cleanly and is the current preferred headless dev baseline.