AI Model Rankings May 2026: Top LLMs Ranked by Coding, Reasoning & Cost
TL;DR. As of May 2026, no single model wins every axis. GPT-5.5 leads SWE-bench Verified coding at 88.7%, Claude Opus 4.7 and Gemini 3.1 Pro fight to a near-tie for top reasoning on GPQA Diamond (94.2% vs 94.3%), and DeepSeek V4 Pro dominates cost-quality at $0.43/$0.87 per million tokens with an 80.6% SWE score. The interesting story this month isn’t who’s #1 — it’s that the gap between the most expensive flagship and the cheapest open-weight model is now four task-points and 30x in price.
How this ranking was built
We pulled May 2026 numbers from three sources and refused to mix them: SWE-bench Verified (coding agentic tasks), GPQA Diamond (PhD-level science MCQ for reasoning), and published per-million-token API pricing as of this writing. Every score below was reconfirmed against the public leaderboard the week of May 19, 2026 — not pulled from memory, not extrapolated from older posts.
Models included: GPT-5.5, GPT-5.3-Codex, Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 3.1 Pro, Gemini 3.1 Flash, DeepSeek V4 Pro Max, Kimi K2.6, Qwen 3.6 Plus, Grok 4.1, Llama 4 Maverick. Models excluded: anything not generally API-available, anything still labeled “preview” without a public price (Claude Mythos Preview is mentioned because it shows up on leaderboards, but you can’t buy tokens yet).
If you only need the takeaway, read the next three sections. If you’re picking a model for a specific job, skip to Cross-axis: pick by use case.
Coding rankings (SWE-bench Verified)
GPT-5.5 leads coding at 88.7%, but by less than one task-point over Claude Opus 4.7. OpenAI’s April 23 release of GPT-5.5 was the only real shake-up since the April 2026 LLM leaderboard — every other model on this list was already in the top tier last month.
| Rank | Model | SWE-bench Verified | Notes |
|---|---|---|---|
| 1 | GPT-5.5 | 88.7% | Released 2026-04-23, $5/$30 |
| 2 | Claude Opus 4.7 | 87.6% | $5/$25, new 2026 tokenizer adds ~35% tokens vs 4.6 |
| 3 | GPT-5.3-Codex | 85.0% | Specialized for code, lower price than 5.5 |
| 4 | Claude Opus 4.5 | 80.9% | Older flagship, still useful for cost-sensitive coding |
| 5 | Claude Opus 4.6 | 80.8% | Pre-4.7 flagship, prompt-caching mature |
| 6 | DeepSeek V4 Pro Max | 80.6% | First open-weight in this tier |
| 7 | Gemini 3.1 Pro | 80.6% | Strong on long-context refactors |
| 8 | Kimi K2.6 | ~72% (Tier A coding bench) | 384 routed experts, strong Chinese-domain code |
| 9 | Qwen 3.6 Plus | ~71% | Tier B coding bench, open-weight |
The top four sit within ~8 points and any of them will ship working code on standard tasks. Real-world differentiation now lives in tasks SWE-bench doesn’t capture — multi-file refactors, long debugging loops, and how gracefully the model recovers from a bad initial diff. We covered the lived-experience side in Best LLM for coding in 2026: ranked by real use, which doesn’t always agree with the leaderboard order.
One tokenizer footnote worth knowing: Opus 4.7 ships a new 2026 tokenizer that produces up to 35% more tokens for the same raw English text vs 4.6. The headline $5/$25 rate card is closer to $6.75/$33.75 in effective spend for English-heavy workloads. Anthropic kept the sticker price flat; the actual bill went up. We hit this the week 4.7 went live and a routing pipeline that had been fine on 4.6 quietly blew its budget.
Reasoning rankings (GPQA Diamond)
The top of the reasoning leaderboard is a statistical tie — the top four span 0.5 percentage points on a 198-question benchmark. That’s literally one question. Anyone telling you “Gemini is the smartest” or “Claude reasons best” based on May 2026 GPQA numbers is overclaiming.
| Rank | Model | GPQA Diamond | Caveat |
|---|---|---|---|
| 1 | Gemini 3.1 Pro | 94.3% | Strongest on multi-step science |
| 2 | Claude Opus 4.7 | 94.2% | Best on prompts that need self-correction |
| 3 | GPT-5.5 (xhigh effort) | 93.5% | Higher with extended thinking budget |
| 4 | GPT-5.5 (high effort) | 93.2% | Cheaper effort tier, slight quality drop |
| 5 | Claude Opus 4.6 | ~92% | Still excellent, half the latency of Mythos |
| 6 | DeepSeek V4 Pro Max | ~89% | Best open-weight reasoning |
| 7 | Grok 4.1 | ~87% | Improved sharply from 4.0 |
In practice this means the effort-budget knob now matters more than the model name. GPT-5.5 with reasoning_effort: high behaves like a different product than default GPT-5.5. Same story for Claude’s extended-thinking mode. “Which model is smartest” has quietly collapsed into “how much thinking budget can I afford.”
We dug into the practical reasoning differences on three real-world tasks in Claude Opus 4.6 vs GPT-5.5 vs Gemini 3.1 Pro: reasoning benchmarks — the leaderboard tie hides large differences in how each model fails when it does fail.
Cost rankings (price per million tokens)
DeepSeek V4 Pro is the cost-quality king — roughly 1/12th the price of GPT-5.5 for an 80.6% SWE score. No close competition.
| Model | Input | Output | Effective cost-per-quality (output $ ÷ SWE score) |
|---|---|---|---|
| DeepSeek V4 Pro | $0.43 | $0.87 | $0.0108/point |
| Gemini 3.1 Flash | $0.15 | $0.60 | ~$0.0086/point (lower score) |
| Gemini 3.1 Pro | $2.00 | $12.00 | $0.149/point |
| Claude Opus 4.7 | $5.00 | $25.00 | $0.285/point |
| GPT-5.5 | $5.00 | $30.00 | $0.338/point |
| Kimi K2.6 | ~$0.60 | ~$2.50 | ~$0.035/point |
| Qwen 3.6 Plus | ~$0.50 | ~$2.00 | ~$0.028/point |
A few things to flag:
- DeepSeek’s 75% promo ends May 31, 2026. The cache-miss rate above already reflects the promo. Post-promo it goes back to roughly $1.72/$3.48 — still cheap, but no longer obscene-value.
- Prompt caching changes everything. Claude’s 5-minute cache discount is 90% on cached input. If your workload re-sends the same system prompt or context, Opus 4.7 can be cheaper than its rate card implies. We broke this down in detail in Claude API pricing: complete breakdown.
- Effective cost ≠ headline cost. The tokenizer change on Opus 4.7, batch-API discounts, and provider-specific deals all distort the simple rate-card comparison.
For a deeper cost framework that doesn’t just look at the rate card, see How to reduce AI API costs in 2026.
Cross-axis: pick by use case
The rankings above only matter if your workload looks like the benchmark. It usually doesn’t. Here’s the model we’d actually pick for common 2026 jobs:
Production coding agent with a tight budget: DeepSeek V4 Pro for the heavy lifting, GPT-5.5 only when you hit a hard task DeepSeek can’t close. The hybrid pattern is documented in Claude Code hybrid routing pattern — same routing logic works with DeepSeek as the cheap tier.
Research / hard reasoning: Claude Opus 4.7 or Gemini 3.1 Pro — pick on latency tolerance. Opus is slower; Gemini Pro is faster and slightly cheaper. Within 0.1 GPQA points, they’re interchangeable.
Long-context refactors (>200K tokens): Gemini 3.1 Pro wins. It’s the only model in this list with a 1M-token native context that doesn’t degrade badly past 500K. See Gemini 3.1 Pro API guide.
Cheap multi-step agent loops: Gemini 3.1 Flash-Lite or DeepSeek V4 Flash for the inner loop, with one Opus or GPT-5.5 call at the end for synthesis. Detailed in Gemini 3.1 Flash Lite vs DeepSeek V4 Flash for budget agents.
Bring-your-own infra, no API spend: Qwen 3.6 27B locally if you have a 24GB GPU, DeepSeek V4 Pro via API for everything that exceeds local capacity. The break-even is honest about local-vs-hosted trade-offs in Qwen 3.6 27B vs Claude Opus 4.6 for coding.
Just give me one default: Claude Opus 4.7 if budget isn’t tight, DeepSeek V4 Pro if it is. The middle (Gemini 3.1 Pro at $2/$12) is genuinely good but ends up being a “neither one nor the other” pick for most teams.
How to access these models without nine API keys
Every model in this ranking is callable through ofox.ai’s unified gateway using one OpenAI-compatible endpoint and one key. We host Claude, GPT, Gemini, DeepSeek, Kimi, Qwen, and Llama via OpenAI-compat — you change the model string, not the SDK. That matters when a ranking like this shifts month to month: rerouting to GPT-5.5 from Opus 4.7 is a one-line change, not a multi-week SDK swap.
If you’re shopping gateways more broadly, Why an LLM API gateway covers the criteria honestly — including when a gateway is the wrong answer.
What’s about to change
Three things are visible on the horizon that will reshape the next ranking:
- Anthropic’s Mythos Preview. Mythos isn’t generally available, but it’s already topping GPQA. When it lands as a real product (analyst whispers suggest June-July 2026), the reasoning ranking moves.
- DeepSeek’s promo expires May 31. Post-promo pricing is still excellent but the “30x cheaper than GPT-5.5” gap narrows to “10x cheaper.”
- GPT-5.5-Codex. OpenAI is widely expected to ship a Codex-specialized 5.5 variant in Q3, which would likely take the SWE-bench crown back from the general 5.5.
The leaderboard is moving roughly once a month right now. If you’re locking in a model for the next 90 days, build the routing layer first and the model choice second. The fact that swapping flagships used to be a quarter-long project and is now a one-line config change is the actual story of 2026 in LLM infra — and most teams haven’t realized it yet.
For the broader picking framework that doesn’t expire when GPT-5.6 ships, see Claude vs GPT vs Gemini: how to pick the right AI model and the LLM API selection decision matrix.


