DeepSeek V4 Pro Real Cost: 120x Cache Gap Behind Sticker

DeepSeek V4 Pro Real Cost: 120x Cache Gap Behind Sticker

TL;DR — DeepSeek V4 Pro’s “$0.28/M” reputation is wrong on two counts. First, $0.28/M is the output price of V4 Flash, not Pro — Pro is $0.87/M output. Second, the published $0.435/M input price assumes a cache miss. Cache hits cost $0.003625/M, a 120x spread that quietly determines whether your bill matches the headline. To show why the published price isn’t the bill, we ran V4 Pro, GPT-5.5, and Claude Sonnet 4.6 against the same refactoring task on api.ofox.ai. The mechanism is the lesson, not the leaderboard: V4 Pro finished in ~1,500 output tokens, GPT-5.5 ran into our 8,192-token cap on 2 of 3 runs, Sonnet 4.6 used ~3,800. Output verbosity × output sticker price compounds into an order-of-magnitude gap on a reasoning-heavy task like this one. The exact “Nx cheaper” multiplier depends on caps, quality gates, and sample size we explicitly didn’t lock down — see the methodology limits below. Pricing verified at api-docs.deepseek.com/quick_start/pricing on 2026-06-12.

Try it yourself, with the 75% permanent discount built in. Run V4 Pro through ofox.ai on the same API key that already covers Claude, GPT, and Gemini — same V4 Pro price as the official endpoint, no Chinese phone number required, no separate account.

The three variables that decide your bill

A model’s sticker price is the smallest input to your monthly invoice. Three variables matter more:

  1. Cache miss rate. Cache-miss input on V4 Pro is $0.435/M. Cache-hit is $0.003625/M. That is not a 90% discount, that is a 120x discount. A workload that runs at 95% cache hit costs a different order of magnitude than one running at 60% hit. Most “DeepSeek is 50x cheaper than Claude” comparisons silently assume the high-hit case.
  2. Thinking-mode output inflation. V4 Pro’s thinking mode is on by default. On a single refactoring prompt in our benchmark, V4 Pro emitted 1,556 output tokens; GPT-5.5 emitted 8,192 (it hit our max_tokens cap) for the same prompt. Output is also where the real spend lives — $0.87/M on V4 Pro vs $0.435/M for cache-miss input. Verbose models burn the more expensive token.
  3. Tokenizer drift. Different providers count the same English text slightly differently. For our 1,400-character prompt, DeepSeek counted 339 input tokens, OpenAI counted 331, Anthropic counted 396 — a ~20% spread between the lightest and heaviest counter. Small. But it’s measurable and asymmetric: Anthropic tends to count more.

These three variables compound. A 60% cache hit rate with a verbose model and a heavy-counting tokenizer can turn a “12x cheaper” provider into “3x cheaper” on the actual invoice. None of that is misrepresentation by DeepSeek — it’s what cache pricing always means, just at an unusually large spread.

What “$0.28/M” actually refers to

The reason $0.28/M shows up in conversations about V4 Pro pricing is that V4 Flash, the smaller sibling, lists its output at exactly that number. Once a price gets repeated on social media, it stops carrying the model name attached. Here’s the full table, verified against the official DeepSeek docs on 2026-06-12:

TierModelInput (cache-miss)Input (cache-hit)Output
Premiumdeepseek-v4-pro$0.435/M$0.003625/M$0.87/M
Budgetdeepseek-v4-flash$0.14/M$0.0028/M$0.28/M

Source: api-docs.deepseek.com/quick_start/pricing, accessed 2026-06-12. Both models advertise a 1M-token context window and a 384K max output. Pricing is in USD per 1M tokens.

The original V4 Pro launch listed regular prices of $1.74/M input cache-miss and $3.48/M output, with a 75% promotional discount running through May 31, 2026. In late May, the 75% discount was reportedly made permanent — DeepSeek’s official docs no longer show a “promo until” footnote, and the community announcement thread on r/GithubCopilot (340 upvotes, 71 comments) summarizes the change:

DeepSeek just made the 1/4 discounted price for v4 Pro permanent. […] It increases the gap with the frontier (Sonnet/GPT 5.4) models to a 12 to 17x difference. And we are not even talking about the cache hit, where the difference is easily 60 to 80x cheaper.

The 75% discount being permanent matters because it locks in the cache-hit/miss spread at 120x. At the original sticker price the spread was the same ratio but in absolute terms looked different — at the new “permanent” price the spread is large enough that hit-rate behavior dominates your bill outright.

A one-task mechanism probe (not a leaderboard)

We ran one workload to make the mechanism visible — not to publish a Nx ranking. Three models, one task, three runs each, same single user message to all three. Same prompt body, temperature=0.2, max_tokens=8192. All three were called through api.ofox.ai/v1, an OpenAI-compatible router that exposes V4 Pro, GPT-5.5, and Sonnet 4.6 on the same key. Verbatim prompt:

You are a senior TypeScript engineer.

Refactor the following module into idiomatic TypeScript with these constraints:
- Replace any "any" types with precise generics.
- Extract pure helper functions.
- Add Vitest unit tests covering: happy path, empty input, malformed JSON, and a network timeout.
- Use AbortController for the timeout.
- Return a single fenced ```ts code block with the refactored module followed by a separate fenced ```ts block with the tests.
- Do not include prose commentary outside the code blocks.

```ts
// fetchAndAggregate.ts — original
export async function fetchAndAggregate(urls, opts) {
  const results = [];
  for (const u of urls) {
    try {
      const r = await fetch(u, opts);
      const j = await r.json();
      if (j && j.records) {
        for (const rec of j.records) {
          if (!rec.skip) results.push(rec);
        }
      }
    } catch (e) {
      console.log('failed', u, e);
    }
  }
  // group by category
  const out = {};
  for (const r of results) {
    const k = r.category || 'misc';
    if (!out[k]) out[k] = [];
    out[k].push(r);
  }
  // sort each group by ts desc
  for (const k of Object.keys(out)) {
    out[k].sort((a, b) => (b.ts || 0) - (a.ts || 0));
  }
  return out;
}
```

Paste that into any OpenAI-compatible client (point baseURL at https://api.ofox.ai/v1 or your provider of choice) with the parameters above and you’ll re-run exactly what we ran.

Median of 3 runs, on 2026-06-12. The right way to read this table is output verbosity, not the dollar column — the dollar column is illustrative because GPT-5.5’s outputs are capped at 8,192 (see methodology limits):

ModelMedian wall-clockInput tokensOutput tokensfinish_reasonIllustrative cost per run*
deepseek/deepseek-v4-pro18.8 s3391,556stop (3/3)$0.0015
openai/gpt-5.5117.4 s3318,192length (2/3), stop at 7,961 (1/3)$0.2474 (truncated)
anthropic/claude-sonnet-4.638.9 s3963,880stop (3/3)$0.0594

*Computed using each provider’s published sticker price (DeepSeek $0.435/$0.87, OpenAI $5/$30, Anthropic $3/$15, USD per 1M tokens). The GPT-5.5 dollar number is a lower bound — its outputs were truncated by our cap on 2 of 3 runs, and a complete run would cost more.

What the data does say with confidence:

  1. Verbosity is doing more of the work than per-token price. V4 Pro emitted 1,556 output tokens; Sonnet emitted 3,880; GPT-5.5 emitted 8,192+ before our cap stopped it. The model-behavior gap (5x verbosity between V4 Pro and GPT-5.5) compounds with the sticker gap (34x output price between V4 Pro and GPT-5.5). That’s a mechanism statement and it doesn’t depend on whether the GPT-5.5 cost number is exact.
  2. None of the three runs reported cached tokens. Each call was a cold prompt. On a real Claude Code or OpenCode session, the system prompt and project context stay stable across turns and the cache fires. Our cold-start numbers are an upper bound on V4 Pro’s per-call cost. If you sustain a 70% input cache hit, V4 Pro’s effective input cost drops from $0.435/M to ~$0.13/M.
  3. Tokenizer differences exist but are small here. DeepSeek counted 339 input tokens, OpenAI 331, Anthropic 396 for the same English prompt. Anthropic counts heaviest, but not by enough to change the conclusion. On longer prompts with code, the spread widens — worth eyeballing if you’re doing nine-figure-token workloads.

Per-run breakdown (3 runs × 3 models): V4 Pro finished at 1,494 / 1,556 / 2,875 output tokens (stop × 3). Sonnet 4.6 at 3,463 / 3,880 / 4,009 (stop × 3). GPT-5.5 at 7,961 (stop, hugging the cap) / 8,192 (length, truncated) / 8,192 (length, truncated). If you want the same probe on your own key, the prompt is verbatim above and the parameters are temperature=0.2, max_tokens=8192 — copy them straight into your OpenAI-compatible client.

Methodology limits (read this before you quote the numbers)

The numbers above are honest about what they are: a single mechanism probe, not a benchmark you can quote a clean “Nx cheaper” multiplier from. Four limits, in order of how much they would change the headline:

  1. GPT-5.5 was truncated on 2 of 3 runs. Our max_tokens=8192 cap is the reason 2 of 3 GPT-5.5 runs report finish_reason: "length" — those outputs are unfinished. A fair version of this benchmark would set max_tokens high enough that no model is artificially cut off, or discard length-terminated runs from the cost average. Until that’s redone, the GPT-5.5 cost column is a lower bound and the V4-Pro-vs-GPT-5.5 cost ratio is best stated as “an order of magnitude on this kind of task,” not “Nx.”
  2. No quality gate. We did not run the generated code or its tests through Vitest. “V4 Pro is cheaper” only matters if the output is also correct — a model that ships syntactically valid code but with broken tests isn’t actually cheaper. A proper rerun would extract each model’s \“tsblocks, write them to disk, runvitest run, and only count finish_reason=stop` + tests-pass runs into the cost comparison.
  3. n=3 with no variance reported. Three runs are enough to spot a mechanism (cap hits, verbosity, latency) but not enough to claim a stable rate. A defensible rerun would be n≥10 with median and min/max.
  4. One task. This is a TypeScript refactor with test scaffolding. The conclusion does not generalize to “V4 Pro is an order of magnitude cheaper on every task.” On a one-shot prose generation, on tool-heavy agent loops, or on a Flash-suitable scaffold, the ranking changes — sometimes inverts. The honest framing: V4 Pro is cheaper on reasoning-heavy refactors where its smaller output footprint is what matters.

We’re doing the proper rerun (max_tokens lifted, Vitest gate applied, n≥10, full per-run usage and finish_reason published) as a follow-up — link will land in this section when it ships. Until then: trust the mechanism, treat the multipliers as illustrative.

What real bills look like in the wild

Our benchmark is one task across three models. The community has been posting real monthly bills for weeks. Sampled from r/DeepSeek, r/opencode, r/Anthropic, r/LLMDevs, and r/GithubCopilot — bills are real, screenshots posted by users on their own accounts:

“I just spent $2 over two days with DS on OpenCode. I’d be so upset if I’d spent $265 with Claude for the same thing.” — u/pepeperezcanyear on r/DeepSeek “Pricing is crazy” (1,027 upvotes, 149 comments).

“200M tokens total. roughly 70/30 split on prompt vs completion. came out under 35 bucks all in. […] for context, when we were on claude pro for similar workload the per-seat math was 6x that and we had to babysit context limits. when we tested gpt-5.5-codex on the same kind of work the per-token was 8-10x and the wall time was worse.” — u/Fun_Walk_4965, r/DeepSeek, 189↑/132💬.

“With $3.88 & 690,003,591 tokens and 5 hours, Deepseek Pro & Flash combined, managed to reverse engineer Teamspeak’s Licensing System […] In 5 hours of trial and error, debugging with Ghidra and x64dbg.” — original post on r/DeepSeek, 365↑/48💬. Note the top reply: “Less than 1% of those tokens are output tokens” (u/—Spaci—, +23) — meaning the cache was doing real work on this run.

“Deepseek V4 current cost: 78.2m tokens for $1.14. What’s yours?” — r/opencode, 361↑/55💬. A reply from u/Still-Notice8155 (+5): “mine roughly 85m/1$ using pro, it’s insane with opus 4.6 like quality.”

“65 million tokens for 7 dollars lol” — r/DeepSeek, 170↑/93💬. Top reply, with a screenshot: “You spent too much lmfao. 680 million for 14” (u/deleted-account69420, +55).

The thread on r/Anthropic — where you’d expect skepticism — is 232 upvotes and 123 comments deep, and the highest-rated reply (u/KaMaFour, +74) gets the warning exactly right:

PRICE PER MILLION TOKENS IS NOT A GOOD MEASURE OF THE COST BECAUSE IT DOESN’T TAKE INTO ACCOUNT THE VERBOSITY OF THE MODEL. 5x cheaper model per token can be as expensive if it uses 5x more tokens per task.

That’s the cleanest version of the GPT-5.5 result above. Per-token price is half the story.

The counter-narrative: when V4 Pro isn’t cheap

It’s not all “$2 for two days.” Worth giving the dissent a real airing. From r/LLMDevs “Token costs are actually unsustainable for multi-project work” (32↑/94💬), top reply from u/look (+11):

“I primarily use a mix of Mimo V2.5 Pro, GLM-5.1, Qwen 3.6 Plus, and Deepseek V4 Flash (don’t waste your time with Pro — it’s as expensive as US models in actual use) and my average blended token costs are under 5 cents per Mtok and still dropping.”

The “Pro is as expensive as US models in actual use” claim is the inverse of our benchmark result. Both can be true. Pro’s pricing assumes you don’t blow through cache, and assumes the task benefits from thinking output enough to justify the higher per-token cost. If you’re driving Pro on prompts where Flash would have answered correctly, you are paying the Pro tax — 3x more input, 3x more output — for output you didn’t need. Our benchmark task was deliberately reasoning-heavy (refactor + write tests + reason about edge cases). On a one-shot “write a function that sums an array” prompt, the conclusion flips and Flash wins decisively. We covered that tradeoff in detail in our earlier piece on DeepSeek V4 Pro vs Flash: real cost-quality tradeoff.

The honest answer: on the reasoning-heavy refactor we tested, V4 Pro produced output an order of magnitude cheaper than GPT-5.5 and Sonnet 4.6 — most of that gap coming from output verbosity, not per-token price. On tasks Flash can handle, Pro is not cheap relative to Flash. And the exact multiplier we publish above is one mechanism probe, not a benchmark you can A/B procurement on (see methodology limits). Route by task type, not by default.

What to do with this

Three concrete moves, in order of how much they actually affect your bill:

  1. Measure your cache hit rate before you trust any “Nx cheaper” claim. Pull a week of usage from your API dashboard. DeepSeek reports prompt_cache_hit_tokens and prompt_cache_miss_tokens per call. If your hit rate is 80%+, the published savings numbers apply roughly as-is. If you’re at 60%, multiply the cache-miss cost by 0.4 and the cache-hit cost by 0.6 and recompute. The math in the cache-hit math primer walks through this step by step.
  2. Route reasoning to Pro, scaffolding to Flash. On a one-shot CRUD scaffold, Flash produces output that’s hard to distinguish from Pro on a blind read, and it’s 3x cheaper at every tier. On a multi-file refactor with implicit invariants, Pro holds the constraints across the rewrite; Flash drifts. The DeepSeek API pricing guide has a per-task decision table.
  3. Run cold-start prompts through a router that lets you A/B. Routing through a unified endpoint like ofox.ai’s API means you can ship the same prompt to V4 Pro, GPT-5.5, and Sonnet 4.6 with a one-character model-ID change. We use it ourselves — the benchmark in this post took three runs per model on one key, all three models reachable through the same endpoint. If you’re still budgeting against someone else’s sticker price, that’s the cheapest experiment to run before you commit.

The sticker price is a number on a tag. The bill is a receipt that unspools downward. Both are real; only one of them gets paid.


Methodology footnote. Benchmark task: TypeScript refactor + add Vitest tests. 3 runs per model, median reported, no Vitest gate applied. Routed through https://api.ofox.ai/v1 on 2026-06-12, temperature=0.2, max_tokens=8192. Finish reasons from raw runs: V4 Pro stop × 3; Sonnet 4.6 stop × 3; GPT-5.5 length × 2 (truncated at 8,192) + stop × 1 (at 7,961 tokens, hugging the cap). Cost computed using each provider’s published sticker price (USD per 1M tokens): V4 Pro $0.435 input cache-miss / $0.87 output (verified at api-docs.deepseek.com/quick_start/pricing 2026-06-12); GPT-5.5 $5 input / $30 output (verified at ofox.ai/models/openai 2026-06-12); Sonnet 4.6 $3 input / $15 output (verified at ofox.ai/models/anthropic 2026-06-12). Cached-token columns were 0 across all runs (cold prompts). The prompt body and per-run breakdown are in the post so you can repeat the same probe on your own key. Known limits (see in-post section): GPT-5.5 cap hit, no quality gate, n=3, single task.