Kimi K2.7 Code: Does a 30% Token Cut Lower Your Bill? (2026)
Kimi K2.7 Code costs the same per token as K2.6 ($0.95/$4.00). Its 30% thinking-token cut trims ~13% off a reasoning-heavy bill, under 1% input-heavy.
TL;DR. Kimi K2.7 Code costs the exact same per token as K2.6 ($0.95/M in, $4.00/M out), and its cache reads are slightly worse ($0.19/M vs $0.16/M). So a lower bill rides entirely on Moonshot’s claim that the model burns ~30% fewer thinking tokens. That cut is real money only where reasoning dominates your spend. On a reasoning-heavy job the bill drops about 13%, not 30. On an input-heavy job it drops under 1%. Pick K2.7 Code (moonshotai/kimi-k2.7-code) for text-only reasoning work, stay on K2.6 (moonshotai/kimi-k2.6) for images or short-output jobs. The benchmarks behind the hype are all vendor-reported and unverified, so the only number you should trust is your own bill.
TL;DR: Which One Should You Pick?
One-line verdict: if your coding traffic is text-only and reasoning-heavy, K2.7 Code shaves real dollars; everywhere else the “30% cut” mostly evaporates by the time it hits your invoice.
The trap is reading “30% fewer tokens” as “30% cheaper.” It isn’t. Same per-token price, same context window, marginally worse cache. The savings live in one place only, and you have to qualify for them.
| Your workload | Pick | Why |
|---|---|---|
| Text-only, reasoning-heavy coding (long thinking traces) | K2.7 Code | Thinking tokens are most of your output spend, so the 30% cut lands hard |
| Agentic loops with long autonomous runs | K2.7 Code | Reasoning token reduction compounds across many turns |
| Vision / screenshot / image input | K2.6 | K2.7 Code is text-only; an image_url block fails on it |
| Input-heavy, short outputs (RAG, summarize, classify) | K2.6 | Output is a sliver of your bill, so a 30% output cut saves under 1% |
| Heavy cache reuse on repeated context | K2.6 | K2.7 Code’s cache read is $0.19/M vs K2.6’s $0.16/M, so K2.6 is cheaper on cached input |
| You haven’t measured your own thinking/output ratio yet | measure first | The whole decision turns on that ratio; the A/B loop below gives it to you in 10 lines |
If you do nothing else, take that last row seriously. Every dollar figure in this post depends on the share of your output tokens that go to reasoning, and that number is specific to your traffic. Vendor benchmarks won’t tell you. Your own logs will.
Quick Specs Comparison
Verified against the ofox model catalog on June 26, 2026. Prices are per million tokens.
| Spec | Kimi K2.7 Code | Kimi K2.6 |
|---|---|---|
| ofox model ID | moonshotai/kimi-k2.7-code | moonshotai/kimi-k2.6 |
| Context window | 262,144 | 262,144 |
| Max output | 262,144 | 262,144 |
| Input $/M | $0.95 | $0.95 |
| Output $/M | $4.00 | $4.00 |
| Cache read $/M | $0.19 | $0.16 |
| Modality | text only | text + image |
| Architecture | 1T MoE / 32B active | MoE |
| Built-in thinking | yes | yes (thinking / non-thinking modes) |
| Released | 2026-06-12 | 2026-04-21 |
| License | Modified MIT (open weights) | open weights |
Three facts decide everything below:
- Per-token price is identical. $0.95 in, $4.00 out, on both. If you priced K2.6 already, you’ve priced K2.7 Code’s per-token rate.
- Cache reads are worse on K2.7 Code. $0.19/M versus $0.16/M. If your pipeline reuses a lot of cached context, K2.7 Code is the more expensive model on that line item. Small, but it cuts against the “cheaper” framing.
- K2.7 Code is text-only. The detail people miss: the Code variant on ofox does not take images. K2.6 does. There’s also a
moonshotai/kimi-k2.7-code-highspeedvariant at the same price, still text-only.
So price parity plus a worse cache rate means there is exactly one lever that can lower your bill, and it’s the thinking-token reduction. The rest of this post is about whether that lever moves your specific invoice.
Coding Benchmark: What Moonshot Reports (and What’s Unverified)
Moonshot’s launch numbers for K2.7 Code over K2.6 look strong. Here they are, with the caveat attached to every row.
| Benchmark | K2.6 | K2.7 Code | Reported gain | Verified by a third party? |
|---|---|---|---|---|
| Kimi Code Bench v2 | 50.9 | 62.0 | +21.8% | No |
| Program Bench | 48.3 | 53.6 | +11.0% | No |
| MLS Bench Lite | 26.7 | 35.1 | +31.5% | No |
Read that last column twice. All three are Moonshot’s own proprietary benchmarks. There is no independent reproduction, and as of the June 12 release there were no public results on SWE-bench Verified, LiveCodeBench, or GPQA, the benchmarks the rest of the field actually compares against.
VentureBeat covered the release under the headline that practitioners say the benchmarks don’t check out. Researcher Elliot Arledge ran K2.7 Code against K2.6 on KernelBench-Hard, a public GPU-kernel benchmark, and its MoE-kernel score regressed to 0.157 from K2.6’s 0.222 on worse tuning. So the picture from outside Moonshot is, at best, mixed, and at worst points the other way on at least one public test.
There’s a structural reason to discount these numbers beyond “they’re first-party.” A vendor benchmark with a narrow score spread can show a big percentage gain off a small absolute move, and a proprietary harness can be tuned, intentionally or not, to the model that ships with it. The benchmark that would actually settle the question for a routing decision is one with a wide spread across models and a public methodology, where a real capability gap shows up as a large gap in score. K2.7 Code wasn’t submitted to that kind of test at launch. So you have three impressive percentages and no way to place them against the models you might route to instead.
This matters for cost work specifically. If you’re switching to K2.7 Code partly because you expect better output quality (fewer retries, fewer correction rounds), the vendor benchmarks are not evidence you can bank on. Fewer retries would be a real cost saving, every failed attempt is tokens you paid for, but you can’t claim that saving off numbers nobody outside Moonshot has reproduced. The honest position: treat K2.7 Code as roughly K2.6-class on quality until your own evals say otherwise, and justify the switch on the token math alone, not on the benchmark deltas. For the K2.6 baseline numbers that have at least been public longer, see the Kimi K2.6 release guide and the Kimi K2.6 vs Claude Opus 4.6 coding benchmark.
The Token Math: Where the 30% Actually Lands
Here’s the part the marketing skips. The 30% reduction is on thinking/reasoning tokens, and thinking tokens bill as output (completion) tokens. Your input tokens don’t move at all.
So the structure of a Kimi bill is:
bill = input_tokens × $0.95/M + output_tokens × $4.00/M
where output_tokens = thinking_tokens + visible_tokens
K2.7 Code’s claim cuts only the thinking_tokens piece, by ~30%. Everything else stays put. That gives a clean formula for the real saving:
bill reduction ≈ 0.30 × (thinking spend / total spend)
If thinking is your whole bill, you get close to 30%. If thinking is a sliver, you get a sliver. The variable that decides your outcome is the share of your spend that goes to reasoning, and it ranges from near-total (agentic, multi-step coding) to near-zero (long input, one-line answer).
Moonshot’s own framing makes this concrete with an agentic example: a 12-hour run dropping from ~2M reasoning tokens to ~1.4M, the 30% figure. That’s a vendor example, not a measured result on your traffic, but it shows the shape, reasoning-token-dominated work is exactly where the cut is designed to pay off.
The mistake is generalizing that 12-hour agent run to every job. A summarization call that reads 200K tokens and writes 200 is the opposite profile, and it will see almost nothing. The next section puts dollars on both ends.
You don’t have to guess your thinking-spend share, the API tells you. Every response carries a usage object with prompt_tokens and completion_tokens. Thinking tokens are folded into completion tokens, so the share you care about is completion_tokens × $4.00/M divided by the whole bill. Log that across a representative week of real traffic and you’ll know exactly where on the 1%-to-26% range you sit, before you change a single model string. That measured ratio, not Moonshot’s example, is what decides whether the switch pays.
Pricing Math: Real Monthly Bill
Two worked examples, recomputed from the $0.95/$4.00 rates. No cache hits assumed, so this isolates the thinking-token effect. You can rerun the arithmetic; it’s deliberately simple.
Example 1: reasoning-heavy coding job
Profile: 50,000 input tokens, 20,000 output tokens, of which 70% (14,000) is thinking and 30% (6,000) is visible answer. This is the shape of agentic coding, plan, reason, revise.
| Line | K2.6 | K2.7 Code |
|---|---|---|
| Input (50,000 × $0.95/M) | $0.0475 | $0.0475 |
| Thinking tokens | 14,000 | 9,800 (−30%) |
| Visible tokens | 6,000 | 6,000 |
| Output tokens total | 20,000 | 15,800 |
| Output cost (× $4.00/M) | $0.0800 | $0.0632 |
| Per-job total | $0.1275 | $0.1107 |
Bill reduction: ($0.1275 − $0.1107) / $0.1275 = 13.2%.
Note what happened. Thinking tokens fell 30% (14,000 → 9,800). Total output tokens fell only 21% (20,000 → 15,800), because the visible answer didn’t shrink. And the bill fell only 13.2%, because input tokens, a third of the cost here, didn’t move at all. The “30%” headline became 13% by the time it reached the invoice. That tracks the formula: 0.30 × (thinking spend $0.0560 / total $0.1275) = 13.2%.
Scale that to a real workload, 1,000 of these jobs a day, 30 days:
| Model | Monthly bill |
|---|---|
| K2.6 | $3,825.00 |
| K2.7 Code | $3,321.00 |
| Saving | $504.00/mo (−13.2%) |
$504 a month is worth having. Just don’t budget for the $1,147 a naive “30% off $3,825” would have promised.
Example 2: input-heavy job (the cut barely shows)
Profile: 200,000 input tokens, 4,000 output tokens, of which 40% (1,600) is thinking. This is RAG, long-document Q&A, or summarization, big read, short write.
| Line | K2.6 | K2.7 Code |
|---|---|---|
| Input (200,000 × $0.95/M) | $0.1900 | $0.1900 |
| Output tokens total | 4,000 | 3,520 (thinking 1,600 → 1,120) |
| Output cost (× $4.00/M) | $0.0160 | $0.0141 |
| Per-job total | $0.2060 | $0.2041 |
Bill reduction: ($0.2060 − $0.2041) / $0.2060 = 0.93%.
Under one percent. The output is a rounding error against the input, so a 30% cut on part of the output is invisible on the invoice. For this load profile, switching to K2.7 Code for cost reasons is pointless, and if you lean on cached input, K2.6’s cheaper cache read ($0.16 vs $0.19) makes it the cheaper model outright.
Example 3: the 12-hour agentic run (the high end)
Moonshot’s headline example is a 12-hour agentic run where reasoning tokens drop from ~2M to ~1.4M. That’s their number, not mine, but it’s worth costing because it’s the profile that gets closest to the 30% headline. Assume the run also reads about 500K of input over its life and emits ~200K of visible output (tool calls, file edits, final summaries).
| Line | K2.6 | K2.7 Code |
|---|---|---|
| Input (500,000 × $0.95/M) | $0.475 | $0.475 |
| Reasoning tokens | 2,000,000 | 1,400,000 (−30%) |
| Visible output | 200,000 | 200,000 |
| Output cost (× $4.00/M) | $8.800 | $6.400 |
| Per-run total | $9.275 | $6.875 |
Bill reduction: ($9.275 − $6.875) / $9.275 = 25.9%.
This is as good as it gets. Reasoning is the overwhelming share of the bill here, so the cut almost fully passes through. Even so, it’s 26%, not 30%, because the input and visible output don’t move. Run 20 of these a day for a month and the gap is real:
| Model | Monthly bill (20 runs/day × 30 days) |
|---|---|
| K2.6 | $5,565 |
| K2.7 Code | $4,125 |
| Saving | $1,440/mo (−25.9%) |
If your traffic genuinely looks like long autonomous agent runs, K2.7 Code earns its keep. The further your load drifts from that profile toward Example 2, the less it does.
The three examples bracket the real world. Your bill reduction lands somewhere between ~1% and ~26% depending on how reasoning-heavy your traffic is, and a typical mixed coding workload sits around the 13% middle. The closer your output is to all-thinking, the closer you get to the headline; the more your bill is input, the less you save. If you want to route a mix of these job shapes across cheaper models entirely, that’s a different lever, covered in routing multiple models through one API.
The Cache Line Item Cuts Against K2.7 Code
One more number the “30% cheaper” story ignores: cache reads. K2.7 Code bills cached input at $0.19/M; K2.6 bills it at $0.16/M. That’s a 19% premium on every cached token, on the one model that’s supposed to be the cheaper choice.
It matters whenever you reuse context. Code-review loops over the same repo, multi-turn agent sessions that re-send a system prompt and codebase, RAG over a stable corpus, all of these hit cache on most of their input. Take a 300K-input job at 80% cache hit, output held equal between the two models so we isolate the cache effect:
| Line | K2.6 | K2.7 Code |
|---|---|---|
| Fresh input (60,000 × $0.95/M) | $0.0570 | $0.0570 |
| Cached input (240,000) | × $0.16/M = $0.0384 | × $0.19/M = $0.0456 |
| Input cost | $0.0954 | $0.1026 |
K2.7 Code costs $0.0072 more per job on input alone. Over 1,000 cache-heavy jobs a day for a month, that’s about $216/mo extra that the thinking-token savings have to overcome before you break even. On a job profile that’s heavy on cached reads and light on reasoning output (the Example 2 shape with caching added), K2.7 Code can end up the more expensive model. Worth checking against your own cache-hit rate before you assume “newer = cheaper.”
When to Pick K2.7 Code
Pick moonshotai/kimi-k2.7-code when all of these hold:
- Your work is text-only. No images in the loop.
- Your jobs are reasoning-heavy, meaning long thinking traces relative to the visible answer. Agentic coding, multi-step debugging, planning-heavy tasks.
- You’re not leaning hard on cache reuse (if you are, K2.7 Code’s $0.19/M cache read costs more than K2.6’s $0.16/M).
That’s the profile where the 30% thinking-token cut translates into a double-digit bill reduction. It’s a genuine win for that exact shape of work. Use moonshotai/kimi-k2.7-code-highspeed if you want more throughput at the same price; the token math is unchanged.
When to Stick with K2.6
Stay on moonshotai/kimi-k2.6 when any of these hold:
- You need image input. K2.7 Code can’t do it, full stop.
- Your jobs are input-heavy with short outputs. The savings round to nothing (Example 2), and the cheaper cache read makes K2.6 the lower bill.
- You rely on non-thinking mode for fast, direct answers. If you’re not generating thinking tokens, there’s nothing for the 30% cut to reduce.
- You’ve already validated K2.6 quality in production and have no measured reason to expect K2.7 Code does the job better, since the benchmarks supporting that are unverified.
K2.6 is the conservative default. It does everything K2.7 Code does except the reasoning-token diet, plus it takes images and has the cheaper cache. For the K2.6 pricing and access details, see the Kimi K2.5 API pricing and access guide, which carries the same per-token structure forward.
When NOT to Use Either (and What to Use Instead)
Both Kimi models sit at $0.95/$4.00. That’s mid-pack, not cheap. If your driving constraint is raw cost-per-token and the task doesn’t need Kimi-class reasoning, neither is the right answer.
- For budget, high-volume batch work (classification, extraction, bulk summarization), route to a cheaper tier. DeepSeek V4 Flash lists at $0.14/$0.28, roughly 6x cheaper blended than Kimi. See the DeepSeek V4 release guide.
- For hard reasoning where you want a different model family’s strengths, GLM-5.2 is the reasoning-tier alternative on ofox. See the GLM-5.2 access guide.
- Mixed traffic across all of the above? Don’t pick one model. Route each job class to the cheapest model that clears its quality bar; that beats any single-model choice on cost. The multi-model router walkthrough has the worked routing table.
The point of K2.7 Code is a narrow efficiency gain on reasoning-heavy text. If that’s not your bottleneck, spend your optimization effort on routing, not on this one model swap. A team paying Kimi’s $4.00/M output on bulk classification work is leaving far more on the table than the 13% K2.7 Code could ever return, because the right fix there is a cheaper model entirely, not a leaner version of an expensive one. Match the model tier to the job first; optimize within a tier second.
Try Both via ofox: A/B in 10 Lines
Every number above depends on your own thinking-to-output ratio, and you can measure it directly. Both models share one OpenAI-compatible endpoint and one ofox key, so an A/B is a loop over two model strings. Run your real prompt through both, log the token counts the API returns, and compute the bill on your traffic instead of trusting an estimate.
Python, A/B both models in one loop
from openai import OpenAI
client = OpenAI(base_url="https://api.ofox.ai/v1", api_key="YOUR_OFOX_KEY")
prompt = "Refactor this 200-line module into composable functions: <paste code>"
for model in ["moonshotai/kimi-k2.6", "moonshotai/kimi-k2.7-code"]:
r = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
u = r.usage
bill = u.prompt_tokens * 0.95e-6 + u.completion_tokens * 4.00e-6
print(f"{model}: in={u.prompt_tokens} out={u.completion_tokens} bill=${bill:.4f}")
Node, same shape
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://api.ofox.ai/v1", apiKey: process.env.OFOX_KEY });
const prompt = "Refactor this 200-line module into composable functions: <paste code>";
for (const model of ["moonshotai/kimi-k2.6", "moonshotai/kimi-k2.7-code"]) {
const r = await client.chat.completions.create({
model,
messages: [{ role: "user", content: prompt }],
});
const u = r.usage;
const bill = u.prompt_tokens * 0.95e-6 + u.completion_tokens * 4.0e-6;
console.log(`${model}: in=${u.prompt_tokens} out=${u.completion_tokens} bill=$${bill.toFixed(4)}`);
}
Swap is one string. Run the loop over your top 20 real prompts, sum the bills, and you have your actual reduction, not the brochure’s.
One gotcha: K2.7 Code is text-only
K2.6 takes images. K2.7 Code does not. The same image_url content block that works on moonshotai/kimi-k2.6 will fail on moonshotai/kimi-k2.7-code:
# Works on K2.6, fails on K2.7 Code (text-only)
client.chat.completions.create(
model="moonshotai/kimi-k2.6", # swap to kimi-k2.7-code -> error
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this screenshot?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}},
],
}],
)
If a job in your A/B set sends an image, keep it on K2.6 and don’t route it to K2.7 Code at all.
FAQ
Is Kimi K2.7 Code cheaper than K2.6? No. Per-token prices are identical ($0.95/M input, $4.00/M output). Cache reads are more expensive on K2.7 Code ($0.19/M vs $0.16/M). The only path to a lower bill is the ~30% thinking-token reduction, and only on reasoning-heavy work.
Does a 30% token cut mean a 30% lower bill? No. The cut applies to thinking tokens, which bill as output; input tokens don’t change. Real reduction is about 30% times your thinking-spend share. Reasoning-heavy job: ~13%. Input-heavy job: under 1%.
What is the ofox model ID for Kimi K2.7 Code?
moonshotai/kimi-k2.7-code on the endpoint https://api.ofox.ai/v1. There’s also moonshotai/kimi-k2.7-code-highspeed at the same price. K2.6 is moonshotai/kimi-k2.6.
Does Kimi K2.7 Code accept images?
No. The K2.7 Code variant is text-to-text only; an image_url block fails. Route vision tasks to moonshotai/kimi-k2.6, which takes text plus image.
Are Kimi K2.7 Code’s benchmark numbers verified? Not independently. The +21.8% / +11.0% / +31.5% gains are all Moonshot’s proprietary benchmarks with no third-party reproduction. VentureBeat reported practitioners say the benchmarks don’t check out, and a public KernelBench-Hard run showed a regression. Treat them as vendor-reported.
What is the context window on Kimi K2.7 Code? 262,144 tokens (256K) for both context and max output, same as K2.6. It’s a 1T-total / 32B-active MoE with built-in thinking, released June 12 2026 under a Modified MIT open-weight license.
When should I switch from K2.6 to K2.7 Code? For text-only, reasoning-heavy coding where thinking dominates output spend. Stay on K2.6 for image input or input-heavy short-output jobs, where the savings round to nothing.
Is there a faster version?
Yes, moonshotai/kimi-k2.7-code-highspeed, same $0.95/$4.00 pricing, higher throughput. It doesn’t change the token math here.
Sources Checked for This Refresh
- ofox model page, Kimi K2.7 Code: https://ofox.io/models/moonshotai/kimi-k2.7-code (verified 200, June 26 2026)
- ofox model page, Kimi K2.6: https://ofox.io/models/moonshotai/kimi-k2.6 (verified 200, June 26 2026)
- Moonshot AI, Kimi K2.7 Code overview: https://www.kimi.com/resources/kimi-k2-7-code (verified 200, June 26 2026)
- Hugging Face model card, moonshotai/Kimi-K2.7-Code: https://huggingface.co/moonshotai/Kimi-K2.7-Code (verified 200, June 26 2026)
- VentureBeat coverage on K2.7 Code’s 30% thinking-token cut and practitioner skepticism that the benchmarks check out: https://venturebeat.com/technology/kimi-k2-7-code-cuts-thinking-tokens-30-practitioners-say-benchmarks-dont-check-out (article confirmed live; host returns a bot challenge to automated curl)
- Elliot Arledge, independent KernelBench-Hard run (K2.7 Code MoE-kernel 0.157 vs K2.6 0.222): https://x.com/elliotarledge/status/2065443474560946615 and the public board at https://kernelbench.com/hard (both verified 200, June 26 2026)


