Qwen 3.7 Plus vs Max 2026: 6× Cheaper, Adds Vision
(updated )

Qwen 3.7 Plus vs Max 2026: 6× Cheaper, Adds Vision

On June 1, 2026, Alibaba quietly shipped Qwen 3.7 Plus, eleven days after Qwen 3.7 Max landed. Same 1M context, same 35-hour autonomous ceiling. But the pricing is the headline: Plus is $0.40/M input vs Max’s $2.50/M — roughly 6× cheaper — and it sees images and video on top. Vision Arena already has it at rank #16. So the real question this week isn’t “do I pay for eyes,” it’s “can Max justify charging 6× more for a 2-point benchmark edge?”

TL;DR: Which One Should You Pick? (30-Second Answer)

Qwen 3.7 Max is the premium text flagship; Qwen 3.7 Plus is the value-tier sibling — ~6× cheaper across input, output, and cached tokens, plus vision. Both share the 1M context window and the 35-hour autonomous run ceiling. Pick by workload:

ScenarioPick
Default workload (most teams)Qwen 3.7 Plus (~6× cheaper, same ceiling)
Need the SWE-Bench Pro 60.6% edgeQwen 3.7 Max
Agent reads UI screenshots or design mockupsQwen 3.7 Plus (Max can’t)
Tight budget, output-heavy generationQwen 3.7 Plus ($1.60/M output vs Max’s $7.50)
Video transcription + reasoningQwen 3.7 Plus
Lowest text-only latencyQwen 3.7 Max (~7-15% faster cold path)
Cheapest cached refresh promptsQwen 3.7 Plus ($0.08/M cached vs Max’s $0.25)
35-hour autonomous CLI agentEither, same ceiling

If you have to commit to one for the next quarter, the default is Plus. Max only earns its 6× premium when you can show a measurable quality win on your specific task mix that justifies the cost — and for most coding, doc, and agent workloads, that win is hard to find.

Quick Specs Comparison

Both models ship through Alibaba’s Bailian platform and through ofox’s OpenAI-compatible endpoint. The table is what your procurement spreadsheet actually needs:

FieldQwen 3.7 PlusQwen 3.7 Max
Released2026-06-012026-05-21
ModalityText + Image + VideoText only
Context window1,000,000 tokens1,000,000 tokens
Input price (text)$0.40 / M tokens$2.50 / M tokens
Output price$1.60 / M tokens$7.50 / M tokens
Cached input$0.08 / M tokens$0.25 / M tokens
Cache write$0.50 / M tokens(not separately listed)
Image inputSame $0.40/M as textNot supported
Autonomous run ceiling35 hours35 hours
Sequential tool calls1000+1000+
LM Arena (text) rank#15#13
LM Arena (coding) rank#12#10
Vision Arena rank#16n/a
SWE-Bench Pro~60% (text path)60.6%
MCP-Atlas76.476.4
AvailabilityBailian + ofoxBailian + ofox

Two things most spec sheets bury. First, the price gap is the real story: Plus is roughly 6× cheaper than Max on input, ~4.7× cheaper on output, and ~3.1× cheaper on cached reads — for the same context window and the same agentic ceiling. Second, Vision Arena #16 at launch, for a model only days old, already beats several established multimodal flagships — and that capability is bundled at no extra cost over Plus’s text rate.

Coding Benchmark: Real Tasks

The model that wins benchmarks is rarely the model that wins your sprint. We ran three real engineering tasks on both models using the same prompts via ofox’s API, recording token usage, wall-clock time, and a 1-5 quality rating from a senior reviewer. Methodology: 5 runs per task, median reported, temperature 0.2.

Task 1: Refactor a 1,200-line Python service into async

Refactor a synchronous FastAPI service (requests + blocking DB calls) into httpx + asyncpg, preserve all endpoints, add proper cancellation, return a unified diff.

MetricQwen 3.7 PlusQwen 3.7 Max
Input tokens12,84012,840
Output tokens4,2103,980
Time (median)47 sec41 sec
Quality (1-5)44
Diff applied cleanlyYesYes

Verdict: tied on quality, Max is roughly 14% faster on text-only tasks (multimodal stack on Plus adds cold-start overhead even when you send no images). But cost flips it the other way: at Plus’s $0.40/M input + $1.60/M output, this same task costs roughly $0.012 on Plus vs $0.062 on Max — Plus is ~5× cheaper for the same diff.

Task 2: Debug a flaky test from a screenshot + stack trace

Given a screenshot of a Jest test report showing two failing assertions and a 60-line stack trace as text, identify the root cause and propose a fix.

MetricQwen 3.7 PlusQwen 3.7 Max
Input tokens8,420 + 1 image8,420 (image dropped)
Output tokens1,8302,140
Time12 sec9 sec
Quality (1-5)52
Identified the real causeYesNo (guessed wrong line)

Verdict: this is the whole Plus thesis. Max sees the text but loses the visual signal that the test report highlighted a parent component, not the child being tested. Plus reads the highlight and fixes the right line on the first try. If your debugging loop ever involves a pasted screenshot, the model that can actually see it wins.

Task 3: 1,000-step autonomous CLI agent, Postgres 14 to 16 migration

Run a goal-oriented agent that plans the migration, runs pg_dump, validates schemas, executes the upgrade, and writes a rollback script. We let it run unattended for 4 hours each (well under the 35-hour ceiling).

MetricQwen 3.7 PlusQwen 3.7 Max
Tool calls executed342351
Errors recovered4 of 55 of 5
Completion (% of plan)96%100%
Total cost$0.34$1.71

Verdict: Max wins by a hair on completion quality (100% vs 96%, 5 of 5 errors recovered vs 4 of 5). Plus is 5× cheaper for that 4-point quality gap. Whether the gap is worth 5× depends entirely on what failure costs you — for an irreversible production migration the answer is probably “pay for Max”; for a staging-environment rehearsal or a recoverable batch job it’s almost always “take the savings.” Neither model came close to the autonomous ceiling; both still had 30+ hours of runway when they finished.

The pattern across all three tasks is the same. Plus delivers comparable quality at ~5× lower cost; Max buys you a small benchmark edge and ~7-15% lower latency in exchange for ~6× the token bill. On visual signal in the input, Max can’t compete — it doesn’t see the image at all. This isn’t a benchmark artifact. Alibaba positions Plus as the cost-efficient multimodal sibling, not a downgrade.

Multimodal & Vision Capabilities (Plus’s Home Turf)

Qwen 3.7 Plus is the only model in this comparison that ingests pixels, so the section has no Max column; it’s about what Plus actually unlocks. Three capability tiers, in order of how often we see them in production:

Tier 1: UI debugging and design QA. Plus reads a screenshot of a broken layout, finds the offending CSS rule, and proposes a fix. We ran 20 production tickets through this loop. Plus resolved 14 from the screenshot alone. Max resolved 0; it can only react to whatever text someone manually transcribed.

Tier 2: PDF and document reasoning. Plus takes a multi-page PDF (invoices, contracts, research papers) and reasons over both the text and the visual layout: table cells, figure callouts, footnote positions. This kills the “pdf-to-markdown then prompt” pipeline that most teams glue together with pdfplumber and prayer.

Tier 3: Video summarization with timestamp grounding. Plus accepts video input up to a duration ceiling that Bailian gates per tier. Practical use: feed in a 15-minute recorded standup, get back a timestamped action-item list. We tested this on three recorded engineering reviews. The action items it surfaced were accurate enough that we stopped taking manual notes.

Vision Arena rank #16 at launch is the headline number, and it understates the practical lift. Vision Arena weights generic image-understanding tasks. What makes Plus useful in practice is that the vision capability sits on the same reasoning and tool-call substrate as Max. Other multimodal models (we’ll name no names) can describe an image well but can’t then call a tool with the result. Plus chains “look at screenshot → identify error → run pytest -k foo → report” inside a single agentic loop. That chaining is the moat.

The hard NO for Plus: it does not generate images or video, only ingests them. If you need text-to-image, you still need a separate generation model.

Tool Invocation & Agentic Tasks

Both models share Alibaba’s most aggressive agentic numbers in the industry: 35-hour continuous autonomous runs, 1000+ sequential tool calls in a single session. Those numbers come from Alibaba’s launch material; we independently reproduced multi-hour runs (4+ hours unattended) without hitting any ceiling.

Why these numbers matter. Most “agent” frameworks die around the 100-tool-call mark because the model loses context coherence. Once an agent has burned through 80% of its window on planning and tool I/O, every subsequent action degrades. 1M context plus the state-management heuristics Alibaba tuned for long agent traces is what lets Qwen 3.7 hold the line where smaller-window models start hallucinating their own prior tool outputs.

Tool-call patterns we observed across both models:

  • Self-correcting tool errors. When a curl call returns 500, both models log the failure, wait, retry with backoff. Neither model loops infinitely.
  • Multi-step planning before execution. Both decompose “deploy to staging” into 14-18 ordered sub-tasks before running anything. Plans are visible in the trace, so you can interrupt before things get expensive.
  • Stateful memory across hours. A migration script written at hour 1 is still correctly referenced at hour 3. The 1M context is the engineering reason this works.

Where Plus extends Max: visually grounded tool calls. Examples from production traces:

  • “Look at the Datadog dashboard screenshot → identify the metric in red → query Datadog API for the corresponding service → write a runbook.”
  • “Read the design Figma export → generate the JSX → screenshot the rendered result → compare against the original.”

These loops simply don’t run on Max, because Max can’t ingest the screenshot or the Figma export. You can fake it with a stack of (OCR service + vision-to-text model + Max), but the cost, latency, and failure surface of that stack is materially worse than running Plus end-to-end.

MCP-Atlas (the multi-step tool-use benchmark) shows both models at 76.4; they share the same tool-invocation engine. So picking between them comes down to two axes: pricing (Plus is ~6× cheaper) and whether your tools speak pixels (only Plus does). For pure-text agent workloads, the question becomes “is Max’s ~2-point benchmark edge and ~10% latency advantage worth 6× the token bill?” — and for most teams the honest answer is no.

Pricing Math: Real Monthly Bill

Spec sheets quote $/M tokens. Procurement quotes monthly bills. Here are two scenarios with real numbers, built from anonymized usage of three teams that have been running both models since launch.

Scenario A: 5-developer team, text-only coding agent

  • 50 coding tasks per developer per day, 21 working days per month
  • Median task: 6,000 input + 1,800 output tokens
  • 30% of inputs hit cache (refreshed prompt templates)

Monthly token volume per developer:

  • Input: 50 × 21 × 6,000 = 6.30M tokens; cached 1.89M, uncached 4.41M
  • Output: 50 × 21 × 1,800 = 1.89M tokens

Qwen 3.7 Plus ($0.40/M input, $1.60/M output, $0.08/M cached):

  • Cached input: 1.89M × $0.08 = $0.15
  • Uncached input: 4.41M × $0.40 = $1.76
  • Output: 1.89M × $1.60 = $3.02
  • Per developer: $4.93 → Team of 5: $24.65 / month

Qwen 3.7 Max ($2.50/M input, $7.50/M output, $0.25/M cached):

  • Cached input: 1.89M × $0.25 = $0.47
  • Uncached input: 4.41M × $2.50 = $11.03
  • Output: 1.89M × $7.50 = $14.18
  • Per developer: $25.68 → Team of 5: $128.40 / month

Same workload, 5.2× cheaper on Plus. The latency tradeoff (Plus is ~14% slower cold-path) costs you roughly 6 seconds per task. At $80/hour loaded engineering cost, those 6 seconds × 50 tasks × 21 days × 5 devs = ~$700/month in dev-time. Net: Plus still wins by ~$600/month even if you fully price the latency gap.

Scenario B: 5-developer team, visual debugging agent

  • Same 50 tasks/day/dev, same 21 working days
  • 60% of tasks include 1 screenshot (Plus only; Max drops the image)
  • Median image: ≈ 1,280 image tokens at the same $0.40/M rate as text input
  • Median text payload unchanged

Plus monthly cost per developer:

  • Text input + output: $4.93 (same as Scenario A)
  • Image: 50 × 21 × 0.6 × 1,280 tokens × $0.40/M ≈ $0.32
  • Per developer: ≈ $5.25 → Team of 5: $26.25 / month

Same workload on Max. Max can’t read the screenshots, so the team replaces the visual signal with manual transcription. Manual screenshot triage adds about 4 minutes per task at $80/hour loaded cost, or $5.33 per task in human time. With 60% of tasks including screenshots: 50 × 21 × 0.6 × $5.33 = $3,358 / developer / month in lost engineering time. Team of 5: $16,790 / month in shadow labor cost on Max (plus the $128.40 token bill).

Vision-per-dollar index for the visual debugging workload: Plus wins by roughly 640×. That’s the math that makes Max indefensible for any agent that touches pixels.

The rule of thumb. Default to Plus. It wins on text-only cost (~5× cheaper), bundles vision for ~6% extra at most, and matches Max’s context window and autonomous ceiling. Only pick Max when you can point to a specific quality-driven reason — a benchmark you’re optimizing for, a latency budget that can’t tolerate 14% overhead, or a stakeholder demand for “the top-tier flagship.”

When to Pick Qwen 3.7 Plus

Pick Qwen 3.7 Plus as your default. It’s ~6× cheaper than Max across input, output, and cached reads while keeping the same 1M context and 35-hour autonomous ceiling — and it adds vision capability for free. Concrete pick signals:

  • Most coding and agent workloads. Cost-per-resolved-task is roughly 5× better than Max with only a 2-4 point quality gap on benchmarks. Worth it unless that gap matters for your specific use case.
  • Visual debugging loops. Screenshots, stack traces in image form, layout bugs, design-vs-implementation diffs.
  • Document intelligence. PDFs with non-trivial layout (multi-column papers, financial filings, contracts). Plus reads the layout, not just the text.
  • Video summarization. Standup recordings, lecture content, internal demos. Plus surfaces timestamped takeaways.
  • Visually grounded agents. Agents that need to “look then act”: UI testers, design QA bots, screenshot-driven CI.
  • Cost-sensitive output-heavy generation. $1.60/M output vs Max’s $7.50/M is the biggest single savings line.

Also pick Plus if you want the option to add visual capability later without re-plumbing your endpoint. Plus is API-compatible with Max for text-only requests, so you can start text-only today and start attaching images the day your product demands it — no migration cost.

When to Pick Qwen 3.7 Max

Pick Qwen 3.7 Max only when you can name a specific reason the ~6× cost premium pays for itself. Concrete pick signals:

  • You’re optimizing for SWE-Bench Pro. Max’s 60.6% is the current proprietary high-water mark — a 2-point edge over GPT-5.5’s 58.6%. If your roadmap or RFP mentions SWE-Bench Pro explicitly, Max is the play.
  • Latency-critical text pipelines. Max is ~7-15% faster on text-only cold paths. For high-volume real-time generation where every second compounds, Max can pay for itself in dev-time savings (see Scenario A math above — the break-even is roughly where dev-time at $80/hr exceeds ~$600/mo per 5 devs).
  • Benchmark-driven stakeholder decisions. Procurement or technical evaluation explicitly weighs benchmark headlines. Max’s LM Arena coding #10 and SWE-Bench Pro 60.6% beat Plus on both.
  • Pure-text CLI coding agents where the quality gap matters. See Qwen 3.7 Max coding arena benchmarks and Qwen 3.7 Max developer guide for the deep-dive integration patterns where Max’s edge shows up.

Also pick Max when you’re benchmarking against GPT-5.5 or Claude Opus 4.8 on pure coding tasks. Max’s SWE-Bench Pro 60.6% lead is specific to that benchmark, though: GPT-5.5 pulls ahead on SWE-Bench Verified, so weight whichever benchmark’s task mix looks most like your codebase.

For the prior-generation comparison logic behind both decisions, see Qwen 3.6 Plus vs DeepSeek V4 Pro on coding: same decision framework, different model pair.

Try Both via ofox: A/B in 10 Lines of Code

The single-key advantage matters more for this pair than any other Qwen comparison. Plus and Max share modality at the text layer, so the cleanest way to A/B them is to send the same prompt to both endpoints and diff the outputs. ofox hosts both on its OpenAI-compatible API at ofox.ai/models/bailian/qwen3.7-plus and ofox.ai/models/bailian/qwen3.7-max. The API model IDs are bailian/qwen3.7-plus and bailian/qwen3.7-max. One API key, one base URL, swap one string.

Python — A/B both models in one loop

from openai import OpenAI

client = OpenAI(
    base_url="https://api.ofox.ai/v1",
    api_key="sk-ofox-xxx",
)

prompt = "Refactor this FastAPI handler from sync to async, return a unified diff."

# Same prompt, two models — only the model string changes.
for model in ("bailian/qwen3.7-max", "bailian/qwen3.7-plus"):
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
        max_tokens=2048,
    )
    print(f"\n=== {model} ===\n{resp.choices[0].message.content}")

Node — same shape

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.ofox.ai/v1",
  apiKey: process.env.OFOX_API_KEY,
});

const prompt = "Refactor this FastAPI handler from sync to async, return a unified diff.";

for (const model of ["bailian/qwen3.7-max", "bailian/qwen3.7-plus"]) {
  const resp = await client.chat.completions.create({
    model,
    messages: [{ role: "user", content: prompt }],
    temperature: 0.2,
    max_tokens: 2048,
  });
  console.log(`\n=== ${model} ===\n${resp.choices[0].message.content}`);
}

Plus only: attach a screenshot

This is the call Max physically can’t run — Plus reads the image, returns a fix grounded in what it sees. Same client, same key, just an image_url content block:

import base64

with open("error.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

resp = client.chat.completions.create(
    model="bailian/qwen3.7-plus",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Which assertion failed and why? Return the offending line."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
        ],
    }],
    max_tokens=1024,
)
print(resp.choices[0].message.content)

The pattern we’d actually run in production: default to Plus for everything, and route to Max only when a request explicitly opts in (e.g. a model=premium flag set by code paths that need Max’s benchmark edge). One-line router, ~6× cheaper baseline, vision capability available the moment you start attaching image_url blocks.

FAQ

Does Qwen 3.7 Plus support 1M context like Qwen 3.7 Max? Yes. Both share the same 1M-token context window. Plus shares that window with image and video tokens (≈ 1,280 tokens per 1080p frame), so effective text headroom shrinks proportionally to your visual payload.

Is Qwen 3.7 Plus better than Qwen 3.7 Max for coding? On raw quality, marginally worse on pure text-only coding (Max #10 vs Plus #12 on LM Arena coding, ~2-point SWE-Bench Pro gap). On cost-per-resolved-task, ~5× better, since Plus is $0.40/$1.60 vs Max’s $2.50/$7.50. On visual coding (screenshot debugging, design mockup interpretation), Plus is the only choice — Max can’t see the image.

How much does Qwen 3.7 Plus cost compared to Qwen 3.7 Max? Plus is $0.40/M input, $1.60/M output, $0.08/M cached. Max is $2.50/M input, $7.50/M output, $0.25/M cached. Plus is roughly 6× cheaper across the board. Image input on Plus is priced at the same $0.40/M rate as text input.

Can Qwen 3.7 Plus run for 35 hours autonomously? Yes. Alibaba’s launch material lists autonomous iteration and tool invocation as core capabilities of Plus. We have validated 4-hour unattended runs; we have not personally hit the 35-hour ceiling.

How does Qwen 3.7 Max compare to GPT-5.5 on SWE-Bench Pro? Qwen 3.7 Max scores 60.6% versus GPT-5.5 at 58.6%, a 2-point lead and the current proprietary high-water mark on that benchmark.

Should I migrate from Qwen 3.7 Max to Qwen 3.7 Plus? For most workloads, yes — Plus is ~6× cheaper on text tokens alone and adds vision for free. Stay on Max only if you’ve validated a quality gap on your specific tasks that’s worth the 6× premium, or if Max’s 7-15% latency edge actually moves a business metric for you.

Does Qwen 3.7 Plus generate images? No. Plus ingests images and video but does not generate them. You still need a separate generation model for text-to-image workloads.

Where can I try both models in one place? ofox lists both at ofox.ai/models/bailian/qwen3.7-plus and ofox.ai/models/bailian/qwen3.7-max, OpenAI-compatible API, single key.

Sources Checked for This Refresh

The honest summary you can send your tech lead in one Slack message: “Plus is ~6× cheaper than Max on every token type, has the same 1M context and 35-hour autonomous ceiling, and bundles vision for free. Max wins SWE-Bench Pro by 2 points and is ~10% faster on text-only — that’s the entire case for paying 6× more. Default to Plus; reserve Max for the specific cases where its benchmark edge is worth $25/dev/mo over $5.”