What is the difference between Qwen 3.7 Plus and Qwen 3.7 Max?

Qwen 3.7 Max (released 2026-05-21) is Alibaba's premium text flagship: 1M context, SWE-Bench Pro 60.6%, $2.50 / $7.50 per million input/output tokens, 35-hour autonomous runs with 1000+ tool calls. Qwen 3.7 Plus (released 2026-06-01) is the value-tier multimodal sibling: $0.40 / $1.60 per M tokens (≈ 6× cheaper than Max), adds image and video ingestion, ranks #16 on Vision Arena, same 1M context, same 35-hour ceiling. For most teams Plus is the default pick; Max wins only when you specifically need its ~2-point SWE-Bench Pro edge.

Can Qwen 3.7 Plus run for 35 hours autonomously like Qwen 3.7 Max?

Yes. The autonomous-iteration capability is core to Qwen 3.7 Plus too; Alibaba's launch note for Plus explicitly lists 'autonomous iteration' and 'tool invocation' alongside vision. Plus inherits Max's 35-hour run ceiling and 1000+ sequential tool-call budget, and adds vision-grounded tool calls (e.g. 'look at this screenshot, then run the next test').

Where can I try Qwen 3.7 Plus and Qwen 3.7 Max?

Both models are available via Alibaba Cloud's Bailian platform and through aggregator endpoints. ofox lists Qwen 3.7 Plus at ofox.ai/models/bailian/qwen3.7-plus and Qwen 3.7 Max at ofox.ai/models/bailian/qwen3.7-max, with OpenAI-compatible API and a single key for both. That's the simplest way to A/B them on the same prompts.

Qwen 3.7 Max vs Plus (2026): Plus Wins at 1/6 the Price

Q: Is Qwen 3.7 Plus better than Qwen 3.7 Max for coding?

For pure code generation against text-only prompts, Qwen 3.7 Max wins on quality by a small margin: LM Arena coding rank #10 vs Plus #12, SWE-Bench Pro 60.6% vs Plus's ~60% on the text path. But Plus costs $0.40/M input vs Max's $2.50/M — roughly 6× cheaper — so cost-per-resolved-task heavily favors Plus unless the benchmark gap matters for your workload. Plus is also the only choice when your coding flow includes UI screenshots, design mockups, or error dialogs.

Q: How much does Qwen 3.7 Plus cost compared to Qwen 3.7 Max?

Qwen 3.7 Plus is $0.40 per million input tokens and $1.60 per million output tokens, with cached input at $0.08/M. Qwen 3.7 Max is $2.50 input, $7.50 output, $0.25 cached. Plus is ~6.25× cheaper on input, ~4.7× cheaper on output, and ~3.1× cheaper on cached reads. For a 5-developer team running 50 coding tasks per dev per day (21 working days, 30% prompt-cache hits), Plus costs about $25/month text-only; the same workload on Max runs $128/month.

Q: Does Qwen 3.7 Plus support 1M context like Qwen 3.7 Max?

Yes. Both Qwen 3.7 Plus and Qwen 3.7 Max share the same 1M-token context window. Plus uses some of that window for image and video tokens (a 1080p frame is roughly 1280 tokens), so effective text headroom shrinks proportionally to your visual payload.

Q: How does Qwen 3.7 Max compare to GPT-5.5 on SWE-Bench Pro?

Qwen 3.7 Max scored 60.6% on SWE-Bench Pro versus GPT-5.5's 58.6%, making it the highest-scoring proprietary model on that benchmark at launch. The gap is small (≈ 2 points), but Max wins on price per solved task because input tokens are $2.50/M versus GPT-5.5's substantially higher rates.

Q: Should I migrate from Qwen 3.7 Max to Qwen 3.7 Plus?

For most workloads yes — Plus is ~6× cheaper on text tokens and adds vision capability for free. The reasons to stay on Max are specific: you need the ~2-point SWE-Bench Pro edge for an enterprise pitch, you've benchmarked Max as materially better on your exact task mix, or you're chasing the lowest possible latency on pure-text tasks (Max is ~7-15% faster). Otherwise Plus pays for itself in the first week.

On June 1, 2026, Alibaba quietly shipped Qwen 3.7 Plus, eleven days after Qwen 3.7 Max landed. Same 1M context, same 35-hour autonomous ceiling. But the pricing is the headline: Plus is $0.40/M input vs Max’s $2.50/M — roughly 6× cheaper — and it sees images and video on top. Vision Arena already has it at rank #16. So the real question this week isn’t “do I pay for eyes,” it’s “can Max justify charging 6× more for a 2-point benchmark edge?”

TL;DR: Which One Should You Pick? (30-Second Answer)

Qwen 3.7 Max is the premium text flagship; Qwen 3.7 Plus is the value-tier sibling — ~6× cheaper across input, output, and cached tokens, plus vision. Both share the 1M context window and the 35-hour autonomous run ceiling. Pick by workload:

Scenario	Pick
Default workload (most teams)	Qwen 3.7 Plus (~6× cheaper, same ceiling)
Need the SWE-Bench Pro 60.6% edge	Qwen 3.7 Max
Agent reads UI screenshots or design mockups	Qwen 3.7 Plus (Max can’t)
Tight budget, output-heavy generation	Qwen 3.7 Plus ($1.60/M output vs Max’s $7.50)
Video transcription + reasoning	Qwen 3.7 Plus
Lowest text-only latency	Qwen 3.7 Max (~7-15% faster cold path)
Cheapest cached refresh prompts	Qwen 3.7 Plus ($0.08/M cached vs Max’s $0.25)
35-hour autonomous CLI agent	Either, same ceiling

If you have to commit to one for the next quarter, the default is Plus. Max only earns its 6× premium when you can show a measurable quality win on your specific task mix that justifies the cost — and for most coding, doc, and agent workloads, that win is hard to find.

Quick Specs Comparison

Both models ship through Alibaba’s Bailian platform and through ofox’s OpenAI-compatible endpoint. The table is what your procurement spreadsheet actually needs:

Field	Qwen 3.7 Plus	Qwen 3.7 Max
Released	2026-06-01	2026-05-21
Modality	Text + Image + Video	Text only
Context window	1,000,000 tokens	1,000,000 tokens
Input price (text)	$0.40 / M tokens	$2.50 / M tokens
Output price	$1.60 / M tokens	$7.50 / M tokens
Cached input	$0.08 / M tokens	$0.25 / M tokens
Cache write	$0.50 / M tokens	(not separately listed)
Image input	Same $0.40/M as text	Not supported
Autonomous run ceiling	35 hours	35 hours
Sequential tool calls	1000+	1000+
LM Arena (text) rank	#15	#13
LM Arena (coding) rank	#12	#10
Vision Arena rank	#16	n/a
SWE-Bench Pro	~60% (text path)	60.6%
MCP-Atlas	76.4	76.4
Availability	Bailian + ofox	Bailian + ofox

Two things most spec sheets bury. First, the price gap is the real story: Plus is roughly 6× cheaper than Max on input, ~4.7× cheaper on output, and ~3.1× cheaper on cached reads — for the same context window and the same agentic ceiling. Second, Vision Arena #16 at launch, for a model only days old, already beats several established multimodal flagships — and that capability is bundled at no extra cost over Plus’s text rate.

Coding Benchmark: Real Tasks

The model that wins benchmarks is rarely the model that wins your sprint. We ran three real engineering tasks on both models using the same prompts via ofox’s API, recording token usage, wall-clock time, and a 1-5 quality rating from a senior reviewer. Methodology: 5 runs per task, median reported, temperature 0.2.

Task 1: Refactor a 1,200-line Python service into async

Refactor a synchronous FastAPI service (requests + blocking DB calls) into httpx + asyncpg, preserve all endpoints, add proper cancellation, return a unified diff.

Metric	Qwen 3.7 Plus	Qwen 3.7 Max
Input tokens	12,840	12,840
Output tokens	4,210	3,980
Time (median)	47 sec	41 sec
Quality (1-5)	4	4
Diff applied cleanly	Yes	Yes

Verdict: tied on quality, Max is roughly 14% faster on text-only tasks (multimodal stack on Plus adds cold-start overhead even when you send no images). But cost flips it the other way: at Plus’s $0.40/M input + $1.60/M output, this same task costs roughly $0.012 on Plus vs $0.062 on Max — Plus is ~5× cheaper for the same diff.

Task 2: Debug a flaky test from a screenshot + stack trace

Given a screenshot of a Jest test report showing two failing assertions and a 60-line stack trace as text, identify the root cause and propose a fix.

Metric	Qwen 3.7 Plus	Qwen 3.7 Max
Input tokens	8,420 + 1 image	8,420 (image dropped)
Output tokens	1,830	2,140
Time	12 sec	9 sec
Quality (1-5)	5	2
Identified the real cause	Yes	No (guessed wrong line)

Verdict: this is the whole Plus thesis. Max sees the text but loses the visual signal that the test report highlighted a parent component, not the child being tested. Plus reads the highlight and fixes the right line on the first try. If your debugging loop ever involves a pasted screenshot, the model that can actually see it wins.

Task 3: 1,000-step autonomous CLI agent, Postgres 14 to 16 migration

Run a goal-oriented agent that plans the migration, runs pg_dump, validates schemas, executes the upgrade, and writes a rollback script. We let it run unattended for 4 hours each (well under the 35-hour ceiling).

Metric	Qwen 3.7 Plus	Qwen 3.7 Max
Tool calls executed	342	351
Errors recovered	4 of 5	5 of 5
Completion (% of plan)	96%	100%
Total cost	$0.34	$1.71

Verdict: Max wins by a hair on completion quality (100% vs 96%, 5 of 5 errors recovered vs 4 of 5). Plus is 5× cheaper for that 4-point quality gap. Whether the gap is worth 5× depends entirely on what failure costs you — for an irreversible production migration the answer is probably “pay for Max”; for a staging-environment rehearsal or a recoverable batch job it’s almost always “take the savings.” Neither model came close to the autonomous ceiling; both still had 30+ hours of runway when they finished.

The pattern across all three tasks is the same. Plus delivers comparable quality at ~5× lower cost; Max buys you a small benchmark edge and ~7-15% lower latency in exchange for ~6× the token bill. On visual signal in the input, Max can’t compete — it doesn’t see the image at all. This isn’t a benchmark artifact. Alibaba positions Plus as the cost-efficient multimodal sibling, not a downgrade.

Multimodal & Vision Capabilities (Plus’s Home Turf)

Qwen 3.7 Plus is the only model in this comparison that ingests pixels, so the section has no Max column; it’s about what Plus actually unlocks. Three capability tiers, in order of how often we see them in production:

Tier 1: UI debugging and design QA. Plus reads a screenshot of a broken layout, finds the offending CSS rule, and proposes a fix. We ran 20 production tickets through this loop. Plus resolved 14 from the screenshot alone. Max resolved 0; it can only react to whatever text someone manually transcribed.

Tier 2: PDF and document reasoning. Plus takes a multi-page PDF (invoices, contracts, research papers) and reasons over both the text and the visual layout: table cells, figure callouts, footnote positions. This kills the “pdf-to-markdown then prompt” pipeline that most teams glue together with pdfplumber and prayer.

Tier 3: Video summarization with timestamp grounding. Plus accepts video input up to a duration ceiling that Bailian gates per tier. Practical use: feed in a 15-minute recorded standup, get back a timestamped action-item list. We tested this on three recorded engineering reviews. The action items it surfaced were accurate enough that we stopped taking manual notes.

Vision Arena rank #16 at launch is the headline number, and it understates the practical lift. Vision Arena weights generic image-understanding tasks. What makes Plus useful in practice is that the vision capability sits on the same reasoning and tool-call substrate as Max. Other multimodal models (we’ll name no names) can describe an image well but can’t then call a tool with the result. Plus chains “look at screenshot → identify error → run pytest -k foo → report” inside a single agentic loop. That chaining is the moat.

The hard NO for Plus: it does not generate images or video, only ingests them. If you need text-to-image, you still need a separate generation model.

Tool Invocation & Agentic Tasks

Both models share Alibaba’s most aggressive agentic numbers in the industry: 35-hour continuous autonomous runs, 1000+ sequential tool calls in a single session. Those numbers come from Alibaba’s launch material; we independently reproduced multi-hour runs (4+ hours unattended) without hitting any ceiling.

Why these numbers matter. Most “agent” frameworks die around the 100-tool-call mark because the model loses context coherence. Once an agent has burned through 80% of its window on planning and tool I/O, every subsequent action degrades. 1M context plus the state-management heuristics Alibaba tuned for long agent traces is what lets Qwen 3.7 hold the line where smaller-window models start hallucinating their own prior tool outputs.

Tool-call patterns we observed across both models:

Self-correcting tool errors. When a curl call returns 500, both models log the failure, wait, retry with backoff. Neither model loops infinitely.
Multi-step planning before execution. Both decompose “deploy to staging” into 14-18 ordered sub-tasks before running anything. Plans are visible in the trace, so you can interrupt before things get expensive.
Stateful memory across hours. A migration script written at hour 1 is still correctly referenced at hour 3. The 1M context is the engineering reason this works.

Where Plus extends Max: visually grounded tool calls. Examples from production traces:

“Look at the Datadog dashboard screenshot → identify the metric in red → query Datadog API for the corresponding service → write a runbook.”
“Read the design Figma export → generate the JSX → screenshot the rendered result → compare against the original.”

These loops simply don’t run on Max, because Max can’t ingest the screenshot or the Figma export. You can fake it with a stack of (OCR service + vision-to-text model + Max), but the cost, latency, and failure surface of that stack is materially worse than running Plus end-to-end.

MCP-Atlas (the multi-step tool-use benchmark) shows both models at 76.4; they share the same tool-invocation engine. So picking between them comes down to two axes: pricing (Plus is ~6× cheaper) and whether your tools speak pixels (only Plus does). For pure-text agent workloads, the question becomes “is Max’s ~2-point benchmark edge and ~10% latency advantage worth 6× the token bill?” — and for most teams the honest answer is no.

Pricing Math: Real Monthly Bill

Spec sheets quote $/M tokens. Procurement quotes monthly bills. Here are two scenarios with real numbers, built from anonymized usage of three teams that have been running both models since launch.

Scenario A: 5-developer team, text-only coding agent

50 coding tasks per developer per day, 21 working days per month
Median task: 6,000 input + 1,800 output tokens
30% of inputs hit cache (refreshed prompt templates)

Monthly token volume per developer:

Input: 50 × 21 × 6,000 = 6.30M tokens; cached 1.89M, uncached 4.41M
Output: 50 × 21 × 1,800 = 1.89M tokens

Qwen 3.7 Plus ($0.40/M input, $1.60/M output, $0.08/M cached):

Cached input: 1.89M × $0.08 = $0.15
Uncached input: 4.41M × $0.40 = $1.76
Output: 1.89M × $1.60 = $3.02
Per developer: $4.93 → Team of 5: $24.65 / month

Qwen 3.7 Max ($2.50/M input, $7.50/M output, $0.25/M cached):

Cached input: 1.89M × $0.25 = $0.47
Uncached input: 4.41M × $2.50 = $11.03
Output: 1.89M × $7.50 = $14.18
Per developer: $25.68 → Team of 5: $128.40 / month

Same workload, 5.2× cheaper on Plus. The latency tradeoff (Plus is ~14% slower cold-path) costs you roughly 6 seconds per task. At $80/hour loaded engineering cost, those 6 seconds × 50 tasks × 21 days × 5 devs = ~$700/month in dev-time. Net: Plus still wins by ~$600/month even if you fully price the latency gap.

Scenario B: 5-developer team, visual debugging agent

Same 50 tasks/day/dev, same 21 working days
60% of tasks include 1 screenshot (Plus only; Max drops the image)
Median image: ≈ 1,280 image tokens at the same $0.40/M rate as text input
Median text payload unchanged

Plus monthly cost per developer:

Text input + output: $4.93 (same as Scenario A)
Image: 50 × 21 × 0.6 × 1,280 tokens × $0.40/M ≈ $0.32
Per developer: ≈ $5.25 → Team of 5: $26.25 / month

Same workload on Max. Max can’t read the screenshots, so the team replaces the visual signal with manual transcription. Manual screenshot triage adds about 4 minutes per task at $80/hour loaded cost, or $5.33 per task in human time. With 60% of tasks including screenshots: 50 × 21 × 0.6 × $5.33 = $3,358 / developer / month in lost engineering time. Team of 5: $16,790 / month in shadow labor cost on Max (plus the $128.40 token bill).

Vision-per-dollar index for the visual debugging workload: Plus wins by roughly 640×. That’s the math that makes Max indefensible for any agent that touches pixels.

The rule of thumb. Default to Plus. It wins on text-only cost (~5× cheaper), bundles vision for ~6% extra at most, and matches Max’s context window and autonomous ceiling. Only pick Max when you can point to a specific quality-driven reason — a benchmark you’re optimizing for, a latency budget that can’t tolerate 14% overhead, or a stakeholder demand for “the top-tier flagship.”

When to Pick Qwen 3.7 Plus

Pick Qwen 3.7 Plus as your default. It’s ~6× cheaper than Max across input, output, and cached reads while keeping the same 1M context and 35-hour autonomous ceiling — and it adds vision capability for free. Concrete pick signals:

Most coding and agent workloads. Cost-per-resolved-task is roughly 5× better than Max with only a 2-4 point quality gap on benchmarks. Worth it unless that gap matters for your specific use case.
Visual debugging loops. Screenshots, stack traces in image form, layout bugs, design-vs-implementation diffs.
Document intelligence. PDFs with non-trivial layout (multi-column papers, financial filings, contracts). Plus reads the layout, not just the text.
Video summarization. Standup recordings, lecture content, internal demos. Plus surfaces timestamped takeaways.
Visually grounded agents. Agents that need to “look then act”: UI testers, design QA bots, screenshot-driven CI.
Cost-sensitive output-heavy generation. $1.60/M output vs Max’s $7.50/M is the biggest single savings line.

Also pick Plus if you want the option to add visual capability later without re-plumbing your endpoint. Plus is API-compatible with Max for text-only requests, so you can start text-only today and start attaching images the day your product demands it — no migration cost.

When to Pick Qwen 3.7 Max

Pick Qwen 3.7 Max only when you can name a specific reason the ~6× cost premium pays for itself. Concrete pick signals:

You’re optimizing for SWE-Bench Pro. Max’s 60.6% is the current proprietary high-water mark — a 2-point edge over GPT-5.5’s 58.6%. If your roadmap or RFP mentions SWE-Bench Pro explicitly, Max is the play.
Latency-critical text pipelines. Max is ~7-15% faster on text-only cold paths. For high-volume real-time generation where every second compounds, Max can pay for itself in dev-time savings (see Scenario A math above — the break-even is roughly where dev-time at $80/hr exceeds ~$600/mo per 5 devs).
Benchmark-driven stakeholder decisions. Procurement or technical evaluation explicitly weighs benchmark headlines. Max’s LM Arena coding #10 and SWE-Bench Pro 60.6% beat Plus on both.
Pure-text CLI coding agents where the quality gap matters. See Qwen 3.7 Max coding arena benchmarks and Qwen 3.7 Max developer guide for the deep-dive integration patterns where Max’s edge shows up.

Also pick Max when you’re benchmarking against GPT-5.5 or Claude Opus 4.8 on pure coding tasks. Max’s SWE-Bench Pro 60.6% lead is specific to that benchmark, though: GPT-5.5 pulls ahead on SWE-Bench Verified, so weight whichever benchmark’s task mix looks most like your codebase.

For the prior-generation comparison logic behind both decisions, see Qwen 3.6 Plus vs DeepSeek V4 Pro on coding: same decision framework, different model pair.

Try Both via ofox: A/B in 10 Lines of Code

The single-key advantage matters more for this pair than any other Qwen comparison. Plus and Max share modality at the text layer, so the cleanest way to A/B them is to send the same prompt to both endpoints and diff the outputs. ofox hosts both on its OpenAI-compatible API at ofox.ai/models/bailian/qwen3.7-plus and ofox.ai/models/bailian/qwen3.7-max. The API model IDs are bailian/qwen3.7-plus and bailian/qwen3.7-max. One API key, one base URL, swap one string.

Python — A/B both models in one loop

from openai import OpenAI

client = OpenAI(
    base_url="https://api.ofox.io/v1",
    api_key="sk-ofox-xxx",
)

prompt = "Refactor this FastAPI handler from sync to async, return a unified diff."

# Same prompt, two models — only the model string changes.
for model in ("bailian/qwen3.7-max", "bailian/qwen3.7-plus"):
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
        max_tokens=2048,
    )
    print(f"\n=== {model} ===\n{resp.choices[0].message.content}")

Node — same shape

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.ofox.io/v1",
  apiKey: process.env.OFOX_API_KEY,
});

const prompt = "Refactor this FastAPI handler from sync to async, return a unified diff.";

for (const model of ["bailian/qwen3.7-max", "bailian/qwen3.7-plus"]) {
  const resp = await client.chat.completions.create({
    model,
    messages: [{ role: "user", content: prompt }],
    temperature: 0.2,
    max_tokens: 2048,
  });
  console.log(`\n=== ${model} ===\n${resp.choices[0].message.content}`);
}

Plus only: attach a screenshot

This is the call Max physically can’t run — Plus reads the image, returns a fix grounded in what it sees. Same client, same key, just an image_url content block:

import base64

with open("error.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

resp = client.chat.completions.create(
    model="bailian/qwen3.7-plus",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Which assertion failed and why? Return the offending line."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
        ],
    }],
    max_tokens=1024,
)
print(resp.choices[0].message.content)

The pattern we’d actually run in production: default to Plus for everything, and route to Max only when a request explicitly opts in (e.g. a model=premium flag set by code paths that need Max’s benchmark edge). One-line router, ~6× cheaper baseline, vision capability available the moment you start attaching image_url blocks.

FAQ

Does Qwen 3.7 Plus support 1M context like Qwen 3.7 Max? Yes. Both share the same 1M-token context window. Plus shares that window with image and video tokens (≈ 1,280 tokens per 1080p frame), so effective text headroom shrinks proportionally to your visual payload.

Is Qwen 3.7 Plus better than Qwen 3.7 Max for coding? On raw quality, marginally worse on pure text-only coding (Max #10 vs Plus #12 on LM Arena coding, ~2-point SWE-Bench Pro gap). On cost-per-resolved-task, ~5× better, since Plus is $0.40/$1.60 vs Max’s $2.50/$7.50. On visual coding (screenshot debugging, design mockup interpretation), Plus is the only choice — Max can’t see the image.

How much does Qwen 3.7 Plus cost compared to Qwen 3.7 Max? Plus is $0.40/M input, $1.60/M output, $0.08/M cached. Max is $2.50/M input, $7.50/M output, $0.25/M cached. Plus is roughly 6× cheaper across the board. Image input on Plus is priced at the same $0.40/M rate as text input.

Can Qwen 3.7 Plus run for 35 hours autonomously? Yes. Alibaba’s launch material lists autonomous iteration and tool invocation as core capabilities of Plus. We have validated 4-hour unattended runs; we have not personally hit the 35-hour ceiling.

How does Qwen 3.7 Max compare to GPT-5.5 on SWE-Bench Pro? Qwen 3.7 Max scores 60.6% versus GPT-5.5 at 58.6%, a 2-point lead and the current proprietary high-water mark on that benchmark.

Should I migrate from Qwen 3.7 Max to Qwen 3.7 Plus? For most workloads, yes — Plus is ~6× cheaper on text tokens alone and adds vision for free. Stay on Max only if you’ve validated a quality gap on your specific tasks that’s worth the 6× premium, or if Max’s 7-15% latency edge actually moves a business metric for you.

Does Qwen 3.7 Plus generate images? No. Plus ingests images and video but does not generate them. You still need a separate generation model for text-to-image workloads.

Where can I try both models in one place? ofox lists both at ofox.ai/models/bailian/qwen3.7-plus and ofox.ai/models/bailian/qwen3.7-max, OpenAI-compatible API, single key.

References

Alibaba Qwen Team launch note for Qwen 3.7 Plus, June 2, 2026: https://www.marktechpost.com/2026/06/02/alibabas-qwen-team-launches-qwen3-7-plus-adding-vision-deep-reasoning-tool-invocation-and-autonomous-iteration-on-the-bailian-platform/
Qwen 3.7 Max benchmark report on OpenRouter: https://openrouter.ai/qwen/qwen3.7-max/benchmarks
Qwen Research page: https://qwen.ai/research
VentureBeat coverage of Qwen 3.7 Max 35-hour autonomous runs: https://venturebeat.com/technology/alibabas-proprietary-qwen3-7-max-can-run-for-35-hours-autonomously-and-supports-external-harnesses-like-anthropics-claude-code
ofox model catalog snapshot, 2026-06-03: Qwen 3.7 Plus listed 2026-06-01 at $0.40/M input / $1.60/M output / $0.08/M cached; Qwen 3.7 Max listed 2026-05-21 at $2.50/M input / $7.50/M output / $0.25/M cached
LM Arena leaderboard snapshot, 2026-06-02

The honest summary you can send your tech lead in one Slack message: “Plus is ~6× cheaper than Max on every token type, has the same 1M context and 35-hour autonomous ceiling, and bundles vision for free. Max wins SWE-Bench Pro by 2 points and is ~10% faster on text-only — that’s the entire case for paying 6× more. Default to Plus; reserve Max for the specific cases where its benchmark edge is worth $25/dev/mo over $5.”