ai-trendscomparisonai-code

Qwen 35B Beats Claude Opus on Image Generation Tasks

A real test shows local Qwen 3.6-35B matched or exceeded Claude Opus 4.7 on image generation, proving open-source models now handle specific tasks better than frontier AI at a fraction of the cost.

April 19, 2026

Qwen 35B Beats Claude Opus on Image Generation Tasks

You probably believe that paying for the best AI model guarantees better results across the board. Simon Willison just proved you wrong on at least one task that matters.

He ran Qwen 3.6 35B locally against Claude Opus 4.7 with a simple prompt: draw a pelican. The local open-source model produced a better image. Not a marginal improvement. A genuinely better pelican. This isn't a benchmark anomaly or a cherry-picked example. It's a direct challenge to the assumption that frontier models dominate at everything.

TL;DR

Qwen 3.6 35B running locally beat Claude Opus at image generation in a real-world test, suggesting open-source models now match frontier capabilities on specific tasks while costing orders of magnitude less.

Comparison of image outputs between local Qwen model and Claude Opus
Local vs frontier model capability test
The AI industry has spent years marketing a false equivalence: frontier model equals best for everything. It doesn't. Frontier models excel at specific, well-optimized domains. Everything else is a waste of your budget.

Simple Tasks Don't Need Frontier Capability

Watch how AI comparisons actually work. Benchmark scores. Token throughput. Latency tables. What almost nobody measures is whether the output is actually usable for real work. Willison asked that question and got an uncomfortable answer.

Claude has become increasingly defensive on image generation. Refusals pile up. Watermarks appear. Safety layers multiply. Meanwhile Qwen iterates without those liability concerns constraining every design decision. The model improves faster because it's not optimizing for liability reduction.

A pelican isn't trivial in the way that matters. It maps directly to real workflows: reference images for design work, visual mockups, asset generation, prototyping. These tasks don't need photorealism. They don't need reasoning across 100,000 tokens. They need a model that generates usable output quickly. Qwen does that. Claude still costs money to do it worse.

The Economics Are Actually Brutal

Claude Opus charges per token. Every request accumulates. Qwen 35B running locally costs electricity and hardware you already own. The comparison isn't close.

If your team generates dozens of images weekly through the API, your bill becomes real. If those images don't require frontier reasoning, you're subsidizing capabilities sitting unused. The pelican test proves the math: you're overpaying for generic competence.

Orders of magnitude

cost difference between API-based and locally-run models for the same task

This creates structural pressure on Claude's positioning. The marketing promise is universal superiority. The moment developers start auditing their actual workflows against local alternatives, that promise breaks. Claude loses its assumed quality premium that justifies the cost difference.

The uncomfortable truth for frontier model companies: most teams have never tested whether they actually need frontier. They've just assumed it.

Data Privacy Changes the Calculus Entirely

There's an advantage to local models that goes unmentioned in most coverage. Your prompts never leave your hardware. No API logging. No data retention. No terms-of-service review with legal before you can use Claude on sensitive documents.

For teams handling confidential information, this isn't a secondary benefit. It's foundational infrastructure. You cannot use certain API models on documents containing trade secrets, medical data, or client information without legal clearance. Local models eliminate that constraint completely.

Important

Data never leaving your hardware means no compliance review, no data retention policies, no audit trails sent to external servers. For regulated industries, local deployment isn't optional.

Qwen running locally gives you privacy by architecture, not by terms-of-service promise. That's a different security model entirely.

Frontier Models Still Matter, But Only Sometimes

Claude isn't obsolete. The gap just got task-specific and asymmetrical.

Claude maintains an edge on work requiring sustained logical inference, complex reasoning chains, and nuanced understanding across long documents. If your use case is understanding domain-specific context across 100,000 tokens with zero errors, frontier models still win. If you need a system that can hold complicated constraints in mind while generating output, that's where Claude justifies its cost.

But that isn't most workflows. Most work is straightforward: summarization, image generation, simple coding, classification, formatting. Open-source models now deliver comparable or superior results on these tasks. Sometimes significantly better.

The real shift is bifurcation. Twenty percent of use cases genuinely need frontier capability. The other eighty percent get handled cheaper and faster locally.

Teams that build modular systems routing tasks to appropriate models save orders of magnitude on infrastructure. Use Claude for deep reasoning. Use Qwen locally for generation. Use a quantized model for classification. Cost per task collapses when you stop treating every request as if it needs the best model available.

Dashboard showing cost optimization through multi-model routing
Task-specific model selection reducing overall infrastructure costs

Why The Industry Structure Itself Is Wrong

Anthropic, OpenAI, and Google market their models as universal tools superior at everything. It's effective marketing. It's terrible engineering philosophy for actually serving customer needs.

That strategy assumes customers prefer simplicity over efficiency. Pay one vendor, use one model, get acceptable results across all tasks. The evidence now suggests customers actually prefer cost-effectiveness. Local models are winning because they're cheaper, not because they're philosophically pure.

Willison's test will seem trivial to most developers. That's precisely why it matters. Trivial tasks are the majority of production usage. Simple requests scale. And on simple requests, local models now dominate on the metrics that matter: cost, speed, and result quality.

The Question You Should Answer This Week

If Qwen 35B beats Claude Opus at image generation, what else is Claude overdeployed on in your infrastructure? Which high-volume, low-complexity requests are you routing to frontier models when local alternatives would work identically while costing 99% less?

Honest assessment: most teams have never actually tested this. They've rationalized the premium as necessary for quality. Willison just proved that assumption wrong at least once, and one test usually means more failures lurk in your usage patterns.

Pull your API logs. Find your high-volume, low-reasoning requests. Run a parallel test with a local model. The pelican test scales. The only variable is whether you actually run it.

Comments

Some links in this article are affiliate links. Learn more.