Qwen3.5 Medium Is Here and It Just Ran Frontier-Level AI on a Gaming PC

Qwen3.5 Medium Is Here and It Just Ran Frontier-Level AI on a Gaming PC
Qwen3.5 Medium Is Here and It Just Ran Frontier-Level AI on a Gaming PC

Qwen3.5 Medium Is Here and It Just Ran Frontier-Level AI on a Gaming PC

A few days ago, Alibaba's Qwen team quietly dropped something that most people in the AI world did not see coming. They released a new set of open source language models called the Qwen3.5 Medium Model Series, and the numbers attached to them are genuinely wild.

One of these models, the Qwen3.5-35B-A3B, beats both OpenAI's GPT-5 mini and Anthropic's Claude Sonnet 4.5 on several standard AI benchmarks. Claude Sonnet 4.5 only came out five months ago. And the Qwen team did it with a model that can run on a consumer desktop GPU with 32 gigabytes of VRAM.

That is not supposed to happen. Top-tier AI performance was supposed to require massive server farms, billion-dollar infrastructure, and a monthly subscription to some cloud platform. Alibaba just made that assumption look shaky.

You can grab all three open source models right now on Hugging Face and ModelScope. They are free to use and modify under the Apache 2.0 license, which means commercial use is totally fair game.

Let's break down what is actually going on here, why it matters, and what it means for anyone who builds with AI or just pays attention to where this technology is heading.


What Alibaba Just Released

The Four Models in the Qwen3.5 Medium Series

The Qwen3.5 Medium series is a family of four models. Three of them are fully open source, and one is a hosted proprietary option through Alibaba Cloud:

  • Qwen3.5-35B-A3B โ€” The headline model. 35 billion total parameters, but only activates 3 billion at a time. Runs on consumer hardware. Beats GPT-5 mini and Claude Sonnet 4.5 on multiple benchmarks.
  • Qwen3.5-27B โ€” Efficient and fast, handles over 800,000 tokens of context. Good fit for mid-range setups.
  • Qwen3.5-122B-A10B โ€” Designed for server-grade GPUs with 80GB of VRAM. Handles over 1 million tokens and starts closing the gap with the very largest AI models in the world.
  • Qwen3.5-Flash โ€” The hosted cloud version. Proprietary, available only through Alibaba Cloud's Model Studio API, but priced more affordably than nearly any comparable Western model.

The three open source models are available right now on Hugging Face and ModelScope. The base model for Qwen3.5-35B-A3B has also been released separately for researchers who want to fine-tune from scratch.

What โ€œOpen Sourceโ€ Actually Means Here

When something is released under Apache 2.0, it means you can download the model weights, run them yourself, modify them, and build products on top of them without paying a licensing fee. You own your deployment. No monthly API bill, no usage limits, no sending your data to someone else's servers.

For a solo developer or a small company, that is a genuinely big deal. The alternative is paying $3 per million input tokens for Claude Sonnet 4.5, or $1.75 per million for GPT-5.2, every time your app processes text. At scale, those costs add up fast.


The Technical Part Made Simple

What Is a Mixture-of-Experts Model?

Here is the thing about the Qwen3.5-35B-A3B that makes it special: it has 35 billion parameters total, but it only uses 3 billion of them to process any single piece of text.

Think of it like a company with 256 specialized employees. When a question comes in, a manager (the router) picks the 8 or 9 employees best suited to answer it. The rest stay idle. You get the collective knowledge of the whole team, but the actual work is done by a much smaller group.

That system is called a Mixture-of-Experts (MoE) architecture. It is why this model can deliver performance that competes with much larger models while using far less computing power at any given moment.

The full breakdown for Qwen3.5-35B-A3B looks like this:

SpecificationValue
Total Parameters35 billion
Active Parameters per Token3 billion
Number of Experts256
Routed Experts (per token)8
Shared Experts1
Max Context Length1 million+ tokens (on 32GB VRAM GPU)
Quantization Support4-bit (near-lossless)
LicenseApache 2.0

The 27B model supports over 800,000 tokens of context. The 122B model handles 1 million plus. Both are significant numbers. For reference, a typical novel is around 100,000 tokens. One million tokens means you could theoretically feed an entire codebase, a year of meeting transcripts, or a full research archive into a single conversation.

Gated Delta Networks: The Other Technical Trick

Beyond the MoE setup, Qwen3.5 also uses something called Gated Delta Networks layered into its standard Transformer architecture. Without going too deep into the math, Delta Networks help the model track and update information more precisely over long sequences. Combined with MoE, this is part of why the model handles massive context windows without its accuracy falling apart.

Most models get noticeably worse as conversations get longer. They start forgetting earlier details or confusing them. Qwen3.5 was specifically engineered to hold up better across those long contexts.

Quantization and Why It Matters for Running AI Locally

Running a full precision AI model on your own machine requires a lot of memory. A 35 billion parameter model stored in 16-bit floats would need roughly 70 gigabytes of GPU memory. Most people do not have that at home.

Quantization compresses the model by storing its weights using fewer bits. Going from 16-bit to 4-bit cuts the memory requirement by about 75%. The risk is that you lose accuracy in the compression.

The Qwen team says Qwen3.5-35B-A3B is near-lossless at 4-bit quantization. That means the compressed version performs almost identically to the full precision version. With 4-bit compression and the 3 billion active parameter design, the model can run the full 1 million token context window on a consumer GPU with 32GB of VRAM.

That is a gaming-class GPU. The NVIDIA RTX 5090, released earlier this year, has 32GB of VRAM. So does the RTX 4090. This is hardware that individual developers and serious enthusiasts actually own.


How It Performs Against the Competition

Benchmark Results That Turn Heads

Benchmark tests are how AI labs measure their models against each other. They are not perfect measures of real-world usefulness, but they give a consistent basis for comparison. Here is how Qwen3.5-35B-A3B stacks up against models from OpenAI and Anthropic on some key tests:

BenchmarkWhat It TestsQwen3.5-35B-A3BClaude Sonnet 4.5GPT-5 mini
IFBenchInstruction following75.469.076.5
MMMLUMultilingual knowledge85.278.286.0
MMMU-ProVisual reasoning75.168.475.0
SWE-bench VerifiedAgentic coding69.262.072.4
HMMT Feb 2025Graduate math89.080.192.0
BFCL V4Agentic tool use67.355.568.5
BrowseCompAgentic web search61.041.163.8
OmniDocBench v1.5Document understanding89.377.088.9

The Qwen3.5-35B-A3B consistently beats Claude Sonnet 4.5 across most categories. It trades blows with GPT-5 mini in some areas and comes out ahead in others, particularly in visual reasoning and document understanding.

What makes this genuinely surprising is that Claude Sonnet 4.5 is Anthropic's current mid-tier workhorse model. It costs $3 per million input tokens through the API and is used by thousands of production apps. Beating it with a free, locally-runnable model is a real statement.

The 122B model goes even further. It starts competing with some of the largest closed models available, the kind that typically require enterprise contracts and dedicated cloud infrastructure to access.

What the Early User Community Is Saying

People on Hugging Face who have already tested these models have been notably positive about one thing in particular: agentic performance. Agentic tasks are when an AI model has to use tools, browse the web, write and run code, or complete multi-step tasks autonomously without a human guiding each step.

Until recently, that kind of performance was only reliable in the very largest closed models like GPT-5.2 or Claude Opus 4.6. Reviewers are saying Qwen3.5 is narrowing that gap in a way smaller open source models have not managed before.


The โ€œThinking Modeโ€ Feature

How the Model Reasons Before Answering

Qwen3.5 ships with a built-in Thinking Mode turned on by default. Before it gives you a final answer, the model generates an internal chain of reasoning. This reasoning is wrapped in special <think> tags and works through the problem step by step before committing to a response.

This is similar to what OpenAI does with their โ€œoโ€ series reasoning models, but here it is baked directly into the standard model rather than being a separate product.

Why does this matter? For complex tasks like debugging code, solving math problems, or planning multi-step projects, models that reason before answering tend to get things right more often. They catch their own mistakes before surfacing a response. The thinking process acts like a scratchpad.

For most users, this just means the model is more reliable on hard questions. You get fewer confident-sounding wrong answers.


Pricing: The Part That Makes This Really Interesting

What Qwen3.5-Flash Costs vs. the Competition

Even if you are not running models locally and just want to use the API, Qwen3.5-Flash is priced well below almost everything else available. Here is a comparison table of major API models by total cost per million tokens:

ModelInput (per 1M tokens)Output (per 1M tokens)TotalProvider
Qwen 3 Turbo$0.05$0.20$0.25Alibaba Cloud
Qwen3.5-Flash$0.10$0.40$0.50Alibaba Cloud
DeepSeek V3.2$0.28$0.42$0.70DeepSeek
Grok 4.1 Fast$0.20$0.50$0.70xAI
MiniMax M2.5$0.15$1.20$1.35MiniMax
Gemini 3 Flash Preview$0.50$3.00$3.50Google
Claude Haiku 4.5$1.00$5.00$6.00Anthropic
Gemini 3 Pro (under 200K)$2.00$12.00$14.00Google
GPT-5.2$1.75$14.00$15.75OpenAI
Claude Sonnet 4.5$3.00$15.00$18.00Anthropic
Claude Opus 4.6$5.00$25.00$30.00Anthropic
GPT-5.2 Pro$21.00$168.00$189.00OpenAI

Qwen3.5-Flash at $0.50 total versus Claude Sonnet 4.5 at $18.00. That is a 36x price difference for a model that scores higher on several benchmarks.

The API also has specific pricing for tool calling features:

  • Web Search: $10 per 1,000 calls
  • Code Interpreter: Free for a limited time

For anyone building applications that need to search the web or run code as part of their workflow, free code interpretation while it lasts is a meaningful cost reduction.


What This Means If You Are a Developer or Builder

Running AI Without a Cloud Bill

The most immediate practical implication is that you can run a genuinely competitive AI model on your own machine without paying anyone anything per query.

If you have a desktop or workstation with a 32GB GPU, you can download Qwen3.5-35B-A3B, quantize it to 4-bit, and run it locally. Your data stays on your machine. Your usage costs are just electricity. There is no rate limiting, no per-token pricing, no privacy concerns about sending data to a third-party API.

For students, indie developers, researchers, and anyone working on personal projects, that changes the math significantly. You can prototype something ambitious without worrying about running up a $500 API bill during development.

Building Private Applications at Organizations

For companies that deal with sensitive data, the architecture of Qwen3.5 is interesting for a different reason. When you run AI through a cloud API, your prompts and documents leave your infrastructure. Legal teams, compliance departments, and security-conscious organizations often have problems with that.

Running Qwen3.5 locally or on private infrastructure means your data never leaves your firewall. You can analyze internal documents, process confidential client information, and build autonomous agents that operate entirely within your own systems.

Early adopters in enterprise settings have specifically called out the model's agentic capabilities here. The combination of thinking mode, long context windows, and native tool calling means you can build agents that work through complex internal tasks autonomously, without relying on the largest and most expensive closed models to get reliable results.

Fine-Tuning and Research

Alibaba also released the Qwen3.5-35B-A3B-Base model, which is the pre-trained version before instruction tuning. This is specifically useful for researchers and developers who want to fine-tune the model on their own data.

Fine-tuning lets you take a general-purpose model and specialize it for a specific domain. A legal tech company could fine-tune it on case law. A medical startup could train it on clinical notes. A gaming studio could make it an expert on their specific game world.

Having access to the base weights, under an open license, makes this practical in a way that is not possible with closed models.


The Bigger Picture: What Alibaba Is Doing to the AI Market

The Pattern That Keeps Repeating

If you have been paying attention to AI news over the past year, you have probably noticed a pattern. A US lab releases a new model that sets a new benchmark record. A few months later, a Chinese lab releases an open source model that matches or beats it at a fraction of the cost.

DeepSeek did it with V3 and R1. MiniMax did it with M2.5. Now the Qwen team is doing it again with the 3.5 Medium series.

Each time it happens, it puts pressure on the pricing and positioning of Western closed models. Claude Sonnet 4.5 at $18 per million tokens made sense when it was among the best models available at its capability level. That calculus shifts when a free, locally-runnable alternative starts beating it on benchmarks.

This does not mean Anthropic or OpenAI are in trouble. They both have the larger flagship models, customer trust, integrations, safety research, and enterprise relationships that do not disappear overnight. But the mid-tier of the AI market is getting genuinely contested.

Why the โ€œOpen vs. Closedโ€ Debate Is Getting More Interesting

There is an ongoing argument in the AI world about whether open source models are good or bad for safety. Some argue that open weights can be misused more easily because there is no API layer to apply safety filters. Others argue that open source enables transparency, independent auditing, and access for researchers who cannot afford closed model pricing.

The Qwen3.5 release adds fuel to that conversation. When an open source model can match a top closed model in raw capability, the tradeoffs become more real and more widely felt.

What is clear is that the assumption that frontier performance requires closed proprietary systems is becoming harder to maintain with each release cycle.


Should You Actually Try These Models?

If You Are a Developer

Yes, worth experimenting with, especially if you have a GPU with 16GB or more of VRAM. The 4-bit quantized version of the 35B model should run reasonably well even on a 24GB card, though you may need to reduce context length. Tools like Ollama and LM Studio make local deployment fairly approachable even if you are not deep into the technical side.

If you do not want to run it locally, the API through Alibaba Cloud is worth testing for any cost-sensitive application. At $0.50 total per million tokens, you would need a strong reason to pay 36x more for Sonnet 4.5 unless you specifically need Anthropic's safety filters or ecosystem integrations.

If You Are Just Curious About AI

Even if you are not building anything, this release is worth knowing about because it illustrates something real about how AI development is working right now. The gap between what the biggest labs can do and what is freely available to anyone is shrinking fast.

Two years ago, GPT-4 was the most capable model available and it was only accessible through OpenAI's paid API. Today, you can run something that beats several 2025-era mid-tier models on your own computer for free.

The technology is genuinely spreading outward. That has complicated implications, both exciting and worth thinking carefully about.


Where to Get the Models

The Qwen3.5 Medium Series is available now at two places:

Both platforms host the open source versions of Qwen3.5-35B-A3B, Qwen3.5-27B, and Qwen3.5-122B-A10B, along with their base model variants. Model cards on both platforms include technical documentation, recommended quantization settings, and example prompts.

For the API version of Qwen3.5-Flash, you will need to sign up for Alibaba Cloud Model Studio and access it from there.


A Few Things Worth Keeping in Mind

Benchmarks Are Not Everything

The benchmark numbers are real, but they are also cherry-picked to some degree by every lab doing the benchmarking. Real-world usefulness involves things that are harder to measure: does the model follow complex instructions reliably, does it handle edge cases gracefully, does it refuse to do harmful things appropriately, does it maintain coherence over a very long conversation.

The early community feedback on Qwen3.5 is positive, but it is still early. Broad real-world testing over weeks and months will tell a more complete story.

Censorship and Regional Considerations

Qwen models are trained by a Chinese company and operate under different content policies than Anthropic or OpenAI models. Some users have noted that Qwen models refuse to engage with certain politically sensitive topics related to China. This is worth factoring in if your use case involves topics that might trigger those filters.

For most technical, coding, analytical, and creative tasks, this is unlikely to come up. But it is worth knowing.

The Proprietary Flash Model Has Different Rules

Qwen3.5-Flash is not open source. If you use it through the API, you are subject to Alibaba Cloud's terms of service, which include regional access restrictions and usage policies. The open source models do not have those constraints in the same way, which is part of why local deployment is appealing to privacy-focused users.


Wrapping Up

Alibaba just released a set of AI models that beat Anthropic's Claude Sonnet 4.5 on multiple benchmarks, run on consumer hardware, and are completely free to use and modify.

That is a real thing that happened, and it is worth understanding even if you are not a developer or AI researcher. The underlying story is about where serious AI capability is landing: not just in expensive cloud APIs controlled by a few US companies, but increasingly in open weights that anyone can download and run.

For developers, the opportunity to build cost-effectively with competitive models has genuinely expanded. For people thinking about AI more broadly, it is one more data point in a trend that shows this technology spreading faster and more widely than most predictions expected.

Whether that is entirely good news depends on what you value. But it is definitely news.


Models available at Hugging Face and ModelScope

More Posts

Subscription Form