How Scientists Trained AI to Think 3x Faster — Without Adding Any Extra Hardware

How Scientists Trained AI to Think 3x Faster — Without Adding Any Extra Hardware
A team of researchers from four institutions just pulled off something the AI world has been chasing for years: they made language models dramatically faster by changing how the model thinks during training, not by bolting on expensive extras afterward.
The Problem Nobody Talks About Enough
If you've ever typed a question into an AI chatbot and watched those little dots appear while you wait for a response, you've already experienced the bottleneck this research is trying to fix.
Every time a large language model generates text, it works through the response one word chunk at a time — or more precisely, one “token” at a time. A token is roughly a word or part of a word. The model reads everything you've written, thinks really hard, produces one token, then repeats that entire process thousands of times until it finishes its response.
That's a lot of repetition. And each cycle takes real time and real computing resources.
For casual use, this feels like a minor inconvenience. But think about what happens when AI systems start running in what researchers call “agentic” workflows — where one AI triggers another, which triggers another, and the whole chain needs to finish fast enough to be useful. Or think about the newer “reasoning” models that generate thousands of tokens just thinking through a problem before they even start writing an answer.
At that scale, slow token generation doesn't just feel annoying. It costs real money and creates real delays that make certain applications completely impractical.
Why One Token at a Time Was Always the Standard
The reason AI models generate text this way goes back to how they were trained. The standard training approach is called next-token prediction. You show the model enormous amounts of text, and you train it to predict what word comes next. Over billions of examples, it gets very good at this.
It's a clean, elegant approach that scales well. The problem is that it locks the model into sequential behavior at inference time. Generate one token, feed it back in, generate the next one. One at a time, every time.
For a while, this was fine. But as models got bigger, as reasoning traces got longer, and as AI systems started working in longer chains of tasks, the costs kept compounding.
A researcher at the University of Maryland, John Kirchenbauer, who co-authored the paper this article is based on, put it clearly: as AI moves toward agentic workflows, latency — meaning how long a single user waits for their answer — becomes just as important as overall throughput. It's not enough for a GPU to process lots of tokens per second across many users. Individual users sitting in agentic loops need their specific responses to come back fast.
That's a fundamentally different problem than just making AI generally efficient. And it required a different solution.
The Existing Workarounds and Why They Fall Short
Before this new research, engineers had already developed techniques to speed up inference. The most well-known is speculative decoding.
Here's how speculative decoding works: you run a small, fast “draft” model alongside the main model. The draft model guesses several tokens ahead. Then the main model checks those guesses. If they're correct, you accept them all at once, which is much faster than generating them one by one. If they're wrong, you throw them out and generate correctly.
It's clever. It works reasonably well. But it comes with real operational headaches.
You need to deploy and maintain two separate models — the main one and the draft one. The draft model has to be tuned to match the main model closely enough that its guesses are usually right. When the main model gets updated, the draft model may need updating too. You're spending more total compute running two models in parallel. And the infrastructure to manage this is genuinely complex.
There's also diffusion-based language modeling, which approaches text generation in a completely different way by predicting tokens in parallel using a denoising process. This is technically interesting but requires training models from scratch using an entirely different approach, which means you can't use it with models that already exist.
So when this new research came along, the question it was answering was: can we get the speed benefits of parallel token generation without the headache of maintaining a separate draft model, and without rebuilding everything from scratch?
The answer turned out to be yes.
Multi-Token Prediction: The Core Idea
The concept behind the new approach is called multi-token prediction, or MTP. The idea is exactly what it sounds like: instead of training a model to predict one token at a time, you train it to predict a whole block of tokens at once.
Produce token 1, token 2, token 3, token 4 — all in a single forward pass through the model. One round of computation, multiple outputs.
If you can actually make this work well, the speed gains are obvious. Predicting four tokens at once doesn't take four times the compute of predicting one. It takes maybe 1.5x or 2x. So you're getting a lot more text out for each pass, which means the overall process runs much faster.
The catch is that standard training methods don't actually produce a model that's good at this.
Why Standard Training Fails for Multi-Token Generation
When you train a model on next-token prediction, it learns to predict each token based only on what comes before it. Each prediction is independent. The model doesn't learn to think about how tokens relate to each other across a group.
When you take such a model and ask it to generate multiple tokens at once using the standard training objective, two specific things go wrong.
The first problem is what the researchers call grammatical mismatch. Say the model is predicting two words to follow the phrase “The zookeeper fed the.” It might predict “panda” as one independent guess and “bamboo” as another independent guess. But these were drawn from different contexts — “panda” was likely the next word in some training examples about pandas eating, and “bamboo” came from other contexts entirely. Paired together, you get “panda bamboo” when you needed “panda bamboo” or “lion meat” — matched pairs. Instead, you get scrambled combinations like “lion bamboo” which don't make sense.
The second problem is degenerate repetition. When a model trained on standard next-token prediction is asked to predict a token that's 50 or 100 positions ahead, it has no idea. It can't look that far forward based on its training. So it falls back on the most statistically common word in the English language: “the.” You end up with outputs that trail off into “…the the the the the…” which is obviously useless.
These aren't small problems. They're fundamental failures that make naive multi-token prediction worse than useless.
The Student-Teacher Fix
The research team's solution borrows a framework from machine learning called knowledge distillation — but they applied it in a novel way specifically designed to fix the problems described above.
They set up a student-teacher training scheme. The student is the model being trained to generate multiple tokens at once. The teacher is a strong, well-trained next-token prediction model that already generates high-quality text.
Here's how training works. The student model looks at a prompt and generates a block of tokens all at once — let's say four tokens in one forward pass. This happens in parallel; it's not sequential. The student produces its best guess at what four tokens should follow.
Then the teacher model evaluates that block. The teacher reads the context plus the student's proposed block and calculates how likely and how coherent that sequence is. If the student said “lion bamboo” when the context made “lion meat” much more sensible, the teacher assigns a high loss to that output. This loss signal gets fed back to the student, and the student's weights get adjusted to avoid making that mistake again.
The feedback is dynamic. The student isn't memorizing fixed text from a dataset. It's generating its own proposals and getting evaluated on those proposals. In reinforcement learning terms, the student is sampling its own trajectories and receiving rewards based on quality.
This is the key insight that makes everything work. By having the teacher grade the coherence of full multi-token sequences rather than evaluating each token in isolation, the student learns to think about tokens as groups. It learns that “lion” and “meat” belong together in the same way that “panda” and “bamboo” do. It learns to generate blocks that are internally consistent.
The degenerate repetition problem also goes away because the teacher, which is a strong language model, assigns terrible scores to outputs full of repeated words. The student learns quickly that “the the the the” earns a very bad grade.
The Beautiful Simplicity of the Architecture Change
One of the most impressive aspects of this research is how little it changes about the model's architecture. Normally, when researchers want to add a new capability to a neural network, they add new layers, new attention mechanisms, new specialized components. These changes are expensive to implement, hard to integrate with existing systems, and often incompatible with models already deployed.
This approach adds one thing: a single special token called an <MTP> token.
The way it works is that when the model sees this token in the input, it knows to generate a full block of tokens in parallel rather than a single next token. The <MTP> token occupies an unused slot in the model's existing vocabulary embedding matrix — the table of representations that the model already has for every word and special symbol it knows.
That's it. The underlying architecture — the layers, the attention mechanism, the internal structure, whatever variant the model uses like mixture-of-experts or sliding window attention — stays completely untouched.
What this means practically is enormous. Any existing language model can potentially be adapted to use this approach without rebuilding anything from scratch. The team confirmed this directly: any standard next-token prediction model can be adapted this way, with no barriers from the model's internal design choices.
For teams running AI models in production, this is genuinely good news. You don't have to throw away your existing model and start over. You apply the MTP training procedure to the model you already have, and it gains the ability to generate multiple tokens at once.
ConfAdapt: Being Smart About When to Go Fast
Training a model to generate multiple tokens at once is only half the problem. The other half is figuring out when, at inference time, to actually use that capability versus when to fall back to single-token generation.
Here's why this matters. Not all text is equally predictable. If you're generating a mathematical formula or filling in a templated sentence, the next several tokens might be extremely obvious and the model can be very confident about what they'll be. If you're generating the punchline to a joke or expressing a nuanced opinion, the next token is genuinely uncertain and getting it wrong will mess up everything that follows.
Generating multiple tokens at once works great when confidence is high. But forcing it when confidence is low produces errors, and those errors cascade through the rest of the response.
The researchers' solution is an adaptive strategy they call ConfAdapt — short for confidence-based adaptation.
How ConfAdapt Works in Practice
At each generation step, the model generates a block of tokens — say four tokens. It also computes a confidence score for that block, representing how sure it is that this particular sequence is the right one.
If the confidence score meets a threshold — the researchers used 90% in their tests — the model accepts and outputs the entire block at once. Four tokens for the price of roughly one and a half passes through the model.
If the confidence score falls below the threshold, the model discards everything after the first token. It keeps only the single highest-confidence token and tries again.
This means the model is effectively doing what a skilled person does when reading a predictable text: skimming quickly through the obvious parts and slowing down to think carefully about the uncertain parts.
When generating formulaic or highly structured text — legal boilerplate, mathematical reasoning steps, code comments with obvious continuation — ConfAdapt lets the model sprint through large chunks at once. When generating genuinely creative or ambiguous text, it naturally slows to careful single-token generation.
The result is that the speed gains concentrate exactly where they're most reliable. Predictable domains see the biggest acceleration. Unpredictable domains see modest acceleration or none at all, which is fine because forcing speed there would just produce worse outputs.
The Test Results: What 3x Actually Looks Like
The research team tested their method on two real, widely-used open-source models. The first was Llama-3.1-8B-Magpie, a strong general-purpose instruction-following model at 8 billion parameters. The second was Qwen3-4B-Instruct-2507, a smaller 4 billion parameter model often chosen when compute costs are a concern.
Both models were fine-tuned on MetaMathQA, a dataset of synthetic math problems that require multi-step reasoning. Math was chosen because it has long reasoning traces — exactly the use case where speed matters most — and because math answers are objectively verifiable, which makes accuracy measurement clean.
What the Numbers Showed
Using ConfAdapt with appropriate confidence thresholds, the Llama-3.1-8B model achieved a 3x speedup. Accuracy on math benchmarks dropped by less than 3 percentage points. That's a remarkably small quality cost for tripling generation speed.
The Qwen3-4B model also hit 3x speedup, with a slightly larger accuracy drop of around 7 percentage points. Still very usable, especially for applications where a small accuracy tradeoff is acceptable in exchange for dramatically lower latency.
More aggressive settings — using lower confidence thresholds, accepting more uncertain multi-token blocks — could push the acceleration up to 5x. The tradeoff is steeper accuracy losses at that level, so the 3x setting represents what the researchers describe as the sweet spot between speed and quality.
What's especially notable is where those speedups come from. On sections of text where the model is highly confident — structured steps in a math proof, repeated formulas, common phrasing patterns — ConfAdapt might emit six or seven tokens in a single pass. On sections where the content is genuinely uncertain, it drops to one or two tokens per pass. The average works out to around 3x because math reasoning has a lot of predictable structure mixed in with genuinely hard inference steps.
The Generalization Surprise
One finding the researchers didn't fully anticipate: the speedups transferred to tasks that weren't part of the training data at all.
The models were trained for multi-token generation using math problems. But when tested on creative writing, text summarization, and general question answering, they still showed significant acceleration. Not as dramatic as on math-like tasks, but meaningful.
This suggests that the model is learning something more general than just how to generate math text quickly. It seems to be learning a broader capability for recognizing predictable structure in language and exploiting that structure to generate in parallel.
For practical deployments, the researchers still recommend fine-tuning models for MTP using samples from your specific application domain. If you're deploying for customer service text, use customer service examples. If you're deploying for code generation, use code. The best performance comes from matching the training domain to the deployment domain. But the fact that generalization exists at all is encouraging and gives users a reasonable baseline even before domain-specific tuning.
What This Means for Real AI Deployments
Let's think about what a 3x inference speedup without quality loss means for actual products and services.
For a company running an AI assistant that needs to generate long responses, cutting generation time by two-thirds directly reduces the GPU time needed per response. That's a meaningful cost reduction. At scale, it could mean the difference between a product being economically viable or not.
For agentic applications where one AI model triggers multiple other calls — planning, executing, checking, reporting — each step in the chain gets faster. The compounding effect of 3x speedup across five chained calls doesn't just add up, it multiplies the overall responsiveness of the whole system.
For users, the experience is simply better. Waiting three seconds for an answer versus waiting nine seconds is not a minor difference. It changes how the interaction feels. It makes the tool feel alive and responsive rather than laborious.
The Infrastructure Story
One of the practical concerns engineers might have about adopting new training techniques is the cost of rebuilding deployment infrastructure.
The researchers addressed this directly. Teams using popular inference frameworks like vLLM or SGLang will need to make some adjustments to how batching and KV caching work — these are the systems that manage how tokens are processed and stored during generation. But the team described these as one-time engineering investments, not ongoing complexity.
More encouragingly, John Kirchenbauer stated clearly that the team sees no fundamental barriers to integration and is actively working with systems experts to figure out the most straightforward path to plugging this into existing deployment stacks.
The framing matters here. Existing acceleration techniques tend to focus entirely on the inference system — the harnesses and logic that run the model after it's trained. This approach embeds some of the acceleration capability directly into the model's weights during training. That means the inference system doesn't need to do as much extra work. The speed is already there in the model.
This makes it genuinely complementary to other optimization techniques. You could, in theory, combine MTP with other inference optimizations and stack the benefits.
The Bigger Picture: Why This Research Direction Matters
Step back and think about what's happening in AI right now.
Models are getting better at reasoning, but better reasoning means longer reasoning traces, which means more tokens, which means higher costs and longer waits. The capability improvements and the efficiency costs are moving in the same direction.
Multi-token prediction via self-distillation is one of several research directions trying to break that coupling — to let models reason deeply without paying a linear penalty in time and compute for every extra step.
What makes this particular approach worth paying attention to is the combination of three things:
The first is the simplicity of the architecture change. Adding one special token and adapting an existing model is dramatically easier than building entirely new architectures from scratch. This means the technique can potentially be applied to models that already exist and are already deployed.
The second is the quality preservation at 3x speed. A speedup that comes with a 30% quality loss is not useful in most applications. A speedup that comes with a sub-3% quality loss is genuinely deployable.
The third is the training approach using self-distillation. By using the model's own confident outputs as training signal rather than static ground truth, the framework avoids the degenerate collapse problems that plagued earlier multi-token prediction attempts. It's a clever solution to a hard problem.
What Comes Next
The research team has already released their trained models publicly on Hugging Face, so anyone can download and experiment with them today. If you want to see ConfAdapt's acceleration in action, the lead researcher suggests starting with simple, highly structured prompts — counting sequences, repeated phrases, formulaic text — where the model's confidence will be consistently high and the speedups will be most visible.
The code for the full MTP training framework is being released shortly at https://github.com/jwkirchenbauer/mtp-lm. This is where the real action will be for teams that want to apply the technique to their own models. Once the code is available, the workflow is: take your existing model, run the MTP fine-tuning procedure using examples from your deployment domain, and deploy the resulting model with a ConfAdapt-aware inference setup.
The team behind this research spans the University of Maryland, Lawrence Livermore National Labs, Columbia University, and TogetherAI — a collaboration that combines academic depth in machine learning theory with practical expertise in deploying models at scale. That combination shows in how the research is framed: not just “here is a technique that works in a paper” but “here is a technique that engineering teams can actually use.”
Putting This in Context for Someone New to AI
If you're coming to this without deep AI background, here's the simplest way to think about what happened here.
AI language models are like very fast readers who write text one letter at a time. Each letter they write requires re-reading everything before it. That's slow when you need thousands of letters.
Researchers have tried to speed this up by having a second, cheaper reader guess the next few letters so the main reader can just verify them — that's speculative decoding. It works, but it means maintaining two readers and keeping them in sync.
What this new research does is train the main reader to write multiple letters at once when it's confident enough to do so. No second reader needed. The main reader just learned a new skill during its training process.
The training trick that made this possible was pairing the model being trained with a strong evaluator — a teacher — that grades not just whether individual letters are correct, but whether the whole group of letters makes sense together. This teaches the student model to think in groups, not just one at a time.
The result is a model that can move three times faster through text it's confident about, while still taking its time on the hard parts. All from a single change: one special token added to its vocabulary.
Final Thoughts
What this research represents isn't just a speed trick. It's a shift in how we think about where AI efficiency improvements should live.
Most of the work on making AI faster has happened at the infrastructure level — better hardware, smarter batching, more efficient memory management. These are all valuable, and they'll keep being valuable. But they're all operating on models that were trained to generate one token at a time.
This approach says: what if the speed improvement lives inside the model itself? What if the model learns during training to be more efficient at inference time?
That's a different philosophy, and it opens up interesting possibilities. Models could be trained from the start with inference efficiency as an explicit goal, not just a post-hoc optimization. The gap between what a model can think and how fast it can communicate that thinking could close significantly.
For the many people building products on top of AI right now — whether you're a developer, a product manager, or just someone curious about where this technology is going — this kind of research is worth understanding. Not because you need to implement it yourself, but because it represents the kind of progress that makes AI tools genuinely more useful rather than just more impressive on benchmarks.
Faster, cheaper, and still accurate. That's the combination that makes AI accessible to more people and more applications. And that's worth being excited about.
The full MTP framework code will be available at https://github.com/jwkirchenbauer/mtp-lm — a resource worth bookmarking if you're building anything that relies on language model inference and cares about speed.
More Posts
- KittenTTS Nano, Small Text to Speech LLM That Runs on Standard CPUs
- How AI Coding Agents Are Changing the Way Teams Build Internal Tools
- The OpenClaw Incident With Antigravity: How a Weekend Coding Experiment Locked Developers Out of Their Digital Lives
- How AI Killed the Click … And What Brands Must Do Now
- How Beginners Are Building Passive Royalty Streams With Automated Puzzle Books