Liquid AI Launches LFM2.5-350M, Compact AI Model Trained on Massive 28T Token Dataset

Liquid AI Launches LFM2.5-350M, Compact AI Model Trained on Massive 28T Token Dataset
Liquid AI Launches LFM2.5-350M, Compact AI Model Trained on Massive 28T Token Dataset

Liquid AI Launches LFM2.5-350M, Compact AI Model Trained on Massive 28T Token Dataset


There is a popular belief in the AI world that size is everything. Bigger models, more parameters, more memory โ€” that has been the playbook for years. Companies race to build the largest possible systems, and everyone assumes more compute automatically means smarter AI.

Liquid AI is quietly challenging that belief.

Their latest release, LFM2.5-350M, is a small AI model by almost any standard โ€” 350 million parameters. But what it can do with those parameters is genuinely interesting. This model was trained on 28 trillion tokens and uses a combination of reinforcement learning and a completely non-traditional architecture to punch well above its weight. On several benchmarks, it outperforms models that are more than twice its size.

That is not a marketing claim. That is what the benchmark results actually show.

This article breaks down exactly what makes LFM2.5-350M different, why the architecture matters, what the numbers actually mean, and why any of this is worth caring about.


Why Small Models Are Getting More Attention

Before getting into the technical details, it helps to understand why a small, efficient model like this matters right now.

Most of the buzz in AI goes to the largest frontier models โ€” the ones requiring massive data centers, expensive GPUs, and significant infrastructure. Those models are impressive, but they are not practical for every use case.

Think about a smart home device. Or a phone that can run AI without sending data to a server. Or an embedded system in a car, a hospital monitor, or an industrial machine. None of those environments can host a 70-billion parameter model. They need something lean, fast, and accurate enough to be genuinely useful.

That is the space Liquid AI is targeting with this release. They call it the โ€œedgeโ€ โ€” hardware with limited memory and limited compute. And the question they are answering with LFM2.5-350M is: how much can you actually do at the edge if you train the right way?


What Does 28 Trillion Tokens Even Mean?

For context, a โ€œtokenโ€ is roughly a word or a word fragment. When a model is trained, it reads through an enormous amount of text, and each piece of text is broken into tokens for processing.

Most large language models are trained on a few trillion tokens. GPT-4 was reportedly trained on somewhere around 13 trillion. The point is that 28 trillion is a lot โ€” especially for a model with only 350 million parameters.

The Token-to-Parameter Ratio

Here is where things get interesting. When you divide 28 trillion tokens by 350 million parameters, you get a ratio of approximately 80,000 to 1. That means for every single parameter in the model, it was trained on 80,000 tokens of data.

This ratio is extraordinarily high compared to industry norms. Most large models have much lower token-to-parameter ratios. By pushing this ratio so high, Liquid AI is forcing the model to extract as much useful information as possible from the data it sees.

Think of it this way. If you study a textbook once quickly, you probably remember a fraction of it. If you study it deeply, work through every problem, read it multiple times, and actually try to apply what you learn โ€” you will absorb significantly more. The LFM2.5-350M is the second student. It was trained intensively on a massive dataset, which means every parameter is doing more work.

The result is what Liquid AI calls intelligence density โ€” the idea that you can pack a lot of capability into a small number of parameters, as long as your training process is good enough.


The Architecture: Why This Is Not a Standard Transformer

Most AI language models you have heard of โ€” GPT, LLaMA, Mistral, Gemma โ€” are built on the Transformer architecture. Transformers work well, but they have a known limitation that becomes problematic at scale.

The KV Cache Problem

When a Transformer processes a long piece of text, it stores information about every previous token in something called the Key-Value cache, or KV cache. This is how the model remembers context. As the context window grows โ€” meaning the longer the text you feed it โ€” the KV cache grows too.

That sounds fine in theory. In practice, it creates a memory bottleneck. On edge devices with limited RAM, a large KV cache can make running the model completely impractical. The memory requirements grow quadratically with context length, which means longer inputs exponentially increase the burden.

Liquid AI sidesteps this problem entirely by using a different architecture.

Linear Input-Varying Systems

The core of LFM2.5-350M is built on Linear Input-Varying Systems, or LIVs. These are a type of sequence processing layer that functions similarly to Recurrent Neural Networks โ€” systems that process information sequentially and maintain a form of internal memory state.

The key difference is that LIVs are designed to be more parallelizable and more stable during training than traditional RNNs, which were historically difficult to train well. LIVs maintain a constant-size memory state regardless of how long the input is, which means memory requirements do not grow as context length increases.

This directly solves the KV cache problem.

The Hybrid Backbone

LFM2.5-350M uses a hybrid architecture that combines two types of layers:

10 Double-Gated LIV Convolution Blocks

These handle the majority of sequence processing. They are efficient, memory-friendly, and handle most of the heavy lifting. Because they use a constant-size memory state, they produce very low I/O overhead compared to attention-based layers.

6 Grouped Query Attention (GQA) Blocks

A small number of attention blocks are included for precision tasks โ€” retrieving specific details, handling long-range dependencies, and managing complex contextual relationships. GQA is a more memory-efficient version of standard attention, using grouped queries to reduce the KV cache size while retaining much of the capability.

By combining both approaches, the model gets the memory efficiency of LIV processing for most tasks, plus the precision of attention for the tasks that genuinely need it. Neither approach alone would achieve the same result.


Context Window and Memory Footprint

One of the most impressive technical specs is that LFM2.5-350M supports a 32,768 token context window โ€” commonly written as 32k tokens.

To put that in perspective, 32k tokens is roughly equivalent to 24,000 words of text. That is about the length of a short novella, or an entire research report, or a lengthy codebase. For a 350-million parameter model to handle that much context with low memory overhead is unusual.

The memory numbers Liquid AI published for specific hardware are striking:

  • Snapdragon 8 Elite NPU: 169 MB peak memory using RunAnywhere Q4
  • Snapdragon GPU: 81 MB peak memory using RunAnywhere Q4
  • Raspberry Pi 5: 300 MB using Cactus Engine int8

The 81 MB figure on a mobile GPU is the one that stands out. Most consumer smartphones have 8 to 12 GB of RAM. Running an AI model that handles a 32k context window in 81 MB of peak memory means this model can run locally on a phone without competing meaningfully with other apps for resources.

That opens up genuinely new deployment scenarios โ€” offline AI assistants, on-device document processing, local tool use agents, and more.


Benchmark Performance

Benchmark scores are useful, but they need context to be meaningful. Here is what the numbers actually show.

IFEval: Instruction Following

Score: 76.96

IFEval measures how well a model follows structured instructions. This is arguably the most important benchmark for this particular model's intended use cases, which include tool use, function calling, and structured data extraction.

A score of 76.96 is strong for any model in this size range. It means the model is reliable at understanding and executing precise instructions โ€” the kind of thing you need when you are asking an AI to fill out a JSON object, call an API, or extract specific data from a document.

GPQA Diamond

Score: 30.64

GPQA is a graduate-level science and reasoning benchmark. A score of 30.64 is modest, and the Liquid AI documentation is transparent about this โ€” the model is not designed for complex scientific reasoning or graduate-level problem-solving. That is not its job.

MMLU-Pro

Score: 20.01

MMLU-Pro tests broad academic knowledge across many subjects. Again, this is a modest score, and it is expected. A 350 million parameter model is not competing with 70-billion parameter models on broad knowledge recall.

What the Benchmarks Are Really Telling You

The pattern in the scores is deliberate. High instruction-following ability, modest general reasoning. This model was built for a specific job, and the training reflects that. It follows instructions well, extracts structured data reliably, and can be deployed at the edge with minimal resources.

For creative writing, complex math problems, or multi-step code generation, you would reach for a bigger model. The Liquid AI documentation explicitly says as much โ€” those tasks require larger parameter counts. Knowing what a tool is for is part of using it well.


Throughput on Server Hardware

Edge deployment is one side of the story. The other side is raw throughput in server environments.

On a single NVIDIA H100 GPU at high concurrency, LFM2.5-350M achieves 40,400 output tokens per second.

That is a very high number. The efficiency gains from the hybrid LIV architecture โ€” particularly the reduced KV cache overhead โ€” translate directly into higher throughput. When you are not spending as much memory on context management, the processor can spend that capacity on generating output.

For applications that involve processing enormous volumes of data โ€” log analysis, real-time classification, large-scale data extraction pipelines โ€” this throughput matters. You can run more requests per GPU, which reduces infrastructure costs at scale.


Reinforcement Learning at Scale

One aspect of LFM2.5-350M that deserves its own discussion is the role of scaled reinforcement learning in training.

Reinforcement learning from human feedback (RLHF) has been a standard tool for improving language model behavior since around 2022. The idea is to train the model not just on text prediction, but on feedback signals โ€” reward the model for good responses, penalize it for poor ones.

What Liquid AI has done with this model is apply reinforcement learning at a scale that was not common for models of this size. Most RL training is expensive and time-consuming, which means it tends to be applied more selectively or at smaller scales.

By combining the deep pre-training on 28 trillion tokens with large-scale RL, Liquid AI is using both halves of the training pipeline aggressively. Pre-training gives the model raw knowledge and pattern recognition. Reinforcement learning refines how it uses that knowledge โ€” making it better at following instructions, staying on task, and producing outputs that are actually useful.

The IFEval score of 76.96 is, in part, a reflection of that RL work paying off.


What Agentic AI Means and Why This Model Targets It

The phrase โ€œagentic AIโ€ comes up often in discussions of LFM2.5-350M. It is worth explaining what that actually means.

An โ€œagentโ€ in AI refers to a system that takes actions autonomously, often calling tools, executing code, making API requests, or performing multi-step tasks based on instructions. Rather than just answering a question, an agent does things.

For agentic workflows to work, a model needs to:

  1. Follow structured instructions precisely
  2. Output well-formatted data (often JSON or structured commands)
  3. Respond quickly
  4. Stay on task across multiple steps

LFM2.5-350M is specifically optimized for all four of those requirements. The high IFEval score speaks to points one and two. The low memory footprint and high throughput speak to point three. The RL training speaks to point four.

Think about an AI assistant running locally on a device, managing notifications, summarizing messages, calling calendar APIs, and triggering reminders. That kind of agent does not need to write poetry or solve differential equations. It needs to be fast, accurate, and reliable at following instructions.

That is exactly the job description LFM2.5-350M was built for.


Where You Can Actually Use This Model

On Mobile and Edge Devices

The sub-200 MB memory footprint on modern mobile hardware means local deployment is realistic. Developers building on-device AI features for Android and iOS can integrate this model without requiring constant network connectivity or offloading computation to a cloud server.

This is significant for privacy-sensitive applications โ€” health monitoring, personal finance tools, offline assistants โ€” where sending data to a server is undesirable.

In Embedded Systems

The Raspberry Pi 5 result (300 MB using int8 quantization) suggests the model can run on single-board computers and embedded processors. That opens the door for IoT applications, edge data processing, and on-device inference for industrial and consumer hardware.

In High-Volume Data Pipelines

At 40,400 tokens per second on an H100, this model is cost-effective for high-throughput data extraction tasks. If you need to process thousands of documents, classify millions of data points, or extract structured information from large text corpora, running a compact, fast model makes economic sense.

As Part of Multi-Model Pipelines

A growing pattern in AI deployment is to use multiple models at different sizes for different parts of a task. A lightweight model handles routing, filtering, and simple extraction. A larger model handles complex reasoning when genuinely needed.

LFM2.5-350M is well-suited for the lightweight role in that kind of hybrid pipeline.


What This Tells Us About the Direction of AI Research

The release of LFM2.5-350M fits into a broader pattern that has become more visible over the past year: the shift from scale-first thinking to efficiency-first thinking.

For a long time, the assumption was that more parameters were always better. Training on more data with more compute would yield better models, full stop. The race to build the biggest model dominated research priorities.

But several things are pushing researchers toward more efficient approaches:

Compute costs are significant. Training and running very large models requires expensive hardware and significant energy. There is growing interest in getting more out of smaller parameter counts.

Deployment constraints are real. Not every application can run in a data center. Edge deployment, on-device inference, and embedded AI are growing use cases that large models cannot serve.

The Chinchilla results changed the conversation. Research published in 2022 showed that many large models had been over-parameterized and under-trained โ€” they had too many parameters relative to how much data they were trained on. Optimal training involves a much higher token-to-parameter ratio than most models used at the time.

Liquid AI's approach with LFM2.5-350M takes that lesson seriously. The 80,000:1 token-to-parameter ratio is not accidental โ€” it reflects a deliberate choice to invest training compute in data rather than model size.


A Note on What This Model Cannot Do

Transparency matters when evaluating AI tools, and Liquid AI is explicit about this model's limitations.

LFM2.5-350M is not recommended for:

  • Complex mathematics
  • Sophisticated multi-step coding tasks
  • Creative writing
  • Deep general-purpose reasoning

These tasks benefit from larger parameter counts in ways that more training data alone cannot fully compensate for. The capacity to hold and manipulate complex conceptual structures across long reasoning chains appears to require more parameters than 350 million provides.

If you need a model for those use cases, a larger model is the right choice. LFM2.5-350M is not trying to be everything โ€” it is trying to be excellent at a specific set of tasks, and the benchmarks show it succeeds at that.


Where to Find the Model

Liquid AI has published the technical details on their blog at liquid.ai and the model weights are available publicly. This means developers can download, run, and fine-tune the model for specific applications.

For developers experimenting with edge deployment, the RunAnywhere Q4 inference engine and the Cactus Engine int8 quantization format are worth exploring โ€” those are the configurations that produced the low memory footprint results mentioned earlier.


Final Thoughts

LFM2.5-350M is a good example of what focused, efficient AI engineering looks like. It is not trying to answer every question or solve every problem. It is trying to do one thing well: run fast, follow instructions accurately, and do it in environments where large models simply cannot operate.

The combination of a hybrid LIV architecture, deep pre-training on 28 trillion tokens, and scaled reinforcement learning produces a model that is genuinely capable for its intended workload, at a fraction of the resource cost of larger alternatives.

For anyone building AI-powered tools that need to run on constrained hardware โ€” or anyone building high-throughput data pipelines that need cost-effective inference โ€” this is a model worth paying attention to.

The broader lesson here is also worth sitting with. The assumption that bigger is always better is increasingly being questioned by results like these. The most useful AI system is not always the largest one. Sometimes it is the one that fits your constraints, runs reliably, and does its specific job well.

LFM2.5-350M makes a strong case for that way of thinking.


The Bigger Picture: AI That Works Everywhere

There is something worth stepping back to appreciate here. For most of the past several years, the practical reality of using powerful AI has been: you need an internet connection, you need to trust a third-party server with your data, and you need to accept the latency of sending a request across a network.

LFM2.5-350M represents a meaningful step toward changing that reality for specific use cases.

When a model can run in 81 MB of memory on a mobile GPU, it can run on a device you carry in your pocket. When it supports a 32k token context window, it can read and summarize an entire document locally. When it scores 76.96 on IFEval, it can reliably extract structured information, call local APIs, and complete multi-step tasks without ever leaving your device.

That kind of capability, deployed at scale across consumer hardware, has real implications for how AI-powered software gets built. Not every feature needs a cloud connection. Not every user query needs to leave the device. Some workflows that currently depend on expensive server infrastructure could run locally, saving both cost and latency.

That is not a solved problem yet โ€” but LFM2.5-350M is one concrete demonstration that it is becoming more achievable.

The Open Source Angle

The fact that Liquid AI has made the model weights publicly available is also worth noting. Open access to model weights allows developers to fine-tune the model for specific domains, integrate it into their own tools, and adapt it for use cases that Liquid AI did not specifically anticipate.

A small, open, efficient model is a useful building block. Fine-tuned on domain-specific data โ€” legal documents, medical records, code repositories, customer service transcripts โ€” it could become highly specialized and effective for narrow tasks. That fine-tuning process is far cheaper and faster with a 350 million parameter model than with a multi-billion parameter alternative.

The combination of strong baseline capability, efficient architecture, and open weights makes this a practical tool for developers who want to build AI-powered features without building from scratch or paying for API access every time a user makes a request.

This is the direction AI tools need to go if they are going to become genuinely ubiquitous โ€” not just available to large companies with data center budgets, but accessible to smaller teams, individual developers, and anyone building software for the real world with real constraints.

Technical details and model weights are available via Liquid AI's official blog and model repository.

More Posts:

Subscription Form