Cohere Just Dropped a Free, Open Source Voice Model That Could Change How We Handle Audio

Cohere Just Dropped a Free, Open Source Voice Model That Could Change How We Handle Audio
You know that moment when you sit through a long meeting, a lecture, or a voice note, and you wish something could just write it all down for you? That problem has been around forever, and a lot of companies have tried to fix it.
Cohere just stepped into that space with something a little different from the rest. They did not just build another transcription tool. They built one, made it open source, kept it light enough to run on regular hardware, and released it through their API at no cost.
That is a lot of boxes checked at once.
What Cohere Actually Built Here
Cohere is an enterprise AI company. They mostly focus on tools for businesses: text analysis, natural language understanding, and AI agents that help companies automate work. Stepping into voice and audio is a bit of a new direction for them.
Their new model is called Transcribe, and it is their first-ever voice model.
At its core, Transcribe is an automatic speech recognition (ASR) model. ASR is the technology that takes spoken words and converts them into text. Think about the captions that appear when you watch a video, or how your phone turns voice memos into readable text. That is ASR working behind the scenes.
Cohere built Transcribe to handle that kind of job, but with a few specific goals in mind: make it fast, keep it accurate, and make it accessible to people who want to run it themselves.
The Numbers Behind the Model
A Lighter Model With Serious Performance
Transcribe runs on 2 billion parameters. If you are not familiar with AI terminology, parameters are basically the internal settings a model uses to process and understand data. The more parameters, the more powerful and often the heavier the model.
Two billion might sound like a lot, but in the world of AI, it is actually on the smaller side. That is intentional. Cohere designed Transcribe to work on consumer-grade GPUs, meaning the kind of graphics card you might find in a regular gaming PC or workstation, not a massive data center server.
That matters because most open source models with strong performance require expensive cloud infrastructure to run well. Transcribe does not. If you have a decent computer, you can run this yourself.
How Fast Does It Actually Process Audio?
Speed is one of the places where Transcribe really makes an impression. The model can process 525 minutes of audio in a single minute of processing time.
Read that again. 525 minutes of audio. In one minute.
For its size and class, that kind of throughput is genuinely high. If you have ever worked with transcription tools that take forever to process a recording, you will appreciate just how fast that actually is.
How It Stacks Up Against Other Models
Breaking Down the Benchmark Results
There is a benchmark called the Hugging Face Open ASR Leaderboard that researchers and developers use to compare how different speech recognition models perform across languages and conditions. It measures something called Word Error Rate (WER), which tracks how many words a model gets wrong compared to the actual spoken content.
Lower WER is better. It means the model is making fewer mistakes.
Cohere says Transcribe achieved an average WER of 5.42 on this leaderboard. That puts it ahead of several other models already considered strong performers in this space, including:
- Zoom Scribe v1
- IBM Granite 4.0 1B
- ElevenLabs Scribe v2
- Qwen3-ASR-1.7B Speech
That is a genuinely strong result for a model at this size.
What Real Human Evaluators Found
Raw benchmarks tell one part of the story, but Cohere also had real human evaluators test Transcribe by listening to audio and comparing its transcriptions against alternatives.
In those evaluations, Transcribe had an average win rate of 61% across the models it competed against. Human evaluators judged it on accuracy, coherence, and how usable the transcriptions actually felt in practice.
Those are meaningful criteria. A transcription that is technically close but reads like a jumbled mess is not that useful. Coherence and usability matter just as much as raw accuracy.
Where It Still Struggles
It would not be an honest write-up without covering the weaknesses.
Transcribe ran into trouble with three languages specifically: Portuguese, German, and Spanish. In those cases, it fell behind competing models. Cohere has not detailed exactly what caused the performance dip in those languages, but it is worth noting if you or the people you work with primarily speak any of those three.
The model currently supports 14 languages total: English, French, Italian, Greek, Dutch, Polish, Chinese, Japanese, Korean, Vietnamese, and Arabic, alongside the three where it underperformed. For many use cases and languages, it holds up well. But if your work is heavily Portuguese, German, or Spanish, run some tests before fully committing.
Why Open Source Matters Here
What Open Source Actually Means for You
When a company releases a model as open source, it means the underlying weights and architecture are publicly available. Anyone can download the model, inspect how it works, modify it, and deploy it for their own use.
This is different from most commercial transcription tools, which are closed systems. You send your audio to their servers, they process it, and you get text back. You have no visibility into what happens in between.
With an open source model like Transcribe, you can run the entire thing on your own machine. Your audio never has to leave your computer if you do not want it to.
For anyone working with sensitive recordings, that is not a small deal. Think confidential meetings, medical consultations, legal conversations, personal voice notes. All of that can be transcribed locally, privately, without sending anything to a third-party server.
Who This Is Really Built For
Cohere is primarily targeting enterprise users, which makes sense given their business focus. But the open source release means a much broader group of people can use and build on it.
Developers can integrate it into their own apps. Researchers can fine-tune it on their own data. Small teams can self-host it without needing to subscribe to expensive services from other providers.
And honestly, even individuals who are comfortable running models locally now have a fast, capable transcription option that does not require signing up for anything.
The Growing Market for Transcription and Voice AI
Why This Space Is Blowing Up Right Now
Note-taking and transcription apps have had a real moment over the past couple of years. Tools like Granola and Wispr Flow have built loyal user bases by solving a simple problem: people take a lot of meetings, and no one wants to write everything down by hand.
The demand keeps growing because the way people work keeps changing. Remote meetings, asynchronous audio messages, recorded lectures, voice memos. Audio is everywhere, and text is easier to search, edit, and share.
Speech recognition quality has also crossed a threshold where it is good enough for most everyday use cases. A few years ago, transcription tools made enough mistakes to be more of a hassle than a help. Now the accuracy is high enough that people actually trust the output.
That shift in trust has opened up a much bigger market. Companies like Cohere see an opportunity to get in front of that wave with strong, accessible models.
What Cohere Plans to Do With Transcribe Next
Cohere is not just releasing Transcribe as a standalone project. They plan to integrate it into North, their enterprise agent orchestration platform.
If you are not familiar with agent orchestration, here is the quick version: it is a system that lets AI tools work together in coordinated workflows. Instead of using one AI tool at a time, an orchestration platform strings multiple tools together so they can hand off tasks to each other automatically.
Think of it like this: you have a meeting, Transcribe turns it into text, another AI model summarizes it, and then a third one creates action items and drops them into your task management tool. All without you doing any of that manually.
Plugging Transcribe into North sets up that kind of pipeline. Voice input goes in. Structured, useful output comes out.
The model is also being made available on Model Vault, Cohere's managed inference platform, for users who want to access it through Cohere's infrastructure rather than self-hosting.
What This Means for the State of ASR Right Now
The Competition Is Real
Cohere is entering a space that already has some serious players. OpenAI's Whisper has been a major reference point in open source speech recognition for a while. Other companies have released competing models, and the leaderboard standings shift regularly as new versions come out.
What makes Transcribe interesting is not just the performance numbers. It is the combination of factors working together:
- Open source and freely available
- Small enough to run on regular hardware
- Fast enough for real-time or near-real-time use cases
- Built by a company with enterprise integration already on the roadmap
Each of those things on its own is not unique. Together, they make for a fairly compelling package.
The Word Error Rate Race
The push to lower WER scores is ongoing across the industry. Every few months, a new model comes out claiming to beat the previous benchmark leader. Transcribe's score of 5.42 is strong today, but the space moves quickly.
What will matter over time is not just the benchmark score but how the model performs on real-world audio. Studio-quality recordings are easy. Noisy environments, strong accents, overlapping speakers, and technical vocabulary: those are the conditions where models get tested in ways that benchmarks do not always capture.
Cohere will likely need to keep improving Transcribe to stay competitive, especially in the languages where it currently falls short.
How to Access Transcribe
Cohere is making Transcribe available in a few different ways:
- Through their API โ currently offered at no cost, which means you can start building with it right away.
- Via Model Vault โ for users who want Cohere to handle the infrastructure side of things.
- Open source download โ for developers and researchers who want to run it locally or fine-tune it for specific use cases.
The open source route is the most flexible one. If you have the technical setup to run a 2B parameter model on your own GPU, you get full control over how it runs and what you do with it.
A Look at the Leaderboard
If you want to see exactly how Transcribe and other speech recognition models compare, the Hugging Face Open ASR Leaderboard is the place to look. It tracks WER scores across languages and conditions for dozens of models, and it gets updated regularly as new models get submitted.
Here is a simplified snapshot of where Transcribe lands relative to its competition:
| Model | Avg. WER | Notes |
|---|---|---|
| Cohere Transcribe | 5.42 | Best overall on leaderboard at release |
| ElevenLabs Scribe v2 | Higher | Strong alternative |
| IBM Granite 4.0 1B | Higher | Smaller-scale rival |
| Zoom Scribe v1 | Higher | Enterprise-focused, outperformed overall |
| Qwen3-ASR-1.7B | Higher | Comparable size, lower accuracy |
It is a genuinely useful resource if you are trying to pick the right tool for a specific language or use case.
See how the models stack up yourself: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
Wrapping Up
Cohere entering the voice AI space with Transcribe is worth paying attention to, not because it is a perfect product, but because of what it represents.
A capable, fast, open source transcription model that you can run on your own hardware is genuinely useful to a lot of people. Developers building apps. Researchers working with audio data. Teams that handle sensitive conversations. Individual users who just want accurate transcriptions without sending their audio to someone else's server.
The gaps in Portuguese, German, and Spanish performance are real. But for the languages where it performs well, Transcribe holds its own against models from bigger names.
Whether this is the transcription tool that sticks around for years or just a strong opening move from Cohere in a new category, it shows that the field of speech recognition is still moving fast.
And for anyone who has ever wished their meetings would just write themselves down, well. It is getting a whole lot closer.
- MultiCasting Commission: The 3-Day LIVE Training That Shows You How to Pocket $2,000 Per DealโNo Experience Needed!
- Python for AI Agents: A Complete Beginnerโs Guide
- Why Content Marketing Must Stop Chasing Traffic and Start Building Fame
- Microsoftโs Phi-4-Reasoning-Vision-15B: The AI Model That Knows When to Think and When Not To
- How MITโs Attention Matching Shrinks AI Memory by 50x Without Breaking Anything