This open-source AI video model creates 4-minute videos directly on your PC

Posted on May 8, 2026May 8, 2026 by Mark Harrell

Contents show

This open-source AI video model creates 4-minute videos directly on your PC

While everyone is busy arguing about which paid AI video platform produces the best results, a Chinese tech company called Meituan quietly released something that changes the conversation entirely. No subscription required. No watermark slapped onto your work. No cloud server processing your prompts somewhere you can't see. Just a model you download, run on your own machine, and use to generate minutes-long videos from scratch.

It's called LongCat-Video, and it's already turning heads in the AI community for very good reason.

What Even Is LongCat-Video?

LongCat-Video is a video generation model built by Meituan, a large Chinese company probably best known outside of tech circles for food delivery but very well-known inside AI circles for some genuinely serious research. This model has 13.6 billion parameters, which sounds like an abstract number but essentially means it's a very large, capable system that has learned from enormous amounts of video data how to produce realistic-looking footage from text or image prompts.

What makes it different from most things in this space is that it was designed from the ground up to handle long video. Most AI video tools today generate clips that last five to ten seconds before quality starts falling apart. LongCat-Video was pretrained specifically on a task called Video Continuation, which means its architecture was built to extend video over time without color drifting, flickering, or degrading into visual noise. The result is something that can produce videos up to about four minutes long at 720p resolution and 30 frames per second while maintaining consistent quality throughout.

That's not a small thing. Long-form video generation has been one of the hardest problems in this field, and most commercial platforms haven't fully solved it either.

Why the Free and Open-Source Part Actually Matters

Here's the part that's easy to gloss over but is actually the most important detail: this model is released under the MIT license.

If you've ever dealt with software licensing, you know the MIT license is about as permissive as it gets. It means you can use the model, modify it, build products with it, and distribute your own versions of it. Commercially. Without paying anyone. The only condition is that you keep the license notice attached.

Compare that to the situation with most AI video tools in 2025 and 2026. Sora, OpenAI's video model, is locked behind a Plus or Pro subscription. Google's Veo 3 is available through Gemini Advanced, which also costs money. PixVerse, RunwayML, and most other players in the space operate on usage-based pricing where you pay per generation or buy credit bundles. If you're a student, an independent creator, or someone just experimenting with the technology, those costs add up fast. And even when you do pay, most platforms own the infrastructure, see what you generate, and in some cases retain usage rights to your outputs under their terms of service.

Running LongCat-Video locally means none of that applies. Your prompts stay on your machine. Your outputs are yours. You don't need a subscription to run it tomorrow or next year.

The Screenshot That Started the Conversation

The post that caught everyone's attention: Meituan drops a 13.6B parameter model, free and open-source, while the rest of the world pays subscriptions.

When news of LongCat-Video started spreading on social media, the reaction was a mix of surprise and genuine excitement. The framing that kept appearing in posts and comments was essentially: people are paying monthly fees for tools that do this commercially, and here's a fully capable alternative you can just run at home.

That's not hype. The model's performance on standard benchmarks is legitimately competitive with some of the best commercial options available. On the internal benchmark that the Meituan team ran, LongCat-Video scored a 3.76 on text alignment and a 3.25 on visual quality out of a 5-point scale. For context, Google's Veo 3 scored 3.99 and 3.23 on those same metrics. The commercial leader is ahead on text alignment, but the gap is smaller than you might expect given the difference in access and pricing.

What LongCat-Video Can Actually Do

The model isn't just one thing. It supports several different generation modes, which makes it flexible depending on what you're trying to create.

Text-to-Video

The most straightforward use case: you write a description and the model generates a video based on it. Type something like “a drone shot flying over a foggy mountain range at sunrise” and the model will generate footage that matches that description. The quality at 720p is solid, and because of the long-form training, you can generate footage that holds together for much longer than the typical 5-10 second clip other tools produce.

Image-to-Video

This is where things get interesting for a lot of creators. You provide a still image and the model animates it, generating a realistic video that starts from that frame and evolves naturally over time. The motion quality has to be coherent to pull this off, and LongCat-Video performs well here, particularly with the coarse-to-fine generation strategy it uses to maintain spatial consistency across frames.

Video Continuation

This is the mode that the model was specifically pretrained on and where it genuinely stands out. You give it an existing video clip and it extends it, generating new frames that follow naturally from where the original clip ends. This is how you chain together longer sequences, and it's the feature most responsible for the four-minute generation capability.

Interactive Video Generation

The model also supports a more interactive mode where you can guide the generation with additional inputs during the process, giving you more control over where the video goes rather than just letting the model do its thing from a single prompt.

LongCat-Video-Avatar

In December 2025, the team released a separate but related model called LongCat-Video-Avatar. This one is specifically designed for audio-driven character animation. You provide an audio track, a character image or description, and it generates a video of that character speaking and moving in sync with the audio. It supports both single-character and multi-character scenarios, and handles two separate audio streams if you're generating a conversation between two people. The lip sync accuracy is handled through something called audio CFG, and the team's documentation suggests keeping that value between 3 and 5 for the best results.

How It Actually Works Under the Hood

You don't need to understand the technical details to use the model, but they're genuinely interesting if you want to know why it works as well as it does.

Coarse-to-Fine Generation

Most video models generate each frame somewhat independently, which is why quality degrades over long clips. LongCat-Video uses a coarse-to-fine strategy along both the time axis and the spatial axis, meaning it first produces a rough low-resolution version of the whole video and then refines it in passes. This is why it can maintain quality over several minutes rather than falling apart after ten seconds.

Block Sparse Attention

Attention mechanisms in neural networks are what allow models to relate different parts of the input to each other. In video generation, this means understanding how frame 1 relates to frame 50 or how the left side of the screen relates to the right. Full attention over long videos is computationally expensive, so LongCat-Video uses Block Sparse Attention, which focuses attention on the most relevant regions and reduces the processing cost significantly, particularly at high resolutions. This is a big part of why it can run efficiently even on a single GPU.

Multi-Reward RLHF

RLHF stands for Reinforcement Learning from Human Feedback, and it's the technique that most modern AI models use to improve quality based on human preferences. LongCat-Video uses a variant called Group Relative Policy Optimization with multiple reward signals, meaning it was trained not just on a single quality metric but on several at once, including text alignment, visual quality, and motion quality. The result is a model that doesn't sacrifice one aspect of video quality to improve another.

What You Need to Run It

This is where the conversation gets a little more realistic. Running a 13.6 billion parameter model locally is not something you can do on a laptop or a machine without a dedicated GPU.

The setup requires:

A machine with a CUDA-compatible NVIDIA GPU (the more VRAM the better)
CUDA version 12.4 or compatible
Python 3.10 via a conda environment
PyTorch 2.6.0
FlashAttention 2, which speeds up the attention computation
The model weights themselves, downloaded from HuggingFace using the huggingface-cli tool

The installation process involves setting up a conda environment, installing PyTorch with the right CUDA backend, installing FlashAttention, and then downloading the weights. It's not a one-click install, and it requires comfort with command-line tools and package managers. But if you've ever set up a machine learning environment before, the process is well documented in the GitHub repo and manageable within an afternoon.

For users who want multi-GPU inference, the model supports parallelizing the generation across two or more GPUs using the context parallel size parameter, which cuts down generation time significantly.

How It Compares to What's Out There

Let's put the benchmarks in plain language.

When the Meituan team compared LongCat-Video against other models on their internal evaluation, the results looked like this:

Model	Type	Parameters	Text Alignment	Visual Quality
Veo 3	Proprietary	Unknown	3.99	3.23
PixVerse V5	Proprietary	Unknown	3.81	3.13
Wan 2.2 (T2V)	Open Source	14B active / 28B total	3.70	3.26
LongCat-Video	Open Source	13.6B	3.76	3.25

The takeaway here is that LongCat-Video sits right in the middle of the competitive set, beating PixVerse on text alignment and matching or slightly trailing Veo 3 on most metrics. The gap between the best commercial model and the best open-source model is real but not enormous. And the open-source models in this comparison, including LongCat-Video and Wan 2.2, are both runnable locally while the commercial ones are not.

Worth noting: Wan 2.2 has 28 billion total parameters but only activates 14 billion at a time due to its Mixture of Experts architecture. LongCat-Video uses a Dense architecture where all 13.6 billion parameters are active during inference, which has implications for memory requirements but also for consistency of output.

The Bigger Picture

What LongCat-Video represents is not just a cool tool for video creators. It's a signal about where AI development is heading in an era where open-source models are increasingly catching up with closed commercial systems.

For most of 2024 and into 2025, the narrative in AI video was that the best results required proprietary infrastructure and serious investment. The gap between what you could do locally versus what OpenAI or Google could do in the cloud felt insurmountable. LongCat-Video narrows that gap considerably. Not all the way, but meaningfully.

For independent creators, this matters because it changes the economics of the work. A short film producer who wants to experiment with AI-generated footage doesn't need to budget for API costs. A developer building a tool that incorporates video generation can do it without licensing a commercial API. A student studying AI-generated media can run experiments locally without worrying about billing.

The MIT license also means the community can build on top of it. Fine-tuned versions, integration into other tools, optimization for different hardware setups. That's already started happening with similar open-source models, and LongCat-Video is positioned to benefit from the same community energy.

A Few Honest Caveats

No tool is perfect, and LongCat-Video has some real limitations worth knowing about.

The hardware requirements are non-trivial. If you don't have a modern GPU with substantial VRAM, you won't be running this on your own machine. You could theoretically rent GPU time from a cloud provider to run it, which still avoids the subscription model of commercial platforms, but it's not purely free in that case.

The setup is genuinely technical. People comfortable with Python, conda, and terminal commands can handle it, but this isn't a consumer app with a drag-and-drop interface. The Streamlit interface that comes with the repo makes things easier once the model is installed, but getting there requires patience.

And while the benchmark scores are competitive, there are areas where commercial platforms still pull ahead, particularly on text alignment for complex prompts and on overall polish. The model is excellent, but the best outputs from Veo 3 or Runway remain a step ahead on the highest difficulty prompts.

Getting Started

If you want to try it yourself, the full setup guide lives at the official GitHub repository. The process looks roughly like this:

Clone the repo from GitHub
Create a conda environment with Python 3.10
Install PyTorch 2.6.0 with the right CUDA backend for your GPU
Install FlashAttention 2 and the remaining dependencies from requirements.txt
Download the model weights from HuggingFace using huggingface-cli
Run the demo script for whatever generation mode you want to try

Once you're up and running, you can generate text-to-video, image-to-video, and video continuation outputs. For the Avatar model, the process is similar but requires the additional avatar requirements file.

Final Thoughts

There's something genuinely refreshing about a tool like this existing. The AI video space has been moving fast, but the dominant narrative has been that the best stuff is locked behind subscriptions and cloud platforms. LongCat-Video pushes back on that, not by being the absolute best model in every category, but by being a genuinely capable, free, locally-runnable option that comes close to competing with the leaders.

If you have the hardware to run it and you're curious about what's possible with AI video generation, it's worth exploring. The GitHub repo is active, the documentation is solid, and the community around it is growing. Meituan has already released a follow-up Avatar model and published a detailed technical report, which suggests this is a project they're committed to continuing.

For anyone who's been sitting on the sidelines of AI video because of the cost barrier, this might be the moment to jump in.

Source and full documentation: LongCat-Video on GitHub

This open-source AI video model creates 4-minute videos directly on your PC

This open-source AI video model creates 4-minute videos directly on your PC

What Even Is LongCat-Video?

Why the Free and Open-Source Part Actually Matters

The Screenshot That Started the Conversation

What LongCat-Video Can Actually Do

Text-to-Video

Image-to-Video

Video Continuation

Interactive Video Generation

LongCat-Video-Avatar

How It Actually Works Under the Hood

Coarse-to-Fine Generation

Block Sparse Attention

Multi-Reward RLHF

What You Need to Run It

How It Compares to What's Out There

The Bigger Picture

A Few Honest Caveats

Getting Started

Final Thoughts

More Posts:

Source and full documentation: LongCat-Video on GitHub

Leave a Reply Cancel reply