We introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across Text-to-Video, Image-to-Video, and Video-Continuation generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models.
For more detail, please refer to the comprehensive LongCat-Video Technical Report.
Clone the repo:
git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video cd LongCat-Video
Install dependencies:
# create conda environment
conda create -n longcat-video python=3.10
conda activate longcat-video
# install torch (configure according to your CUDA version)
pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# install flash-attn-2
pip install ninja
pip install psutil
pip install packaging
pip install flash_attn==2.7.4.post1
# install other requirements
pip install -r requirements.txt
FlashAttention-2 is enabled in the model config by default; you can also change the model config ("./weights/LongCat-Video/dit/config.json") to use FlashAttention-3 or xformers once installed.
| Models | Download Link |
|---|---|
| LongCat-Video | 🤗 Huggingface |
Download models using huggingface-cli:
pip install "huggingface_hub[cli]" huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
# Single-GPU inference
torchrun run_demo_text_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Multi-GPU inference
torchrun --nproc_per_node=2 run_demo_text_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Single-GPU inference
torchrun run_demo_image_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Multi-GPU inference
torchrun --nproc_per_node=2 run_demo_image_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Single-GPU inference
torchrun run_demo_video_continuation.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Multi-GPU inference
torchrun --nproc_per_node=2 run_demo_video_continuation.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Single-GPU inference
torchrun run_demo_long_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Multi-GPU inference
torchrun --nproc_per_node=2 run_demo_long_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Single-GPU inference
streamlit run ./run_streamlit.py --server.fileWatcherType none --server.headless=false
The Text-to-Video MOS evaluation results on our internal benchmark.
| MOS score | Veo3 | PixVerse-V5 | Wan 2.2-T2V-A14B | LongCat-Video |
|---|---|---|---|---|
| Accessibility | Proprietary | Proprietary | Open Source | Open Source |
| Architecture | - | - | MoE | Dense |
| # Total Params | - | - | 28B | 13.6B |
| # Activated Params | - | - | 14B | 13.6B |
| Text-Alignment↑ | 3.99 | 3.81 | 3.70 | 3.76 |
| Visual Quality↑ | 3.23 | 3.13 | 3.26 | 3.25 |
| Motion Quality↑ | 3.86 | 3.81 | 3.78 | 3.74 |
| Overall Quality↑ | 3.48 | 3.36 | 3.35 | 3.38 |
The Image-to-Video MOS evaluation results on our internal benchmark.
| MOS score | Seedance 1.0 | Hailuo-02 | Wan 2.2-I2V-A14B | LongCat-Video |
|---|---|---|---|---|
| Accessibility | Proprietary | Proprietary | Open Source | Open Source |
| Architecture | - | - | MoE | Dense |
| # Total Params | - | - | 28B | 13.6B |
| # Activated Params | - | - | 14B | 13.6B |
| Image-Alignment↑ | 4.12 | 4.18 | 4.18 | 4.04 |
| Text-Alignment↑ | 3.70 | 3.85 | 3.33 | 3.49 |
| Visual Quality↑ | 3.22 | 3.18 | 3.23 | 3.27 |
| Motion Quality↑ | 3.77 | 3.80 | 3.79 | 3.59 |
| Overall Quality↑ | 3.35 | 3.27 | 3.26 | 3.17 |
The model weights are released under the MIT License.
Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents.
See the LICENSE file for the full license text.
This model has not been specifically designed or comprehensively evaluated for every possible downstream application.
Developers should take into account the known limitations of large language models, including performance variations across different languages, and carefully assess accuracy, safety, and fairness before deploying the model in sensitive or high-risk scenarios. It is the responsibility of developers and downstream users to understand and comply with all applicable laws and regulations relevant to their use case, including but not limited to data protection, privacy, and content safety requirements.
Nothing in this Model Card should be interpreted as altering or restricting the terms of the MIT License under which the model is released.
We kindly encourage citation of our work if you find it useful.
@misc{meituan2025longcatvideotechnicalreport, title={LongCat-Video Technical Report}, author={Meituan LongCat Team}, year={2025}, eprint={xxx}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/xxx}, }
We would like to thank the contributors to the Wan, UMT5-XXL, Diffusers and HuggingFace repositories, for their open research.
Please contact us at longcat-team@meituan.com or join our WeChat Group if you have any questions.