We are excited to introduce LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features.
This codebase is built upon Wan2.2. Please refer to their documentation for installation instructions.
Clone the repo:
git clone https://github.com/robbyant/lingbot-world.git
cd lingbot-world
Install dependencies:
# Ensure torch >= 2.4.0
pip install -r requirements.txt
Install flash_attn:
pip install flash-attn --no-build-isolation
| Model | Control Signals | Resolution | Download Links |
|---|---|---|---|
| LingBot-World-Base (Cam) | Camera Poses | 480P & 720P | 🤗 HuggingFace 🤖 ModelScope |
| LingBot-World-Base (Act) | Actions | - | To be released |
| LingBot-World-Fast | - | - | To be released |
Download models using huggingface-cli:
pip install "huggingface_hub[cli]"
huggingface-cli download robbyant/lingbot-world-base-cam --local-dir ./lingbot-world-base-cam
Our model supports video generation at both 480P and 720P resolutions. You can find data samples for inference in the examples/ directory, which includes the corresponding input images, prompts, and control signals. To enable long video generation, we utilize multi-GPU inference powered by FSDP and DeepSpeed Ulysses.
torchrun --nproc_per_node=8 generate.py --task i2v-A14B --size 480*832 --ckpt_dir lingbot-world-base-cam --image examples/00/image.jpg --action_path examples/00 --dit_fsdp --t5_fsdp --ulysses_size 8 --frame_num 161 --prompt "The video presents a soaring journey through a fantasy jungle. The wind whips past the rider's blue hands gripping the reins, causing the leather straps to vibrate. The ancient gothic castle approaches steadily, its stone details becoming clearer against the backdrop of floating islands and distant waterfalls."
torchrun --nproc_per_node=8 generate.py --task i2v-A14B --size 720*1280 --ckpt_dir lingbot-world-base-cam --image examples/00/image.jpg --action_path examples/00 --dit_fsdp --t5_fsdp --ulysses_size 8 --frame_num 161 --prompt "The video presents a soaring journey through a fantasy jungle. The wind whips past the rider's blue hands gripping the reins, causing the leather straps to vibrate. The ancient gothic castle approaches steadily, its stone details becoming clearer against the backdrop of floating islands and distant waterfalls."
Alternatively, you can run inference without control actions:
torchrun --nproc_per_node=8 generate.py --task i2v-A14B --size 480*832 --ckpt_dir lingbot-world-base-cam --image examples/00/image.jpg --dit_fsdp --t5_fsdp --ulysses_size 8 --frame_num 161 --prompt "The video presents a soaring journey through a fantasy jungle. The wind whips past the rider's blue hands gripping the reins, causing the leather straps to vibrate. The ancient gothic castle approaches steadily, its stone details becoming clearer against the backdrop of floating islands and distant waterfalls."
Tips:
If you have sufficient CUDA memory, you may increase the frame_num parameter to a value such as 961 to generate a one-minute video at 16 FPS.
This project is licensed under the Apache 2.0 License. Please refer to the LICENSE file for the full text, including details on rights and restrictions.
We would like to express our gratitude to the Wan2.2 team for open-sourcing their code and models. Their contributions have been instrumental to the development of this project.
If you find this work useful for your research, please cite our paper:
@article{lingbot-world, title={Advancing Open-source World Models}, author={Robbyant Team}, journal={arXiv preprint arXiv:2601.20540}, year={2026} }