logo
0
0
Login
Jin Zhou<jinzhou1119@gmail.com>
[add] discord link
HunyuanVideo-Foley Logo

Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Professional-grade AI sound effect generation for video content creators


👥 Authors

Sizhe Shan1,2*Qiulin Li1,3*Yutao Cui1Miles Yang1Yuehai Wang2Qun Yang3Jin Zhou1†Zhao Zhong1

🏢 1Tencent Hunyuan • 🎓 2Zhejiang University • ✈️ 3Nanjing University of Aeronautics and Astronautics

*Equal contribution • †Project lead


🎥 Demo & Showcase

Experience the magic of AI-generated Foley audio in perfect sync with video content!

🎬 Watch how HunyuanVideo-Foley generates immersive sound effects synchronized with video content


🤝 Community Contributions

ComfyUI Integration - Thanks to the amazing community for creating ComfyUI nodes:

🌟 We encourage and appreciate community contributions that make HunyuanVideo-Foley more accessible!


Key Highlights

🎭 Multi-scenario Sync
High-quality audio synchronized with complex video scenes

🧠 Multi-modal Balance
Perfect harmony between visual and textual information

🎵 48kHz Hi-Fi Output
Professional-grade audio generation with crystal clarity


📄 Abstract

🚀 Tencent Hunyuan open-sources HunyuanVideo-Foley an end-to-end video sound effect generation model!

A professional-grade AI tool specifically designed for video content creators, widely applicable to diverse scenarios including short video creation, film production, advertising creativity, and game development.

🎯 Core Highlights

🎬 Multi-scenario Audio-Visual Synchronization
Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.

⚖️ Multi-modal Semantic Balance
Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.

🎵 High-fidelity Audio Output
Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.

🏆 SOTA Performance Achieved

HunyuanVideo-Foley comprehensively leads the field across multiple evaluation benchmarks, achieving new state-of-the-art levels in audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching - surpassing all open-source solutions!

Performance Overview 📊 Performance comparison across different evaluation metrics - HunyuanVideo-Foley leads in all categories


🔧 Technical Architecture

📊 Data Pipeline Design

Data Pipeline 🔄 Comprehensive data processing pipeline for high-quality text-video-audio datasets

The TV2A (Text-Video-to-Audio) task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.

🏗️ Model Architecture

Model Architecture 🧠 HunyuanVideo-Foley hybrid architecture with multimodal and unimodal transformer blocks

HunyuanVideo-Foley employs a sophisticated hybrid architecture:

  • 🔄 Multimodal Transformer Blocks: Process visual-audio streams simultaneously
  • 🎵 Unimodal Transformer Blocks: Focus on audio stream refinement
  • 👁️ Visual Encoding: Pre-trained encoder extracts visual features from video frames
  • 📝 Text Processing: Semantic features extracted via pre-trained text encoder
  • 🎧 Audio Encoding: Latent representations with Gaussian noise perturbation
  • ⏰ Temporal Alignment: Synchformer-based frame-level synchronization with gated modulation

📈 Performance Benchmarks

🎬 MovieGen-Audio-Bench Results

Objective and Subjective evaluation results demonstrating superior performance across all metrics

🏆 MethodPQPCCECUIBDeSyncCLAPMOS-QMOS-SMOS-T
FoleyGrafter6.272.723.345.680.171.290.143.36±0.783.54±0.883.46±0.95
V-AURA5.824.303.635.110.231.380.142.55±0.972.60±1.202.70±1.37
Frieren5.712.813.475.310.181.390.162.92±0.952.76±1.202.94±1.26
MMAudio6.172.843.595.620.270.800.353.58±0.843.63±1.003.47±1.03
ThinkSound6.043.733.815.590.180.910.203.20±0.973.01±1.043.02±1.08
HunyuanVideo-Foley (ours)6.592.743.886.130.350.740.334.14±0.684.12±0.774.15±0.75

🎯 Kling-Audio-Eval Results

Comprehensive objective evaluation showcasing state-of-the-art performance

🏆 MethodFD_PANNsFD_PASSTKLISPQPCCECUIBDeSyncCLAP
FoleyGrafter22.30322.632.477.086.052.913.285.440.221.230.22
V-AURA33.15474.563.245.805.693.983.134.830.250.860.13
Frieren16.86293.572.957.325.722.552.885.100.210.860.16
MMAudio9.01205.852.179.595.942.913.305.390.300.560.27
ThinkSound9.92228.682.396.865.783.233.125.110.220.670.22
HunyuanVideo-Foley (ours)6.07202.121.898.306.122.763.225.530.380.540.24

🎉 Outstanding Results! HunyuanVideo-Foley achieves the best scores across ALL evaluation metrics, demonstrating significant improvements in audio quality, synchronization, and semantic alignment.


🚀 Quick Start

📦 Installation

🔧 System Requirements

  • CUDA: 12.4 or 11.8 recommended
  • Python: 3.8+
  • OS: Linux (primary support)
  • Note: This model requires approximately 20GB of VRAM for inference. It is recommended to use a GPU >= 24GB of VRAM​ (such as RTX 3090 or 4090) for stable performance.

Step 1: Clone Repository

# 📥 Clone the repository git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley cd HunyuanVideo-Foley

Step 2: Environment Setup

💡 Tip: We recommend using Conda for Python environment management.

# 🔧 Install dependencies pip install -r requirements.txt

Step 3: Download Pretrained Models

🔗 Download Model weights from Huggingface

# using git-lfs git clone https://huggingface.co/tencent/HunyuanVideo-Foley # using huggingface-cli huggingface-cli download tencent/HunyuanVideo-Foley

💻 Usage

🎬 Single Video Generation

Generate Foley audio for a single video file with text description:

python3 infer.py \ --model_path PRETRAINED_MODEL_PATH_DIR \ --config_path ./configs/hunyuanvideo-foley-xxl.yaml \ --single_video video_path \ --single_prompt "audio description" \ --output_dir OUTPUT_DIR

📂 Batch Processing

Process multiple videos using a CSV file with video paths and descriptions:

# Download sample test videos bash ./download_test_videos.sh python3 infer.py \ --model_path PRETRAINED_MODEL_PATH_DIR \ --config_path ./configs/hunyuanvideo-foley-xxl.yaml \ --csv_path assets/test.csv \ --output_dir OUTPUT_DIR

🌐 Interactive Web Interface

Launch a user-friendly Gradio web interface for easy interaction:

export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR python3 gradio_app.py

🚀 Then open your browser and navigate to the provided local URL to start generating Foley audio!


📚 Citation

If you find HunyuanVideo-Foley useful for your research, please consider citing our paper:

@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation, title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong}, year={2025}, eprint={2508.16930}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2508.16930}, }

Star History

Star History Chart

🙏 Acknowledgements

We extend our heartfelt gratitude to the open-source community!

🎨 Stable Diffusion 3
Foundation diffusion models

FLUX
Advanced generation techniques

🎵 MMAudio
Multimodal audio generation

🤗 HuggingFace
Platform & diffusers library

🗜️ DAC
High-Fidelity Audio Compression

🔗 Synchformer
Audio-Visual Synchronization

🌟 Special thanks to all researchers and developers who contribute to the advancement of AI-generated audio and multimodal learning!


🔗 Connect with Us

GitHub Twitter Hunyuan

© 2025 Tencent Hunyuan. All rights reserved. | Made with ❤️ for the AI community