🤗 HuggingFace | 💻 官网体验 试用我们的模型!
本仓库包含我们关于混元图像 2.1 的 PyTorch 模型定义、预训练权重,以及推理/采样代码。您可以在官网直接体验我们的模型,更多可视化示例请访问我们的项目主页。
我们很高兴推出混元图像 2.1,这是一个 170 亿参数的文生图模型,能够生成2K(2048 × 2048)分辨率的图像。
我们的架构包含两个阶段:
👑 我们在 Arena 文生图开源模型排行榜上获得第一名。
硬件和操作系统要求:
支持 CUDA 的 NVIDIA GPU。
目前最低要求: 24 GB GPU 显存,用于 2048x2048 图像生成。
注意: 上述显存要求是在启用模型 CPU offloading 和 FP8 量化的情况下测量的。如果您的 GPU 有足够的显存,可以禁用 offloading 以提高推理速度。
支持的操作系统:Linux。
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-2.1.git
cd HunyuanImage-2.1
pip install -r requirements.txt pip install flash-attn==2.7.3 --no-build-isolation
模型的下载与说明请参考这里。
提示词增强在使我们的模型生成高质量图像方面发挥着关键作用。通过编写更长、更详细的提示词,生成的图像将得到显著改善。我们鼓励您制作全面和描述性的提示词以获得最佳的图像质量。
我们强烈推荐您尝试 PromptEnhancer-32B 模型 以获得更高质量的提示词增强。
HunyuanImage-2.1 仅支持 2K 图像生成(如 1:1 时为 2048x2048,16:9 时为 2560x1536 等)。 使用 1K 分辨率生成图像可能会导致画质下降与瑕疵。
此外,我们强烈建议使用完整的生成流程以获得更高画质(即启用提示词增强和精修功能)。
| 模型类型 | 模型名称 | 描述 | num_inference_steps | guidance_scale | shift |
|---|---|---|---|---|---|
| 基础文生图模型 | hunyuanimage2.1 | 未蒸馏模型,质量最佳。 | 50 | 3.5 | 5 |
| 蒸馏文生图模型 | hunyuanimage2.1-distilled | 蒸馏模型,推理更快 | 8 | 3.25 | 4 |
| 精修模型 | hunyuanimage-refiner | 精修模型 | N/A | N/A | N/A |
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
import torch
from hyimage.diffusion.pipelines.hunyuanimage_pipeline import HunyuanImagePipeline
# 支持的 model_name:hunyuanimage-v2.1, hunyuanimage-v2.1-distilled
model_name = "hunyuanimage-v2.1"
pipe = HunyuanImagePipeline.from_pretrained(model_name=model_name, use_fp8=True)
pipe = pipe.to("cuda")
# 输入提示词
prompt = "A cute, cartoon-style anthropomorphic penguin plush toy with fluffy fur, standing in a painting studio, wearing a red knitted scarf and a red beret with the word \"Tencent\" on it, holding a paintbrush with a focused expression as it paints an oil painting of the Mona Lisa, rendered in a photorealistic photographic style."
# 生成不同宽高比的图像
aspect_ratios = {
"16:9": (2560, 1536),
"4:3": (2304, 1792),
"1:1": (2048, 2048),
"3:4": (1792, 2304),
"9:16": (1536, 2560),
}
width, height = aspect_ratios["1:1"]
image = pipe(
prompt=prompt,
width=width,
height=height,
# 如果您已经使用提示词增强来增强提示词,请禁用 reprompt
use_reprompt=False, # 启用提示词增强(可能会导致更高的显存使用)
use_refiner=True, # 启用精修模型
# 对于蒸馏模型,使用 8 步以加快推理速度
# 对于非蒸馏模型,使用 50 步以获得更好质量
num_inference_steps=8 if "distilled" in model_name else 50,
guidance_scale=3.25 if "distilled" in model_name else 3.5,
shift=4 if "distilled" in model_name else 5,
seed=649151,
)
image.save("generated_image.png")
我们的模型可以根据复杂指令生成高质量、具有创意的图像。
我们建议使用更长、更详细的提示词。可以尝试一下我们提供的提示词。
| Index | 用户提示词 | 图像 |
|---|---|---|
| 1 | 宏伟教堂的内部,穹顶下方的中央矗立着一尊小巧的维纳斯雕像,微微侧对镜头。雕像没有双手,布满裂纹,表面若干古老的水泥片剥落,露出内部真人质感的牛奶肌肤。雕像穿着薄薄的白色婚纱,在雕像的身后,一只浮空水泥断手轻轻提起长长的婚纱拖尾;在雕像的头顶上方,另一只浮空水泥断手正为她戴上一个由白色花朵组成的花环,雕像本身是没有双手的。教堂穹顶上布满彩色玻璃窗,一束阳光从上往下照射到雕像上,形成丁达尔效应,光斑点点洒在雕像的脸庞和胸前。充满神性的光辉,背景微微虚化,物体的边缘模糊柔和。拉斐尔前派的梦幻朦胧美学风格。 | ![]() |
| 2 | A hyper-realistic photograph of a crystal ball diorama sitting atop fluffy forest moss and surrounded by scattered sunlight. Inside, detailed diorama features a Tencent meeting room, an animated chat bubble sculpture, and several joyful penguins—one wearing a graduation cap, others playing soccer and waving tiny banners. The base of the crystal sphere boldly presents ""Tencent"" in large, crisp, white 3D letters. Background is softly blurred and bokeh-rich, emphasizing the cute, vibrant details of the sphere. | ![]() |
| 3 | A close-up portrait of an elderly Italian man with deeply wrinkled skin, expressive hazel eyes, and a neatly trimmed white mustache. His olive-toned complexion shows the marks of sun and age, and he wears a flat cap slightly tilted to the side. He smiles faintly, revealing warmth and wisdom, while holding a small espresso cup in one hand. The softly blurred background shows a rustic stone wall with climbing ivy, captured in a realistic photography style. | ![]() |
| 4 | An open vintage suitcase on a neutral, softly lit background. The suitcase is made of deep brown, worn leather with visible scuffs and creases, and its interior is lined with dark, plush fabric. Inside the suitcase is a meticulously crafted miniature landscape of China, featuring the Great Wall of China winding across model mountains, the pagoda roofs of the Forbidden City, and a representation of the terracotta army, all interwoven with vibrant green rice paddies. On the side of the suitcase, a text "China" is labeled. The entire diorama is bathed in warm, ethereal light, with a dreamy lens bloom and soft, glowing highlights. Photorealistic style, ultra-detailed textures, cinematic lighting. | ![]() |
为了提升生成图像的质量和细节,我们使用了提示词重写模型。该模型能够自动增强用户提供的文本提示词,添加详细和描述性的信息。
| Index | 用户提示词 | 增强提示词 | 图像 |
|---|---|---|---|
| 1 | Wildlife poster for Serengeti plains. Wide-eyed chibi explorer riding friendly lion cub. 'Serengeti: Roar of Adventure' in whimsical font. 'Where Dreams Run Wild' tagline. Warm yellows and soft browns. | A wildlife poster design for the Serengeti plains features a central illustration of a chibi-style explorer riding a lion cub, set against a backdrop of rolling hills. At the top of the composition, the title "Serengeti: Roar of Adventure" is displayed in a large, whimsical font with decorative, swirling letters. The main scene depicts a wide-eyed chibi explorer, characterized by a large head and a small body, sitting atop a friendly lion cub. The explorer wears a green explorer's hat, a backpack, and holds onto the cub's mane, looking forward with a look of wonder. The lion cub, with a light brown mane and a smiling expression, strides forward, its body rendered in warm orange tones. In the background, the Serengeti plains are illustrated with rolling hills and savanna grass, all in shades of warm yellow and soft brown. Below the main illustration, the tagline "Where Dreams Run Wild" is written in a smaller, elegant script. The overall presentation is that of a poster design, combining a cute chibi illustration style with playful, whimsical typography. | ![]() |
| 2 | Energetic poster for New York City. Anime businesswoman hailing a taxi with skyscrapers and Times Square signs around. 'NYC: Bright Ambitions' in urban graffiti font. 'Own Every Dream' tagline. Saturated yellows, reds, and sharp blues. | An energetic poster for New York City unfolds, featuring a dynamic scene with an anime-style businesswoman in the midst of hailing a taxi. The central figure is a young woman with large, expressive eyes and dark hair styled in a bob, wearing a professional blue business suit with motion lines indicating movement. She stands on a bustling street, her arms outstretched as she calls for a classic yellow taxi cab that is approaching. In the background, towering skyscrapers with sleek, anime-inspired architecture rise into the sky, adorned with vibrant, glowing billboards and neon signs characteristic of Times Square. Across the top of the poster, the text "NYC: Bright Ambitions" is displayed in a large, stylized urban graffiti font, with spray-paint-like edges. Below this main title, the tagline "Own Every Dream" is written in a smaller, clean font. The entire composition is rendered with saturated colors, dominated by bright yellows, reds, and sharp blues. The overall presentation is a fusion of anime illustration and graphic design. | ![]() |
| 3 | An artistic studio portrait captures a high fashion model in a striking, dynamic pose. Her face is a canvas for avant-garde makeup, defined by bold, geometric applications of primary colors. She wears a sculptural, unconventional garment, emphasizing clean lines and form. The scene is illuminated by dramatic studio lighting, creating sharp contrasts and highlighting her features against an abstract, blurred background of colors. The image is presented in a realistic photography style. | An artistic studio portrait captures a high fashion model in a striking, dynamic pose, her body twisted with one arm raised high to convey energy and movement. Her face serves as a canvas for avant-garde makeup, featuring bold, geometric applications of primary colors; vibrant yellow triangles are painted on her forehead, and electric blue lines accentuate her eye sockets. She wears a sculptural, unconventional garment made of a stiff, matte white fabric, with asymmetrical panels that wrap around her torso, emphasizing clean lines and form. Illuminated by dramatic studio lighting, with a strong beam from the side casting sharp shadows and highlighting the contours of her face and body against an abstract, blurred background of purples and oranges, creating a bokeh effect. Realistic photography style. | ![]() |
| 4 | An environmental portrait of a chef, captured with a focused expression in a bustling kitchen. He holds culinary tools, his gaze fixed on his work, embodying passion and creativity. The background is a blur of motion with stainless steel counters, all illuminated by a warm ambient light. The image is presented in a realistic photography style. | An environmental portrait of a male chef in the midst of work within a bustling kitchen. The chef, as the central subject and viewed from the chest up, has a focused expression with a furrowed brow, his gaze directed downward at the culinary tools he holds. He wears a professional white chef‘s jacket and a traditional toque, with flour lightly dusting his face and clothes. In his hands, he grips a large chef’s knife and a metal spatula, poised over an unseen cooking surface. The background is a dynamic blur of motion, with out-of-focus shapes of stainless steel counters, pots, and other kitchen equipment suggesting a busy environment. Warm ambient light from overhead fixtures casts a golden hue, creating highlights on the chef‘s jacket and the tools. Realistic photography style, characterized by a shallow depth of field that emphasizes the subject while conveying the energy and creativity of the kitchen. | ![]() |
SSAE(结构化语义对齐评估)是一种基于先进多模态大语言模型(MLLMs)的图文对齐智能评测指标。我们在 12 个类别中共提取了 3500 个关键要点,然后利用多模态大语言模型,基于图像的视觉内容,将生成的图像与这些关键要点进行比对,自动完成评估与打分。平均图像准确率(Mean Image Accuracy)表示以图像为单位在所有关键要点上的平均得分,而全局准确率(Global Accuracy) 则直接对所有关键要点的平均得分进行计算。
| 模型 | 开源 | 平均图像准确率 | 全局准确率 | 主体 | 次要主体 | 场景 | 其他 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 名词 | 关键属性 | 其他属性 | 动作 | 名词 | 属性 | 动作 | 名词 | 属性 | 镜头 | 风格 | 构图 | ||||
| FLUX-dev | ✅ | 0.7122 | 0.6995 | 0.7965 | 0.7824 | 0.5993 | 0.5777 | 0.7950 | 0.6826 | 0.6923 | 0.8453 | 0.8094 | 0.6452 | 0.7096 | 0.6190 |
| Seedream-3.0 | ❌ | 0.8827 | 0.8792 | 0.9490 | 0.9311 | 0.8242 | 0.8177 | 0.9747 | 0.9103 | 0.8400 | 0.9489 | 0.8848 | 0.7582 | 0.8726 | 0.7619 |
| Qwen-Image | ✅ | 0.8854 | 0.8828 | 0.9502 | 0.9231 | 0.8351 | 0.8161 | 0.9938 | 0.9043 | 0.8846 | 0.9613 | 0.8978 | 0.7634 | 0.8548 | 0.8095 |
| GPT-Image | ❌ | 0.8952 | 0.8929 | 0.9448 | 0.9289 | 0.8655 | 0.8445 | 0.9494 | 0.9283 | 0.8800 | 0.9432 | 0.9017 | 0.7253 | 0.8582 | 0.7143 |
| HunyuanImage 2.1 | ✅ | 0.8888 | 0.8832 | 0.9339 | 0.9341 | 0.8363 | 0.8342 | 0.9627 | 0.8870 | 0.9615 | 0.9448 | 0.9254 | 0.7527 | 0.8689 | 0.7619 |
从 SSAE 的评估结果上看,我们的模型在语义对齐上目前达到了开源模型上最优的效果,并且非常接近闭源商业模型 (GPT-Image) 的效果。
欢迎加入我们的 Discord 服务器或微信交流群,无论是交流想法、探索合作机会,还是提出任何问题,我们都非常欢迎。您也可以在 GitHub 上提交 issue 或 pull request。您的反馈对我们非常宝贵,这也是 HunyuanImage 不断进步的动力。感谢您加入我们的社区!
如果本项目对你的研究或应用有帮助,请引用:
@misc{HunyuanImage-2.1, title={HunyuanImage 2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation}, author={Tencent Hunyuan Team}, year={2025}, howpublished={\url{https://github.com/Tencent-Hunyuan/HunyuanImage-2.1}}, }
感谢以下开源项目与社区为开放研究和探索所做的贡献:Qwen、 FLUX、diffusers 与 HuggingFace。