logo
0
0
WeChat Login

CogView4-6B

🤗 Space | 🌐 Github | 📜 CogView3 Paper

img

Inference Requirements and Model Introduction

  • Resolution: Width and height must be between 512px and 2048px, divisible by 32, and ensure the maximum number of pixels does not exceed 2^21 px.
  • Precision: BF16 / FP32 (FP16 is not supported as it will cause overflow resulting in completely black images)

Using BF16 precision with batchsize=4 for testing, the memory usage is shown in the table below:

Resolutionenable_model_cpu_offload OFFenable_model_cpu_offload ONenable_model_cpu_offload ON
Text Encoder 4bit
512 * 51233GB20GB13G
1280 * 72035GB20GB13G
1024 * 102435GB20GB13G
1920 * 128039GB20GB14G

Quick Start

First, ensure you install the diffusers library from source.

pip install git+https://github.com/huggingface/diffusers.git cd diffusers pip install -e .

Then, run the following code:

from diffusers import CogView4Pipeline pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16) # Open it for reduce GPU memory usage pipe.enable_model_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background." image = pipe( prompt=prompt, guidance_scale=3.5, num_images_per_prompt=1, num_inference_steps=50, width=1024, height=1024, ).images[0] image.save("cogview4.png")

Model Metrics

We've tested on multiple benchmarks and achieved the following scores:

DPG-Bench

ModelOverallGlobalEntityAttributeRelationOther
SDXL74.6583.2782.4380.9186.7680.41
PixArt-alpha71.1174.9779.3278.6082.5776.96
SD3-Medium84.0887.9091.0188.8380.7088.68
DALL-E 383.5090.9789.6188.3990.5889.83
Flux.1-dev83.7985.8086.7989.9890.0489.90
Janus-Pro-7B84.1986.9088.9089.4089.3289.48
CogView4-6B85.1383.8590.3591.1791.1487.29

GenEval

ModelOverallSingle Obj.Two Obj.CountingColorsPositionColor attribution
SDXL0.550.980.740.390.850.150.23
PixArt-alpha0.480.980.500.440.800.080.07
SD3-Medium0.740.990.940.720.890.330.60
DALL-E 30.670.960.870.470.830.430.45
Flux.1-dev0.660.980.790.730.770.220.45
Janus-Pro-7B0.800.990.890.590.900.790.66
CogView4-6B0.730.990.860.660.790.480.58

T2I-CompBench

ModelColorShapeTexture2D-Spatial3D-SpatialNumeracyNon-spatial ClipComplex 3-in-1
SDXL0.58790.46870.52990.21330.35660.49880.31190.3237
PixArt-alpha0.66900.49270.64770.20640.39010.50580.31970.3433
SD3-Medium0.81320.58850.73340.32000.40840.61740.31400.3771
DALL-E 30.77850.62050.70360.28650.37440.58800.30030.3773
Flux.1-dev0.75720.50660.63000.27000.39920.61650.30650.3628
Janus-Pro-7B0.51450.33230.40690.15660.27530.44060.31370.3806
CogView4-6B0.77860.58800.69830.30750.37080.66260.30560.3869

Chinese Text Accuracy Evaluation

ModelPrecisionRecallF1 ScorePick@4
Kolors0.60940.18860.28800.1633
CogView4-6B0.69690.55320.61680.3265

Citation

🌟 If you find our work helpful, please consider citing our paper and leaving valuable stars

@article{zheng2024cogview3, title={Cogview3: Finer and faster text-to-image generation via relay diffusion}, author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie}, journal={arXiv preprint arXiv:2403.05121}, year={2024} }

License

This model is released under the Apache 2.0 License.

About

No description, topics, or website provided.