ai-models/mit-han-lab/svdq-int4-flux.1-fill-dev

Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

Branch

Tag

Muyang Li<Lmxyy@users.noreply.huggingface.co>

Upload folder using huggingface_hub

47ae4107

12 commits

.gitattributes
README.md
comfy_config.json
config.json
demo.jpg
example.png
mask.png
transformer_blocks.safetensors
unquantized_layers.safetensors

Quantization Library: DeepCompressor Inference Engine: Nunchaku

[Paper] [Code] [Demo] [Website] [Blog]

teaser svdq-int4-flux.1-fill-dev is an INT4-quantized version of FLUX.1-Fill-dev, which can fill areas in existing images based on a text description. It offers approximately 4× memory savings while also running 2–3× faster than the original BF16 model.

Method

Quantization Method -- SVDQuant

intuition Overview of SVDQuant. Stage1: Originally, both the activation X and weights W contain outliers, making 4-bit quantization challenging. Stage 2: We migrate the outliers from activations to weights, resulting in the updated activation and weight. While the activation becomes easier to quantize, the weight now becomes more difficult. Stage 3: SVDQuant further decomposes the weight into a low-rank component and a residual with SVD. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision.

Nunchaku Engine Design

engine (a) Naïvely running low-rank branch with rank 32 will introduce 57% latency overhead due to extra read of 16-bit inputs in Down Projection and extra write of 16-bit outputs in Up Projection. Nunchaku optimizes this overhead with kernel fusion. (b) Down Projection and Quantize kernels use the same input, while Up Projection and 4-Bit Compute kernels share the same output. To reduce data movement overhead, we fuse the first two and the latter two kernels together.

Model Description

Developed by: MIT, NVIDIA, CMU, Princeton, UC Berkeley, SJTU and Pika Labs
Model type: INT W4A4 model
Model size: 6.64GB
Model resolution: The number of pixels need to be a multiple of 65,536.
License: Apache-2.0

Usage

Diffusers

Please follow the instructions in mit-han-lab/nunchaku to set up the environment. Then you can run the model with


import torch
from diffusers import FluxFillPipeline
from diffusers.utils import load_image

from nunchaku.models.transformer_flux import NunchakuFluxTransformer2dModel

image = load_image("https://huggingface.co/mit-han-lab/svdq-int4-flux.1-fill-dev/resolve/main/example.png")
mask = load_image("https://huggingface.co/mit-han-lab/svdq-int4-flux.1-fill-dev/resolve/main/mask.png")

transformer = NunchakuFluxTransformer2dModel.from_pretrained("mit-han-lab/svdq-int4-flux.1-fill-dev")
pipe = FluxFillPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-Fill-dev", transformer=transformer, torch_dtype=torch.bfloat16
).to("cuda")
image = pipe(
    prompt="A wooden basket of a cat.",
    image=image,
    mask_image=mask,
    height=1024,
    width=1024,
    guidance_scale=30,
    num_inference_steps=50,
    max_sequence_length=512,
).images[0]
image.save("flux.1-fill-dev.png")

Comfy UI

Work in progress. Stay tuned!

Limitations

The model is only runnable on NVIDIA GPUs with architectures sm_86 (Ampere: RTX 3090, A6000), sm_89 (Ada: RTX 4090), and sm_80 (A100). See this issue for more details.
You may observe some slight differences from the BF16 models in detail.

Citation

If you find this model useful or relevant to your research, please cite


@inproceedings{
  li2024svdquant,
  title={SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models},
  author={Li*, Muyang and Lin*, Yujun and Zhang*, Zhekai and Cai, Tianle and Li, Xiuyu and Guo, Junxian and Xie, Enze and Meng, Chenlin and Zhu, Jun-Yan and Han, Song},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025}
}