Instructions to use DFloat11/Qwen-Image-DF11 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use DFloat11/Qwen-Image-DF11 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("DFloat11/Qwen-Image-DF11", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
| base_model: | |
| - Qwen/Qwen-Image | |
| base_model_relation: quantized | |
| tags: | |
| - dfloat11 | |
| - df11 | |
| - lossless compression | |
| - 70% size, 100% accuracy | |
| # DFloat11 Compressed Model: `Qwen/Qwen-Image` | |
| This is a **DFloat11 losslessly compressed** version of the original `Qwen/Qwen-Image` model. It reduces model size by **32%** compared to the original BFloat16 model, while maintaining **bit-identical outputs** and supporting **efficient GPU inference**. | |
| 🔥🔥🔥 Thanks to DFloat11 compression, Qwen-Image can now run on **a single 32GB GPU**, or on **a single 16GB GPU with CPU offloading**, while maintaining full model quality. 🔥🔥🔥 | |
| ### 📊 Performance Comparison | |
| | Model | Model Size | Peak GPU Memory (1328x1328 image generation) | Generation Time (A100 GPU) | | |
| |-------------------------------------------|------------|----------------------------------------------|----------------------------| | |
| | Qwen-Image (BFloat16) | ~41 GB | OOM | - | | |
| | Qwen-Image (DFloat11) | 28.42 GB | 29.74 GB | 100 seconds | | |
| | Qwen-Image (DFloat11 + GPU Offloading) | 28.42 GB | 16.68 GB | 260 seconds | | |
| ### 🔧 How to Use | |
| 1. Install or upgrade the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*: | |
| ```bash | |
| pip install -U dfloat11[cuda12] | |
| ``` | |
| 2. Install or upgrade diffusers: | |
| ```bash | |
| pip install git+https://github.com/huggingface/diffusers | |
| ``` | |
| 3. Save the following code to a Python file `qwen_image.py`: | |
| ```python | |
| from diffusers import DiffusionPipeline, QwenImageTransformer2DModel | |
| import torch | |
| from transformers.modeling_utils import no_init_weights | |
| from dfloat11 import DFloat11Model | |
| import argparse | |
| def parse_args(): | |
| parser = argparse.ArgumentParser(description='Generate images using Qwen-Image model') | |
| parser.add_argument('--cpu_offload', action='store_true', help='Enable CPU offloading') | |
| parser.add_argument('--cpu_offload_blocks', type=int, default=None, help='Number of transformer blocks to offload to CPU') | |
| parser.add_argument('--no_pin_memory', action='store_true', help='Disable memory pinning') | |
| parser.add_argument('--prompt', type=str, default='A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197".', | |
| help='Text prompt for image generation') | |
| parser.add_argument('--negative_prompt', type=str, default=' ', | |
| help='Negative prompt for image generation') | |
| parser.add_argument('--aspect_ratio', type=str, default='16:9', choices=['1:1', '16:9', '9:16', '4:3', '3:4'], | |
| help='Aspect ratio of generated image') | |
| parser.add_argument('--num_inference_steps', type=int, default=50, | |
| help='Number of denoising steps') | |
| parser.add_argument('--true_cfg_scale', type=float, default=4.0, | |
| help='Classifier free guidance scale') | |
| parser.add_argument('--seed', type=int, default=42, | |
| help='Random seed for generation') | |
| parser.add_argument('--output', type=str, default='example.png', | |
| help='Output image path') | |
| parser.add_argument('--language', type=str, default='en', choices=['en', 'zh'], | |
| help='Language for positive magic prompt') | |
| return parser.parse_args() | |
| args = parse_args() | |
| model_name = "Qwen/Qwen-Image" | |
| with no_init_weights(): | |
| transformer = QwenImageTransformer2DModel.from_config( | |
| QwenImageTransformer2DModel.load_config( | |
| model_name, subfolder="transformer", | |
| ), | |
| ).to(torch.bfloat16) | |
| DFloat11Model.from_pretrained( | |
| "DFloat11/Qwen-Image-DF11", | |
| device="cpu", | |
| cpu_offload=args.cpu_offload, | |
| cpu_offload_blocks=args.cpu_offload_blocks, | |
| pin_memory=not args.no_pin_memory, | |
| bfloat16_model=transformer, | |
| ) | |
| pipe = DiffusionPipeline.from_pretrained( | |
| model_name, | |
| transformer=transformer, | |
| torch_dtype=torch.bfloat16, | |
| ) | |
| pipe.enable_model_cpu_offload() | |
| positive_magic = { | |
| "en": "Ultra HD, 4K, cinematic composition.", # for english prompt, | |
| "zh": "超清,4K,电影级构图" # for chinese prompt, | |
| } | |
| # Generate with different aspect ratios | |
| aspect_ratios = { | |
| "1:1": (1328, 1328), | |
| "16:9": (1664, 928), | |
| "9:16": (928, 1664), | |
| "4:3": (1472, 1140), | |
| "3:4": (1140, 1472), | |
| } | |
| width, height = aspect_ratios[args.aspect_ratio] | |
| image = pipe( | |
| prompt=args.prompt + positive_magic[args.language], | |
| negative_prompt=args.negative_prompt, | |
| width=width, | |
| height=height, | |
| num_inference_steps=args.num_inference_steps, | |
| true_cfg_scale=args.true_cfg_scale, | |
| generator=torch.Generator(device="cuda").manual_seed(args.seed) | |
| ).images[0] | |
| image.save(args.output) | |
| max_memory = torch.cuda.max_memory_allocated() | |
| print(f"Max memory: {max_memory / (1000 ** 3):.2f} GB") | |
| ``` | |
| 4. To run without CPU offloading (32GB VRAM required): | |
| ```bash | |
| python qwen_image.py | |
| ``` | |
| To run with CPU offloading (16GB VRAM required): | |
| ```bash | |
| python qwen_image.py --cpu_offload | |
| ``` | |
| If you are getting out-of-CPU-memory errors, try limiting the number of offloaded blocks or disabling memory-pinning: | |
| ```bash | |
| # Offload only 16 blocks (offloading more blocks uses less GPU memory and more CPU memory; offloading less blocks is faster): | |
| python qwen_image.py --cpu_offload --cpu_offload_blocks 16 | |
| # Disable memory-pinning (the most memory efficient way, but could be slower): | |
| python qwen_image.py --cpu_offload --no_pin_memory | |
| ``` | |
| ### 🔍 How It Works | |
| We apply **Huffman coding** to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. | |
| The result is a model that is **~32% smaller**, delivers **bit-identical outputs**, and achieves performance **comparable to the original** BFloat16 model. | |
| Learn more in our [research paper](https://arxiv.org/abs/2504.11651). | |
| ### 📄 Learn More | |
| * **Paper**: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651) | |
| * **GitHub**: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11) | |
| * **HuggingFace**: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11) | |