nielsr HF Staff commited on
Commit
1978b1e
Β·
verified Β·
1 Parent(s): bc80d45

Add model card for Image Tokenizer Needs Post-Training

Browse files

This PR significantly improves the model card for "Image Tokenizer Needs Post-Training" by:

- Adding the `pipeline_tag: image-feature-extraction`, which helps users discover this model on the Hub at https://huggingface.co/models?pipeline_tag=image-feature-extraction.
- Including direct links to the paper ([Image Tokenizer Needs Post-Training](https://huggingface.co/papers/2509.12474)), the official project page (https://qiuk2.github.io/works/RobusTok/index.html), and the GitHub repository (https://github.com/qiuk2/RobusTok).
- Enriching the model card content by incorporating the paper abstract, key highlights, model zoo, installation instructions, training procedures, and inference commands directly from the original GitHub README.
- Adding relevant visualizations and the BibTeX citation.

Per guidelines, `library_name` and `license` tags have not been added as no explicit evidence was found in the provided documentation (GitHub README, paper abstract). Furthermore, no custom Python code snippet for sample usage was generated; instead, the original shell commands for training and inference from the GitHub repository are provided to comply with the "Do not make up code yourself" disclaimer.

Files changed (1) hide show
  1. README.md +210 -1
README.md CHANGED
@@ -1 +1,210 @@
1
- Image Tokenizers Needs Post-Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-feature-extraction
3
+ ---
4
+
5
+ # Image Tokenizer Needs Post-Training
6
+
7
+ [![project page](https://img.shields.io/badge/%20project%20page-lightblue)](https://qiuk2.github.io/works/RobusTok/index.html) 
8
+ [![arXiv](https://img.shields.io/badge/arXiv%20paper-2509.12474-b31b1b.svg)](https://arxiv.org/abs/2509.12474) 
9
+ [![πŸ€— Weights](https://img.shields.io/badge/%F0%9F%A4%97%20Weights-yellow)](https://huggingface.co/qiuk6/RobusTok) 
10
+
11
+ This repository contains the official implementation of the paper [Image Tokenizer Needs Post-Training](https://huggingface.co/papers/2509.12474).
12
+
13
+ Project Page: https://qiuk2.github.io/works/RobusTok/index.html
14
+ GitHub Repository: https://github.com/qiuk2/RobusTok
15
+
16
+ ## Abstract
17
+
18
+ Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a $\sim$400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.
19
+
20
+ ## TL;DR
21
+
22
+ We present RobusTok, a new image tokenizer with a two-stage training scheme:
23
+
24
+ Main training β†’ constructs a robust latent space.
25
+
26
+ Post-training β†’ aligns the generator’s latent distribution with its image space.
27
+
28
+ ## Key highlights of Post-Training
29
+
30
+ - πŸš€ **Better generative quality**: gFID 1.60 β†’ 1.36.
31
+ - πŸ”‘ **Generalizability**: applicable to both autoregressive & diffusion models.
32
+ - ⚑ **Efficiency**: strong results with only ~400M generative models.
33
+
34
+ ## Model Zoo
35
+ | Generator \ Tokenizer | RobusTok w/o. P.T([weights](https://huggingface.co/qiuk6/RobusTok/resolve/main/main-train.pt?download=true)) | RobusTok w/. P.T ([weights](https://huggingface.co/qiuk6/RobusTok/resolve/main/post-train.pt?download=true)) |
36
+ |---|---:|---:|
37
+ | Base ([weights](https://huggingface.co/qiuk6/RobusTok/resolve/main/rar_b.bin?download=true)) | gFID = 1.83 | gFID = 1.60 |
38
+ | Large ([weights](https://huggingface.co/qiuk6/RobusTok/resolve/main/rar_l.bin?download=true)) | gFID = 1.60 | gFID = 1.36 |
39
+
40
+ ## Updates
41
+ - (2025.09.16) Paper released in Arxiv.
42
+ - (2025.09.18) Code and checkpoint are released. Preparing for PFID calculation
43
+
44
+ ## Installation
45
+
46
+ Install all packages as
47
+
48
+ ```bash
49
+ conda env create -f environment.yml
50
+ ```
51
+
52
+ ## Dataset
53
+
54
+ We download the ImageNet2012 from the website and collect it as
55
+
56
+ ```
57
+ ImageNet2012
58
+ β”œβ”€β”€ train
59
+ └── val
60
+ ```
61
+
62
+ If you want to train or finetune on other datasets, collect them in the format that ImageFolder (pytorch's [ImageFolder](https://pytorch.org/vision/main/generated/torchvision.datasets.ImageFolder.html)) can recognize.
63
+
64
+ ```
65
+ Dataset
66
+ β”œβ”€β”€ train
67
+ β”‚ β”œβ”€β”€ Class1
68
+ β”‚ β”‚ β”œβ”€β”€ 1.png
69
+ β”‚ β”‚ └── 2.png
70
+ β”œβ”€β”€ Class2
71
+ β”‚ β”‚ β”œβ”€β”€ 1.png
72
+ β”‚ β”‚ └── 2.png
73
+ β”œβ”€β”€ val
74
+ ```
75
+
76
+ ## Main Train for tokenizer
77
+
78
+ Please login to Wandb first using
79
+
80
+ ```bash
81
+ wandb login
82
+ ```
83
+
84
+ rFID will be automatically evaluated and reported on Wandb. The checkpoint with the best rFID on the val set will be saved. We provide basic configurations in the "configs" folder.
85
+
86
+ Warning❗️: You may want to modify the metric to save models as rFID is not closely correlated to gFID. PSNR and SSIM are also good choices.
87
+
88
+ ```bash
89
+ torchrun --nproc_per_node=8 tokenizer/tokenizer_image/main_train.py --config configs/main-train.yaml
90
+ ```
91
+
92
+ Please modify the configuration file as needed for your specific dataset. We list some important ones here.
93
+
94
+ ```
95
+ vq_ckpt: ckpt_best.pt # resume
96
+ cloud_save_path: output/exp-xx # output dir
97
+ data_path: ImageNet2012/train # training set dir
98
+ val_data_path: ImageNet2012/val # val set dir
99
+ enc_tuning_method: 'full' # ['full', 'lora', 'frozen']
100
+ dec_tuning_method: 'full' # ['full', 'lora', 'frozen']
101
+ codebook_embed_dim: 32 # codebook dim
102
+ codebook_size: 4096 # codebook size
103
+ product_quant: 1 # vanilla VQ
104
+ v_patch_nums: [16,] # latent resolution for RQ ([16,] is equivalent to vanilla VQ)
105
+ codebook_drop: 0.1 # quantizer dropout rate if RQ is applied
106
+ semantic_guide: dinov2 # ['none', 'dinov2', 'clip']
107
+ disc_epoch_start: 56 # epoch that discriminator starts
108
+ disc_type: dinodisc # discriminator type
109
+ disc_adaptive_weight: true # adaptive weight for discriminator loss
110
+ ema: true # use ema to update the model
111
+ num_latent_code: 256 # latent token number (must equals to the v_patch_nums[-1] ** 2οΌ‰
112
+ ```
113
+
114
+ ## Training code for Generator
115
+
116
+ We follow [RAR](https://github.com/bytedance/1d-tokenizer) to pretokenize the whole dataset for speed-up the training process. We have uploaded [it](https://huggingface.co/qiuk6/RobustTok/resolve/main/RobustTok-half-pretokenized.jsonl?download=true) so you can train RobusTok-RAR directly.
117
+
118
+ ```bash
119
+ # training code for rar-b
120
+ accelerate launch scripts/train_rar.py experiment.project="rar" experiment.name="rar_b" experiment.output_dir="rar_b" model.generator.hidden_size=768 model.generator.num_hidden_layers=24 model.generator.num_attention_heads=16 model.generator.intermediate_size=3072 config=configs/generator/rar.yaml dataset.params.pretokenization=/path/to/pretokenized.jsonl model.vq_ckpt=/path/to/RobustTok.pt
121
+
122
+ # training code for rar-l
123
+ accelerate launch scripts/train_rar.py experiment.project="rar" experiment.name="rar_l" experiment.output_dir="rar_l" model.generator.hidden_size=1024 model.generator.num_hidden_layers=24 model.generator.num_attention_heads=16 model.generator.intermediate_size=4096 config=configs/generator/rar.yaml dataset.params.pretokenization=/path/to/pretokenized.jsonl model.vq_ckpt=/path/to/RobustTok.pt
124
+ ```
125
+
126
+ ## Post-Training for Tokenizer
127
+
128
+ For post-training, we need to (1) prepare paired dataset and (2) post-train our decoder to align with generated latent space
129
+
130
+ ### Prepare data
131
+ You can follow our code with your desired dataset / σ / number to generate data
132
+ ```bash
133
+ torchrun --nnodes=1 --nproc_per_node=8 --rdzv-endpoint=localhost:9999 post_train_data.py config=configs/generator/rar.yaml \
134
+ experiment.output_dir="/path/to/data-folder" \
135
+ experiment.generator_checkpoint="rar_b.bin" \
136
+ model.vq_ckpt=/path/to/RobustTok.pt \
137
+ model.generator.hidden_size=768 \
138
+ model.generator.num_hidden_layers=24 \
139
+ model.generator.num_attention_heads=16 \
140
+ model.generator.intermediate_size=3072 \
141
+ model.generator.randomize_temperature=1.02 \
142
+ model.generator.guidance_scale=6.0 \
143
+ model.generator.guidance_scale_pow=1.15 \
144
+ --sigma 0.7 --data-path /path/to/imagenet --num_samples /number/of/generate
145
+ ```
146
+
147
+ ### Post-Training
148
+
149
+ ```bash
150
+ torchrun --nproc_per_node=8 tokenizer/tokenizer_image/xqgan_post_train.py --config configs/post-train.yaml --data-path /path/to/data-folder --pair-set /path/to/imagenet --vq-ckpt /path/to/main-train/ckpt
151
+ ```
152
+
153
+ ## Inference Code
154
+
155
+ ```bash
156
+ # Reproducing RAR-B
157
+ torchrun --nnodes=1 --nproc_per_node=8 --rdzv-endpoint=localhost:9999 sample_imagenet_rar.py config=configs/generator/rar.yaml \
158
+ experiment.output_dir="rar_b" \
159
+ experiment.generator_checkpoint="rar_b.bin" \
160
+ model.vq_ckpt=/path/to/RobustTok.pt \
161
+ model.generator.hidden_size=768 \
162
+ model.generator.num_hidden_layers=24 \
163
+ model.generator.num_attention_heads=16 \
164
+ model.generator.intermediate_size=3072 \
165
+ model.generator.randomize_temperature=1.02 \
166
+ model.generator.guidance_scale=6.0 \
167
+ model.generator.guidance_scale_pow=1.15
168
+ # Run eval script. The result FID should be ~1.83 before post-training and ~1.60 after post-training
169
+ python3 evaluator.py VIRTUAL_imagenet256_labeled.npz rar_b.npz
170
+
171
+ # Reproducing RAR-L
172
+ torchrun --nnodes=1 --nproc_per_node=8 --rdzv-endpoint=localhost:9999 sample_imagenet_rar.py config=configs/generator/rar.yaml \
173
+ experiment.output_dir="rar_l" \
174
+ experiment.generator_checkpoint="rar_l.bin" \
175
+ model.vq_ckpt=/path/to/RobustTok.pt \
176
+ model.generator.hidden_size=1024 \
177
+ model.generator.num_hidden_layers=24 \
178
+ model.generator.num_attention_heads=16 \
179
+ model.generator.intermediate_size=4096 \
180
+ model.generator.randomize_temperature=1.04 \
181
+ model.generator.guidance_scale=6.75 \
182
+ model.generator.guidance_scale_pow=1.01
183
+ # Run eval script. The result FID should be ~1.60 before post-training and ~1.36 after post-training
184
+ python3 evaluator.py VIRTUAL_imagenet256_labeled.npz rar_l.npz
185
+ ```
186
+
187
+ ## Visualization
188
+
189
+ <div align="center">
190
+ <img src="assets/ft-diff.png" alt="vis" width="95%">
191
+ <p>
192
+ visualization of 256&times;256 image generation before (top) and after (bottom) post-training. Three improvements are observed: (a) OOD mitigation, (b) Color fidelity, (c) detail refinement.
193
+ </p>
194
+ </div>
195
+
196
+ ## Citation
197
+
198
+ If our work assists your research, feel free to give us a star ⭐ or cite us using
199
+
200
+ ```bibtex
201
+ @misc{qiu2025imagetokenizerneedsposttraining,
202
+ title={Image Tokenizer Needs Post-Training},
203
+ author={Kai Qiu and Xiang Li and Hao Chen and Jason Kuen and Xiaohao Xu and Jiuxiang Gu and Yinyi Luo and Bhiksha Raj and Zhe Lin and Marios Savvides},
204
+ year={2025},
205
+ eprint={2509.12474},
206
+ archivePrefix={arXiv},
207
+ primaryClass={cs.CV},
208
+ url={https://arxiv.org/abs/2509.12474},
209
+ }
210
+ ```