`additional_special_tokens` are not added

Yao-Lirong · June 20, 2024, 12:55pm

Hi Hugging Face Community,

I have the following questions regarding special tokens:

Why doesn’t tokenizer.all_special_tokens include <image> token? I’m using the LLaVA model that has <image> as a special token (as defined in added_tokens_decoder of tokenizer_config.json). The tokenizer encodes and decodes it indeed as a special token. However, when I load in its tokenizer and call tokenizer.all_special_tokens or tokenizer.additional_special_tokens, <image> token is not included.
Where is <image> token loaded? I looked into the tokenizer.from_pretrained function but there doesn’t seem to be a place to actually read in the added_tokens_decoder where in config file this special token is defined ?
Where is tokenizer.decode function defined as a special token? I tried to break-point into it to find how it skipped <image> as a special token but I seem to get into a loop call between tokenization_utils_base.py and tokenization_utils_fast.py

It would be really helpful if you give an answer to any of these questions. Thank you very much!

Yao-Lirong · June 20, 2024, 12:58pm

I’m not allowed to paste more than 2 links in a post so I will provide the codes related to question 3 where I got into a loop.
tokenization_utils_base.py and tokenization_utils_fast.py

Topic		Replies	Views
How to understand the special tokens? 🤗Transformers	7	326	December 2, 2025
How to determine if a token is special 🤗Tokenizers	2	166	April 29, 2025
`add_tokens` with argument `special_tokens=True` vs `add_special_tokens` 🤗Tokenizers	0	395	April 5, 2023
How to add all standard special tokens to my tokenizer and model? Beginners	1	5933	August 11, 2022
Tokenizer is splitting special token 🤗Tokenizers	3	59	June 30, 2025