Best way to extend vocabulary of pretrained model?

marton-avrios · October 12, 2020, 12:54pm

What would be the best way to somehow “mix” a SentencePiece vocabulary trained on a corpus with English and German documents with the existing English only vocabulary of a pretrained transformer? So I take the pretrained model (let’s say English BERT thought it’s WordPiece, I know), somehow create a new mixed vocabulary and then finetune on my mixed language downstream task.

rgwatwormhill · October 12, 2020, 4:22pm

[I am not an expert]

When BERT “chooses” which word-pieces to use in its vocabulary, it does so by taking all the individual characters plus the most common combinations that it finds in its training corpus. (See Chris McCormick’s helpful blog and video).

I’m not entirely sure how to trigger this vocabulary-building, but I expect that your new data wouldn’t be big enough to make your new words count as the “most common”.

Owos · March 28, 2024, 1:46pm

@marton-avrios it’s 2024, found a solution yet?

Deema · February 9, 2025, 2:05am

it is 2025 any solution?

Owos · August 16, 2025, 4:27am

Train a new tokenizer on the German dataset.
Get non-overlapping German tokens.
Add them to your English tokenizer using add_tokens method.

Pimpcat-AU · August 18, 2025, 9:22pm

One way is to add tokens to the tokenizer and then resize the embeddings with model.resize_token_embeddings(len(tokenizer)). The new rows can be initialized randomly or by averaging existing subword embeddings, but you’ll still need more masked language modeling on your mixed corpus so those tokens actually learn. Another option is to train a new joint vocabulary with SentencePiece or BPE on both English and German, retokenize your corpus, and copy over the embeddings that match. The unmatched rows get random initialization, and then you continue pretraining until the model stabilizes.

You can also keep the old vocabulary and train a small projection layer that maps your new token IDs into the old embedding space, then fine tune the whole model once that mapping has settled. Another solution is to use a character or byte level model like ByT5 or CANINE, which avoids this problem entirely since they don’t depend on fixed vocabularies.

The biggest challenges are distribution shift, catastrophic forgetting, and poor initialization of the new embeddings. The way around them is to do more masked language modeling on in domain text, use a lower learning rate, and have some patience.

Topic		Replies	Views
How the vocabulary of BERT tokenizer is generated? 🤗Transformers	2	3153	January 6, 2024
Fine tunning pretrained bert with new vocabulary Beginners	0	457	October 1, 2020
How to properly add new vocabulary to BPE tokenizers (like Roberta)? Beginners	3	5855	December 8, 2021
How can I pretrain a new model re-initializing with my own vocab? 🤗Transformers	0	305	May 25, 2021
Domain adaptation of Language Model and Tokenizer Beginners	8	3109	June 17, 2024

Best way to extend vocabulary of pretrained model?

Related topics