Best way to extend vocabulary of pretrained model?

What would be the best way to somehow “mix” a SentencePiece vocabulary trained on a corpus with English and German documents with the existing English only vocabulary of a pretrained transformer? So I take the pretrained model (let’s say English BERT thought it’s WordPiece, I know), somehow create a new mixed vocabulary and then finetune on my mixed language downstream task.

[I am not an expert]

When BERT “chooses” which word-pieces to use in its vocabulary, it does so by taking all the individual characters plus the most common combinations that it finds in its training corpus. (See Chris McCormick’s helpful blog and video).

I’m not entirely sure how to trigger this vocabulary-building, but I expect that your new data wouldn’t be big enough to make your new words count as the “most common”.

1 Like

@marton-avrios it’s 2024, found a solution yet?

2 Likes

it is 2025 any solution?

3 Likes
  • Train a new tokenizer on the German dataset.
  • Get non-overlapping German tokens.
  • Add them to your English tokenizer using add_tokens method.
1 Like

One way is to add tokens to the tokenizer and then resize the embeddings with model.resize_token_embeddings(len(tokenizer)). The new rows can be initialized randomly or by averaging existing subword embeddings, but you’ll still need more masked language modeling on your mixed corpus so those tokens actually learn. Another option is to train a new joint vocabulary with SentencePiece or BPE on both English and German, retokenize your corpus, and copy over the embeddings that match. The unmatched rows get random initialization, and then you continue pretraining until the model stabilizes.

You can also keep the old vocabulary and train a small projection layer that maps your new token IDs into the old embedding space, then fine tune the whole model once that mapping has settled. Another solution is to use a character or byte level model like ByT5 or CANINE, which avoids this problem entirely since they don’t depend on fixed vocabularies.

The biggest challenges are distribution shift, catastrophic forgetting, and poor initialization of the new embeddings. The way around them is to do more masked language modeling on in domain text, use a lower learning rate, and have some patience.

1 Like