The "Softmax Bottleneck" might be interesting to look at
I've been thinking about this a fair bit recently, but not had time to run any experiments (25-30 years ago I was fascinated with making a more "human like" chess engine, but that has largely been solved via Leela now).
Not sure if you've heard of the "Softmax Bottleneck" before:
https://arxiv.org/abs/1711.03953
It's explained a lot better in this guy's final year project:
https://project-archive.inf.ed.ac.uk/ug4/20234017/ug4_proj.pdf
For natural language there are often lots of similar ways to express the same meaning and lots of redundancy in the tokeniser... But for chess there is only a single way to express a valid move.
If you look at your tokeniser then you can see that of the 4687 tokens there are quite a few that can never happen, eg:
"a1a1": 5,
"a1h7": 67,
but what the "Softmax Bottleneck" is telling us mathematically is that if the set of token with all these invalid moves removed is still greater than the hidden dimension of the residual stream, eg:
"hidden_size": 768,
"vocab_size": 4687,
"hidden_size": 1024,
"vocab_size": 4687,
it's impossible for the hidden_state x lm_head transformation to express all possible valid moves and the LLM making invalid moves is almost a certainty (eg: it is likely being forced to learn to represent rook and bishop moves using the vector magnitude to decide "how far" to move, but without knowing the intervening pieces, etc).
It would probably be interesting to see how many valid tokens/moves there actually are of the 4687 in the data and increase the hidden_size appropriately and/or experiment with much wider/shallower LLMs where the hidden_size is 2-3x the number of valid tokens. This added redundancy might then be able to encode for other latent features (eg: early/mid/late game stage, winning/losing, etc) for the final lm_head output decision...
The added redundancy of the input embedding will also likely help the gradient flow during training and make the final model less brittle if you wanted to do further fine-tuning on it.
It's probably also worth experimenting with untied embeddings.
Your current model looks to tied the lm_head with the input_embeddings:
This is a very strong inductive bias to have, and my experiments training the small draft models suggest this is often done to simply claim "our model is in the 0.5B class so we'll compare it with other 0.5B models" (even when the tokeniser is nearly an order of magnitude larger than what you are comparing it with!).
This might be different for chess though, but still worth investigating as it costs almost nothing in terms of VRAM or compute to not tie these tensors.
Hi! Thanks for the very in depth comment! I looked at the softmax bottleneck problem you reference and it seems really interesting. If I find some time I'll try to remove the invalid tokens and try some training runs with shallower but wider networks. Interestingly I found that the 250m model makes less illegal moves.
Percentage of games where model makes an incorrect move:
250m: ~12% of games
100m: ~17% of games
If thats simply due to overall larger parameter size or if the larger hidden_size (1024 vs 726) plays a role would be something to try out.
Regarding tied embeddings, my model is based on smollm3 from huggingface. AFAIK they did some ablations with and without tied embeddings for their 3B model and found the performance to be very similar. Their concern was that for small models with a large vocabulary, the embedding layers take up a large percentage of the total trainable weights. However I didn't do any ablations and the result could definitely be different for chess move prediction.
