Confusion about Mistral Small 24B 3.1 head_dim calculation

I’ve noticed something puzzling about the head dimension calculation in Mistral Small 3.1 According to the transformers documentation, head_dim should be calculated as hidden_size // num_attention_heads, which for Mistral Small 3.1 would be:

    head_dim = 5120 // 32 = 160

However, in the model’s config.json, head_dim is explicitly set to 128:

{
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 32768,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 40,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 100000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.49.0.dev0",
  "use_cache": true,
  "vocab_size": 131072
}
    

    

This seems to conflict with the transformers documentation which states: head_dim (int, optional, defaults to hidden_size // num_attention_heads) — The attention head dimension.source

Interestingly, if we calculate using num_hidden_layers instead:

head_dim = hidden_size // num_hidden_layers
head_dim = 5120 // 40 = 128
    

This matches the preset value in the config, but I’m not sure if this is coincidental or intentional.

Can someone explain:

  1. Why is head_dim set to 128 instead of the calculated 160?

  2. Is the correlation with num_hidden_layers meaningful or coincidental?

1 Like

Hmm…


  1. head_dim=128 is deliberate, not a miscalculation.
  • The Mistral-Small-3.1-24B checkpoints ship a config that explicitly sets:
    hidden_size=5120, num_attention_heads=32, num_key_value_heads=8, and head_dim=128. Transformers reads and uses this field directly; it does not recompute hidden_size // num_attention_heads when head_dim is present. See the model’s config.json showing head_dim: 128. (Hugging Face)

  • Mistral/NIM model cards also list ā€œDim: 5,120; Heads: 32; Head dim: 128; KV heads: 8 (GQA)ā€, which matches the HF config. (NVIDIA Docs)

  • Why pick 128 instead of 160? Lower per-head width cuts attention compute and KV-cache size while keeping 32 Q-heads and using GQA with 8 KV-heads. A standard KV cache size formula per token is:

    • MHA: bytes ā‰ˆ 2 * num_layers * (num_heads * head_dim) * bytes_per_elem. (NVIDIA Developer)
    • GQA: bytes ā‰ˆ 2 * num_layers * (num_kv_heads * head_dim) * bytes_per_elem since only K/V heads count. (Medium)
      With bf16 (2 bytes), num_layers=40, num_kv_heads=8:
    • If head_dim=128: per-token, per-layer KV = 2 * 8 * 128 * 2 = 4096 bytes.
    • If head_dim=160: per-token, per-layer KV = 2 * 8 * 160 * 2 = 5120 bytes.
      That’s +25% KV memory and proportional attention FLOPs if head_dim were 160. Using 8 KV-heads instead of 32 further yields a 4Ɨ KV reduction vs MHA. (General GQA behavior documented in HF/vLLM docs.) (Hugging Face)
  1. The ā€œcorrelationā€ with num_hidden_layers is coincidental.
  • num_hidden_layers (depth) and head_dim (per-head width) are independent config knobs in Transformers. The HF docs list them as separate attributes; there is no rule tying them. (Hugging Face)
  • Mistral Small 24B happens to use 40 layers and head_dim 128. That pairing is also shown on Mistral/NIM cards but it’s a design choice, not a constraint. Other models use 40 layers with different head_dims or different layers with 128-dim heads. (NVIDIA Docs)

Pointers if you want to dig into code and ecosystem behavior:

  • HF Mistral-3 docs and the exact checkpoint config that fixes head_dim=128. (Hugging Face)
  • GQA parameter (num_key_value_heads) in HF model docs and vLLM docs. (Hugging Face)
  • Community discussions confirming manual head_dim is supported in some models and not hard-tied to hidden_size // num_heads everywhere. (GitHub)

Bottom line: head_dim=128 is an intentional latency/memory trade-off with GQA. Any match to 5120 // 40 = 128 is numerology, not architecture. (Hugging Face)