Confusion about Mistral Small 24B 3.1 head_dim calculation

caitrigg · September 24, 2025, 10:41pm

I’ve noticed something puzzling about the head dimension calculation in Mistral Small 3.1 According to the transformers documentation, head_dim should be calculated as hidden_size // num_attention_heads, which for Mistral Small 3.1 would be:

    head_dim = 5120 // 32 = 160

However, in the model’s config.json, head_dim is explicitly set to 128:

{
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 32768,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 40,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 100000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.49.0.dev0",
  "use_cache": true,
  "vocab_size": 131072
}

This seems to conflict with the transformers documentation which states: head_dim (int, optional, defaults to hidden_size // num_attention_heads) — The attention head dimension.source

Interestingly, if we calculate using num_hidden_layers instead:

head_dim = hidden_size // num_hidden_layers
head_dim = 5120 // 40 = 128

This matches the preset value in the config, but I’m not sure if this is coincidental or intentional.

Can someone explain:

Why is head_dim set to 128 instead of the calculated 160?
Is the correlation with num_hidden_layers meaningful or coincidental?

John6666 · September 24, 2025, 11:59pm

Hmm…

head_dim=128 is deliberate, not a miscalculation.

The Mistral-Small-3.1-24B checkpoints ship a config that explicitly sets:
hidden_size=5120, num_attention_heads=32, num_key_value_heads=8, and head_dim=128. Transformers reads and uses this field directly; it does not recompute hidden_size // num_attention_heads when head_dim is present. See the model’s config.json showing head_dim: 128. (Hugging Face)
Mistral/NIM model cards also list “Dim: 5,120; Heads: 32; Head dim: 128; KV heads: 8 (GQA)”, which matches the HF config. (NVIDIA Docs)
Why pick 128 instead of 160? Lower per-head width cuts attention compute and KV-cache size while keeping 32 Q-heads and using GQA with 8 KV-heads. A standard KV cache size formula per token is:
- MHA: bytes ≈ 2 * num_layers * (num_heads * head_dim) * bytes_per_elem. (NVIDIA Developer)
- GQA: bytes ≈ 2 * num_layers * (num_kv_heads * head_dim) * bytes_per_elem since only K/V heads count. (Medium)
  With bf16 (2 bytes), num_layers=40, num_kv_heads=8:
- If head_dim=128: per-token, per-layer KV = 2 * 8 * 128 * 2 = 4096 bytes.
- If head_dim=160: per-token, per-layer KV = 2 * 8 * 160 * 2 = 5120 bytes.
  That’s +25% KV memory and proportional attention FLOPs if head_dim were 160. Using 8 KV-heads instead of 32 further yields a 4× KV reduction vs MHA. (General GQA behavior documented in HF/vLLM docs.) (Hugging Face)

The “correlation” with num_hidden_layers is coincidental.

num_hidden_layers (depth) and head_dim (per-head width) are independent config knobs in Transformers. The HF docs list them as separate attributes; there is no rule tying them. (Hugging Face)
Mistral Small 24B happens to use 40 layers and head_dim 128. That pairing is also shown on Mistral/NIM cards but it’s a design choice, not a constraint. Other models use 40 layers with different head_dims or different layers with 128-dim heads. (NVIDIA Docs)

Pointers if you want to dig into code and ecosystem behavior:

HF Mistral-3 docs and the exact checkpoint config that fixes head_dim=128. (Hugging Face)
GQA parameter (num_key_value_heads) in HF model docs and vLLM docs. (Hugging Face)
Community discussions confirming manual head_dim is supported in some models and not hard-tied to hidden_size // num_heads everywhere. (GitHub)

Bottom line: head_dim=128 is an intentional latency/memory trade-off with GQA. Any match to 5120 // 40 = 128 is numerology, not architecture. (Hugging Face)

Topic		Replies	Views
Question about all_head_size under BertSelfAttention 🤗Transformers	0	384	July 13, 2020
Bert Config: Num attention heads 🤗Transformers	2	1182	March 7, 2023
How to create custom GPT-2 model with different number of attention heads in different layers? 🤗Transformers	0	413	July 17, 2023
Different lm_head size and vocab_size 🤗Transformers	1	920	January 28, 2026
Sizes of Query, key and value vector in Bert Model 🤗Transformers	3	6125	March 25, 2021

Confusion about Mistral Small 24B 3.1 head_dim calculation

Related topics