Goedel-mHC-1B
A 1B-parameter language model built on multi-stream Hyperconnections (mHC), Gated GQA, and ReLU² FFN. This is the first open 1B+ LLM using mHC as its residual connection mechanism.
This is an architecture research release. The model is trained on 20B tokens of FineWeb-Edu to validate that mHC, combined with modern attention and FFN innovations from the NanoGPT speedrun community, produces better scaling behavior than a standard transformer at equivalent compute. It does: 3.8% better bits-per-byte with 15% fewer parameters compared to a standard GQA + SwiGLU + PreNorm + AdamW baseline trained identically.
Architecture
Parameters: 1,009M
| Component | Design | Details |
|---|---|---|
| Attention | Gated GQA | 16 query heads, 4 KV heads, 128 head dim, QK-norm, sigmoid output gate |
| FFN | ReLU² | relu(x)² activation, 2.667x expansion (hidden dim rounded to nearest 256) |
| Residual | mHC | 4 parallel streams, Sinkhorn-constrained mixing matrices per layer |
| Norm | RMSNorm | Pre-norm within mHC streams |
| Positional | RoPE | \u03b8 = 10,000 |
| Vocab | 50,304 | GPT-2 tokenizer, padded to multiple of 64 |
Gated GQA
Grouped Query Attention with a learned sigmoid output gate, following Qwen3 (arXiv:2505.09388). After computing standard GQA attention output, an additional linear projection produces a gate of the same shape, and the output is element-wise multiplied by sigmoid(gate). This eliminates attention sink tokens and prevents bf16 loss spikes that occur with standard attention at scale.
ReLU²
From the NanoGPT speedrun lineage. The FFN applies relu(x * W_up)² followed by W_down. Squared ReLU produces sparser activations than SwiGLU while being simpler and more fusible. The intermediate dimension is dim * 2.667, rounded up to the nearest multiple of 256 for hardware alignment.
Multi-stream Hyperconnections (mHC)
Instead of a single residual stream x + sublayer(norm(x)), mHC maintains n parallel streams of the full hidden dimension. Between layers, streams are mixed via a learned doubly-stochastic matrix (enforced by Sinkhorn-Knopp iterations on a 4x4 logit matrix). A learned h_pre vector combines streams into a single input for each sublayer, and a learned h_post vector distributes the sublayer output back across streams.
At initialization, mHC exactly recovers standard pre-norm residual connections. During training, the model learns to route information through multiple parallel pathways, which empirically improves gradient flow and representation capacity.
The expanded hidden state between blocks has shape (B, S, n*D) where n=4 streams and D=2048, so the inter-block representation is 8,192-dimensional. expand() replicates the embedding into streams after the embedding layer; contract() averages across streams before the final norm.
Reference: Zhu et al., 2024, Wenfeng et al., 2024.
Optimizer: NorMuon
A split optimizer: Muon (LR 0.007) for all 2D weight matrices, Adam (LR 3e-4) for 1D parameters and embeddings. Muon applies Newton's method in the spectral domain via Nesterov momentum on the orthogonalized gradient, which significantly accelerates training of large weight matrices. The "NorMuon" variant adds a normalization step with beta2=0.95 for additional stability.
LR schedule: trapezoidal with 500-step linear warmup, constant phase, and 45% linear cooldown.
Training
| Setting | Value |
|---|---|
| Data | 20B tokens of FineWeb-Edu |
| Tokenizer | GPT-2 (50,257 tokens, vocab padded to 50,304) |
| Hardware | 8x NVIDIA H200 SXM (Vast.ai) |
| Sequence length | 4,096 |
| Per-GPU batch | 8 sequences |
| Gradient accumulation | 4 steps |
| Effective batch | 256 sequences (8 GPUs x 8 seq x 4 accum = 1.05M tokens/step) |
| Precision | bf16 |
| Compilation | torch.compile with max-autotune-no-cudagraphs |
| Fused CE loss | Liger kernel (never materializes full logit tensor) |
| Weight tying | Embedding and LM head share weights |
| Wall-clock time | ~21 hours |
Results
| Benchmark | Goedel-mHC-1B (1,009M) | Baseline (1,185M) |
|---|---|---|
| BPB (wikitext-2) | 1.087 | 1.130 |
| val_loss (FineWeb-Edu) | 2.645 | 2.686 |
| HellaSwag | 39.7% | 36.2% |
| ARC-Easy | 57.8% | 52.8% |
| ARC-Challenge | 24.3% | 23.9% |
| WinoGrande | 54.9% | 53.1% |
Both models trained on 20B tokens of FineWeb-Edu, 8xH200. The baseline uses GQA + SwiGLU + PreNorm + AdamW and has 15% more parameters.
Key result: Goedel-mHC-1B achieves 3.8% better BPB with 15% fewer parameters than the baseline, demonstrating that the combination of mHC + Gated GQA + ReLU² + NorMuon meaningfully improves parameter efficiency.
Full Config
The complete resolved configuration used for training:
model:
dim: 2048
n_layers: 24
vocab_size: 50304
attention:
type: gated_gqa
num_heads: 16
num_kv_heads: 4
head_dim: 128
qk_norm: true
rope_theta: 10000
ffn:
type: relu2
intermediate_mult: 2.667
residual:
type: mhc
n_streams: 4
optim:
type: muon
lr: 3.0e-4
muon_lr: 0.007
normuon: true
normuon_beta2: 0.95
scheduler: trapezoidal
cooldown_fraction: 0.45
warmup_steps: 500
weight_decay: 0.1
max_grad_norm: 1.0
training:
tokens: 20_000_000_000
batch_size: 8
seq_len: 4096
grad_accum_steps: 4
liger: true
compile: true
compile_mode: max-autotune-no-cudagraphs
data:
shard_dir: data/fineweb_edu
Limitations
- Undertrained. 20B tokens is far below modern standards. Comparable 1B models typically train on 1--4T tokens. This model exists to validate architecture choices, not to compete on downstream benchmarks.
- English only. FineWeb-Edu is English-language web text filtered for educational content.
- Base model only. No instruction tuning, RLHF, or alignment. The model will not follow instructions reliably and may generate harmful or incorrect text.
- Custom architecture required. mHC, Gated GQA, and ReLU² are not standard HuggingFace
transformersarchitectures. You cannot load this model withAutoModelForCausalLM. Loading requires our custom codebase. Code release is coming. - Not suitable for production use. This is a research artifact for architecture exploration.
What's Next
- 100B-token continued pretraining on FineWeb-HQ is currently in progress, which will bring this architecture closer to a properly trained model.
- Full code release and technical writeup will accompany the 100B results.
Citation
If you use this model in your work, please cite:
@misc{goedel2026mhc1b,
title={Goedel-mHC-1B: First Open 1B+ Language Model with Multi-Stream Hyperconnections},
author={Goedel Machines},
year={2026},
url={https://huggingface.co/GoedelMachines/Goedel-mHC-1B}
}
This work builds on the Hyper-Connections papers:
@article{zhu2024hyperconnections,
title={Hyper-Connections},
author={Defa Zhu and Hongzhi Huang and Zihao Huang and Yutao Zeng and Yunyao Mao and Banggu Wu and Qiyang Min and Xun Zhou},
journal={arXiv preprint arXiv:2409.19606},
year={2024}
}
@article{wenfeng2024mhc,
title={Manifold-Constrained Hyper-Connections},
author={Zhenda Xie and Yixuan Wei and Huanqi Cao and others},
journal={arXiv preprint arXiv:2512.24880},
year={2024}
}
- Downloads last month
- 4