Goedel-mHC-1B

A 1B-parameter language model built on multi-stream Hyperconnections (mHC), Gated GQA, and ReLU² FFN. This is the first open 1B+ LLM using mHC as its residual connection mechanism.

This is an architecture research release. The model is trained on 20B tokens of FineWeb-Edu to validate that mHC, combined with modern attention and FFN innovations from the NanoGPT speedrun community, produces better scaling behavior than a standard transformer at equivalent compute. It does: 3.8% better bits-per-byte with 15% fewer parameters compared to a standard GQA + SwiGLU + PreNorm + AdamW baseline trained identically.

Architecture

Parameters: 1,009M

Component Design Details
Attention Gated GQA 16 query heads, 4 KV heads, 128 head dim, QK-norm, sigmoid output gate
FFN ReLU² relu(x)² activation, 2.667x expansion (hidden dim rounded to nearest 256)
Residual mHC 4 parallel streams, Sinkhorn-constrained mixing matrices per layer
Norm RMSNorm Pre-norm within mHC streams
Positional RoPE \u03b8 = 10,000
Vocab 50,304 GPT-2 tokenizer, padded to multiple of 64

Gated GQA

Grouped Query Attention with a learned sigmoid output gate, following Qwen3 (arXiv:2505.09388). After computing standard GQA attention output, an additional linear projection produces a gate of the same shape, and the output is element-wise multiplied by sigmoid(gate). This eliminates attention sink tokens and prevents bf16 loss spikes that occur with standard attention at scale.

ReLU²

From the NanoGPT speedrun lineage. The FFN applies relu(x * W_up)² followed by W_down. Squared ReLU produces sparser activations than SwiGLU while being simpler and more fusible. The intermediate dimension is dim * 2.667, rounded up to the nearest multiple of 256 for hardware alignment.

Multi-stream Hyperconnections (mHC)

Instead of a single residual stream x + sublayer(norm(x)), mHC maintains n parallel streams of the full hidden dimension. Between layers, streams are mixed via a learned doubly-stochastic matrix (enforced by Sinkhorn-Knopp iterations on a 4x4 logit matrix). A learned h_pre vector combines streams into a single input for each sublayer, and a learned h_post vector distributes the sublayer output back across streams.

At initialization, mHC exactly recovers standard pre-norm residual connections. During training, the model learns to route information through multiple parallel pathways, which empirically improves gradient flow and representation capacity.

The expanded hidden state between blocks has shape (B, S, n*D) where n=4 streams and D=2048, so the inter-block representation is 8,192-dimensional. expand() replicates the embedding into streams after the embedding layer; contract() averages across streams before the final norm.

Reference: Zhu et al., 2024, Wenfeng et al., 2024.

Optimizer: NorMuon

A split optimizer: Muon (LR 0.007) for all 2D weight matrices, Adam (LR 3e-4) for 1D parameters and embeddings. Muon applies Newton's method in the spectral domain via Nesterov momentum on the orthogonalized gradient, which significantly accelerates training of large weight matrices. The "NorMuon" variant adds a normalization step with beta2=0.95 for additional stability.

LR schedule: trapezoidal with 500-step linear warmup, constant phase, and 45% linear cooldown.

Training

Setting Value
Data 20B tokens of FineWeb-Edu
Tokenizer GPT-2 (50,257 tokens, vocab padded to 50,304)
Hardware 8x NVIDIA H200 SXM (Vast.ai)
Sequence length 4,096
Per-GPU batch 8 sequences
Gradient accumulation 4 steps
Effective batch 256 sequences (8 GPUs x 8 seq x 4 accum = 1.05M tokens/step)
Precision bf16
Compilation torch.compile with max-autotune-no-cudagraphs
Fused CE loss Liger kernel (never materializes full logit tensor)
Weight tying Embedding and LM head share weights
Wall-clock time ~21 hours

Results

Benchmark Goedel-mHC-1B (1,009M) Baseline (1,185M)
BPB (wikitext-2) 1.087 1.130
val_loss (FineWeb-Edu) 2.645 2.686
HellaSwag 39.7% 36.2%
ARC-Easy 57.8% 52.8%
ARC-Challenge 24.3% 23.9%
WinoGrande 54.9% 53.1%

Both models trained on 20B tokens of FineWeb-Edu, 8xH200. The baseline uses GQA + SwiGLU + PreNorm + AdamW and has 15% more parameters.

Key result: Goedel-mHC-1B achieves 3.8% better BPB with 15% fewer parameters than the baseline, demonstrating that the combination of mHC + Gated GQA + ReLU² + NorMuon meaningfully improves parameter efficiency.

Full Config

The complete resolved configuration used for training:

model:
  dim: 2048
  n_layers: 24
  vocab_size: 50304

attention:
  type: gated_gqa
  num_heads: 16
  num_kv_heads: 4
  head_dim: 128
  qk_norm: true
  rope_theta: 10000

ffn:
  type: relu2
  intermediate_mult: 2.667

residual:
  type: mhc
  n_streams: 4

optim:
  type: muon
  lr: 3.0e-4
  muon_lr: 0.007
  normuon: true
  normuon_beta2: 0.95
  scheduler: trapezoidal
  cooldown_fraction: 0.45
  warmup_steps: 500
  weight_decay: 0.1
  max_grad_norm: 1.0

training:
  tokens: 20_000_000_000
  batch_size: 8
  seq_len: 4096
  grad_accum_steps: 4
  liger: true
  compile: true
  compile_mode: max-autotune-no-cudagraphs

data:
  shard_dir: data/fineweb_edu

Limitations

  • Undertrained. 20B tokens is far below modern standards. Comparable 1B models typically train on 1--4T tokens. This model exists to validate architecture choices, not to compete on downstream benchmarks.
  • English only. FineWeb-Edu is English-language web text filtered for educational content.
  • Base model only. No instruction tuning, RLHF, or alignment. The model will not follow instructions reliably and may generate harmful or incorrect text.
  • Custom architecture required. mHC, Gated GQA, and ReLU² are not standard HuggingFace transformers architectures. You cannot load this model with AutoModelForCausalLM. Loading requires our custom codebase. Code release is coming.
  • Not suitable for production use. This is a research artifact for architecture exploration.

What's Next

  • 100B-token continued pretraining on FineWeb-HQ is currently in progress, which will bring this architecture closer to a properly trained model.
  • Full code release and technical writeup will accompany the 100B results.

Citation

If you use this model in your work, please cite:

@misc{goedel2026mhc1b,
  title={Goedel-mHC-1B: First Open 1B+ Language Model with Multi-Stream Hyperconnections},
  author={Goedel Machines},
  year={2026},
  url={https://huggingface.co/GoedelMachines/Goedel-mHC-1B}
}

This work builds on the Hyper-Connections papers:

@article{zhu2024hyperconnections,
  title={Hyper-Connections},
  author={Defa Zhu and Hongzhi Huang and Zihao Huang and Yutao Zeng and Yunyao Mao and Banggu Wu and Qiyang Min and Xun Zhou},
  journal={arXiv preprint arXiv:2409.19606},
  year={2024}
}

@article{wenfeng2024mhc,
  title={Manifold-Constrained Hyper-Connections},
  author={Zhenda Xie and Yixuan Wei and Huanqi Cao and others},
  journal={arXiv preprint arXiv:2512.24880},
  year={2024}
}
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for GoedelMachines/Goedel-mHC-1B