GPT-OSS-20B-Pruned-28x: Lossless Expert Pruning
10.40 GB | 28/32 experts per layer | Zero quality loss | GGUF Q4_K_M | Apache 2.0
A losslessly pruned variant of OpenAI's GPT-OSS-20B that removes 4 experts per MoE layer (12.5% of experts) while preserving 100% of benchmark performance. The pruned model saves 1.27 GB (-10.9%) and fits comfortably in 16 GB of RAM for full GPU-resident inference.
Highlights
- Lossless pruning: MMLU 78%, GSM8K 92%, Code 80% -- identical to the unpruned original
- 10.40 GB: fits in 16 GB RAM with room for KV cache and context
- ~55 tok/s on Apple M4 Pro (Metal GPU)
- Drop-in replacement: works with llama.cpp (build 7970+) with no code changes
- Reproducible: pruning plan and scripts included
Benchmark Results
| Benchmark | Original (11.67 GB) | Pruned (10.40 GB) | Delta |
|---|---|---|---|
| MMLU (0-shot, 100Q) | 78% | 78% | 0 pp |
| GSM8K (0-shot, 50Q) | 70%* | 92% (46/50) | -- |
| Code Generation (10Q) | 80% | 80% | 0 pp |
| Inference speed (M4 Pro) | ~55 tok/s | ~55 tok/s | negligible |
| VRAM usage | 11.1 GB | ~10.0 GB | -1.1 GB |
* Original GSM8K was measured on a 10-question subset; the 50-question evaluation was run only on the pruned model. MMLU and Code scores are directly comparable.
Model Details
| Property | Value |
|---|---|
| Base model | openai/gpt-oss-20b |
| Total parameters | 20.9B (3.6B active per token) |
| Architecture | Transformer with Sparse MoE (SwiGLU experts) |
| MoE layers | 24 |
| Experts per layer | 28 (pruned from 32) |
| Routing | Top-4, sigmoid activation |
| Attention | GQA (64 query heads, 8 KV heads), RoPE, RMSNorm |
| Context length | 128K tokens |
| Quantization | Q4_K_M GGUF (MXFP4 expert weights, Q5_0/Q5_K/Q8_0 attention) |
| File size | 10.40 GB (original: 11.67 GB) |
| License | Apache 2.0 |
How to Use
With llama.cpp (recommended)
# Download the model
# Place gpt-oss-20b-pruned-90pct.gguf in your models directory
# Run with llama-server (Metal GPU)
llama-server \
-m gpt-oss-20b-pruned-90pct.gguf \
--port 8090 \
-ngl 99 \
-c 4096
# Or run with llama-cli for one-shot inference
llama-cli \
-m gpt-oss-20b-pruned-90pct.gguf \
-ngl 99 \
-c 4096 \
-p "Explain the theory of relativity in simple terms."
With OpenAI-compatible API
Once llama-server is running, you can query it via any OpenAI-compatible client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8090/v1", api_key="none")
response = client.chat.completions.create(
model="gpt-oss-20b",
messages=[{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."}],
max_tokens=500,
temperature=0.7,
)
print(response.choices[0].message.content)
Requirements
- llama.cpp build 7970 or later
- 16 GB RAM minimum (model uses ~10-11 GB; rest for KV cache)
- Apple Silicon (Metal) recommended for best performance; CUDA and CPU also supported
- Set
max_tokens >= 500-- the model uses an internal reasoning channel that consumes 100-200 tokens before the final answer
Pruning Methodology
Weight-Based Importance Scoring
Each expert is scored using a static, inference-free heuristic that combines two signals:
- Router weight L2 norm: The L2 norm of each expert's column in the router (gate) weight matrix. Experts with larger norms are more likely to be selected by the routing mechanism.
- Router bias (softmax-normalized): The sigmoid routing bias term, softmax-normalized across experts within each layer, reflecting the model's learned preference for each expert.
The final importance score for expert e in layer l is:
importance(l, e) = L2_norm(router_weight[:, e]) * softmax(router_bias)[e]
Layer-Adaptive Pruning
Experts are ranked by importance within each layer, and the bottom k experts are removed uniformly (4 experts per layer, reducing from 32 to 28). The pruning plan is generated once and applied as a deterministic tensor-level operation on the GGUF file -- no fine-tuning or calibration inference is required.
GGUF Pruning Pipeline
The pruning is performed directly on the GGUF file using reprune_gguf.py:
- Parse the pruning plan (JSON) specifying which expert indices to keep per layer
- Copy non-expert tensors (attention, embeddings, norms) unchanged
- For each MoE layer, copy only the kept expert weight tensors, re-indexing them
- Patch the
expert_countmetadata from 32 to 28 - Output a valid, self-contained GGUF file
Pruning Cliff
We found a sharp quality cliff between 28 and 27 experts per layer:
| Experts/Layer | Size | MMLU | Notes |
|---|---|---|---|
| 32 (original) | 11.67 GB | 78% | Baseline |
| 28 (this model) | 10.40 GB | 78% | Lossless |
| 27 | 10.08 GB | 68% | -10 pp sudden drop |
| 25 | 9.44 GB | 52% | Severe degradation |
| 24 | 9.12 GB | 46% | Near-collapse |
With only 32 experts per layer (compared to 256+ in DeepSeek-V3 or 512 in Qwen3), each expert carries a disproportionately large share of the model's knowledge. This coarse granularity creates a narrow safe pruning window: 28/32 is lossless, but 27/32 triggers a catastrophic cliff.
Comparison with Original
| Original | Pruned (this model) | |
|---|---|---|
| File size | 11.67 GB | 10.40 GB (-10.9%) |
| Experts/layer | 32 | 28 |
| Total experts | 768 | 672 (-12.5%) |
| MMLU | 78% | 78% |
| GSM8K | 70% (10Q) | 92% (50Q) |
| Code | 80% | 80% |
| Speed | ~55 tok/s | ~55 tok/s |
| 16 GB RAM | Yes | Yes (more headroom) |
Limitations
- Narrow pruning window: Due to the coarse 32-expert architecture, only 4 experts can be safely removed per layer. Further pruning causes sharp quality degradation.
- Benchmark coverage: Quality was validated on MMLU (100Q), GSM8K (50Q), and code generation (10Q). Performance on specialized domains (e.g., medical, legal, multilingual) has not been evaluated.
- No fine-tuning: This is a pruned model with no post-pruning fine-tuning or knowledge distillation. While benchmarks show no degradation, edge-case behaviors may differ from the original.
- Reasoning channel token overhead: Like the original, the model uses an internal reasoning channel that consumes 100-200 tokens. Set
max_tokens >= 500for reliable output. - Quantization inherited: Expert weights use MXFP4 (4.25 bits/param) from the original GGUF; no additional quantization was applied.
Pruning Toolkit
The pruning scripts are available at goba-labs/moe-compression:
| Script | Description |
|---|---|
scripts/reprune_gguf.py |
GGUF expert pruning (supports bias tensors, expert_count patching) |
scripts/gpt_oss_importance.py |
Weight-based importance scoring for GPT-OSS-20B |
results/calibration/gpt_oss_20b/plan_90pct.json |
Pruning plan for this model (28/32 experts) |
Reproducing this model
# 1. Download the original GGUF
# e.g., from bartowski/openai_gpt-oss-20b-GGUF on HuggingFace
# 2. Compute importance scores (optional -- pre-computed plan is included)
python scripts/gpt_oss_importance.py
# 3. Apply pruning
python scripts/reprune_gguf.py \
original_gpt-oss-20b-Q4_K_M.gguf \
gpt-oss-20b-pruned-90pct.gguf \
results/calibration/gpt_oss_20b/plan_90pct.json
Citation
If you use this model or the pruning methodology in your research, please cite:
@misc{goba2026moe,
title = {Task-Aware Expert Pruning for Mixture-of-Experts Language Models on Consumer Hardware},
author = {Goba Labs},
year = {2026},
url = {https://github.com/goba-labs/moe-compression},
note = {Weight-based importance scoring achieves lossless 12.5\% expert pruning on GPT-OSS-20B}
}
Acknowledgments
- OpenAI for releasing GPT-OSS-20B under Apache 2.0
- llama.cpp for the GGUF format and inference runtime
- bartowski for the original Q4_K_M GGUF quantization
License
This model inherits the Apache 2.0 License from the base model (openai/gpt-oss-20b). The pruning scripts are also released under Apache 2.0.
- Downloads last month
- 280
We're not able to determine the quantization variants.
Model tree for TOk-Atsuru/gpt-oss-20b-pruned-28x
Base model
openai/gpt-oss-20b