GPT-OSS-20B-Pruned-28x: Lossless Expert Pruning

10.40 GB | 28/32 experts per layer | Zero quality loss | GGUF Q4_K_M | Apache 2.0

A losslessly pruned variant of OpenAI's GPT-OSS-20B that removes 4 experts per MoE layer (12.5% of experts) while preserving 100% of benchmark performance. The pruned model saves 1.27 GB (-10.9%) and fits comfortably in 16 GB of RAM for full GPU-resident inference.

Highlights

  • Lossless pruning: MMLU 78%, GSM8K 92%, Code 80% -- identical to the unpruned original
  • 10.40 GB: fits in 16 GB RAM with room for KV cache and context
  • ~55 tok/s on Apple M4 Pro (Metal GPU)
  • Drop-in replacement: works with llama.cpp (build 7970+) with no code changes
  • Reproducible: pruning plan and scripts included

Benchmark Results

Benchmark Original (11.67 GB) Pruned (10.40 GB) Delta
MMLU (0-shot, 100Q) 78% 78% 0 pp
GSM8K (0-shot, 50Q) 70%* 92% (46/50) --
Code Generation (10Q) 80% 80% 0 pp
Inference speed (M4 Pro) ~55 tok/s ~55 tok/s negligible
VRAM usage 11.1 GB ~10.0 GB -1.1 GB

* Original GSM8K was measured on a 10-question subset; the 50-question evaluation was run only on the pruned model. MMLU and Code scores are directly comparable.

Model Details

Property Value
Base model openai/gpt-oss-20b
Total parameters 20.9B (3.6B active per token)
Architecture Transformer with Sparse MoE (SwiGLU experts)
MoE layers 24
Experts per layer 28 (pruned from 32)
Routing Top-4, sigmoid activation
Attention GQA (64 query heads, 8 KV heads), RoPE, RMSNorm
Context length 128K tokens
Quantization Q4_K_M GGUF (MXFP4 expert weights, Q5_0/Q5_K/Q8_0 attention)
File size 10.40 GB (original: 11.67 GB)
License Apache 2.0

How to Use

With llama.cpp (recommended)

# Download the model
# Place gpt-oss-20b-pruned-90pct.gguf in your models directory

# Run with llama-server (Metal GPU)
llama-server \
  -m gpt-oss-20b-pruned-90pct.gguf \
  --port 8090 \
  -ngl 99 \
  -c 4096

# Or run with llama-cli for one-shot inference
llama-cli \
  -m gpt-oss-20b-pruned-90pct.gguf \
  -ngl 99 \
  -c 4096 \
  -p "Explain the theory of relativity in simple terms."

With OpenAI-compatible API

Once llama-server is running, you can query it via any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8090/v1", api_key="none")

response = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."}],
    max_tokens=500,
    temperature=0.7,
)
print(response.choices[0].message.content)

Requirements

  • llama.cpp build 7970 or later
  • 16 GB RAM minimum (model uses ~10-11 GB; rest for KV cache)
  • Apple Silicon (Metal) recommended for best performance; CUDA and CPU also supported
  • Set max_tokens >= 500 -- the model uses an internal reasoning channel that consumes 100-200 tokens before the final answer

Pruning Methodology

Weight-Based Importance Scoring

Each expert is scored using a static, inference-free heuristic that combines two signals:

  1. Router weight L2 norm: The L2 norm of each expert's column in the router (gate) weight matrix. Experts with larger norms are more likely to be selected by the routing mechanism.
  2. Router bias (softmax-normalized): The sigmoid routing bias term, softmax-normalized across experts within each layer, reflecting the model's learned preference for each expert.

The final importance score for expert e in layer l is:

importance(l, e) = L2_norm(router_weight[:, e]) * softmax(router_bias)[e]

Layer-Adaptive Pruning

Experts are ranked by importance within each layer, and the bottom k experts are removed uniformly (4 experts per layer, reducing from 32 to 28). The pruning plan is generated once and applied as a deterministic tensor-level operation on the GGUF file -- no fine-tuning or calibration inference is required.

GGUF Pruning Pipeline

The pruning is performed directly on the GGUF file using reprune_gguf.py:

  1. Parse the pruning plan (JSON) specifying which expert indices to keep per layer
  2. Copy non-expert tensors (attention, embeddings, norms) unchanged
  3. For each MoE layer, copy only the kept expert weight tensors, re-indexing them
  4. Patch the expert_count metadata from 32 to 28
  5. Output a valid, self-contained GGUF file

Pruning Cliff

We found a sharp quality cliff between 28 and 27 experts per layer:

Experts/Layer Size MMLU Notes
32 (original) 11.67 GB 78% Baseline
28 (this model) 10.40 GB 78% Lossless
27 10.08 GB 68% -10 pp sudden drop
25 9.44 GB 52% Severe degradation
24 9.12 GB 46% Near-collapse

With only 32 experts per layer (compared to 256+ in DeepSeek-V3 or 512 in Qwen3), each expert carries a disproportionately large share of the model's knowledge. This coarse granularity creates a narrow safe pruning window: 28/32 is lossless, but 27/32 triggers a catastrophic cliff.

Comparison with Original

Original Pruned (this model)
File size 11.67 GB 10.40 GB (-10.9%)
Experts/layer 32 28
Total experts 768 672 (-12.5%)
MMLU 78% 78%
GSM8K 70% (10Q) 92% (50Q)
Code 80% 80%
Speed ~55 tok/s ~55 tok/s
16 GB RAM Yes Yes (more headroom)

Limitations

  • Narrow pruning window: Due to the coarse 32-expert architecture, only 4 experts can be safely removed per layer. Further pruning causes sharp quality degradation.
  • Benchmark coverage: Quality was validated on MMLU (100Q), GSM8K (50Q), and code generation (10Q). Performance on specialized domains (e.g., medical, legal, multilingual) has not been evaluated.
  • No fine-tuning: This is a pruned model with no post-pruning fine-tuning or knowledge distillation. While benchmarks show no degradation, edge-case behaviors may differ from the original.
  • Reasoning channel token overhead: Like the original, the model uses an internal reasoning channel that consumes 100-200 tokens. Set max_tokens >= 500 for reliable output.
  • Quantization inherited: Expert weights use MXFP4 (4.25 bits/param) from the original GGUF; no additional quantization was applied.

Pruning Toolkit

The pruning scripts are available at goba-labs/moe-compression:

Script Description
scripts/reprune_gguf.py GGUF expert pruning (supports bias tensors, expert_count patching)
scripts/gpt_oss_importance.py Weight-based importance scoring for GPT-OSS-20B
results/calibration/gpt_oss_20b/plan_90pct.json Pruning plan for this model (28/32 experts)

Reproducing this model

# 1. Download the original GGUF
# e.g., from bartowski/openai_gpt-oss-20b-GGUF on HuggingFace

# 2. Compute importance scores (optional -- pre-computed plan is included)
python scripts/gpt_oss_importance.py

# 3. Apply pruning
python scripts/reprune_gguf.py \
  original_gpt-oss-20b-Q4_K_M.gguf \
  gpt-oss-20b-pruned-90pct.gguf \
  results/calibration/gpt_oss_20b/plan_90pct.json

Citation

If you use this model or the pruning methodology in your research, please cite:

@misc{goba2026moe,
  title   = {Task-Aware Expert Pruning for Mixture-of-Experts Language Models on Consumer Hardware},
  author  = {Goba Labs},
  year    = {2026},
  url     = {https://github.com/goba-labs/moe-compression},
  note    = {Weight-based importance scoring achieves lossless 12.5\% expert pruning on GPT-OSS-20B}
}

Acknowledgments

  • OpenAI for releasing GPT-OSS-20B under Apache 2.0
  • llama.cpp for the GGUF format and inference runtime
  • bartowski for the original Q4_K_M GGUF quantization

License

This model inherits the Apache 2.0 License from the base model (openai/gpt-oss-20b). The pruning scripts are also released under Apache 2.0.

Downloads last month
280
GGUF
Model size
19B params
Architecture
gpt-oss
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TOk-Atsuru/gpt-oss-20b-pruned-28x

Base model

openai/gpt-oss-20b
Quantized
(160)
this model

Datasets used to train TOk-Atsuru/gpt-oss-20b-pruned-28x