gpt-oss-20b-reap-0.4-mxfp4-gguf
This repository contains a GGUF quantized version of the sandeshrajx/gpt-oss-20b-reap-0.4-mxfp4 model.
Model Description
This model is a GGUF quantized version of the MXFP4 quantized openai/gpt-oss-20b model.
- Original Model:
openai/gpt-oss-20b - Pruning Method:
reapwith a compression ratio of0.4 - First Quantization Method: MXFP4 weight-only quantization
- Second Quantization Method: GGUF (Q8_0) using
llama.cpp - Dataset used for pruning/quantization (if applicable):
theblackcat102/evol-codealpaca-v1
The original MXFP4 quantization specifically targeted the "expert" layers of the model, skipping self-attention and router layers, as is standard practice for Mixture-of-Experts (MoE) models to optimize performance and reduce size. This GGUF quantization further reduces the model size for efficient inference with llama.cpp.
Usage
You can use this model with llama.cpp or compatible GGUF loaders.
Quantization Details
The model was first pruned with a 0.4 compression ratio using reap, then quantized to MXFP4. Subsequently, it was converted to GGUF (Q8_0) format using the llama.cpp conversion script.
Pruning Commands Used:
python ./reap/src/reap/prune.py \
--model-name "openai/gpt-oss-20b" \
--run_observer_only true \
--samples_per_category 32
python ./reap/src/reap/prune.py \
--model-name "openai/gpt-oss-20b" \
--compression-ratio 0.4 \
--prune-method reap
MXFP4 Quantization Command Used:
python Model-Optimizer/examples/gpt-oss/convert_oai_mxfp4_weight_only.py \
--model_path /workspace/artifacts/gpt-oss-20b/evol-codealpaca-v1/pruned_models/reap-seed_42-0.4-mxfp4 \
--output_path /workspace/artifacts/gpt-oss-20b/evol-codealpaca-v1/pruned_models/reap-seed_42-0.4-mxfp4-quantized
GGUF Quantization Command Used:
python llama.cpp/convert_hf_to_gguf.py \
--outtype q8_0 \
--outfile /path/to/output/gpt-oss-20b-reap-0.4-mxfp4-q8_0.gguf \
/path/to/downloaded/mxfp4_model
License
(Please specify the license of the original model and any modifications)
- Downloads last month
- 79
8-bit