Text Generation
GGUF
English
Chinese
llama
llama-3.2
conversational

GPT-5-Distill-llama3.2-3B-Instruct

Llama-3.2 Instruct GPT-5

Model Type: Instruction-tuned Edge LLM (Llama 3.2 Architecture)

  • Base Model: unsloth/Llama-3.2-3B-Instruct
  • Parameters: ~3.2B (Optimized for Edge/Consumer GPU)
  • Training Method:
    • SFT (Supervised Fine-Tuning) using Unsloth & TRL
    • Knowledge Distillation: Trained on GPT-5 responses to mimic superior reasoning and tone
    • LoRA Config: r=32, alpha=32, targeting all linear projections
  • Max Context Length: 32K tokens (max_seq_length = 32768)
  • Quantization: Native GGUF support (Q4_K_M, Q8_0, FP16) provided

This model represents a high-efficiency distillation attempt, combining the lightweight, edge-ready architecture of Llama-3.2-3B with the high-quality conversational patterns of GPT-5. By filtering for "normal" (flawless) responses from the LMSYS dataset, this model aims to deliver flagship-level instruction following in a 3B parameter package.


2. Intended Use Cases

✅ Recommended:

  • On-Device Chat: Perfect for laptops, phones, and low-VRAM GPUs due to small size.
  • Reasoning & Explanations: Distilled GPT-5 logic helps in providing clearer answers.
  • Summarization & Rewriting: Inherits strong English/Chinese capabilities from the dataset mix.
  • RAG Applications: 32K context window allows for processing moderate-sized documents.

⚠️ Not Suitable For:

  • Math/Complex Coding: While capable, 3B models have limitations compared to 70B+ models in complex logic.
  • High-Stakes Medical/Legal Advice: Outputs should always be verified.
  • Hallucination-Free Tasks: Small models may still hallucinate facts.

3. Training Data & Methodology

The model was trained on a curated mix of ~104,000 high-quality samples:

(1) ds1: ShareGPT-Qwen3 Instruction Mix (~3,900 samples)

  • Source: Jackrong/ShareGPT-Qwen3-235B-A22B-Instuct-2507
  • Role: Provides diverse, multi-turn instruction following capabilities, enhancing the model's ability to handle complex prompts (English & Chinese mixed).

(2) ds2: LMSYS GPT-5 Teacher Responses (~100,000 samples)

  • Source: ytz20/LMSYS-Chat-GPT-5-Chat-Response
  • Filtering Logic:
    • Applied rigorous filtering: flaw == "normal" (Removed hallucinations, refusals, and bad formatting).
    • Only clean, high-quality "Teacher" responses were used for distillation.
  • Role: Imparts the "GPT-5" conversational style, politeness, and reasoning structure to the smaller Llama model.

Training Configuration:

  • Framework: Unsloth + Hugging Face TRL
  • Loss Masking: train_on_responses_only was enabled (Model learns to generate answers, not questions).
  • Optimizer: AdamW 8-bit for efficiency.
  • Precision: Trained in 4-bit, exported to 16-bit and GGUF.

4. Prompt Format (Llama 3.2 Standard)

This model uses the standard Llama 3 / 3.2 prompt template.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

{Your Prompt Here}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Python Inference Example:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Jackrong/GPT-5-Distill-llama3.2-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum mechanics to a 5-year-old."},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))

5. Key Features Summary

Feature Description
Super Lightweight 3B Parameters. Runs on almost any modern consumer hardware.
GPT-5 Distilled Learned from 100k+ clean GPT-5 outputs for superior tone.
Long Context Supports up to 32k context, great for long conversations.
GGUF Ready Available in q4_k_m (very fast) and q8_0 quantizations.

6. Acknowledgements

  • Unsloth: For the 2x faster training and 4-bit loading capabilities.
  • LMSYS Org: For providing the GPT-5 response dataset.
  • Meta AI: For the robust Llama-3.2 base model.

This project is an open research effort to bring "Big Model Intelligence" to "Small Model Footprints."


Downloads last month
138
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jackrong/GPT-5-Distill-llama3.2-3B-Instruct-GGUF

Quantized
(110)
this model

Datasets used to train Jackrong/GPT-5-Distill-llama3.2-3B-Instruct-GGUF