Text Generation
Transformers
Safetensors
llama
causal-lm
pruning
distillation
fine-tuning
efficiency
llama-architecture
conversational
text-generation-inference
Instructions to use abhinavv3/SmolLM-135M-layer-pruned-90M-raw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use abhinavv3/SmolLM-135M-layer-pruned-90M-raw with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="abhinavv3/SmolLM-135M-layer-pruned-90M-raw") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("abhinavv3/SmolLM-135M-layer-pruned-90M-raw") model = AutoModelForCausalLM.from_pretrained("abhinavv3/SmolLM-135M-layer-pruned-90M-raw") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use abhinavv3/SmolLM-135M-layer-pruned-90M-raw with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "abhinavv3/SmolLM-135M-layer-pruned-90M-raw" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "abhinavv3/SmolLM-135M-layer-pruned-90M-raw", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/abhinavv3/SmolLM-135M-layer-pruned-90M-raw
- SGLang
How to use abhinavv3/SmolLM-135M-layer-pruned-90M-raw with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "abhinavv3/SmolLM-135M-layer-pruned-90M-raw" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "abhinavv3/SmolLM-135M-layer-pruned-90M-raw", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "abhinavv3/SmolLM-135M-layer-pruned-90M-raw" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "abhinavv3/SmolLM-135M-layer-pruned-90M-raw", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use abhinavv3/SmolLM-135M-layer-pruned-90M-raw with Docker Model Runner:
docker model run hf.co/abhinavv3/SmolLM-135M-layer-pruned-90M-raw
Model Card for SmolLM-135M Layer-Pruned (~90M params)
This repository hosts a layer-pruned version of HuggingFaceTB/SmolLM-135M, reduced from ~ 135M parameters down to ~ 99M parameters (~26% smaller).
⚠️ Note: This model is intended as a starting point for knowledge distillation or fine-tuning, not as a final standalone model.
Model Details
Model Description
- Developed by: Independent (based on HuggingFaceTB/SmolLM-135M)
- Model type: Decoder-only causal language model (LLaMA-style)
- Language(s) (NLP): English (same as base model)
- License: Inherits license from HuggingFaceTB/SmolLM-135M
- Finetuned from model: HuggingFaceTB/SmolLM-135M
The model was pruned from 30 layers → 20 layers, achieving ~26% parameter reduction while keeping embeddings, the output head, and the final layer intact.
Model Sources
- Repository: HuggingFaceTB/SmolLM-135M (base)
- This model repo: current repository
- Paper [optional]: N/A
- Demo [optional]: N/A
Uses
Direct Use
- Educational / research purposes for studying pruning effects on transformer models.
- Lightweight inference where resources are limited.
Downstream Use
- Knowledge Distillation: Using this pruned model as a student model against larger teacher models.
- Fine-Tuning: Domain adaptation on specific datasets while benefiting from lower compute requirements.
Out-of-Scope Use
- Production deployment without evaluation.
- High-stakes applications (medical, legal, safety-critical systems).
Bias, Risks, and Limitations
- As with the base model, it may generate biased or toxic text.
- Pruning reduces capacity → performance may drop without re-training.
- Model has not been benchmarked post-pruning.
Recommendations
Users should:
- Perform task-specific fine-tuning or distillation before deployment.
- Benchmark against baselines to measure trade-offs in accuracy vs. efficiency.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "your-username/SmolLM-135M-layer-pruned-90M-raw"
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M")
model = AutoModelForCausalLM.from_pretrained(model_id)
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 4