Text Generation
Transformers
Safetensors
TensorRT
llama
quantized
nvfp4
fp4
tensorrt-llm
nvidia
deepseek
conversational
text-generation-inference
8-bit precision
modelopt
Instructions to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4") model = AutoModelForCausalLM.from_pretrained("amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - TensorRT
How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4
- SGLang
How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with Docker Model Runner:
docker model run hf.co/amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4
DeepSeek-R1-Distill-Llama-8B-NVFP4
This is an NVFP4 quantized version of deepseek-ai/DeepSeek-R1-Distill-Llama-8B, optimized for NVIDIA GPUs using TensorRT-LLM.
Quantization Details
| Property | Value |
|---|---|
| Base Model | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
| Quantization Method | NVFP4 (2-bit weights + 4-bit scales) |
| Calibration Dataset | CNN/DailyMail |
| Calibration Samples | 512 |
| Tool | NVIDIA TensorRT Model Optimizer v0.35.0 |
| Export Format | Hugging Face |
Hardware Requirements
- GPU: NVIDIA GPU with FP4 support (Blackwell, Ada Lovelace, or newer)
- VRAM: ~40GB recommended
- Tested on: NVIDIA DGX Spark (GB10)
Usage
With TensorRT-LLM
from tensorrt_llm import LLM
llm = LLM(model="amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4")
output = llm.generate("Paris is great because")
print(output)
With TensorRT-LLM Server
trtllm-serve amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 \
--backend pytorch \
--port 8000
Limitations
- Requires TensorRT-LLM for inference
- Not compatible with standard transformers library
- Optimized for NVIDIA GPUs only
License
This model inherits the license from the base model. See DeepSeek license.
Acknowledgments
- Downloads last month
- 17
Model tree for amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-8B