Instructions to use thanhhoangnvbg/empathAI-llama3.1-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thanhhoangnvbg/empathAI-llama3.1-8b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="thanhhoangnvbg/empathAI-llama3.1-8b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("thanhhoangnvbg/empathAI-llama3.1-8b") model = AutoModelForCausalLM.from_pretrained("thanhhoangnvbg/empathAI-llama3.1-8b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use thanhhoangnvbg/empathAI-llama3.1-8b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "thanhhoangnvbg/empathAI-llama3.1-8b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thanhhoangnvbg/empathAI-llama3.1-8b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/thanhhoangnvbg/empathAI-llama3.1-8b
- SGLang
How to use thanhhoangnvbg/empathAI-llama3.1-8b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "thanhhoangnvbg/empathAI-llama3.1-8b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thanhhoangnvbg/empathAI-llama3.1-8b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "thanhhoangnvbg/empathAI-llama3.1-8b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thanhhoangnvbg/empathAI-llama3.1-8b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio
How to use thanhhoangnvbg/empathAI-llama3.1-8b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for thanhhoangnvbg/empathAI-llama3.1-8b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for thanhhoangnvbg/empathAI-llama3.1-8b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for thanhhoangnvbg/empathAI-llama3.1-8b to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="thanhhoangnvbg/empathAI-llama3.1-8b", max_seq_length=2048, ) - Docker Model Runner
How to use thanhhoangnvbg/empathAI-llama3.1-8b with Docker Model Runner:
docker model run hf.co/thanhhoangnvbg/empathAI-llama3.1-8b
EmpathAI Llama 3.1 8B
EmpathAI Llama 3.1 8B is a Vietnamese customer-support assistant model fine-tuned for empathetic, policy-aware e-commerce conversations. The model was trained with SFT followed by DPO using the thanhhoangnvbg/empathAI-dpo-vi dataset.
The default main branch contains the merged BF16 full-weight export in sharded safetensors format. This is the recommended branch for Featherless AI / Hugging Face Inference Providers because it contains full merged weights rather than LoRA, QLoRA, GGUF, or bitsandbytes-only files.
Branch Layout
| Branch | Contents | Intended use |
|---|---|---|
main |
Merged BF16 full weights, sharded safetensors |
Featherless AI / HF Inference Providers / Transformers |
v2-gguf |
GGUF Q4_K_M and Q5_K_M exports |
llama.cpp, Ollama, local inference |
v2-adapter |
LoRA adapter only | Re-merge, continued training, adapter-based loading |
v1-bf16 |
Archived v1 merged weights | Reproducibility |
v1-gguf |
Archived v1 GGUF exports | Reproducibility |
old_version |
Older archived full-weight branch | Reproducibility |
Model Details
- Base model:
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit - Architecture family: Llama 3.1 8B Instruct
- Language: Vietnamese
- Fine-tuning methods: SFT, then DPO
- Training framework: Unsloth + TRL + Transformers
- Export on
main: merged BF16 full weights insafetensors - Adapter branch:
v2-adapter - GGUF branch:
v2-gguf
Intended Use
This model is intended for Vietnamese e-commerce customer-service conversations, especially cases requiring calm tone, empathy, privacy awareness, and refusal to fabricate order status or compensation decisions.
Examples of intended behavior:
- acknowledge customer frustration without escalating tone;
- ask for non-sensitive order information when needed;
- avoid requesting OTPs, passwords, full card numbers, or full identity documents;
- avoid inventing delivery, refund, or warranty status without access to backend systems;
- redirect sensitive account operations to official support channels.
Out-of-Scope Use
This model should not be used as a source of truth for order status, payment status, refund eligibility, medical, legal, or financial advice. It does not have access to real business systems unless integrated with verified tools. Production deployments should add retrieval, policy checks, logging, human escalation, and safety filters.
Training Data
Dataset: thanhhoangnvbg/empathAI-dpo-vi
Observed split sizes used in this run:
| Split file | Rows | Purpose |
|---|---|---|
sft_train.jsonl |
7,982 | SFT train |
sft_dev.jsonl |
1,016 | SFT validation |
sft_test.jsonl |
1,002 | SFT test |
dpo_train.jsonl |
5,139 | DPO train |
dpo_dev.jsonl |
651 | DPO validation |
dpo_test.jsonl |
664 | DPO test |
Training Configuration
Training was run on an NVIDIA L4 GPU with CUDA 12.4.
Key settings:
| Stage | Setting | Value |
|---|---|---|
| Shared | LoRA rank / alpha | r=32, lora_alpha=32 |
| Shared | LoRA targets | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Shared | Precision | BF16 where supported |
| SFT | Epochs | 3 |
| SFT | Max length | 1024 |
| SFT | Batch / grad accumulation | 1 / 8 |
| SFT | Learning rate | 2e-4 |
| DPO | Epochs | 1 |
| DPO | Max length / prompt length | 1536 / 1024 |
| DPO | Batch / grad accumulation | 1 / 8 |
| DPO | Learning rate | 3e-5 |
Runtime stack:
- Unsloth 2026.6.1
- Transformers 5.5.0
- TRL 0.24.0
- PyTorch 2.5.1+cu124
Evaluation
These metrics are internal validation/test metrics on the same dataset distribution, not an external benchmark.
SFT
| Metric | Value |
|---|---|
| Train runtime | 11,450 seconds |
| Train loss | 4.044 |
| Final dev eval loss | 0.4424 |
| SFT test loss | 0.4120 |
DPO
| Metric | Value |
|---|---|
| Train runtime | 13,280 seconds |
| Train loss | 0.0002717 |
| Final dev eval loss | 0.002075 |
| DPO test loss | 0.0006332 |
| DPO test reward accuracy | 1.0000 |
| DPO test chosen reward | 11.4892 |
| DPO test rejected reward | -6.4002 |
| DPO test reward margin | 17.8894 |
The DPO preference accuracy and reward margin are very high. This means the model strongly separates chosen and rejected answers on the held-out DPO split, but it should not be interpreted as a full real-world benchmark.
How to Use
import torch
# Optional compatibility patch for torch 2.5.x + torchao 0.15.x environments.
for i in range(1, 8):
if not hasattr(torch, f"int{i}"):
setattr(torch, f"int{i}", torch.int8)
if not hasattr(torch, f"uint{i}"):
setattr(torch, f"uint{i}", torch.uint8)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "thanhhoangnvbg/empathAI-llama3.1-8b"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision="main")
model = AutoModelForCausalLM.from_pretrained(
model_id,
revision="main",
dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{
"role": "system",
"content": """Bạn là EmpathAI, trợ lý chăm sóc khách hàng e-commerce tiếng Việt.
Nguyên tắc:
- Không tự suy đoán trạng thái đơn hàng.
- Không tự suy đoán chính sách hoàn tiền, đổi trả hoặc bồi thường.
- Khi thiếu thông tin, hãy nói rõ rằng chưa thể xác nhận.
- Chỉ trả lời dựa trên thông tin được cung cấp.
- Nếu không đủ dữ liệu, hãy yêu cầu thông tin bổ sung hoặc hướng dẫn khách liên hệ bộ phận hỗ trợ.
- Trả lời ngắn gọn, lịch sự và đồng cảm.
Khi khách hỏi về:
- trạng thái đơn hàng
- hoàn tiền
- bồi thường
- đổi trả
mà chưa có đủ thông tin, hãy ưu tiên các cách diễn đạt:
- "Tôi chưa thể xác nhận từ thông tin hiện có."
- "Mình cần thêm thông tin để kiểm tra."
- "Tôi không có đủ dữ liệu để xác nhận điều đó."
""",
},
{
"role": "user",
"content": "Đơn hàng của tôi giao trễ 3 ngày rồi, shop có hoàn tiền không?",
},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
)
device = "cuda" if torch.cuda.is_available() else "cpu"
inputs = {key: value.to(device) for key, value in inputs.items()}
outputs = model.generate(
**inputs,
max_new_tokens=180,
temperature=0.4,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
prompt_length = inputs["input_ids"].shape[-1]
print(tokenizer.decode(outputs[0][prompt_length:], skip_special_tokens=True))
Deployment Notes
- Use
mainfor Featherless AI and HF Inference Providers. - Use
v2-gguffor llama.cpp/Ollama-style local inference. - Use
v2-adapteronly if you want the LoRA adapter and plan to load it with the base model or re-merge it yourself. - The default branch is intentionally BF16 full weights, not bitsandbytes 4-bit, because Featherless AI expects full safetensors weights and handles serving-side optimization.
Limitations
- The model is specialized for Vietnamese e-commerce support and may underperform outside this domain.
- The model can still hallucinate if asked for facts not present in context.
- It cannot verify real orders, refunds, shipping status, identity, or payment state without external tools.
- Internal DPO metrics are strong but do not replace external evaluation or human review.
- Production use should include policy enforcement, PII handling, monitoring, and escalation paths.
- Downloads last month
- 1,967
Model tree for thanhhoangnvbg/empathAI-llama3.1-8b
Base model
meta-llama/Llama-3.1-8B