Instructions to use thanhhoangnvbg/empathAI-llama3.1-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thanhhoangnvbg/empathAI-llama3.1-8b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thanhhoangnvbg/empathAI-llama3.1-8b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("thanhhoangnvbg/empathAI-llama3.1-8b")
model = AutoModelForCausalLM.from_pretrained("thanhhoangnvbg/empathAI-llama3.1-8b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use thanhhoangnvbg/empathAI-llama3.1-8b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thanhhoangnvbg/empathAI-llama3.1-8b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thanhhoangnvbg/empathAI-llama3.1-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/thanhhoangnvbg/empathAI-llama3.1-8b

SGLang

How to use thanhhoangnvbg/empathAI-llama3.1-8b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thanhhoangnvbg/empathAI-llama3.1-8b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thanhhoangnvbg/empathAI-llama3.1-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thanhhoangnvbg/empathAI-llama3.1-8b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thanhhoangnvbg/empathAI-llama3.1-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use thanhhoangnvbg/empathAI-llama3.1-8b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for thanhhoangnvbg/empathAI-llama3.1-8b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for thanhhoangnvbg/empathAI-llama3.1-8b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for thanhhoangnvbg/empathAI-llama3.1-8b to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="thanhhoangnvbg/empathAI-llama3.1-8b",
    max_seq_length=2048,
)

Docker Model Runner
How to use thanhhoangnvbg/empathAI-llama3.1-8b with Docker Model Runner:
```
docker model run hf.co/thanhhoangnvbg/empathAI-llama3.1-8b
```

EmpathAI Llama 3.1 8B

EmpathAI Llama 3.1 8B is a Vietnamese customer-support assistant model fine-tuned for empathetic, policy-aware e-commerce conversations. The model was trained with SFT followed by DPO using the thanhhoangnvbg/empathAI-dpo-vi dataset.

The default main branch contains the merged BF16 full-weight export in sharded safetensors format. This is the recommended branch for Featherless AI / Hugging Face Inference Providers because it contains full merged weights rather than LoRA, QLoRA, GGUF, or bitsandbytes-only files.

Branch Layout

Branch	Contents	Intended use
`main`	Merged BF16 full weights, sharded `safetensors`	Featherless AI / HF Inference Providers / Transformers
`v2-gguf`	GGUF `Q4_K_M` and `Q5_K_M` exports	llama.cpp, Ollama, local inference
`v2-adapter`	LoRA adapter only	Re-merge, continued training, adapter-based loading
`v1-bf16`	Archived v1 merged weights	Reproducibility
`v1-gguf`	Archived v1 GGUF exports	Reproducibility
`old_version`	Older archived full-weight branch	Reproducibility

Model Details

Base model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
Architecture family: Llama 3.1 8B Instruct
Language: Vietnamese
Fine-tuning methods: SFT, then DPO
Training framework: Unsloth + TRL + Transformers
Export on main: merged BF16 full weights in safetensors
Adapter branch: v2-adapter
GGUF branch: v2-gguf

Intended Use

This model is intended for Vietnamese e-commerce customer-service conversations, especially cases requiring calm tone, empathy, privacy awareness, and refusal to fabricate order status or compensation decisions.

Examples of intended behavior:

acknowledge customer frustration without escalating tone;
ask for non-sensitive order information when needed;
avoid requesting OTPs, passwords, full card numbers, or full identity documents;
avoid inventing delivery, refund, or warranty status without access to backend systems;
redirect sensitive account operations to official support channels.

Out-of-Scope Use

This model should not be used as a source of truth for order status, payment status, refund eligibility, medical, legal, or financial advice. It does not have access to real business systems unless integrated with verified tools. Production deployments should add retrieval, policy checks, logging, human escalation, and safety filters.

Training Data

Dataset: thanhhoangnvbg/empathAI-dpo-vi

Observed split sizes used in this run:

Split file	Rows	Purpose
`sft_train.jsonl`	7,982	SFT train
`sft_dev.jsonl`	1,016	SFT validation
`sft_test.jsonl`	1,002	SFT test
`dpo_train.jsonl`	5,139	DPO train
`dpo_dev.jsonl`	651	DPO validation
`dpo_test.jsonl`	664	DPO test

Training Configuration

Training was run on an NVIDIA L4 GPU with CUDA 12.4.

Key settings:

Stage	Setting	Value
Shared	LoRA rank / alpha	`r=32`, `lora_alpha=32`
Shared	LoRA targets	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Shared	Precision	BF16 where supported
SFT	Epochs	3
SFT	Max length	1024
SFT	Batch / grad accumulation	1 / 8
SFT	Learning rate	2e-4
DPO	Epochs	1
DPO	Max length / prompt length	1536 / 1024
DPO	Batch / grad accumulation	1 / 8
DPO	Learning rate	3e-5

Runtime stack:

Unsloth 2026.6.1
Transformers 5.5.0
TRL 0.24.0
PyTorch 2.5.1+cu124

Evaluation

These metrics are internal validation/test metrics on the same dataset distribution, not an external benchmark.

SFT

Metric	Value
Train runtime	11,450 seconds
Train loss	4.044
Final dev eval loss	0.4424
SFT test loss	0.4120

DPO

Metric	Value
Train runtime	13,280 seconds
Train loss	0.0002717
Final dev eval loss	0.002075
DPO test loss	0.0006332
DPO test reward accuracy	1.0000
DPO test chosen reward	11.4892
DPO test rejected reward	-6.4002
DPO test reward margin	17.8894

The DPO preference accuracy and reward margin are very high. This means the model strongly separates chosen and rejected answers on the held-out DPO split, but it should not be interpreted as a full real-world benchmark.

How to Use

import torch

# Optional compatibility patch for torch 2.5.x + torchao 0.15.x environments.
for i in range(1, 8):
    if not hasattr(torch, f"int{i}"):
        setattr(torch, f"int{i}", torch.int8)
    if not hasattr(torch, f"uint{i}"):
        setattr(torch, f"uint{i}", torch.uint8)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "thanhhoangnvbg/empathAI-llama3.1-8b"

tokenizer = AutoTokenizer.from_pretrained(model_id, revision="main")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    revision="main",
    dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": """Bạn là EmpathAI, trợ lý chăm sóc khách hàng e-commerce tiếng Việt.

Nguyên tắc:
- Không tự suy đoán trạng thái đơn hàng.
- Không tự suy đoán chính sách hoàn tiền, đổi trả hoặc bồi thường.
- Khi thiếu thông tin, hãy nói rõ rằng chưa thể xác nhận.
- Chỉ trả lời dựa trên thông tin được cung cấp.
- Nếu không đủ dữ liệu, hãy yêu cầu thông tin bổ sung hoặc hướng dẫn khách liên hệ bộ phận hỗ trợ.
- Trả lời ngắn gọn, lịch sự và đồng cảm.

Khi khách hỏi về:
- trạng thái đơn hàng
- hoàn tiền
- bồi thường
- đổi trả

mà chưa có đủ thông tin, hãy ưu tiên các cách diễn đạt:
- "Tôi chưa thể xác nhận từ thông tin hiện có."
- "Mình cần thêm thông tin để kiểm tra."
- "Tôi không có đủ dữ liệu để xác nhận điều đó."
""",
    },
    {
        "role": "user",
        "content": "Đơn hàng của tôi giao trễ 3 ngày rồi, shop có hoàn tiền không?",
    },
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
)
device = "cuda" if torch.cuda.is_available() else "cpu"
inputs = {key: value.to(device) for key, value in inputs.items()}

outputs = model.generate(
    **inputs,
    max_new_tokens=180,
    temperature=0.4,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

prompt_length = inputs["input_ids"].shape[-1]
print(tokenizer.decode(outputs[0][prompt_length:], skip_special_tokens=True))

Deployment Notes

Use main for Featherless AI and HF Inference Providers.
Use v2-gguf for llama.cpp/Ollama-style local inference.
Use v2-adapter only if you want the LoRA adapter and plan to load it with the base model or re-merge it yourself.
The default branch is intentionally BF16 full weights, not bitsandbytes 4-bit, because Featherless AI expects full safetensors weights and handles serving-side optimization.

Limitations

The model is specialized for Vietnamese e-commerce support and may underperform outside this domain.
The model can still hallucinate if asked for facts not present in context.
It cannot verify real orders, refunds, shipping status, identity, or payment state without external tools.
Internal DPO metrics are strong but do not replace external evaluation or human review.
Production use should include policy enforcement, PII handling, monitoring, and escalation paths.

Downloads last month: 1,967

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for thanhhoangnvbg/empathAI-llama3.1-8b

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Quantized

unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit