---
license: llama3.1
language:
- en
pipeline_tag: text-generation
datasets:
- Anthropic/hh-rlhf
base_model:
- allenai/Llama-3.1-Tulu-3-8B-SFT
library_name: peft
tags:
- lora
- peft
- dpo
- alignment
---

# Tülu3 8B aligned with DPO on HH-RLHF with β=0.01

This repo contains LoRA adapter created by aligning [Tülu3 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) on the [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset using Direct Preference Optimization (DPO).
It was trained as a series of models for studying DPO alignment.

## Model details

* Base model: [allenai/Llama-3.1-Tulu-3-8B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT)
* Preference dataset: [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
* DPO beta: 0.01
* Training framework: PEFT/LoRA

See the base model card for usage and chat template details.

## Training hyperparameters

* Epochs: 1
* Batch size: 8
* Learning rate: 1e-05
* Learning rate scheduler: cosine
* Learning rate warmup ratio: 0.1
* Gradient accumulation: 2
* LoRA:
  * rank: 64
  * alpha: 64
  * dropout: 0.05
  * target modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]

## License

This adapter is released under Meta's [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). Llama 3.1 is © Meta Platforms, Inc.

## Citation

If this work was helpful, please cite:
```
TBA
```