--- license: llama3.1 language: - en pipeline_tag: text-generation datasets: - Anthropic/hh-rlhf base_model: - allenai/Llama-3.1-Tulu-3-8B-SFT library_name: peft tags: - lora - peft - dpo - alignment --- # Tülu3 8B aligned with DPO on HH-RLHF with β=0.01 This repo contains LoRA adapter created by aligning [Tülu3 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) on the [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset using Direct Preference Optimization (DPO). It was trained as a series of models for studying DPO alignment. ## Model details * Base model: [allenai/Llama-3.1-Tulu-3-8B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) * Preference dataset: [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) * DPO beta: 0.01 * Training framework: PEFT/LoRA See the base model card for usage and chat template details. ## Training hyperparameters * Epochs: 1 * Batch size: 8 * Learning rate: 1e-05 * Learning rate scheduler: cosine * Learning rate warmup ratio: 0.1 * Gradient accumulation: 2 * LoRA: * rank: 64 * alpha: 64 * dropout: 0.05 * target modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] ## License This adapter is released under Meta's [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). Llama 3.1 is © Meta Platforms, Inc. ## Citation If this work was helpful, please cite: ``` TBA ```