--- license: apache-2.0 language: - en pipeline_tag: text-generation datasets: - HuggingFaceH4/ultrafeedback_binarized base_model: - alignment-handbook/zephyr-7b-sft-full library_name: peft tags: - lora - peft - dpo - alignment --- # Zephyr 7B SFT aligned with DPO on UltraFeedback with β=0.01 This repo contains LoRA adapter created by aligning [Zephyr 7B SFT](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full) on the [UltraFeedback Binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset using Direct Preference Optimization (DPO). It was trained as a series of models for studying DPO alignment. ## Model details * Base model: [alignment-handbook/zephyr-7b-sft-full](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full) * Preference dataset: [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) * DPO beta: 0.01 * Training framework: PEFT/LoRA See the base model card for usage and chat template details. ## Training hyperparameters * Epochs: 1 * Batch size: 16 * Learning rate: 1e-05 * Learning rate scheduler: cosine * Learning rate warmup ratio: 0.1 * Gradient accumulation: 2 * LoRA: * rank: 64 * alpha: 64 * dropout: 0.05 * target modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] ## License This adapter is released under the Apache License 2.0. ## Citation If this work was helpful, please cite: ``` TBA ```