---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
datasets:
- HuggingFaceH4/ultrafeedback_binarized
base_model:
- alignment-handbook/zephyr-7b-sft-full
library_name: peft
tags:
- lora
- peft
- dpo
- alignment
---

# Zephyr 7B SFT aligned with DPO on UltraFeedback with β=0.01

This repo contains LoRA adapter created by aligning [Zephyr 7B SFT](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full) on the [UltraFeedback Binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset using Direct Preference Optimization (DPO).
It was trained as a series of models for studying DPO alignment.

## Model details

* Base model: [alignment-handbook/zephyr-7b-sft-full](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full)
* Preference dataset: [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)
* DPO beta: 0.01
* Training framework: PEFT/LoRA

See the base model card for usage and chat template details.

## Training hyperparameters

* Epochs: 1
* Batch size: 16
* Learning rate: 1e-05
* Learning rate scheduler: cosine
* Learning rate warmup ratio: 0.1
* Gradient accumulation: 2
* LoRA:
  * rank: 64
  * alpha: 64
  * dropout: 0.05
  * target modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]

## License

This adapter is released under the Apache License 2.0.

## Citation

If this work was helpful, please cite:
```
TBA
```