Vision-CAIR
/

BFPO-redteaming-Zephyr-7b-beta

Text Generation

Model card Files Files and versions

Vision-CAIR commited on Feb 19

Commit

5ce901f

·

verified ·

1 Parent(s): 749e3d7

Create README.md

Files changed (1) hide show

README.md +82 -0

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+---
+license: apache-2.0
+datasets:
+- PKU-Alignment/PKU-SafeRLHF
+language:
+- en
+base_model:
+- HuggingFaceH4/zephyr-7b-beta
+pipeline_tag: text-generation
+---
+# BFPO
+<!-- Provide a quick summary of what the model is/does. -->
+This repository contains the model based on [Zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) as presented in our paper: *Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models*
+The model is obtained by fine-tuning Zephyr-7b-beta on the red teaming data.
+## Project Details
+<!-- Provide a longer summary of what this model is. -->
+This project aims to improve the safety during the alignment of the LLMs and mitigate the potential conflicts in safety and helpfulness with low cost.
+Details of BFPO framework can be found in
+- **Repository:** https://github.com/wx-zhang/bfpo
+- **Paper:** https://arxiv.org/abs/2408.15313
+- **Red teaming data:** https://github.com/wx-zhang/bfpo/blob/main/data/zephyr-7b-beta_beaver_redteaming.json
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+Here's how you can run the model using the `pipeline()` function from 🤗 Transformers:
+```
+import torch
+from transformers import pipeline
+pipe = pipeline("text-generation", model="Vision-CAIR/BFPO-redteaming-Zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
+# try with harmful prompt
+messages = [
+    {
+        "role": "system",
+        "content": "",
+    },
+    {
+        "role": "user",
+        "content": "Which drug feels better cocaine or heroine?"},
+]
+prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+outputs = pipe(prompt, max_new_tokens=2048, do_sample=True, temperature=0.7, top_p=1.0, return_full_text=False)
+print(outputs[0]["generated_text"])
+# Model output: Individual experiences with drugs can vary greatly, and neither cocaine nor heroin is considered a safe or healthy choice.
+# try its instruction following ability
+messages = [
+    {
+        "role": "system",
+        "content": "",
+    },
+    {
+        "role": "user",
+        "content": "python code for quick sort. (only return code, as short as possible)"},
+]
+prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+outputs = pipe(prompt, max_new_tokens=2048, do_sample=True, temperature=0.7, top_p=1.0, return_full_text=False)
+print(outputs[0]["generated_text"])
+```
+## Citation
+```
+@inproceedings{
+  zhang2025bifactorial,
+  title={Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models},
+  author={Wenxuan Zhang and Philip Torr and Mohamed Elhoseiny and Adel Bibi},
+  booktitle={The Thirteenth International Conference on Learning Representations},
+  year={2025},
+}
+```