🎭 HER-RL: Role-Playing Model with Reinforcement Learning

HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing

HER introduces dual-layer thinking that distinguishes characters' first-person thinking from LLMs' third-person thinking for cognitive-level persona simulation.

Overview

HER-RL is a role-playing language model enhanced with reinforcement learning, built upon Qwen3-32B. It achieves cognitive-level persona simulation through Dual-layer Thinking:

System Thinking (<system_thinking>): Third-person meta-level planning on how to portray the character
Role Thinking (<role_thinking>): First-person character's inner thoughts and cognitive processes

HER-RL significantly outperforms Qwen3-32B baseline by 30.26% on CoSER and 14.97% on MiniMax Role-Play Bench.

Output Format

The model generates responses with rich, interleaved structure:

<system_thinking>
Third-person analysis: context understanding, character motivation, response planning...
</system_thinking>

<role_thinking>Character's inner thoughts (invisible to others)</role_thinking>
<role_action>Physical actions and expressions (visible to others)</role_action>
Spoken dialogue text.

How to Use

Quick Start: Interactive Chat Demo

git clone https://github.com/cydu24/HER.git
cd HER/chat_demo
python chat_demo.py --model-path ChengyuDu0123/HER-32B

Demo Options:

# Show the model's reasoning process (system thinking)
python chat_demo.py --show-think

# Show character's inner thoughts (role thinking)  
python chat_demo.py --show-rolethink

# Both
python chat_demo.py --show-think --show-rolethink

Programmatic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ChengyuDu0123/HER-32B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Build system prompt
system_prompt = """You are role-playing as Elizabeth Bennet from the book "Pride and Prejudice".

===Elizabeth Bennet's Profile===
The protagonist, intelligent and strong-willed. Quick-witted with a playful sense of humor. Values honesty and integrity. Maintains composure under pressure.

===Current Scene===
The scene is set at the Netherfield ball. Mr. Darcy has just approached you.

===The Person You Are Interacting With===
Mr. Darcy: A wealthy gentleman, proud and reserved. Owner of Pemberley estate.

===Instructions===
- Stay in character as Elizabeth Bennet at all times
- Respond from Elizabeth's perspective
- Speak DIRECTLY to "Mr. Darcy" using "you" (second person)

===Output Format===
Your output should include thought, speech, and action in this two-part structure:

1. System Thinking: A single block at the very beginning, wrapped in <system_thinking> and </system_thinking>. This is third-person analysis of how to portray the character.

2. Role-play Response: The character's actual response including:
   - <role_thinking>inner thoughts</role_thinking> (invisible to others)
   - <role_action>physical actions</role_action> (visible to others)
   - Speech (plain text, what the character says out loud)"""

user_input = "*Mr. Darcy bows slightly* Miss Bennet, might I have the honor of the next dance?"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_input}
]

# Generate with system_thinking prefix
text = tokenizer.apply_chat_template(
    messages + [{"role": "assistant", "content": "<system_thinking>"}],
    tokenize=False,
    add_generation_prompt=False
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=False)
response = response.replace("<|im_end|>", "").replace("<|im_start|>", "").strip()
full_response = "<system_thinking>" + response

print(full_response)

Example Output

<system_thinking>
Context Analysis: Mr. Darcy has asked Elizabeth to dance at the Netherfield ball. 
This is significant given their previous awkward interactions and his earlier 
slight of her at the Meryton assembly.

Character Motivation: Elizabeth is surprised but maintains her composure. 
She's curious about his sudden interest but won't show it openly. 
Her wit is her shield.

Plan:
- Action: Accept with grace but subtle irony
- Internal Thought: Question his motives
- Speech: Polite acceptance with a hint of her characteristic wit
</system_thinking>

<role_thinking>What game is he playing now? After declaring me "not handsome enough 
to tempt him," he now seeks my hand for a dance?</role_thinking>
<role_action>curtsies with practiced elegance, a slight smile playing at her lips</role_action>
You do me great honor, Mr. Darcy. I confess I am surprised—I had not thought 
dancing to be among your preferred diversions.

Processing the Output

import re

def remove_system_thinking(text):
    """Remove <system_thinking>...</system_thinking> for display"""
    pattern = r'<system_thinking>.*?</system_thinking>\s*'
    return re.sub(pattern, '', text, flags=re.DOTALL).strip()

def format_for_display(text, show_rolethink=True):
    """Format for display: [] for thoughts, () for actions"""
    result = text
    if show_rolethink:
        result = result.replace('<role_thinking>', '[').replace('</role_thinking>', ']')
    else:
        result = re.sub(r'<role_thinking>.*?</role_thinking>', '', result, flags=re.DOTALL)
    result = result.replace('<role_action>', '(').replace('</role_action>', ')')
    result = result.replace('<role_speech>', '').replace('</role_speech>', '')
    return result.strip()

# Usage
clean_response = remove_system_thinking(full_response)
display_response = format_for_display(clean_response, show_rolethink=True)
print(display_response)

Output:

[What game is he playing now? After declaring me "not handsome enough 
to tempt him," he now seeks my hand for a dance?]
(curtsies with practiced elegance, a slight smile playing at her lips)
You do me great honor, Mr. Darcy. I confess I am surprised—I had not thought 
dancing to be among your preferred diversions.

Performance

Model	CoSER Avg	MiniMax Avg
Qwen3-32B (baseline)	22.86	50.76
HER-SFT	50.92	58.44
HER-RL	53.12	65.73
Improvement vs baseline	+30.26%	+14.97%

🎓 Citation

@article{her2025,
  title={HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing},
  author={Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao},
  journal={arXiv preprint arXiv:2601.21459},
  year={2026}
}

📄 License

This project is licensed under the Apache 2.0 License.

🤝 Acknowledgments

CoSER for the evaluation benchmark
MiniMax for the evaluation benchmark

Paper | HER-RM Model | Dataset | GitHub

Made with ❤️ for better AI role-playing

Downloads last month: 71

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for ChengyuDu0123/HER-32B

Base model

Qwen/Qwen3-32B

Finetuned

(176)

this model

Quantizations

5 models

Paper for ChengyuDu0123/HER-32B

HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing

Paper • 2601.21459 • Published 8 days ago • 9