Abstract
UnifiedReward-Flex combines reward modeling with flexible, context-adaptive reasoning to improve visual generation by dynamically constructing hierarchical assessments based on semantic intent and visual evidence.
Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.
Community
- 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/flex
- 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-flex
- 🤗 Dataset: https://huggingface.co/datasets/CodeGoat24/UnifiedReward-Flex-SFT-90K
- 👋 Point of Contact: Yibin Wang
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model (2026)
- ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models (2026)
- Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model (2025)
- Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis (2026)
- Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains (2025)
- Human detectors are surprisingly powerful reward models (2026)
- Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper