Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

This model is associated with the paper "Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception".

Code: https://github.com/YuHengsss/SD-RPN

Updates

  • Sep. 21th, 2025: We release the code, data and ckpt for SD-RPN.

TODO

  • Release code and weights for DeepSeek-VL

Introduction

While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process for RoI identification. We propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. Our core innovation is a pipeline that processes and denoises the noisy cross-attention maps from the MLLM's middle layers to generate pseudo-RoI labels. We then use these labels to train a lightweight and tunable Region Proposal Network (RPN) that is built upon the frozen MLLM backbone. Our RPN predicts the RoI in a single forward pass using features available from the MLLM's middle layers, completely decoupling RoI identification from the auto-regressive generation process and avoiding costly multi-pass operations.

Main Results

Getting Started with SD-RPN

We provide the code and instructions to train and evaluate SD-RPN based on LLaVA. Please follow the instructions below.

  1. Clone this repository and navigate to LLaVA folder

    git clone https://github.com/YuHengsss/SD-RPN.git
    cd LLaVA
    
  2. Install Package

    conda create -n llava_roi python=3.10 -y
    conda activate llava_roi
    pip install --upgrade pip  # enable PEP 660 support
    pip install -e .
    
  3. Install additional packages for training cases

    pip install -e ".[train]"
    pip install flash-attn==2.1.1 --no-build-isolation
    

    Note: we provide some issue fixes during installation and training in issue_fix.

Inference

We utilize lmms-eval to evaluate the model. Please follow the instructions below:

  1. Download the pretrained model and move it to your checkpoints folder if you want to evaluate our pretrained model.

    # export HF_ENDPOINT=https://hf-mirror.com # for China users
    #7B
    huggingface-cli download YuhengSSS/llava-v1.5-7b-roi-K15T3-152k-v1bf16Mheads-twiginit-filled --repo-type model --local-dir ./
    
    #13B, need to run migrate_weights to merge weights, it is not the complete model.
    huggingface-cli download YuhengSSS/llava-v1.5-13b-roi-K15T3-152k-v1bf16Mheads-twiginit --local-dir ./  --repo-type model
    
  2. Install lmms-eval, check the script in lmms-eval/README.md.

  3. Run the evaluation script in lmms-eval. Change the checkpoint_path to your own path.

    bash lmms-eval/examples/A6000/reproduce.sh
    

Citation

If you use this model, please cite the original paper:

@misc{shi2025catching,
      title={Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception}, 
      author={Yuheng Shi and Xiaohuan Pei and Minjing Dong and Chang Xu},
      year={2025},
      eprint={2509.16944},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

This project is based on LLaVA, lmms-eval and DeepSeek-VL. We sincerely thank the authors for their great work and open-sourcing the code.

Downloads last month
8
Safetensors
Model size
7B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support