video-SALMONN 2+ (Qwen 2.5-VL Based video-SALMONN 2)

video-SALMONN 2+ is built on Qwen 2.5-VL using a similar pipeline of video-SALMONN 2. Based on a better baseline and some other optimizations, video-SALMONN 2+ achieves SOTA on audio-visual QA benchmarks, including Video-MME, WorldSense, AVUT, Video-Holmes, and DailyOmni, and visual-only benchmarks including MLVU and LVBench. Our 3B and 7B models achieve SOTA results at comparable scales, while the 72B model surpasses all other open-source systems.

Github Link

Paper Link

Results

How to Use

IMPORTANT: To get the same evaluation result, please use --max_frames 768 --max_pixels 61250. Using excessively high resolution or frame rate for evaluation may lead to too much input token count for the model, potentially causing performance degradation.

Prepare the dataset following scripts/example_av.json, scripts/example_v.json, scripts/example_dpo.json, and scripts/example_a.json
Prepare base audio model through modifying the path in gen_audio_model.py

To conduct audio alignment, use the following script:

bash scripts/train.sh --interval 0.1 --run_name audio_alignment --dataset path_to_dataset --lr 2e-5 --train_qformer --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --bs 16 --epoch 5 --save_steps 5000

To conduct audio-visual SFT, use the following script:

bash scripts/train.sh --interval 0.1 --run_name av_sft --dataset path_to_dataset --lr 2e-5 --train_qformer --train_proj --max_frames 768 --max_pixels 61250 --model audio_align_model --model_base path_to_audio_model --epoch 5 --save_steps 2000 --use_lora --lora_r 128 --lora_alpha 256

To conduct DPO, use the following script:

bash scripts/train.sh --interval 0.1 --run_name dpo --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model audio_visual_base --model_base audio_align_model --lora_ckpt audio_visual_checkpoint --train_type gdpo --use_lora --lora_r 128 --lora_alpha 256 --lr 5e-6 --epoch 1 --save_steps 200 --train_qformer --train_proj

To evaluate 7B model, use the following script:

bash scripts/test.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt

To evaluate 72B model, use the following script:

bash scripts/test_8.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt

Downloads last month: 489

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for tsinghua-ee/video-SALMONN-2_plus_7B

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

(159)

this model

Paper for tsinghua-ee/video-SALMONN-2_plus_7B

video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Paper • 2506.15220 • Published Jun 18, 2025 • 1