---
license: apache-2.0
tags:
- agent
pipeline_tag: image-to-video
---
# LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
This is the official model repository for [LiveTalk](https://github.com/GAIR-NLP/LiveTalk).
**LiveTalk** enables real-time multimodal interactive avatar video generation through an improved on-policy distillation approach. By distilling bidirectional diffusion models into causal, few-step autoregressive models, LiveTalk achieves over **20× speedup**, enabling a seamless real-time interactive experience.
## ⭐ Highlights
- **Real-Time Generation**: Achieves 24.82 FPS throughput with 0.33s first-frame latency.
- **Multimodal Conditioning**: Supports text, image, and audio inputs for flexible avatar control.
- **Efficient Inference**: Reduces inference time from ~83s to real-time through 4-step diffusion distillation.
- **Multi-Turn Coherence**: Demonstrates competitive performance against state-of-the-art models in multi-round interaction benchmarks.
- **End-to-End System**: Provides integration with audio language models for conversational AI applications.
## 🚀 Get started
### Installation
For detailed setup instructions, including environment configuration and dependency installation, please refer to the [official GitHub repository](https://github.com/GAIR-NLP/LiveTalk).
### Inference
Once the environment and checkpoints are prepared, you can execute the inference script:
```bash
bash ./scripts/inference.sh
```
**Input Requirements:**
- **Image**: Reference image of the person (JPG/PNG format).
- **Audio**: Speech audio file (WAV format, 16kHz sample rate recommended).
- **Text Prompt**: Description of the desired video characteristics.
## 🔍 Method Overview
LiveTalk addresses challenges in distilling multimodal video diffusion models by using an improved on-policy distillation recipe. It introduces curated multimodal conditions, converged ODE initialization, and aggressive optimization to eliminate training instability (like flickering or black frames) while delivering high-quality, lip-synced results.
## 📚 Citation
If you find this work useful for your research, please cite:
```bibtex
@article{chern2025livetalk,
title={LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation},
author={Chern, Ethan and Hu, Zhulin and Tang, Bohao and Su, Jiadi and Chern, Steffi and Deng, Zhijie and Liu, Pengfei},
journal={arXiv preprint arXiv:2512.23576},
year={2025}
}
```