--- license: apache-2.0 tags: - agent pipeline_tag: image-to-video --- # LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
LiveTalk icon

arXiv   Hugging Face Model

This is the official model repository for [LiveTalk](https://github.com/GAIR-NLP/LiveTalk). **LiveTalk** enables real-time multimodal interactive avatar video generation through an improved on-policy distillation approach. By distilling bidirectional diffusion models into causal, few-step autoregressive models, LiveTalk achieves over **20× speedup**, enabling a seamless real-time interactive experience.

LiveTalk System Overview

## ⭐ Highlights - **Real-Time Generation**: Achieves 24.82 FPS throughput with 0.33s first-frame latency. - **Multimodal Conditioning**: Supports text, image, and audio inputs for flexible avatar control. - **Efficient Inference**: Reduces inference time from ~83s to real-time through 4-step diffusion distillation. - **Multi-Turn Coherence**: Demonstrates competitive performance against state-of-the-art models in multi-round interaction benchmarks. - **End-to-End System**: Provides integration with audio language models for conversational AI applications. ## 🚀 Get started ### Installation For detailed setup instructions, including environment configuration and dependency installation, please refer to the [official GitHub repository](https://github.com/GAIR-NLP/LiveTalk). ### Inference Once the environment and checkpoints are prepared, you can execute the inference script: ```bash bash ./scripts/inference.sh ``` **Input Requirements:** - **Image**: Reference image of the person (JPG/PNG format). - **Audio**: Speech audio file (WAV format, 16kHz sample rate recommended). - **Text Prompt**: Description of the desired video characteristics. ## 🔍 Method Overview LiveTalk addresses challenges in distilling multimodal video diffusion models by using an improved on-policy distillation recipe. It introduces curated multimodal conditions, converged ODE initialization, and aggressive optimization to eliminate training instability (like flickering or black frames) while delivering high-quality, lip-synced results. ## 📚 Citation If you find this work useful for your research, please cite: ```bibtex @article{chern2025livetalk, title={LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation}, author={Chern, Ethan and Hu, Zhulin and Tang, Bohao and Su, Jiadi and Chern, Steffi and Deng, Zhijie and Liu, Pengfei}, journal={arXiv preprint arXiv:2512.23576}, year={2025} } ```