zhoukz's picture
Remove system role from example conversation
1696dee unverified
---
license: apache-2.0
language:
- en
- zh
- th
- id
- vi
pipeline_tag: audio-text-to-text
tags:
- multimodal
- audio-language-model
- audio
base_model:
- mispeech/dasheng-0.6B
- Qwen/Qwen2.5-Omni-7B
base_model_relation: finetune
---
# MiDashengLM-7B-1021
MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment.
It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512.
📖 For more detailed introduction and technical report, please visit our [GitHub repository](https://github.com/xiaomi-research/dasheng-lm).
Note that for most applications, we strongly recommend using the BF16 version ([mispeech/midashenglm-7b-1021-bf16](https://huggingface.co/mispeech/midashenglm-7b-1021-bf16)) for optimal performance and efficiency.
## Usage
### Load Model
```python
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_id = "mispeech/midashenglm-7b-1021-fp32" # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
```
### Construct Prompt
```python
user_prompt = "Caption the audio." # You may try any other prompt
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": user_prompt},
{
"type": "audio",
"path": "/path/to/example.wav",
# or "url": "https://example.com/example.wav"
# or "audio": np.random.randn(16000)
},
],
},
]
```
### Generate Output
```python
import torch
with torch.no_grad():
model_inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
add_special_tokens=True,
return_dict=True,
).to(device=model.device, dtype=model.dtype)
generation = model.generate(**model_inputs)
output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
```
## Results
The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`.
### Audio Captioning Results
| Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
| Music | MusicCaps | **59.11** | 43.71 | 35.43 |
| Music | Songdescriber | **46.42** | 45.31 | 44.63 |
| Sound | AudioCaps | **62.13** | 60.79 | 49.00 |
| Sound | ClothoV2 | **49.35** | 47.55 | 48.01 |
| Sound | AutoACD | **67.13** | 55.93 | 44.76 |
*Metrics: FENSE (higher is better).*
### Audio and Paralinguistic Classification
| Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:----------------:|:------:|:--------------:|:----------------:|:------------------:|
| VoxCeleb1 | ACC↑ | **92.66** | 59.71 | 82.72 |
| VoxLingua107 | ACC↑ | **93.72** | 51.03 | 73.65 |
| VoxCeleb-Gender | ACC↑ | 97.72 | **99.82** | 99.69 |
| VGGSound | ACC↑ | **52.19** | 0.97 | 2.20 |
| Cochlscene | ACC↑ | **75.81** | 23.88 | 18.34 |
| NSynth | ACC↑ | **80.32** | 60.45 | 38.09 |
| FMA | ACC↑ | 62.94 | **66.77** | 27.91 |
| FSDKaggle2018 | ACC↑ | **73.38** | 31.38 | 24.75 |
| AudioSet | mAP↑ | **9.90** | 6.48 | 3.47 |
| FSD50K | mAP↑ | **38.10** | 23.87 | 27.23 |
### ASR Performance
| Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:------------------:|:------------:|:-------------:|:------------:|:-------------------:|
| LibriSpeech test-clean | English | 3.6 | 1.7 | **1.3** |
| LibriSpeech test-other | English | 5.9 | 3.4 | **2.4** |
| People's Speech | English | 26.12 | 28.6 | **22.3** |
| AISHELL2 Mic | Chinese | 3.2 | **2.5** | 2.7 |
| AISHELL2 iOS | Chinese | 2.9 | **2.6** | **2.6** |
| AISHELL2 Android | Chinese | 3.1 | 2.7 | **2.6** |
| GigaSpeech2 | Indonesian | 22.3 | **21.2** | >100 |
| GigaSpeech2 | Thai | **38.4** | 53.8 | >100 |
| GigaSpeech2 | Viet | **17.7** | 18.6 | >100 |
*Metrics: WER/CER (lower is better).*
### Question Answering Results
| Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:--------------:|:------------------:|:------:|:--------------:|:----------------:|:-------------------:|
| MMAU-Pro | IF | ACC↑ | 37.93 | **61.30** | 42.30 |
| MMAU-Pro | Multi-Audio | ACC↑ | **42.33** | 24.30 | 17.20 |
| MMAU-Pro | Music | ACC↑ | **62.20** | 61.50 | 57.60 |
| MMAU-Pro | Open-ended | ACC↑ | **63.21** | 52.30 | 34.50 |
| MMAU-Pro | Sound | ACC↑ | **58.36** | 47.60 | 46.00 |
| MMAU-Pro | Sound–Music | ACC↑ | 42.00 | 40.00 | **46.00** |
| MMAU-Pro | Sound–Music–Speech | ACC↑ | **71.43** | 28.50 | 42.80 |
| MMAU-Pro | Spatial | ACC↑ | 18.77 | 41.20 | **43.70** |
| MMAU-Pro | Speech | ACC↑ | **61.17** | 57.40 | 52.20 |
| MMAU-Pro | Speech–Music | ACC↑ | **58.70** | 53.20 | 54.30 |
| MMAU-Pro | Speech–Sound | ACC↑ | 51.14 | **60.20** | 48.90 |
| MMAU-Pro | Voice | ACC↑ | 54.83 | **60.00** | 50.60 |
| MMAU-Pro | Average | ACC↑ | **55.92** | 52.20 | 46.60 |
| MMAU-v05.15.25 | Sound | ACC↑ | 77.48 | **78.10** | 75.68 |
| MMAU-v05.15.25 | Music | ACC↑ | **70.96** | 65.90 | 66.77 |
| MMAU-v05.15.25 | Speech | ACC↑ | **76.28** | 70.60 | 62.16 |
| MMAU-v05.15.25 | Average | ACC↑ | **74.90** | 71.50 | 68.20 |
| MuChoMusic | | ACC↑ | **73.04** | 64.79 | 67.40 |
| MusicQA | | FENSE↑ | **61.56** | 60.60 | 40.00 |
| AudioCaps-QA | | FENSE↑ | **54.20** | 53.28 | 47.34 |
*Metrics: Higher is better.*
## Citation
MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
If you find MiDashengLM useful in your research, please consider citing our work:
```bibtex
@techreport{midashenglm7b,
title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
author = {{Horizon Team, MiLM Plus}},
institution= {Xiaomi Inc.},
year = {2025},
note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
url = {https://arxiv.org/abs/2508.03983},
eprint = {2508.03983},
}
```