| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - zh |
| | - th |
| | - id |
| | - vi |
| | pipeline_tag: audio-text-to-text |
| | tags: |
| | - multimodal |
| | - audio-language-model |
| | - audio |
| | base_model: |
| | - mispeech/dasheng-0.6B |
| | - Qwen/Qwen2.5-Omni-7B |
| | base_model_relation: finetune |
| | --- |
| | |
| | # MiDashengLM-7B-1021 |
| |
|
| | MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment. |
| | It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512. |
| |
|
| | 📖 For more detailed introduction and technical report, please visit our [GitHub repository](https://github.com/xiaomi-research/dasheng-lm). |
| |
|
| | Note that for most applications, we strongly recommend using the BF16 version ([mispeech/midashenglm-7b-1021-bf16](https://huggingface.co/mispeech/midashenglm-7b-1021-bf16)) for optimal performance and efficiency. |
| |
|
| | ## Usage |
| |
|
| | ### Load Model |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer |
| | |
| | model_id = "mispeech/midashenglm-7b-1021-fp32" # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16" |
| | model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
| | ``` |
| |
|
| | ### Construct Prompt |
| |
|
| | ```python |
| | user_prompt = "Caption the audio." # You may try any other prompt |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "text", "text": user_prompt}, |
| | { |
| | "type": "audio", |
| | "path": "/path/to/example.wav", |
| | # or "url": "https://example.com/example.wav" |
| | # or "audio": np.random.randn(16000) |
| | }, |
| | ], |
| | }, |
| | ] |
| | ``` |
| |
|
| | ### Generate Output |
| |
|
| | ```python |
| | import torch |
| | |
| | with torch.no_grad(): |
| | model_inputs = processor.apply_chat_template( |
| | messages, |
| | tokenize=True, |
| | add_generation_prompt=True, |
| | add_special_tokens=True, |
| | return_dict=True, |
| | ).to(device=model.device, dtype=model.dtype) |
| | generation = model.generate(**model_inputs) |
| | output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."] |
| | ``` |
| |
|
| |
|
| | ## Results |
| |
|
| | The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`. |
| |
|
| | ### Audio Captioning Results |
| |
|
| | | Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |
| | |:--------:|:--------------:|:--------------:|:----------------:|:-------------------:| |
| | | Music | MusicCaps | **59.11** | 43.71 | 35.43 | |
| | | Music | Songdescriber | **46.42** | 45.31 | 44.63 | |
| | | Sound | AudioCaps | **62.13** | 60.79 | 49.00 | |
| | | Sound | ClothoV2 | **49.35** | 47.55 | 48.01 | |
| | | Sound | AutoACD | **67.13** | 55.93 | 44.76 | |
| |
|
| | *Metrics: FENSE (higher is better).* |
| |
|
| | ### Audio and Paralinguistic Classification |
| |
|
| | | Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |
| | |:----------------:|:------:|:--------------:|:----------------:|:------------------:| |
| | | VoxCeleb1 | ACC↑ | **92.66** | 59.71 | 82.72 | |
| | | VoxLingua107 | ACC↑ | **93.72** | 51.03 | 73.65 | |
| | | VoxCeleb-Gender | ACC↑ | 97.72 | **99.82** | 99.69 | |
| | | VGGSound | ACC↑ | **52.19** | 0.97 | 2.20 | |
| | | Cochlscene | ACC↑ | **75.81** | 23.88 | 18.34 | |
| | | NSynth | ACC↑ | **80.32** | 60.45 | 38.09 | |
| | | FMA | ACC↑ | 62.94 | **66.77** | 27.91 | |
| | | FSDKaggle2018 | ACC↑ | **73.38** | 31.38 | 24.75 | |
| | | AudioSet | mAP↑ | **9.90** | 6.48 | 3.47 | |
| | | FSD50K | mAP↑ | **38.10** | 23.87 | 27.23 | |
| |
|
| | ### ASR Performance |
| |
|
| | | Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |
| | |:------------------:|:------------:|:-------------:|:------------:|:-------------------:| |
| | | LibriSpeech test-clean | English | 3.6 | 1.7 | **1.3** | |
| | | LibriSpeech test-other | English | 5.9 | 3.4 | **2.4** | |
| | | People's Speech | English | 26.12 | 28.6 | **22.3** | |
| | | AISHELL2 Mic | Chinese | 3.2 | **2.5** | 2.7 | |
| | | AISHELL2 iOS | Chinese | 2.9 | **2.6** | **2.6** | |
| | | AISHELL2 Android | Chinese | 3.1 | 2.7 | **2.6** | |
| | | GigaSpeech2 | Indonesian | 22.3 | **21.2** | >100 | |
| | | GigaSpeech2 | Thai | **38.4** | 53.8 | >100 | |
| | | GigaSpeech2 | Viet | **17.7** | 18.6 | >100 | |
| |
|
| | *Metrics: WER/CER (lower is better).* |
| |
|
| | ### Question Answering Results |
| |
|
| | | Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |
| | |:--------------:|:------------------:|:------:|:--------------:|:----------------:|:-------------------:| |
| | | MMAU-Pro | IF | ACC↑ | 37.93 | **61.30** | 42.30 | |
| | | MMAU-Pro | Multi-Audio | ACC↑ | **42.33** | 24.30 | 17.20 | |
| | | MMAU-Pro | Music | ACC↑ | **62.20** | 61.50 | 57.60 | |
| | | MMAU-Pro | Open-ended | ACC↑ | **63.21** | 52.30 | 34.50 | |
| | | MMAU-Pro | Sound | ACC↑ | **58.36** | 47.60 | 46.00 | |
| | | MMAU-Pro | Sound–Music | ACC↑ | 42.00 | 40.00 | **46.00** | |
| | | MMAU-Pro | Sound–Music–Speech | ACC↑ | **71.43** | 28.50 | 42.80 | |
| | | MMAU-Pro | Spatial | ACC↑ | 18.77 | 41.20 | **43.70** | |
| | | MMAU-Pro | Speech | ACC↑ | **61.17** | 57.40 | 52.20 | |
| | | MMAU-Pro | Speech–Music | ACC↑ | **58.70** | 53.20 | 54.30 | |
| | | MMAU-Pro | Speech–Sound | ACC↑ | 51.14 | **60.20** | 48.90 | |
| | | MMAU-Pro | Voice | ACC↑ | 54.83 | **60.00** | 50.60 | |
| | | MMAU-Pro | Average | ACC↑ | **55.92** | 52.20 | 46.60 | |
| | | MMAU-v05.15.25 | Sound | ACC↑ | 77.48 | **78.10** | 75.68 | |
| | | MMAU-v05.15.25 | Music | ACC↑ | **70.96** | 65.90 | 66.77 | |
| | | MMAU-v05.15.25 | Speech | ACC↑ | **76.28** | 70.60 | 62.16 | |
| | | MMAU-v05.15.25 | Average | ACC↑ | **74.90** | 71.50 | 68.20 | |
| | | MuChoMusic | | ACC↑ | **73.04** | 64.79 | 67.40 | |
| | | MusicQA | | FENSE↑ | **61.56** | 60.60 | 40.00 | |
| | | AudioCaps-QA | | FENSE↑ | **54.20** | 53.28 | 47.34 | |
| |
|
| | *Metrics: Higher is better.* |
| |
|
| | ## Citation |
| |
|
| | MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**. |
| |
|
| | If you find MiDashengLM useful in your research, please consider citing our work: |
| |
|
| | ```bibtex |
| | @techreport{midashenglm7b, |
| | title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions}, |
| | author = {{Horizon Team, MiLM Plus}}, |
| | institution= {Xiaomi Inc.}, |
| | year = {2025}, |
| | note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)}, |
| | url = {https://arxiv.org/abs/2508.03983}, |
| | eprint = {2508.03983}, |
| | } |
| | ``` |