Remove system role from example conversation

1696dee unverified 4 months ago

8.4 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	- th
	- id
	- vi
	pipeline_tag: audio-text-to-text
	tags:
	- multimodal
	- audio-language-model
	- audio
	base_model:
	- mispeech/dasheng-0.6B
	- Qwen/Qwen2.5-Omni-7B
	base_model_relation: finetune
	---

	# MiDashengLM-7B-1021

	MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment.
	It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512.

	📖 For more detailed introduction and technical report, please visit our [GitHub repository](https://github.com/xiaomi-research/dasheng-lm).

	Note that for most applications, we strongly recommend using the BF16 version ([mispeech/midashenglm-7b-1021-bf16](https://huggingface.co/mispeech/midashenglm-7b-1021-bf16)) for optimal performance and efficiency.

	## Usage

	### Load Model

	```python
	from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

	model_id = "mispeech/midashenglm-7b-1021-fp32" # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16"
	model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	```

	### Construct Prompt

	```python
	user_prompt = "Caption the audio." # You may try any other prompt

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": user_prompt},
	{
	"type": "audio",
	"path": "/path/to/example.wav",
	# or "url": "https://example.com/example.wav"
	# or "audio": np.random.randn(16000)
	},
	],
	},
	]
	```

	### Generate Output

	```python
	import torch

	with torch.no_grad():
	model_inputs = processor.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	add_special_tokens=True,
	return_dict=True,
	).to(device=model.device, dtype=model.dtype)
	generation = model.generate(**model_inputs)
	output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
	```


	## Results

	The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`.

	### Audio Captioning Results

	\| Domain \| Dataset \| MiDashengLM \| Qwen2.5-Omni-7B \| Kimi-Audio-Instruct \|
	\|:--------:\|:--------------:\|:--------------:\|:----------------:\|:-------------------:\|
	\| Music \| MusicCaps \| 59.11 \| 43.71 \| 35.43 \|
	\| Music \| Songdescriber \| 46.42 \| 45.31 \| 44.63 \|
	\| Sound \| AudioCaps \| 62.13 \| 60.79 \| 49.00 \|
	\| Sound \| ClothoV2 \| 49.35 \| 47.55 \| 48.01 \|
	\| Sound \| AutoACD \| 67.13 \| 55.93 \| 44.76 \|

	Metrics: FENSE (higher is better).

	### Audio and Paralinguistic Classification

	\| Dataset \| Metric \| MiDashengLM \| Qwen2.5-Omni-7B \| Kimi-Audio-Instruct \|
	\|:----------------:\|:------:\|:--------------:\|:----------------:\|:------------------:\|
	\| VoxCeleb1 \| ACC↑ \| 92.66 \| 59.71 \| 82.72 \|
	\| VoxLingua107 \| ACC↑ \| 93.72 \| 51.03 \| 73.65 \|
	\| VoxCeleb-Gender \| ACC↑ \| 97.72 \| 99.82 \| 99.69 \|
	\| VGGSound \| ACC↑ \| 52.19 \| 0.97 \| 2.20 \|
	\| Cochlscene \| ACC↑ \| 75.81 \| 23.88 \| 18.34 \|
	\| NSynth \| ACC↑ \| 80.32 \| 60.45 \| 38.09 \|
	\| FMA \| ACC↑ \| 62.94 \| 66.77 \| 27.91 \|
	\| FSDKaggle2018 \| ACC↑ \| 73.38 \| 31.38 \| 24.75 \|
	\| AudioSet \| mAP↑ \| 9.90 \| 6.48 \| 3.47 \|
	\| FSD50K \| mAP↑ \| 38.10 \| 23.87 \| 27.23 \|

	### ASR Performance

	\| Dataset \| Language \| MiDashengLM \| Qwen2.5-Omni-7B \| Kimi-Audio-Instruct \|
	\|:------------------:\|:------------:\|:-------------:\|:------------:\|:-------------------:\|
	\| LibriSpeech test-clean \| English \| 3.6 \| 1.7 \| 1.3 \|
	\| LibriSpeech test-other \| English \| 5.9 \| 3.4 \| 2.4 \|
	\| People's Speech \| English \| 26.12 \| 28.6 \| 22.3 \|
	\| AISHELL2 Mic \| Chinese \| 3.2 \| 2.5 \| 2.7 \|
	\| AISHELL2 iOS \| Chinese \| 2.9 \| 2.6 \| 2.6 \|
	\| AISHELL2 Android \| Chinese \| 3.1 \| 2.7 \| 2.6 \|
	\| GigaSpeech2 \| Indonesian \| 22.3 \| 21.2 \| >100 \|
	\| GigaSpeech2 \| Thai \| 38.4 \| 53.8 \| >100 \|
	\| GigaSpeech2 \| Viet \| 17.7 \| 18.6 \| >100 \|

	Metrics: WER/CER (lower is better).

	### Question Answering Results

	\| Dataset \| Subset \| Metric \| MiDashengLM \| Qwen2.5-Omni-7B \| Kimi-Audio-Instruct \|
	\|:--------------:\|:------------------:\|:------:\|:--------------:\|:----------------:\|:-------------------:\|
	\| MMAU-Pro \| IF \| ACC↑ \| 37.93 \| 61.30 \| 42.30 \|
	\| MMAU-Pro \| Multi-Audio \| ACC↑ \| 42.33 \| 24.30 \| 17.20 \|
	\| MMAU-Pro \| Music \| ACC↑ \| 62.20 \| 61.50 \| 57.60 \|
	\| MMAU-Pro \| Open-ended \| ACC↑ \| 63.21 \| 52.30 \| 34.50 \|
	\| MMAU-Pro \| Sound \| ACC↑ \| 58.36 \| 47.60 \| 46.00 \|
	\| MMAU-Pro \| Sound–Music \| ACC↑ \| 42.00 \| 40.00 \| 46.00 \|
	\| MMAU-Pro \| Sound–Music–Speech \| ACC↑ \| 71.43 \| 28.50 \| 42.80 \|
	\| MMAU-Pro \| Spatial \| ACC↑ \| 18.77 \| 41.20 \| 43.70 \|
	\| MMAU-Pro \| Speech \| ACC↑ \| 61.17 \| 57.40 \| 52.20 \|
	\| MMAU-Pro \| Speech–Music \| ACC↑ \| 58.70 \| 53.20 \| 54.30 \|
	\| MMAU-Pro \| Speech–Sound \| ACC↑ \| 51.14 \| 60.20 \| 48.90 \|
	\| MMAU-Pro \| Voice \| ACC↑ \| 54.83 \| 60.00 \| 50.60 \|
	\| MMAU-Pro \| Average \| ACC↑ \| 55.92 \| 52.20 \| 46.60 \|
	\| MMAU-v05.15.25 \| Sound \| ACC↑ \| 77.48 \| 78.10 \| 75.68 \|
	\| MMAU-v05.15.25 \| Music \| ACC↑ \| 70.96 \| 65.90 \| 66.77 \|
	\| MMAU-v05.15.25 \| Speech \| ACC↑ \| 76.28 \| 70.60 \| 62.16 \|
	\| MMAU-v05.15.25 \| Average \| ACC↑ \| 74.90 \| 71.50 \| 68.20 \|
	\| MuChoMusic \| \| ACC↑ \| 73.04 \| 64.79 \| 67.40 \|
	\| MusicQA \| \| FENSE↑ \| 61.56 \| 60.60 \| 40.00 \|
	\| AudioCaps-QA \| \| FENSE↑ \| 54.20 \| 53.28 \| 47.34 \|

	Metrics: Higher is better.

	## Citation

	MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications.

	If you find MiDashengLM useful in your research, please consider citing our work:

	```bibtex
	@techreport{midashenglm7b,
	title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
	author = {{Horizon Team, MiLM Plus}},
	institution= {Xiaomi Inc.},
	year = {2025},
	note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
	url = {https://arxiv.org/abs/2508.03983},
	eprint = {2508.03983},
	}
	```