File size: 4,198 Bytes
898044c 56d8789 898044c f5f12a4 56d8789 f5f12a4 56d8789 f5f12a4 56d8789 f5f12a4 56d8789 f5f12a4 56d8789 f5f12a4 56d8789 f5f12a4 56d8789 898044c 56d8789 f5f12a4 56d8789 f5f12a4 56d8789 6fc25f7 f5f12a4 56d8789 f5f12a4 96c3aa9 898044c 56d8789 90ce71a 56d8789 898044c 56d8789 898044c f5f12a4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- multi-modal
- speech-language
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
- MLCommons/ml_spoken_words
- Ar4ikov/iemocap_audio_text_splitted
metrics:
- wer
- accuracy
model-index:
- name: SpeechLLM
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- type: wer
value: 11.51
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- type: wer
value: 16.68
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: wer
value: 25.66
name: Test WER
- task:
type: audio-classification
name: Audio Classification
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: accuracy
value: 64.98
name: Test Age Accuracy
- type: accuracy
value: 81.21
name: Test Accent Accuracy
---
# SpeechLLM

SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:
1. **SpeechActivity** : if the audio signal contains speech (True/False)
2. **Transcript** : ASR transcript of the audio
3. **Gender** of the speaker (Female/Male)
4. **Age** of the speaker (Young/Middle-Age/Senior)
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated)
## Usage
```python
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-1.5B", trust_remote_code=True)
model.generate_meta(
audio_path="path-to-audio.wav", #16k Hz, mono
audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
max_new_tokens=500,
return_special_tokens=False
)
# Model Generation
'''
{
"SpeechActivity" : "True",
"Transcript": "Yes, I got it. I'll make the payment now.",
"Gender": "Female",
"Emotion": "Neutral",
"Age": "Young",
"Accent" : "America",
}
'''
```
Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing).
## Model Details
- **Developed by:** Skit AI
- **Authors:** [Shangeth Rajaa](https://huggingface.co/shangeth), [Abhinav Tushar](https://huggingface.co/lepisma)
- **Language:** English
- **Finetuned from model:** [WavLM](https://huggingface.co/microsoft/wavlm-large), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- **Model Size:** 1.5 B
- **Checkpoint:** 2000 k steps (bs=1)
- **Adapters:** r=8, alpha=16
- **lr** : 1e-4
- **gradient accumulation steps:** 8
## Checkpoint Result
| **Dataset** | **Type** | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
|:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:|
| **librispeech-test-clean** | Read Speech | 11.51 | 0.9594 | | |
| **librispeech-test-other** | Read Speech | 16.68 | 0.9297 | | |
| **CommonVoice test** | Diverse Accent, Age | 25.66 | 0.9476 | 0.6498 | 0.8121 | |