File size: 4,198 Bytes

---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- multi-modal
- speech-language
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
- MLCommons/ml_spoken_words
- Ar4ikov/iemocap_audio_text_splitted
metrics:
- wer
- accuracy
model-index:
- name: SpeechLLM
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (clean)
      type: librispeech_asr
      config: clean
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 11.51
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (other)
      type: librispeech_asr
      config: other
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 16.68
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 16.1
      type: common_voice_16_1
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 25.66
      name: Test WER
  - task:
      type: audio-classification
      name: Audio Classification
    dataset:
      name: Common Voice 16.1
      type: common_voice_16_1
      split: test
      args:
        language: en
    metrics:
    - type: accuracy
      value: 64.98
      name: Test Age Accuracy
    - type: accuracy
      value: 81.21
      name: Test Accent Accuracy
---

# SpeechLLM

![](./speechllm.png)

SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:
1. **SpeechActivity** : if the audio signal contains speech (True/False)
2. **Transcript** : ASR transcript of the audio
3. **Gender** of the speaker (Female/Male)
4. **Age** of the speaker (Young/Middle-Age/Senior)
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

## Usage
```python
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-1.5B", trust_remote_code=True)

model.generate_meta(
	audio_path="path-to-audio.wav", #16k Hz, mono
    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
	instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
	max_new_tokens=500, 
	return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''
```

Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing).

## Model Details

- **Developed by:** Skit AI
- **Authors:** [Shangeth Rajaa](https://huggingface.co/shangeth), [Abhinav Tushar](https://huggingface.co/lepisma)
- **Language:** English
- **Finetuned from model:** [WavLM](https://huggingface.co/microsoft/wavlm-large), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- **Model Size:** 1.5 B
- **Checkpoint:** 2000 k steps (bs=1)
- **Adapters:** r=8, alpha=16
- **lr** : 1e-4
- **gradient accumulation steps:** 8


## Checkpoint Result

|         **Dataset**        |       **Type**      | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
|:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:|
| **librispeech-test-clean** | Read Speech         |        11.51        |     0.9594     |             |                |
| **librispeech-test-other** | Read Speech         |        16.68        |     0.9297     |             |                |
| **CommonVoice test**       | Diverse Accent, Age |        25.66        |     0.9476     |    0.6498   |     0.8121     |