shangeth commited on
Commit
f5f12a4
·
verified ·
1 Parent(s): 90b0faf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -16
README.md CHANGED
@@ -3,9 +3,14 @@ language:
3
  - en
4
  license: apache-2.0
5
  library_name: transformers
 
 
 
6
  datasets:
7
  - mozilla-foundation/common_voice_16_1
8
  - openslr/librispeech_asr
 
 
9
  metrics:
10
  - wer
11
  - accuracy
@@ -24,7 +29,7 @@ model-index:
24
  language: en
25
  metrics:
26
  - type: wer
27
- value: 1.0
28
  name: Test WER
29
  - task:
30
  type: automatic-speech-recognition
@@ -38,7 +43,7 @@ model-index:
38
  language: en
39
  metrics:
40
  - type: wer
41
- value: 1.0
42
  name: Test WER
43
  - task:
44
  type: automatic-speech-recognition
@@ -51,7 +56,7 @@ model-index:
51
  language: en
52
  metrics:
53
  - type: wer
54
- value: 1.0
55
  name: Test WER
56
  - task:
57
  type: audio-classification
@@ -64,18 +69,18 @@ model-index:
64
  language: en
65
  metrics:
66
  - type: accuracy
67
- value: 1.0
68
  name: Test Age Accuracy
69
  - type: accuracy
70
- value: 1.0
71
  name: Test Accent Accuracy
72
  ---
73
 
74
  # SpeechLLM
75
 
76
- [The model is still training, we will be releasing the latest checkpoints soon...]
77
 
78
- SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-1.5B model is based on WavLM audio encoder and TinyLlama LLM. The model predicts the following:
79
  1. **SpeechActivity** : if the audio signal contains speech (True/False)
80
  2. **Transcript** : ASR transcript of the audio
81
  3. **Gender** of the speaker (Female/Male)
@@ -91,6 +96,7 @@ model = AutoModel.from_pretrained("skit-ai/speechllm-1.5B", trust_remote_code=Tr
91
 
92
  model.generate_meta(
93
  audio_path="path-to-audio.wav", #16k Hz, mono
 
94
  instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
95
  max_new_tokens=500,
96
  return_special_tokens=False
@@ -108,6 +114,7 @@ model.generate_meta(
108
  }
109
  '''
110
  ```
 
111
  Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing).
112
 
113
  ## Model Details
@@ -118,18 +125,15 @@ Try the model in [Google Colab Notebook](https://colab.research.google.com/drive
118
  - **Finetuned from model:** [WavLM](https://huggingface.co/microsoft/wavlm-large), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
119
  - **Model Size:** 1.5 B
120
  - **Checkpoint:** 2000 k steps (bs=1)
121
- - **Adapters:** r=8, alpha=16
122
  - **lr** : 1e-4
123
  - **gradient accumulation steps:** 8
124
 
125
 
126
  ## Checkpoint Result
127
 
128
- | **Dataset** | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
129
- |:----------------------:|:----------------------:|:-------------:|:----------:|:-------------:|
130
- | librispeech-test-clean | 7.36 | 0.9490 | | |
131
- | librispeech-test-other | 10.47 | 0.9099 | | |
132
- | CommonVoice test | 24.47 | 0.8680 | 0.6061 | 0.6156 |
133
-
134
-
135
-
 
3
  - en
4
  license: apache-2.0
5
  library_name: transformers
6
+ tags:
7
+ - multi-modal
8
+ - speech-language
9
  datasets:
10
  - mozilla-foundation/common_voice_16_1
11
  - openslr/librispeech_asr
12
+ - MLCommons/ml_spoken_words
13
+ - Ar4ikov/iemocap_audio_text_splitted
14
  metrics:
15
  - wer
16
  - accuracy
 
29
  language: en
30
  metrics:
31
  - type: wer
32
+ value: 11.51
33
  name: Test WER
34
  - task:
35
  type: automatic-speech-recognition
 
43
  language: en
44
  metrics:
45
  - type: wer
46
+ value: 16.68
47
  name: Test WER
48
  - task:
49
  type: automatic-speech-recognition
 
56
  language: en
57
  metrics:
58
  - type: wer
59
+ value: 25.66
60
  name: Test WER
61
  - task:
62
  type: audio-classification
 
69
  language: en
70
  metrics:
71
  - type: accuracy
72
+ value: 64.98
73
  name: Test Age Accuracy
74
  - type: accuracy
75
+ value: 81.21
76
  name: Test Accent Accuracy
77
  ---
78
 
79
  # SpeechLLM
80
 
81
+ ![](./speechllm.png)
82
 
83
+ SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:
84
  1. **SpeechActivity** : if the audio signal contains speech (True/False)
85
  2. **Transcript** : ASR transcript of the audio
86
  3. **Gender** of the speaker (Female/Male)
 
96
 
97
  model.generate_meta(
98
  audio_path="path-to-audio.wav", #16k Hz, mono
99
+ audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
100
  instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
101
  max_new_tokens=500,
102
  return_special_tokens=False
 
114
  }
115
  '''
116
  ```
117
+
118
  Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing).
119
 
120
  ## Model Details
 
125
  - **Finetuned from model:** [WavLM](https://huggingface.co/microsoft/wavlm-large), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
126
  - **Model Size:** 1.5 B
127
  - **Checkpoint:** 2000 k steps (bs=1)
128
+ - **Adapters:** r=4, alpha=8
129
  - **lr** : 1e-4
130
  - **gradient accumulation steps:** 8
131
 
132
 
133
  ## Checkpoint Result
134
 
135
+ | **Dataset** | **Type** | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
136
+ |:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:|
137
+ | **librispeech-test-clean** | Read Speech | 11.51 | 0.9594 | | |
138
+ | **librispeech-test-other** | Read Speech | 16.68 | 0.9297 | | |
139
+ | **CommonVoice test** | Diverse Accent, Age | 25.66 | 0.9476 | 0.6498 | 0.8121 |