skit-ai
/

speechllm-1.5B

@@ -3,9 +3,14 @@ language:
 - en
 license: apache-2.0
 library_name: transformers
 datasets:
 - mozilla-foundation/common_voice_16_1
 - openslr/librispeech_asr
 metrics:
 - wer
 - accuracy
@@ -24,7 +29,7 @@ model-index:
         language: en
     metrics:
     - type: wer
-      value: 1.0
       name: Test WER
   - task:
       type: automatic-speech-recognition
@@ -38,7 +43,7 @@ model-index:
         language: en
     metrics:
     - type: wer
-      value: 1.0
       name: Test WER
   - task:
       type: automatic-speech-recognition
@@ -51,7 +56,7 @@ model-index:
         language: en
     metrics:
     - type: wer
-      value: 1.0
       name: Test WER
   - task:
       type: audio-classification
@@ -64,18 +69,18 @@ model-index:
         language: en
     metrics:
     - type: accuracy
-      value: 1.0
       name: Test Age Accuracy
     - type: accuracy
-      value: 1.0
       name: Test Accent Accuracy
 ---
 # SpeechLLM
-[The model is still training, we will be releasing the latest checkpoints soon...]
-SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-1.5B model is based on WavLM audio encoder and TinyLlama LLM. The model predicts the following:
 1. **SpeechActivity** : if the audio signal contains speech (True/False)
 2. **Transcript** : ASR transcript of the audio
 3. **Gender** of the speaker (Female/Male)
@@ -91,6 +96,7 @@ model = AutoModel.from_pretrained("skit-ai/speechllm-1.5B", trust_remote_code=Tr
 model.generate_meta(
 	audio_path="path-to-audio.wav", #16k Hz, mono
 	instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
 	max_new_tokens=500,
 	return_special_tokens=False
@@ -108,6 +114,7 @@ model.generate_meta(
 }
 '''
 ```
 Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing).
 ## Model Details
@@ -118,18 +125,15 @@ Try the model in [Google Colab Notebook](https://colab.research.google.com/drive
 - **Finetuned from model:** [WavLM](https://huggingface.co/microsoft/wavlm-large), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
 - **Model Size:** 1.5 B
 - **Checkpoint:** 2000 k steps (bs=1)
-- **Adapters:** r=8, alpha=16
 - **lr** : 1e-4
 - **gradient accumulation steps:** 8
 ## Checkpoint Result
-|       **Dataset**      | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
-|:----------------------:|:----------------------:|:-------------:|:----------:|:-------------:|
-| librispeech-test-clean | 7.36                 | 0.9490        |            |               |
-| librispeech-test-other | 10.47                 | 0.9099        |            |               |
-| CommonVoice test       | 24.47                 | 0.8680        | 0.6061     | 0.6156        |

 - en
 license: apache-2.0
 library_name: transformers
+tags:
+- multi-modal
+- speech-language
 datasets:
 - mozilla-foundation/common_voice_16_1
 - openslr/librispeech_asr
+- MLCommons/ml_spoken_words
+- Ar4ikov/iemocap_audio_text_splitted
 metrics:
 - wer
 - accuracy
         language: en
     metrics:
     - type: wer
+      value: 11.51
       name: Test WER
   - task:
       type: automatic-speech-recognition
         language: en
     metrics:
     - type: wer
+      value: 16.68
       name: Test WER
   - task:
       type: automatic-speech-recognition
         language: en
     metrics:
     - type: wer
+      value: 25.66
       name: Test WER
   - task:
       type: audio-classification
         language: en
     metrics:
     - type: accuracy
+      value: 64.98
       name: Test Age Accuracy
     - type: accuracy
+      value: 81.21
       name: Test Accent Accuracy
 ---
 # SpeechLLM
+![](./speechllm.png)
+SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:
 1. **SpeechActivity** : if the audio signal contains speech (True/False)
 2. **Transcript** : ASR transcript of the audio
 3. **Gender** of the speaker (Female/Male)
 model.generate_meta(
 	audio_path="path-to-audio.wav", #16k Hz, mono
+    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
 	instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
 	max_new_tokens=500,
 	return_special_tokens=False
 }
 '''
 ```
 Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing).
 ## Model Details
 - **Finetuned from model:** [WavLM](https://huggingface.co/microsoft/wavlm-large), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
 - **Model Size:** 1.5 B
 - **Checkpoint:** 2000 k steps (bs=1)
+- **Adapters:** r=4, alpha=8
 - **lr** : 1e-4
 - **gradient accumulation steps:** 8
 ## Checkpoint Result
+|         **Dataset**        |       **Type**      | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
+|:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:|
+| **librispeech-test-clean** | Read Speech         |        11.51        |     0.9594     |             |                |
+| **librispeech-test-other** | Read Speech         |        16.68        |     0.9297     |             |                |
+| **CommonVoice test**       | Diverse Accent, Age |        25.66        |     0.9476     |    0.6498   |     0.8121     |