ARTPARK-IISc
/

whisper-medium-vaani-hindi

@@ -1,5 +1,5 @@
 ---
-license: mit
 datasets:
 - ARTPARK-IISc/Vaani
 language:
@@ -8,53 +8,44 @@ base_model:
 - openai/whisper-medium
 pipeline_tag: automatic-speech-recognition
 ---
-```python
-import torch
-from transformers import WhisperForConditionalGeneration, WhisperProcessor, WhisperTokenizer,WhisperFeatureExtractor
-import soundfile as sf
-model="ARTPARK-IISc/whisper-medium-vaani-hindi"
-# Load tokenizer and feature extractor individually
-feature_extractor = WhisperFeatureExtractor.from_pretrained(model)
-tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-medium", language="Hindi", task="transcribe")
-# Create the processor manually
-processor = WhisperProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
-# Load and preprocess the audio file
-audio_file_path = "Sample_Audio.wav"  # replace with your audio file path
-device = "cuda" if torch.cuda.is_available() else "cpu"
-# Load the processor and model
-model = WhisperForConditionalGeneration.from_pretrained(model).to(device)
-# load audio
-audio_data, sample_rate = sf.read(audio_file_path)
-# Ensure the audio is 16kHz (Whisper expects 16kHz audio)
-if sample_rate != 16000:
-    import torchaudio
-    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
-    audio_data = resampler(torch.tensor(audio_data).unsqueeze(0)).squeeze().numpy()
-# Use the processor to prepare the input features
-input_features = processor(audio_data, sampling_rate=16000, return_tensors="pt").input_features.to(device)
-# Generate transcription (disable gradient calculation during inference)
-with torch.no_grad():
-    predicted_ids = model.generate(input_features)
-# Decode the generated IDs into human-readable text
-transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
-print(transcription)
-```

 ---
+license: apache-2.0
 datasets:
 - ARTPARK-IISc/Vaani
 language:
 - openai/whisper-medium
 pipeline_tag: automatic-speech-recognition
 ---
+# Whisper-large-v3-vaani-hindi
+This is a fine-tuned version of [OpenAI's Whisper-Medium](https://huggingface.co/openai/whisper-medium), trained on approximately 718 hours of transcribed Hindi speech from multiple datasets.
+# Usage
+This can be used with the pipeline function from the Transformers module.
+```python
+import torch
+from transformers import pipeline
+audio = "path to the audio file to be transcribed"
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+modelTags="ARTPARK-IISc/whisper-medium-vaani-hindi"
+transcribe = pipeline(task="automatic-speech-recognition", model=modelTags, chunk_length_s=30, device=device)
+transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="hi", task="transcribe")
+print('Transcription: ', transcribe(audio)["text"])
+```
+# Training and Evaluation
+The models has finetuned using folllowing dataset [Vaani](https://huggingface.co/datasets/ARTPARK-IISc/Vaani) ,[Gramvaani](https://sites.google.com/view/gramvaaniasrchallenge/dataset)
+[IndicVoices](https://huggingface.co/datasets/ai4bharat/IndicVoices), [Fleurs](https://huggingface.co/datasets/google/fleurs),[IndicTTS](https://huggingface.co/datasets/SPRINGLab/IndicTTS-Hindi)
+and [Commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
+The performance of the model was evaluated using multiple datasets, and the evaluation results are provided below.
+| Dataset | WER |
+| :---:   | :---: |
+| Gramvaani | 27.64   |
+| Fleurs | 14.34   |
+| IndicTTS | 07.78  |
+| MUCS | 23.46   |
+|Commonvoice | 19.90  |
+| Kathbath | 14.29 |
+| Kathbath Noisy| 16.03  |
+| Vaani  | 25.48  |
+| RESPIN  | 08.79 |