Robust Speech Recognition via Large-Scale Weak Supervision
Paper
•
2212.04356
•
Published
•
46
This repository provides merged model weights for Kyrgyz ASR. The model was created by LoRA fine-tuning and then merging the adapter into the base model.
Links
During training, I fine-tuned a LoRA adapter (PEFT) and then used merge_and_unload() to bake the adapter weights into the base model. This repo contains the resulting standalone Transformers model (no PEFT needed for inference).
If you want the lightweight adapter-only version, see:
fsicoli/common_voice_22_0 (config: ky)Evaluation on Common Voice 22.0 Kyrgyz (test split):
WER (normalized): 16.2061WER_ortho (orthographic): 19.1491test_loss: 0.1722Quick check (200 random test samples):
WER: 16.1677WER_ortho: 19.6021LoRA fine-tuning summary:
r=8, lora_alpha=16, lora_dropout=0.1q_proj, v_projmax_steps=4000checkpoint-4000 (WER=16.21)Training progress (selected checkpoints):
| Step | Train loss | Val loss | WER_ortho | WER |
|---|---|---|---|---|
| 500 | 0.7980 | 0.7911 | 44.3501 | 42.0754 |
| 1000 | 0.3980 | 0.2043 | 28.9947 | 27.8551 |
| 1500 | 0.1712 | 0.1821 | 20.7479 | 17.7343 |
| 2000 | 0.1734 | 0.1770 | 20.7569 | 17.6977 |
| 2500 | 0.1935 | 0.1743 | 19.7995 | 16.8192 |
| 3000 | 0.3406 | 0.1728 | 19.8988 | 16.9656 |
| 3500 | 0.3192 | 0.1724 | 19.3840 | 16.4074 |
| 4000 | 0.1499 | 0.1722 | 19.1491 | 16.2061 |
pip install -U "transformers" "accelerate" "torch"
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
model_id = "AleksTv/whisper-medium-ky-merged"
device = 0 if torch.cuda.is_available() else -1
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
# Standard Whisper processor/tokenizer files are included in this repo.
# No remote custom Python code is required.
processor = AutoProcessor.from_pretrained(model_id)
asr = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=device,
)
print(asr("path/to/audio.wav")["text"])
Apache-2.0.
If you use this model, please cite Whisper:
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022}
}
Base model
openai/whisper-medium