Kyrgyz Whisper Medium (merged)

This repository provides merged model weights for Kyrgyz ASR. The model was created by LoRA fine-tuning and then merging the adapter into the base model.

Links

Base model: https://huggingface.co/nineninesix/kyrgyz-whisper-medium
Whisper paper: https://arxiv.org/abs/2212.04356
Whisper Medium (architecture reference): https://huggingface.co/openai/whisper-medium

What does “merged” mean?

During training, I fine-tuned a LoRA adapter (PEFT) and then used merge_and_unload() to bake the adapter weights into the base model. This repo contains the resulting standalone Transformers model (no PEFT needed for inference).

If you want the lightweight adapter-only version, see:

Adapter: https://huggingface.co/AleksTv/whisper-medium-ky-lora

Dataset

Training/evaluation dataset: fsicoli/common_voice_22_0 (config: ky)

Results

Evaluation on Common Voice 22.0 Kyrgyz (test split):

WER (normalized): 16.2061
WER_ortho (orthographic): 19.1491
test_loss: 0.1722

Quick check (200 random test samples):

WER: 16.1677
WER_ortho: 19.6021

Training details

LoRA fine-tuning summary:

LoRA: r=8, lora_alpha=16, lora_dropout=0.1
Target modules: q_proj, v_proj
Steps: max_steps=4000
Best checkpoint by WER: checkpoint-4000 (WER=16.21)

Training progress (selected checkpoints):

Step	Train loss	Val loss	WER_ortho	WER
500	0.7980	0.7911	44.3501	42.0754
1000	0.3980	0.2043	28.9947	27.8551
1500	0.1712	0.1821	20.7479	17.7343
2000	0.1734	0.1770	20.7569	17.6977
2500	0.1935	0.1743	19.7995	16.8192
3000	0.3406	0.1728	19.8988	16.9656
3500	0.3192	0.1724	19.3840	16.4074
4000	0.1499	0.1722	19.1491	16.2061

How to use

Install

pip install -U "transformers" "accelerate" "torch"

Inference (Transformers pipeline)

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "AleksTv/whisper-medium-ky-merged"

device = 0 if torch.cuda.is_available() else -1
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

# Standard Whisper processor/tokenizer files are included in this repo.
# No remote custom Python code is required.
processor = AutoProcessor.from_pretrained(model_id)

asr = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=device,
)

print(asr("path/to/audio.wav")["text"])

Tips

For long audio, quality usually improves with VAD/segmentation + stitching.
Prefer 16 kHz mono WAV (or rely on the pipeline to resample).

Limitations

Performance may degrade on very noisy audio, overlapping speech, and long recordings without segmentation.
ASR models may occasionally hallucinate text on difficult segments.

License

Apache-2.0.

Citation

If you use this model, please cite Whisper:

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022}
}

Downloads last month: 80

Safetensors

Model size

0.8B params

Tensor type

F16

Model tree for AleksTv/whisper-medium-ky-merged

Base model

openai/whisper-medium

Finetuned

nineninesix/kyrgyz-whisper-medium

Finetuned

(1)

this model

Dataset used to train AleksTv/whisper-medium-ky-merged

Space using AleksTv/whisper-medium-ky-merged 1

Paper for AleksTv/whisper-medium-ky-merged

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 46

Evaluation results

WER (normalized) on Common Voice 22.0 (ky)
test set self-reported

16.206
WER (orthographic) on Common Voice 22.0 (ky)
test set self-reported

19.149