--- license: apache-2.0 language: - sw base_model: - facebook/mms-tts pipeline_tag: text-to-speech datasets: - mozilla-foundation/common_voice_17_0 metrics: - wer tags: - text-to-speech - audio - speech - transformers - vits - swahili --- # πŸ”Š SALAMA-TTS β€” Swahili Text-to-Speech Model **Developer:** AI4NNOV **Version:** v1.0 **License:** Apache 2.0 **Model Type:** Text-to-Speech (TTS) **Base Model:** `facebook/mms-tts-swh` (fine-tuned) --- ## 🌍 Overview **SALAMA-TTS** is the **speech synthesis module** of the **SALAMA Framework**, a complete end-to-end **Speech-to-Speech AI system** for African languages. It generates **natural, high-quality Swahili speech** from text and integrates seamlessly with **SALAMA-LLM** and **SALAMA-STT** for conversational voice assistants. The model is based on **Meta’s MMS (Massively Multilingual Speech)** TTS architecture using the **VITS framework**, fine-tuned for natural prosody, tone, and rhythm in Swahili. --- ## 🧱 Model Architecture SALAMA-TTS is built on the **VITS architecture**, combining the strengths of **variational autoencoders (VAE)** and **GANs** for realistic and expressive speech synthesis. | Parameter | Value | |------------|--------| | Base Model | `facebook/mms-tts-swh` | | Fine-Tuning | 8-bit quantized, LoRA fine-tuning | | Optimizer | AdamW | | Learning Rate | 2e-5 | | Epochs | 20 | | Sampling Rate | 16kHz | | Frameworks | Transformers + Datasets + PyTorch | | Language | Swahili (`sw`) | --- ## πŸ“š Dataset | Dataset | Description | Purpose | |----------|--------------|----------| | `common_voice_17_0` | Swahili voice dataset by Mozilla | Base training | | Custom Swahili speech corpus | Locally recorded sentences and dialogues | Fine-tuning naturalness | | Evaluated on | Common Voice Swahili (test split) | Evaluation | --- ## 🧠 Model Capabilities - Converts **Swahili text to natural-sounding speech** - Handles **both formal and conversational** tone - High clarity and prosody for long-form speech - Seamless integration with **SALAMA-LLM** responses - Output format: **16-bit PCM WAV** --- ## πŸ“Š Evaluation Metrics | Metric | Score | Description | |---------|-------|-------------| | **MOS (Mean Opinion Score)** | **4.05 / 5.0** | Human-rated naturalness | | **WER (Generated β†’ STT)** | **0.21** | Evaluated by re-transcribing synthesized audio | > The MOS was evaluated by 12 native Swahili speakers across clarity, tone, and pronunciation. --- ## βš™οΈ Usage (Python Example) ```python # Requirements: # pip install onnxruntime librosa soundfile transformers numpy # If you want GPU inference: pip install onnxruntime-gpu (and ensure CUDA toolkit is available) import os import numpy as np import onnxruntime from transformers import AutoTokenizer import soundfile as sf TTS_ONNX_MODEL_PATH = "swahili_tts.onnx" # path to your .onnx file TTS_TOKENIZER_ID = "facebook/mms-tts-swh" # or whichever tokenizer you used OUTPUT_SAMPLE_RATE = 16000 OUT_DIR = "tts_outputs" os.makedirs(OUT_DIR, exist_ok=True) def create_onnx_session(onnx_path: str): """Create an ONNX Runtime session using GPU if available, otherwise CPU.""" providers = ["CPUExecutionProvider"] try: # prefer CUDA if available providers = ["CUDAExecutionProvider", "CPUExecutionProvider"] sess = onnxruntime.InferenceSession(onnx_path, providers=providers) print("Using CUDAExecutionProvider for ONNX Runtime.") except Exception: sess = onnxruntime.InferenceSession(onnx_path, providers=["CPUExecutionProvider"]) print("CUDA not available β€” using CPUExecutionProvider for ONNX Runtime.") return sess def generate_speech_from_onnx(text: str, onnx_session: onnxruntime.InferenceSession, tokenizer: AutoTokenizer, out_path: str = None) -> str: """ Synthesize speech from text using an ONNX TTS model. Returns path to WAV file (16kHz, int16). """ if not text: raise ValueError("Empty text provided.") # Tokenize to numpy inputs (match what the ONNX model expects) # NOTE: many TTS tokenizers return {"input_ids": np.array(...)} β€” adapt if your tokenizer differs inputs = tokenizer(text, return_tensors="np", padding=True) # Identify ONNX input name (assume first input) input_name = onnx_session.get_inputs()[0].name # Prepare ort_inputs dict using names expected by ONNX model ort_inputs = {input_name: inputs["input_ids"].astype(np.int64)} # Run ONNX inference ort_outs = onnx_session.run(None, ort_inputs) # The model should return a raw waveform or float array convertible to waveform. # In many single-file TTS ONNX exports the first output is the waveform audio_array = ort_outs[0] # Flatten in case it's multi-dim and ensure 1-D waveform audio_waveform = audio_array.flatten() # If float waveform in -1..1, convert to int16; else try to coerce to int16 if np.issubdtype(audio_waveform.dtype, np.floating): # clip then convert audio_clip = np.clip(audio_waveform, -1.0, 1.0) audio_int16 = (audio_clip * 32767.0).astype(np.int16) else: # if it's already int16-like, cast (safeguard) audio_int16 = audio_waveform.astype(np.int16) # Compose output filename if out_path is None: out_path = os.path.join(OUT_DIR, f"salama_tts_{abs(hash(text)) & 0xFFFF_FFFF}.wav") # Save with soundfile (16kHz) sf.write(out_path, audio_int16, samplerate=OUTPUT_SAMPLE_RATE, subtype="PCM_16") return out_path if __name__ == "__main__": # Example usage sess = create_onnx_session(TTS_ONNX_MODEL_PATH) tokenizer = AutoTokenizer.from_pretrained(TTS_TOKENIZER_ID) example_text = "Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili." out_wav = generate_speech_from_onnx(example_text, sess, tokenizer) print("Saved synthesized audio to:", out_wav) ``` **Example Output:** > *Audio plays:* β€œKaribu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili.” --- ## ⚑ Key Features - πŸ—£οΈ **Natural Swahili speech generation** - 🌍 **Adapted for African tonal variations** - πŸ”‰ **High clarity and rhythm** - βš™οΈ **Fast inference with FP16 precision** - πŸ”— **Compatible with SALAMA-STT and SALAMA-LLM**