---
license: apache-2.0
language:
- sw
base_model:
- facebook/mms-tts
pipeline_tag: text-to-speech
datasets:
- mozilla-foundation/common_voice_17_0
metrics:
- wer
tags:
- text-to-speech
- audio
- speech
- transformers
- vits
- swahili
---


# 🔊 SALAMA-TTS — Swahili Text-to-Speech Model

**Developer:** AI4NNOV

**Version:** v1.0  
**License:** Apache 2.0  
**Model Type:** Text-to-Speech (TTS)  
**Base Model:** `facebook/mms-tts-swh` (fine-tuned)  

---

## 🌍 Overview

**SALAMA-TTS** is the **speech synthesis module** of the **SALAMA Framework**, a complete end-to-end **Speech-to-Speech AI system** for African languages.  
It generates **natural, high-quality Swahili speech** from text and integrates seamlessly with **SALAMA-LLM** and **SALAMA-STT** for conversational voice assistants.  

The model is based on **Meta’s MMS (Massively Multilingual Speech)** TTS architecture using the **VITS framework**, fine-tuned for natural prosody, tone, and rhythm in Swahili.

---

## 🧱 Model Architecture

SALAMA-TTS is built on the **VITS architecture**, combining the strengths of **variational autoencoders (VAE)** and **GANs** for realistic and expressive speech synthesis.

| Parameter | Value |
|------------|--------|
| Base Model | `facebook/mms-tts-swh` |
| Fine-Tuning | 8-bit quantized, LoRA fine-tuning |
| Optimizer | AdamW |
| Learning Rate | 2e-5 |
| Epochs | 20 |
| Sampling Rate | 16kHz |
| Frameworks | Transformers + Datasets + PyTorch |
| Language | Swahili (`sw`) |

---

## 📚 Dataset

| Dataset | Description | Purpose |
|----------|--------------|----------|
| `common_voice_17_0` | Swahili voice dataset by Mozilla | Base training |
| Custom Swahili speech corpus | Locally recorded sentences and dialogues | Fine-tuning naturalness |
| Evaluated on | Common Voice Swahili (test split) | Evaluation |

---

## 🧠 Model Capabilities

- Converts **Swahili text to natural-sounding speech**  
- Handles **both formal and conversational** tone  
- High clarity and prosody for long-form speech  
- Seamless integration with **SALAMA-LLM** responses  
- Output format: **16-bit PCM WAV**  

---

## 📊 Evaluation Metrics

| Metric | Score | Description |
|---------|-------|-------------|
| **MOS (Mean Opinion Score)** | **4.05 / 5.0** | Human-rated naturalness |
| **WER (Generated → STT)** | **0.21** | Evaluated by re-transcribing synthesized audio |

> The MOS was evaluated by 12 native Swahili speakers across clarity, tone, and pronunciation.

---

## ⚙️ Usage (Python Example)

```python
# Requirements:
# pip install onnxruntime librosa soundfile transformers numpy
# If you want GPU inference: pip install onnxruntime-gpu (and ensure CUDA toolkit is available)

import os
import numpy as np
import onnxruntime
from transformers import AutoTokenizer
import soundfile as sf

TTS_ONNX_MODEL_PATH = "swahili_tts.onnx"   # path to your .onnx file
TTS_TOKENIZER_ID = "facebook/mms-tts-swh"  # or whichever tokenizer you used
OUTPUT_SAMPLE_RATE = 16000
OUT_DIR = "tts_outputs"
os.makedirs(OUT_DIR, exist_ok=True)

def create_onnx_session(onnx_path: str):
    """Create an ONNX Runtime session using GPU if available, otherwise CPU."""
    providers = ["CPUExecutionProvider"]
    try:
        # prefer CUDA if available
        providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
        sess = onnxruntime.InferenceSession(onnx_path, providers=providers)
        print("Using CUDAExecutionProvider for ONNX Runtime.")
    except Exception:
        sess = onnxruntime.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
        print("CUDA not available — using CPUExecutionProvider for ONNX Runtime.")
    return sess

def generate_speech_from_onnx(text: str,
                              onnx_session: onnxruntime.InferenceSession,
                              tokenizer: AutoTokenizer,
                              out_path: str = None) -> str:
    """
    Synthesize speech from text using an ONNX TTS model.
    Returns path to WAV file (16kHz, int16).
    """
    if not text:
        raise ValueError("Empty text provided.")

    # Tokenize to numpy inputs (match what the ONNX model expects)
    # NOTE: many TTS tokenizers return {"input_ids": np.array(...)} — adapt if your tokenizer differs
    inputs = tokenizer(text, return_tensors="np", padding=True)
    # Identify ONNX input name (assume first input)
    input_name = onnx_session.get_inputs()[0].name

    # Prepare ort_inputs dict using names expected by ONNX model
    ort_inputs = {input_name: inputs["input_ids"].astype(np.int64)}

    # Run ONNX inference
    ort_outs = onnx_session.run(None, ort_inputs)

    # The model should return a raw waveform or float array convertible to waveform.
    # In many single-file TTS ONNX exports the first output is the waveform
    audio_array = ort_outs[0]

    # Flatten in case it's multi-dim and ensure 1-D waveform
    audio_waveform = audio_array.flatten()

    # If float waveform in -1..1, convert to int16; else try to coerce to int16
    if np.issubdtype(audio_waveform.dtype, np.floating):
        # clip then convert
        audio_clip = np.clip(audio_waveform, -1.0, 1.0)
        audio_int16 = (audio_clip * 32767.0).astype(np.int16)
    else:
        # if it's already int16-like, cast (safeguard)
        audio_int16 = audio_waveform.astype(np.int16)

    # Compose output filename
    if out_path is None:
        out_path = os.path.join(OUT_DIR, f"salama_tts_{abs(hash(text)) & 0xFFFF_FFFF}.wav")

    # Save with soundfile (16kHz)
    sf.write(out_path, audio_int16, samplerate=OUTPUT_SAMPLE_RATE, subtype="PCM_16")
    return out_path

if __name__ == "__main__":
    # Example usage
    sess = create_onnx_session(TTS_ONNX_MODEL_PATH)
    tokenizer = AutoTokenizer.from_pretrained(TTS_TOKENIZER_ID)

    example_text = "Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili."
    out_wav = generate_speech_from_onnx(example_text, sess, tokenizer)
    print("Saved synthesized audio to:", out_wav)

```

**Example Output:**  
> *Audio plays:* “Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili.”

---

## ⚡ Key Features

- 🗣️ **Natural Swahili speech generation**  
- 🌍 **Adapted for African tonal variations**  
- 🔉 **High clarity and rhythm**  
- ⚙️ **Fast inference with FP16 precision**  
- 🔗 **Compatible with SALAMA-STT and SALAMA-LLM**