# VibeVoice Voice Cloning Test

**IMPORTANT:** Voice cloning with custom audio ONLY works through Gradio interface!

The command-line script only uses built-in voices (Alice, Frank, etc.)

In [2]:
# Setup
import torch
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

GPU: NVIDIA L40S


In [3]:
# Install VibeVoice
![ -d /root/VibeVoice ] || git clone --quiet https://github.com/cseti007/VibeVoice.git /root/VibeVoice
%uv pip install --quiet -e /root/VibeVoice
print("Installed")

Note: you may need to restart the kernel to use updated packages.
Installed


In [4]:
# Download models
!huggingface-cli download aoi-ot/VibeVoice-Large --local-dir /root/models/VibeVoice-Large --quiet
!huggingface-cli download ABDALLALSWAITI/vibevoice-arabic-Z --local-dir /root/models/vibevoice-arabic-Z --quiet
print("Models ready")

/root/models/VibeVoice-Large
/root/models/vibevoice-arabic-Z
Models ready


In [5]:
# Launch Gradio with Arabic LoRA
!python /root/VibeVoice/demo/gradio_demo.py \
    --model_path /root/models/VibeVoice-Large \
    --checkpoint_path /root/models/vibevoice-arabic-Z \
    --share

APEX FusedRMSNorm not available, using native implementation
🎙️ Initializing VibeVoice Demo with Streaming Support...
Loading processor & model from /root/models/VibeVoice-Large
Using device: cuda
tokenizer_config.json: 0.00B [00:00, ?B/s]tokenizer_config.json: 7.23kB [00:00, 25.5MB/s]
vocab.json: 0.00B [00:00, ?B/s]vocab.json: 2.78MB [00:00, 134MB/s]
merges.txt: 0.00B [00:00, ?B/s]merges.txt: 1.67MB [00:00, 148MB/s]
tokenizer.json: 0.00B [00:00, ?B/s]tokenizer.json: 7.03MB [00:00, 175MB/s]
loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/tokenizer.json
loading file ad

## Alternative: Test Built-in Voices

If you want to test the Arabic LoRA with built-in voices (not your custom voice):

In [11]:
# Create test text
import os
text = """Speaker 1: مرحباً بكم، اسمي سامي.
أنا الآن أختبر تقنية جديدة لتحويل النص إلى كلام.

كيف يبدو صوتي؟
هل تسمع النبرة الطبيعية في حديثي؟

الأردن بلد الجبال والبحر والصحراء،
وفي كل مدينةٍ قصة، وفي كل شارعٍ حكاية.

الحياة رحلة نتعلّم منها كل يوم،
فلنبتسم الآن… ولنبدأ من جديد.\nSpeaker 2: أنا بخير شكرا"""
with open('/root/test.txt', 'w', encoding='utf-8') as f:
    f.write(text)

In [15]:
# WITH LoRA (built-in Alice voice)
os.makedirs('/root/outputs/builtin_with_lora', exist_ok=True)
!python /root/VibeVoice/demo/inference_from_file.py \
    --model_path /root/models/VibeVoice-Large \
    --txt_path /root/test.txt \
    --speaker_names Alice Frank \
    --checkpoint_path /root/models/vibevoice-arabic-Z \
    --output_dir /root/outputs/builtin_with_lora

APEX FusedRMSNorm not available, using native implementation
Using device: cuda
Found 9 voice files in /root/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /root/test.txt
Found 2 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: مرحباً بكم، اسمي سامي. أنا الآن أختبر تقنية جديدة لتحويل النص إلى كلام. كيف يبدو صوتي؟ هل...
  2. Speaker 2
     Text preview: Speaker 2: أنا بخير شكرا...

Speaker mapping:
  Speaker 2 -> Frank
  Speaker 1 -> Alice
Speaker 1 ('Alice') -> Voice: en-Alice_woman.wav
Speaker 2 ('Frank') -> Voice: en-Frank_man.wav
Loading processor & model from /root/models/VibeVoice-Large
loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/vocab.json
loading file merges.txt from cache at /root/.cache/hugging

In [13]:
# WITHOUT LoRA (built-in Alice voice)
os.makedirs('/root/outputs/builtin_without_lora', exist_ok=True)
!python /root/VibeVoice/demo/inference_from_file.py \
    --model_path /root/models/VibeVoice-Large \
    --txt_path /root/test.txt \
    --speaker_names Alice Frank \
    --output_dir /root/outputs/builtin_without_lora

APEX FusedRMSNorm not available, using native implementation
Using device: cuda
Found 9 voice files in /root/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /root/test.txt
Found 2 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: مرحباً بكم، اسمي سامي. أنا الآن أختبر تقنية جديدة لتحويل النص إلى كلام. كيف يبدو صوتي؟ هل...
  2. Speaker 2
     Text preview: Speaker 2: أنا بخير شكرا...

Speaker mapping:
  Speaker 2 -> Frank
  Speaker 1 -> Alice
Speaker 1 ('Alice') -> Voice: en-Alice_woman.wav
Speaker 2 ('Frank') -> Voice: en-Frank_man.wav
Loading processor & model from /root/models/VibeVoice-Large
loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/vocab.json
loading file merges.txt from cache at /root/.cache/hugging

In [16]:
# Listen to built-in voice comparison
from IPython.display import Audio, display, HTML
display(HTML("<h3>WITH LoRA (Alice)</h3>"))
display(Audio("/root/outputs/builtin_with_lora/test_generated.wav"))
display(HTML("<h3>WITHOUT LoRA (Alice)</h3>"))
display(Audio("/root/outputs/builtin_without_lora/test_generated.wav"))