Vāķ Translate 1.3B — CTranslate2
CTranslate2-converted version of vak-translate-1.3b — open-weight translation model covering 55 Indian languages and 2,970 language pairs.
This is the CTranslate2 optimized version of the original model, enabling faster CPU and GPU inference with reduced memory usage. Converted from the original Transformers checkpoint.
Built by Shunya Labs. Part of the Vāķ suite launched at the India AI Impact Summit 2026.
CTranslate2 Quick Start
pip install ctranslate2 transformers sentencepiece
import ctranslate2
from transformers import NllbTokenizer
from huggingface_hub import snapshot_download
# Download model to local cache and load
model_dir = snapshot_download("shunyalabs/vak-translate-1.3b-ct2")
tokenizer = NllbTokenizer.from_pretrained("shunyalabs/vak-translate-1.3b-ct2")
device = "cuda" if ctranslate2.get_cuda_device_count() > 0 else "cpu"
translator = ctranslate2.Translator(model_dir, device=device)
# Translate English to Hindi
src_lang = "eng_Latn"
tgt_lang = "hin_Deva"
tokenizer.src_lang = src_lang
inputs = tokenizer("Hello, how are you?")
src_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"])
results = translator.translate_batch(
[src_tokens],
target_prefix=[[tgt_lang]],
beam_size=4,
max_decoding_length=256,
)
output_tokens = results[0].hypotheses[0]
output_ids = tokenizer.convert_tokens_to_ids(output_tokens)
translation = tokenizer.decode(output_ids, skip_special_tokens=True)
print(translation)
Batch Translation
# Translate a batch of sentences (English → Hindi)
texts = [
"The sun rises in the east.",
"Water is essential for life.",
"Education is the most powerful weapon.",
]
tokenizer.src_lang = "eng_Latn"
all_src_tokens = [
tokenizer.convert_ids_to_tokens(tokenizer(t)["input_ids"])
for t in texts
]
results = translator.translate_batch(
all_src_tokens,
target_prefix=[["hin_Deva"]] * len(texts),
beam_size=4,
max_decoding_length=256,
)
for orig, result in zip(texts, results):
ids = tokenizer.convert_tokens_to_ids(result.hypotheses[0])
print(tokenizer.decode(ids, skip_special_tokens=True))
Highlights
- 55 Indian languages across 5 language families (Indo-Aryan, Dravidian, Austroasiatic, Sino-Tibetan, Indo-European)
- 2,970 translation pairs - any-to-any translation between all supported languages
- 1.3B parameters - encoder-decoder architecture with 24+24 layers
- Open weights under CC-BY-SA-4.0
- Weighted average BLEU: 38.5 (by speaker count)
- Covers 1.17 billion+ native speakers across every region of India
- First open-weight translation model for many Indian languages including Bhojpuri, Rajasthani, Chhattisgarhi, and Magahi
- CTranslate2 format for optimized CPU/GPU inference
Supported Languages
Full Language List with BLEU Scores
| # | Language | Speakers | BLEU | # | Language | Speakers | BLEU |
|---|---|---|---|---|---|---|---|
| 1 | Hindi | 322.2M | 42 | 28 | Pahari | 3.25M | 20 |
| 2 | Bengali | 96.2M | 41 | 29 | Bhili | 3.21M | 23 |
| 3 | Marathi | 82.8M | 40 | 30 | Harauti | 2.94M | 23 |
| 4 | Telugu | 80.9M | 41 | 31 | Nepali | 2.93M | 36 |
| 5 | Tamil | 68.9M | 42 | 32 | Bagheli | 2.68M | 34 |
| 6 | Gujarati | 55.0M | 40 | 33 | Sambalpuri | 2.63M | 23 |
| 7 | Urdu | 50.7M | 41 | 34 | Dogri | 2.60M | 3 |
| 8 | Bhojpuri | 50.6M | 36 | 35 | Garhwali | 2.48M | 35 |
| 9 | Kannada | 43.5M | 40 | 36 | Nimadi | 2.31M | 26 |
| 10 | Malayalam | 34.8M | 41 | 37 | Konkani | 2.15M | 15 |
| 11 | Odia | 34.1M | 39 | 38 | Kumauni | 2.08M | 34 |
| 12 | Punjabi | 31.1M | 40 | 39 | Kurukh | 1.98M | 3 |
| 13 | Rajasthani | 25.8M | 36 | 40 | Tulu | 1.84M | 3 |
| 14 | Chhattisgarhi | 16.3M | 32 | 41 | Manipuri (Meitei) | 1.76M | 3 |
| 15 | Assamese | 14.8M | 38 | 42 | Surgujia | 1.74M | 28 |
| 16 | Maithili | 13.4M | 37 | 43 | Sindhi | 1.68M | 35 |
| 17 | Magahi | 12.7M | 35 | 44 | Bagri | 1.66M | 12 |
| 18 | Haryanvi | 9.81M | 23 | 45 | Ahirani | 1.64M | 34 |
| 19 | Khortha | 8.04M | 34 | 46 | Banjari | 1.58M | 34 |
| 20 | Marwari | 7.83M | 36 | 47 | Brajbhasha | 1.56M | 35 |
| 21 | Santali | 6.97M | 3 | 48 | Bodo | 1.46M | 3 |
| 22 | Kashmiri | 6.55M | 35 | 49 | Kangri | 1.12M | 3 |
| 23 | Bundeli | 5.63M | 35 | 50 | Garo | 1.13M | 3 |
| 24 | Mewari | 4.21M | 28 | 51 | Kachchhi | 1.03M | 5 |
| 25 | Awadhi | 3.85M | 36 | 52 | Mahasu Pahari | 1.00M | 3 |
| 26 | Wagdi | 3.39M | 35 | 53 | Sanskrit | - | 34 |
| 27 | Lambadi | 3.28M | 28 | 54 | Kodava | - | 3 |
| 55 | Indian English | 250M | 43 |
Performance Tiers
| Tier | BLEU Range | Count | Languages |
|---|---|---|---|
| Strong | 35-43 | 26 | Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Urdu, Bhojpuri, Kannada, Malayalam, Odia, Punjabi, Rajasthani, Assamese, Maithili, Magahi, Marwari, Kashmiri, Bundeli, Awadhi, Wagdi, Nepali, Sindhi, Garhwali, Brajbhasha, Indian English |
| Good | 32-34 | 7 | Chhattisgarhi, Khortha, Bagheli, Kumauni, Ahirani, Banjari, Sanskrit |
| Adequate | 20-28 | 9 | Haryanvi, Mewari, Lambadi, Bhili, Harauti, Pahari, Sambalpuri, Nimadi, Surgujia |
| Partial | 5-15 | 3 | Konkani, Bagri, Kachchhi |
| Experimental | 2-4 | 10 | Dogri, Kurukh, Tulu, Manipuri, Santali, Kangri, Mahasu Pahari, Kodava, Bodo, Garo |
Language Families
| Indo-Aryan | Dravidian | Austroasiatic | Sino-Tibetan | Indo-European |
|---|---|---|---|---|
| 43 languages | 7 languages | 1 language | 3 languages | 1 language |
Model Architecture
| Field | Value |
|---|---|
| Architecture | Encoder-Decoder (M2M-style) |
| Parameters | ~1.3B (dense) |
| Encoder Layers | 24 |
| Decoder Layers | 24 |
| Model Dimension | 1024 |
| Attention Heads | 16 |
| FFN Dimension | 8192 |
| Activation | ReLU |
| Vocab Size | 256,206 |
| Tokenizer | SentencePiece BPE |
| Max Input Length | 512 tokens |
| Max Positions | 1024 |
| Dropout | 0.1 |
| Languages | 55 Indian |
| Translation Pairs | 2,970 |
| Scripts Supported | 15+ |
| CT2 Format | CTranslate2 (model.bin) |
Evaluation
- Weighted Average BLEU (by speaker count): 38.5
- BLEU scores are tentative, based on human evaluation (3 independent evaluations per language, 1-5 adequacy scale)
- Covers 1.17 billion+ native speakers across 5 language families
Use Cases
Government - Citizen services in every mother tongue | Sovereign deployment, data stays in India | Healthcare, education, judiciary outreach
Developers and Startups - Zero API cost for open-weight models | Build voice-first apps for any language | Fine-tune for domain-specific use cases | 2,970 translation pairs out of the box
Researchers and Academia - Full model weights for research | Benchmark against global state of art | Extend to more Indian languages | Advance Indian NLP and speech science
Limitations
- 10 languages are at experimental performance levels (BLEU 2-4): Dogri, Kurukh, Tulu, Manipuri, Santali, Kangri, Mahasu Pahari, Kodava, Bodo, Garo
- 3 languages have partial coverage (BLEU 5-15): Konkani, Bagri, Kachchhi
- Maximum input length is 512 tokens
- BLEU scores are tentative and based on human evaluation rather than standardized test sets
Citation
@misc{vaktranslate2026,
title={Vak Translate: Open-Weight Translation for Every Indian Language},
author={Sourav Bandyopadhyay, Ayush Kumar Bar, Aditya Singh Rathore, Grusha G S, Moksh K},
year={2026},
publisher={Shunya Labs},
url={https://huggingface.co/shunyalabs/vak-translate-1.3b}
}
Contact
Shunya Labs
- Email: 0@shunyalabs.ai
- Web: shunyalabs.ai
Built in India. Open for everyone.
- Downloads last month
- 21
Evaluation results
- Weighted Average BLEUself-reported38.500