Vāķ Translate 1.3B — CTranslate2

CTranslate2-converted version of vak-translate-1.3b — open-weight translation model covering 55 Indian languages and 2,970 language pairs.

This is the CTranslate2 optimized version of the original model, enabling faster CPU and GPU inference with reduced memory usage. Converted from the original Transformers checkpoint.

Built by Shunya Labs. Part of the Vāķ suite launched at the India AI Impact Summit 2026.

CTranslate2 Quick Start

pip install ctranslate2 transformers sentencepiece
import ctranslate2
from transformers import NllbTokenizer
from huggingface_hub import snapshot_download

# Download model to local cache and load
model_dir = snapshot_download("shunyalabs/vak-translate-1.3b-ct2")

tokenizer = NllbTokenizer.from_pretrained("shunyalabs/vak-translate-1.3b-ct2")

device = "cuda" if ctranslate2.get_cuda_device_count() > 0 else "cpu"
translator = ctranslate2.Translator(model_dir, device=device)

# Translate English to Hindi
src_lang = "eng_Latn"
tgt_lang = "hin_Deva"

tokenizer.src_lang = src_lang
inputs = tokenizer("Hello, how are you?")
src_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"])

results = translator.translate_batch(
    [src_tokens],
    target_prefix=[[tgt_lang]],
    beam_size=4,
    max_decoding_length=256,
)

output_tokens = results[0].hypotheses[0]
output_ids = tokenizer.convert_tokens_to_ids(output_tokens)
translation = tokenizer.decode(output_ids, skip_special_tokens=True)
print(translation)

Batch Translation

# Translate a batch of sentences (English → Hindi)
texts = [
    "The sun rises in the east.",
    "Water is essential for life.",
    "Education is the most powerful weapon.",
]

tokenizer.src_lang = "eng_Latn"
all_src_tokens = [
    tokenizer.convert_ids_to_tokens(tokenizer(t)["input_ids"])
    for t in texts
]

results = translator.translate_batch(
    all_src_tokens,
    target_prefix=[["hin_Deva"]] * len(texts),
    beam_size=4,
    max_decoding_length=256,
)

for orig, result in zip(texts, results):
    ids = tokenizer.convert_tokens_to_ids(result.hypotheses[0])
    print(tokenizer.decode(ids, skip_special_tokens=True))

Highlights

  • 55 Indian languages across 5 language families (Indo-Aryan, Dravidian, Austroasiatic, Sino-Tibetan, Indo-European)
  • 2,970 translation pairs - any-to-any translation between all supported languages
  • 1.3B parameters - encoder-decoder architecture with 24+24 layers
  • Open weights under CC-BY-SA-4.0
  • Weighted average BLEU: 38.5 (by speaker count)
  • Covers 1.17 billion+ native speakers across every region of India
  • First open-weight translation model for many Indian languages including Bhojpuri, Rajasthani, Chhattisgarhi, and Magahi
  • CTranslate2 format for optimized CPU/GPU inference

Supported Languages

Full Language List with BLEU Scores

# Language Speakers BLEU # Language Speakers BLEU
1 Hindi 322.2M 42 28 Pahari 3.25M 20
2 Bengali 96.2M 41 29 Bhili 3.21M 23
3 Marathi 82.8M 40 30 Harauti 2.94M 23
4 Telugu 80.9M 41 31 Nepali 2.93M 36
5 Tamil 68.9M 42 32 Bagheli 2.68M 34
6 Gujarati 55.0M 40 33 Sambalpuri 2.63M 23
7 Urdu 50.7M 41 34 Dogri 2.60M 3
8 Bhojpuri 50.6M 36 35 Garhwali 2.48M 35
9 Kannada 43.5M 40 36 Nimadi 2.31M 26
10 Malayalam 34.8M 41 37 Konkani 2.15M 15
11 Odia 34.1M 39 38 Kumauni 2.08M 34
12 Punjabi 31.1M 40 39 Kurukh 1.98M 3
13 Rajasthani 25.8M 36 40 Tulu 1.84M 3
14 Chhattisgarhi 16.3M 32 41 Manipuri (Meitei) 1.76M 3
15 Assamese 14.8M 38 42 Surgujia 1.74M 28
16 Maithili 13.4M 37 43 Sindhi 1.68M 35
17 Magahi 12.7M 35 44 Bagri 1.66M 12
18 Haryanvi 9.81M 23 45 Ahirani 1.64M 34
19 Khortha 8.04M 34 46 Banjari 1.58M 34
20 Marwari 7.83M 36 47 Brajbhasha 1.56M 35
21 Santali 6.97M 3 48 Bodo 1.46M 3
22 Kashmiri 6.55M 35 49 Kangri 1.12M 3
23 Bundeli 5.63M 35 50 Garo 1.13M 3
24 Mewari 4.21M 28 51 Kachchhi 1.03M 5
25 Awadhi 3.85M 36 52 Mahasu Pahari 1.00M 3
26 Wagdi 3.39M 35 53 Sanskrit - 34
27 Lambadi 3.28M 28 54 Kodava - 3
55 Indian English 250M 43

Performance Tiers

Tier BLEU Range Count Languages
Strong 35-43 26 Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Urdu, Bhojpuri, Kannada, Malayalam, Odia, Punjabi, Rajasthani, Assamese, Maithili, Magahi, Marwari, Kashmiri, Bundeli, Awadhi, Wagdi, Nepali, Sindhi, Garhwali, Brajbhasha, Indian English
Good 32-34 7 Chhattisgarhi, Khortha, Bagheli, Kumauni, Ahirani, Banjari, Sanskrit
Adequate 20-28 9 Haryanvi, Mewari, Lambadi, Bhili, Harauti, Pahari, Sambalpuri, Nimadi, Surgujia
Partial 5-15 3 Konkani, Bagri, Kachchhi
Experimental 2-4 10 Dogri, Kurukh, Tulu, Manipuri, Santali, Kangri, Mahasu Pahari, Kodava, Bodo, Garo

Language Families

Indo-Aryan Dravidian Austroasiatic Sino-Tibetan Indo-European
43 languages 7 languages 1 language 3 languages 1 language

Model Architecture

Field Value
Architecture Encoder-Decoder (M2M-style)
Parameters ~1.3B (dense)
Encoder Layers 24
Decoder Layers 24
Model Dimension 1024
Attention Heads 16
FFN Dimension 8192
Activation ReLU
Vocab Size 256,206
Tokenizer SentencePiece BPE
Max Input Length 512 tokens
Max Positions 1024
Dropout 0.1
Languages 55 Indian
Translation Pairs 2,970
Scripts Supported 15+
CT2 Format CTranslate2 (model.bin)

Evaluation

  • Weighted Average BLEU (by speaker count): 38.5
  • BLEU scores are tentative, based on human evaluation (3 independent evaluations per language, 1-5 adequacy scale)
  • Covers 1.17 billion+ native speakers across 5 language families

Use Cases

Government - Citizen services in every mother tongue | Sovereign deployment, data stays in India | Healthcare, education, judiciary outreach

Developers and Startups - Zero API cost for open-weight models | Build voice-first apps for any language | Fine-tune for domain-specific use cases | 2,970 translation pairs out of the box

Researchers and Academia - Full model weights for research | Benchmark against global state of art | Extend to more Indian languages | Advance Indian NLP and speech science

Limitations

  • 10 languages are at experimental performance levels (BLEU 2-4): Dogri, Kurukh, Tulu, Manipuri, Santali, Kangri, Mahasu Pahari, Kodava, Bodo, Garo
  • 3 languages have partial coverage (BLEU 5-15): Konkani, Bagri, Kachchhi
  • Maximum input length is 512 tokens
  • BLEU scores are tentative and based on human evaluation rather than standardized test sets

Citation

@misc{vaktranslate2026,
  title={Vak Translate: Open-Weight Translation for Every Indian Language},
  author={Sourav Bandyopadhyay, Ayush Kumar Bar, Aditya Singh Rathore, Grusha G S, Moksh K},
  year={2026},
  publisher={Shunya Labs},
  url={https://huggingface.co/shunyalabs/vak-translate-1.3b}
}

Contact

Shunya Labs


Built in India. Open for everyone.

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results