kaa-gpt: A Proof of Concept for Karakalpak AI
kaa-gpt is an 80M parameter generative model designed to demonstrate that high-quality AI results are possible for low-resource languages using dedicated, manually curated data. This is a "Proof of Concept" (PoC) model, paving the way for future 1B+ parameter versions.
Project Vision
Most global LLMs overlook the Karakalpak language. This project aims to:
- Preserve: Digitally safeguard the Karakalpak linguistic heritage.
- Empower: Provide a foundation for Karakalpak-native AI tools.
- Scale: Prove that small-scale, high-quality data can outperform generic multilingual models for specific languages.
Data Source
The training data was manually collected and curated from publicly available sources (literature, news, and official records). It has been cleaned and formatted specifically for this task to ensure high linguistic fidelity.
Future Roadmap
- Phase 1: 80M Parameter PoC (Current)
- Phase 2: Expanded Data Collection
- Phase 3: 1B Parameter General-Purpose Model
π Support My Research
This project is a solo effort to digitize the Karakalpak language, built on a $250 laptop funded by manual labor. Your support helps me cover the costs of:
- LLM APIs: (Claude 3.5 & Gemini 1.5) for high-quality data cleaning.
- Compute: Renting GPUs for training the next version of Karakalpak models.
Support via Binance:
- Binance Pay ID:
1207254817
Even a small contribution helps keep this project open-source and free for everyone.
- Downloads last month
- 31