kaa-gpt: A Proof of Concept for Karakalpak AI

kaa-gpt is an 80M parameter generative model designed to demonstrate that high-quality AI results are possible for low-resource languages using dedicated, manually curated data. This is a "Proof of Concept" (PoC) model, paving the way for future 1B+ parameter versions.

Project Vision

Most global LLMs overlook the Karakalpak language. This project aims to:

  1. Preserve: Digitally safeguard the Karakalpak linguistic heritage.
  2. Empower: Provide a foundation for Karakalpak-native AI tools.
  3. Scale: Prove that small-scale, high-quality data can outperform generic multilingual models for specific languages.

Data Source

The training data was manually collected and curated from publicly available sources (literature, news, and official records). It has been cleaned and formatted specifically for this task to ensure high linguistic fidelity.

Future Roadmap

  • Phase 1: 80M Parameter PoC (Current)
  • Phase 2: Expanded Data Collection
  • Phase 3: 1B Parameter General-Purpose Model

🌟 Support My Research

This project is a solo effort to digitize the Karakalpak language, built on a $250 laptop funded by manual labor. Your support helps me cover the costs of:

  • LLM APIs: (Claude 3.5 & Gemini 1.5) for high-quality data cleaning.
  • Compute: Renting GPUs for training the next version of Karakalpak models.

Support via Binance:

  • Binance Pay ID: 1207254817

Even a small contribution helps keep this project open-source and free for everyone.

Downloads last month
31
Safetensors
Model size
82.9M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support