Jumplander-Mini-LM-v1

Persian Text-Based Language Model for Programming Education

Overview — short: Jumplander-Mini-LM-v1 is a Persian text-based language model designed to assist with programming education: generating code examples, explaining algorithms, producing exercises, and powering Persian-language educational assistants and wikis. It is the first release in the Jumplander-Mini-LM research series and a foundational step toward a family of Persian-focused coding LMs.

Model hub & dataset (primary):

Model (preferred id): https://huggingface.co/jumplander/jumplander-mini-lm-v1
(If your published checkpoint uses a different id — e.g. jumplander/jumplander-coder-32b — include that link too): https://huggingface.co/jumplander/jumplander-coder-32b
Training dataset (primary): https://huggingface.co/datasets/jumplander/JumpLander-Persian-Forum-mini-Dataset

Key features

Persian-first: optimized for producing pedagogical content in Persian (lessons, Q&A, explanations).
Code-aware output: preserves code block formatting and aims to produce runnable examples (Python, JavaScript, etc.).
GPT2 backbone: a decoder-only Transformer adapted and fine-tuned for Persian text and code.
Research-oriented: released as part of JumpLander's research efforts — intended to be extended with future versions and focused fine-tunes.
Flexible & extensible: architecture and tokenization designed to accept further domain data and produce specialized variants (e.g., language-specific fine-tunes).

Intended uses (Recommended)

Authoring Persian-language programming tutorials, lesson notes and examples.
Interactive educational assistants for Persian learners.
Generating exercises and model answers for programming courses and practice problems.
Support for documentation generation and code-comment translation to Persian.

Not recommended uses

Making decisions in safety-critical systems (medical, legal, finance) without human oversight.
Directly executing unreviewed code in production environments.
Extracting personal or private information (model trained with cleaned data; avoid PII extraction use-cases).

Training data & preprocessing

Primary dataset: JumpLander-Persian-Forum-mini-Dataset (HF). See dataset page for details. The training set was curated to contain Persian programming discussions, Q&A, code snippets and educational posts. Preprocessing focused on:

Removing or masking personal or sensitive data.
Preserving code blocks and language-specific formatting.
Splitting conversational/forum posts into pedagogical templates: explanation → example → exercise → solution.
Tokenizer training on mixed Persian + code subwords (BPE/SentencePiece).

For full provenance and dataset licenses, inspect the HF dataset page linked above.

Technical specifications

Architecture: GPT2 (decoder-only Transformer), adapted for Persian.
Parameter scale (example): ~1.4B (update with exact number from published checkpoint).
Tokenizer: BPE / SentencePiece trained on Persian + code corpus.
Max input length: 2048 tokens.
Primary language: Persian (with limited English in code/comments).
Recommended inference hardware: GPU 16–48GB VRAM, or use Hugging Face Inference API / hosted endpoints.

3) Lesson generation template (Persian)

درس: لیست‌ها در پایتون
هدف: آشنایی با عملیات پایه روی لیست‌ها
خروجی: توضیح کوتاه، مثال کد، تمرین کوتاه، پاسخ تشریحی

Quick start — Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "jumplander/jumplander-mini-lm-v1"  # update if your published id differs
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)

prompt = "یک تابع پایتون برای مرتب‌سازی لیست بنویس و آن را مرحله‌به‌مرحله توضیح بده.\n\n# پاسخ:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

gen = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.2,
    top_p=0.95,
    do_sample=False,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id
)

print(tokenizer.decode(gen[0], skip_special_tokens=True))

HF Inference API example (simplest):

import requests
API_TOKEN = "hf_your_api_token_here"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
api_url = "/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fmodels%2Fjumplander%2Fjumplander-mini-lm-v1"
data = {"inputs": "یک مثال کوتاه از تابع فاکتوریل در پایتون بنویس."}
resp = requests.post(api_url, headers=headers, json=data, timeout=120)
print(resp.json())

Evaluation & recommended metrics

Automated metrics

Perplexity on held-out Persian programming data.
pass@k / exact-match for code generation using curated test problems.
BLEU/ROUGE for translation and summarization subtasks (if applicable).

Human evaluation

Runnable correctness of generated code.
Clarity and pedagogical usefulness of explanations in Persian.
Cultural and linguistic appropriateness for Iranian learners.

Publish evaluation results and sample prompts on the model page to increase transparency.

Limitations & safety considerations

Model may hallucinate plausible-looking but incorrect code or explanations; always verify.
Avoid running unreviewed code from model outputs.
Dataset biases may transfer into model outputs; perform bias analysis where appropriate.
Not suitable for legal/medical/financial decisions without human oversight.
Sensitive data handling: training pipeline removed PII when possible — still avoid PII extraction tasks.

Roadmap & future work

Fine-tuned variants per language or framework (e.g., Jumplander-Mini-LM-v2-Python).
Expand dataset with curated tutorials and unit-test-backed examples.
Quantized/distilled releases for edge deployment (mobile/embedded).
Prompt templates and LMS plugins for educational platforms.
Community contributions: issues, discussions and PRs on GitHub & HF Discussions.

License & citation

@misc{jumplander2026minilm,
  title = {Jumplander-Mini-LM-v1},
  author = {JumpLander Research Team},
  year = {2026},
  howpublished = {Hugging Face Model Hub},
  url = {https://huggingface.co/jumplander/jumplander-mini-lm-v1}
}

Suggested code/tokenizer license: Apache-2.0.
Suggested model-weights license: CC BY-NC 4.0 (or Apache-2.0 if fully open-source is intended).
Add explicit license statements on both the model and dataset HF pages.

Contact & contribution

Website: https://jumplander.org
GitHub: https://github.com/jumplander
Discussions & Issues: Use the Discussions tab on the Hugging Face model page or GitHub issues.
Email: [email protected]

نسخه فارسی — Jumplander-Mini-LM-v1

مدل زبانی فارسی متن‌محور برای آموزش برنامه‌نویسی

مرور کوتاه: Jumplander-Mini-LM-v1 یک مدل زبانی متن‌محور فارسی است که برای کمک به آموزش برنامه‌نویسی طراحی شده است: تولید مثال‌های کدنویسی، توضیح الگوریتم‌ها، تولید تمرین و پشتیبانی از دستیارهای آموزشی فارسی. این اولین انتشار از سری تحقیقاتی Jumplander-Mini-LM است و پایه‌ای برای خانواده‌ای از مدل‌های فارسی-محور برنامه‌نویسی محسوب می‌شود.

صفحه مدل و مجموعه داده (اصلی):

شناسه مدل (پیشنهادی): https://huggingface.co/jumplander/jumplander-mini-lm-v1
اگر چک‌پوینت دیگری منتشر شده: https://huggingface.co/jumplander/jumplander-coder-32b
مجموعه داده آموزشی: https://huggingface.co/datasets/jumplander/JumpLander-Persian-Forum-mini-Dataset

ویژگی‌های کلیدی

محور فارسی: بهینه‌شده برای تولید محتوای آموزشی به زبان فارسی (درس‌ها، توضیحات و مثال‌ها).
حفظ کد در خروجی: قالب‌بندی بلوک‌های کد حفظ می‌شود و تلاش می‌شود مثال‌های قابل اجرا تولید شود.
بیس معماری GPT2: ترنسفورمر Decoder-only که برای متن فارسی و تولید کد تنظیم شده است.
محور پژوهشی: به عنوان بخشی از فعالیت‌های تحقیقاتی JumpLander منتشر شده و قرار است با نسخه‌های آینده گسترش پیدا کند.
قابل توسعه: توکنایزر و ساختار مدل طوری طراحی شده که بتوان داده‌های حوزه‌ای جدید را اضافه کرد و نسخه‌های تخصصی منتشر کرد.

موارد استفاده پیشنهادی

تولید دروس و محتوای آموزشی برنامه‌نویسی به فارسی.
دستیاران آموزشی متنی و ویکی‌های تخصصی فارسی.
تولید تمرین و پاسخ تشریحی برای کلاس‌ها و دوره‌ها.
کمک به ترجمه توضیحات کامنت و داکیومنت به فارسی.

موارد غیرمستحب

گرفتن تصمیمات حیاتی در سامانه‌های حساس بدون بازبینی انسانی.
اجرای مستقیم کدهای تولیدی بدون بررسی.
درخواست استخراج اطلاعات شخصی یا حساس.

داده‌ها و پیش‌پردازش

مجموعهٔ اصلی: JumpLander-Persian-Forum-mini-Dataset (HF).
شامل: سوالات و پاسخ‌های برنامه‌نویسی فارسی، قطعات کد (Python, JavaScript و ...)، توضیح الگوریتم و محتوای آموزشی از فروم‌ها و منابع منتخب.
پیش‌پردازش: حفظ قالب‌بندی بلوک‌های کد، جداسازی متن و کد، ساختاردهی محتوا در قالب آموزشی.
جزئیات بیشتر و مجوزها را در صفحهٔ مجموعه داده HF بررسی کنید.

مشخصات فنی

معماری: GPT2 (ترنسفورمر Decoder-only)
اندازهٔ پارامتر (نمونه): حدود 1.4 میلیارد (به‌روزرسانی با مقدار دقیق از چک‌پوینت منتشرشده)
توکنایزر: BPE / SentencePiece آموزش‌داده‌شده برای فارسی و کد
حداکثر طول ورودی: 2048 توکن
زبان اصلی: فارسی (پشتیبانی محدود از انگلیسی در بلوک‌های کد)
سخت‌افزار پیشنهادی: GPU با VRAM بین 16 تا 48 گیگ یا سرویس HF Inference

راه‌اندازی سریع — Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "jumplander/jumplander-mini-lm-v1"  # در صورت تغییر شناسه، آن را ویرایش کنید
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)

prompt = "یک تابع پایتون برای مرتب‌سازی لیست بنویس و آن را مرحله‌به‌مرحله توضیح بده.\n\n# پاسخ:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

gen = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.2,
    top_p=0.95,
    do_sample=False,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id
)

print(tokenizer.decode(gen[0], skip_special_tokens=True))

ارزیابی و معیارها

معیارهای ماشینی پیشنهادی:

Perplexity روی مجموعهٔ نگهداری‌شده (held-out)
pass@k / exact-match برای مسائل تولید کد
BLEU/ROUGE برای تسک‌های ترجمه/خلاصه (در صورت نیاز)

ارزیابی انسانی:

صحت فنی کد تولیدشده (آیا اجرا می‌شود؟)
کیفیت و وضوح توضیحات آموزشی فارسی
مناسب بودن فرهنگی و زبانی برای مخاطب ایرانی

محدودیت‌ها و ایمنی

ممکن است مدل خروجی‌های نادرست یا ناقص تولید کند؛ بازبینی انسانی لازم است.
از اجرای کد تولیدی بدون بازبینی خودداری کنید.
سوگیری‌های داده‌ای ممکن است بازتولید شوند؛ تحلیل سوگیری پیشنهاد می‌شود.
استفاده در موارد حساس (قانونی، پزشکی، مالی) بدون بازبینی انسانی مناسب نیست.

نقشهٔ راه و کارهای آینده

انتشار نسخه‌های موضوعی و زبان‌محور (مثلاً نسخه تخصصی پایتون)
افزودن مجموعهٔ داده‌های با کیفیت و تست‌های واحد (unit tests)
تولید نسخه‌های کم‌حجم برای اجرا در لبه (edge)
توسعه قالب‌ها و پلاگین‌های آماده برای LMS و محیط‌های ویکی آموزشی

استناد و مجوز

@misc{jumplander2026minilm,
  title = {Jumplander-Mini-LM-v1},
  author = {JumpLander Research Team},
  year = {2026},
  howpublished = {Hugging Face Model Hub},
  url = {https://huggingface.co/jumplander/jumplander-mini-lm-v1}
}

پیشنهاد مجوز برای کد و توکنایزر: Apache-2.0
پیشنهاد مجوز برای وزن‌ها: CC BY-NC 4.0 یا Apache-2.0 در صورت تمایل به اوپن‌سورس کامل

تماس و مشارکت

وب‌سایت: https://jumplander.org
گیت‌هاب: https://github.com/jumplander
گفتگو و مشارکت: از تب Discussions در صفحه Hugging Face مدل یا Issues در GitHub استفاده کنید
ایمیل: [email protected]

یادداشت پایانی

این مدل با هدف توانمندسازی جامعه فارسی‌زبان در آموزش برنامه‌نویسی طراحی شده است. لطفاً خروجی‌ها را بازبینی کنید، بازخورد بدهید و در توسعهٔ نسخه‌های بعدی مشارکت کنید.

Downloads last month: 14

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

jumplander
/

jumplander-mini-lm-v1