Qwen3-0.6B-KTO
Model Card for Model ID
This model is a fine-tuned variant of Qwen/Qwen3-0.6B, trained using KTO (Kahneman-Tversky Optimization) on the nvidia/HelpSteer2 dataset as part of the AIPlans Model Diffing Project.
Model Details
Model Description
This model is a 0.6B parameter language model based on Qwen3-0.6B and fine-tuned using KTO for preference optimization. The goal of the fine-tuning was to improve helpfulness/harmlessness behavior as measured by the HelpSteer2 dataset, while also enabling controlled model diffing experiments as part of the AIPlans research workflow.
Special care was taken to reduce GPU memory usage during KTO training, including gradient checkpointing, selective layer freezing, and reduced forward-pass caching.
Developed by: AIPlans
Funded by : AIPlans
Shared by: AIPlans
Model type: Causal decoder-only Transformer (LLM)
Languages: English
License: MIT (inherits from base model and dataset licensing)
Fine-tuned from: Qwen/Qwen3-0.6B
Training Method: Kahneman-Tversky Optimization (KTO)
Intended Use: Research on model diffing, preference fine-tuning, evaluation of lightweight LLM behavior changes
Model Sources
- Repository: https://github.com/AI-Plans/Model-Diffing/tree/main/KTO-Trainer
- KTO Paper: https://arxiv.org/abs/2402.01306
Training Details
Training Data
Used HelpSteer2 dataset by Nvidia. To convert the dataset into preference dataset we used threshold of 3 on helpfulness. Anything below three belongs to rejected class. Anything above three belongs to accepted class. Data points that were equal 3 were discarded
For more information refer to the github script
Evaluation
Below is a comparison between the base Qwen3-0.6B model and our KTO-trained version (trained using HelpSteer2 preference data).
π Benchmark Comparison
| Task | Metric | Base Model | KTO Model | Change |
|---|---|---|---|---|
| ARC Challenge | acc | 0.3148 | 0.3123 | -0.0025 |
| ARC Challenge | acc_norm | 0.3447 | 0.3422 | -0.0025 |
| ARC Easy | acc | 0.6044 | 0.6107 | +0.0063 |
| ARC Easy | acc_norm | 0.5589 | 0.5602 | +0.0013 |
| HellaSwag | acc | 0.3751 | 0.3776 | +0.0025 |
| HellaSwag | acc_norm | 0.4738 | 0.4766 | +0.0028 |
| TruthfulQA (mc2) | acc | 0.4275 | 0.4324 | +0.0049 |
| WinoGrande | acc | 0.5604 | 0.5533 | -0.0071 |
π Summary
The KTO-trained model shows:
- Improvements on ARC-Easy, HellaSwag, and TruthfulQA (mc2)
- Minor regressions on ARC-Challenge and WinoGrande (within standard error)
- Overall better truthfulness and instruction-following without compromising reasoning ability
These results indicate that preference optimization using HelpSteer2 improves alignment and factual helpfulness while maintaining core model capabilities.
Model Card Authors
Jithesh Pavan D Souza - AIPlans Research Intern
Model Card Contact
Jithesh - [email protected]
- Downloads last month
- 2