Qwen3-0.6B-KTO

Model Card for Model ID

This model is a fine-tuned variant of Qwen/Qwen3-0.6B, trained using KTO (Kahneman-Tversky Optimization) on the nvidia/HelpSteer2 dataset as part of the AIPlans Model Diffing Project.

Model Details

Model Description

This model is a 0.6B parameter language model based on Qwen3-0.6B and fine-tuned using KTO for preference optimization. The goal of the fine-tuning was to improve helpfulness/harmlessness behavior as measured by the HelpSteer2 dataset, while also enabling controlled model diffing experiments as part of the AIPlans research workflow.

Special care was taken to reduce GPU memory usage during KTO training, including gradient checkpointing, selective layer freezing, and reduced forward-pass caching.

Developed by: AIPlans

Funded by : AIPlans

Shared by: AIPlans

Model type: Causal decoder-only Transformer (LLM)

Languages: English

License: MIT (inherits from base model and dataset licensing)

Fine-tuned from: Qwen/Qwen3-0.6B

Training Method: Kahneman-Tversky Optimization (KTO)

Intended Use: Research on model diffing, preference fine-tuning, evaluation of lightweight LLM behavior changes

Model Sources

Repository: https://github.com/AI-Plans/Model-Diffing/tree/main/KTO-Trainer
KTO Paper: https://arxiv.org/abs/2402.01306

Training Details

Training Data

Used HelpSteer2 dataset by Nvidia. To convert the dataset into preference dataset we used threshold of 3 on helpfulness. Anything below three belongs to rejected class. Anything above three belongs to accepted class. Data points that were equal 3 were discarded

For more information refer to the github script

Evaluation

Below is a comparison between the base Qwen3-0.6B model and our KTO-trained version (trained using HelpSteer2 preference data).

📊 Benchmark Comparison

Task	Metric	Base Model	KTO Model	Change
ARC Challenge	acc	0.3148	0.3123	-0.0025
ARC Challenge	acc_norm	0.3447	0.3422	-0.0025
ARC Easy	acc	0.6044	0.6107	+0.0063
ARC Easy	acc_norm	0.5589	0.5602	+0.0013
HellaSwag	acc	0.3751	0.3776	+0.0025
HellaSwag	acc_norm	0.4738	0.4766	+0.0028
TruthfulQA (mc2)	acc	0.4275	0.4324	+0.0049
WinoGrande	acc	0.5604	0.5533	-0.0071

📌 Summary

The KTO-trained model shows:

Improvements on ARC-Easy, HellaSwag, and TruthfulQA (mc2)
Minor regressions on ARC-Challenge and WinoGrande (within standard error)
Overall better truthfulness and instruction-following without compromising reasoning ability

These results indicate that preference optimization using HelpSteer2 improves alignment and factual helpfulness while maintaining core model capabilities.