This decensored version of Qwen3Guard was made possible due to Heretic, demonstrating the capabilities of abliteration on consumer hardware against heavily safeguarded models. While the model's baseline refusal rate is approximately 74/100, Trial 155 (batch size 4, 7 tok/s, 200 samples) successfully dropped this to 0/100, completely bypassing the Qwen safeguard.

The entire process took about 23 hours on an M1 Max (32 GB).

Final Abliteration Parameters

Running trial 155 of 200...

* Parameters:

  * direction_index = 15.03

  * attn.o_proj.max_weight = 0.98

  * attn.o_proj.max_weight_position = 21.63

  * attn.o_proj.min_weight = 0.69

  * attn.o_proj.min_weight_distance = 14.81

  * mlp.down_proj.max_weight = 1.39

  * mlp.down_proj.max_weight_position = 21.68

  * mlp.down_proj.min_weight = 1.37

  * mlp.down_proj.min_weight_distance = 16.38

* Abliterating...

* Evaluating...

  * Obtaining first-token probability distributions...

  * KL divergence: 2.94

  * Counting model refusals...

  * Refusals: 0/100

Other Notable Trials

   [Trial 171] Refusals:  7/100, KL divergence: 1.51

   [Trial 147] Refusals: 22/100, KL divergence: 0.90

   [Trial 169] Refusals: 26/100, KL divergence: 0.77

   [Trial 195] Refusals: 50/100, KL divergence: 0.73

   [Trial 175] Refusals: 52/100, KL divergence: 0.68

   [Trial   8] Refusals: 68/100, KL divergence: 0.15

   [Trial  70] Refusals: 69/100, KL divergence: 0.05

   [Trial  72] Refusals: 71/100, KL divergence: 0.03

   [Trial 108] Refusals: 72/100, KL divergence: 0.03

   [Trial  89] Refusals: 73/100, KL divergence: 0.02

   [Trial 193] Refusals: 74/100, KL divergence: 0.00

   [Trial 198] Refusals: 75/100, KL divergence: 0.00

   [Trial  98] Refusals: 76/100, KL divergence: 0.00

   [Trial  99] Refusals: 76/100, KL divergence: 0.00
Downloads last month
20
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Otilde/Qwen3Guard-Gen-4B-Heretic

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Finetuned
(1)
this model
Quantizations
1 model

Collection including Otilde/Qwen3Guard-Gen-4B-Heretic