BAD Classifier (FairSteer): Llama-2-7b-chat-hf Layer 28

This Biased Activation Detector (BAD) was trained using the FairSteer methodology to monitor the internal latent states of meta-llama/Llama-2-7b-chat-hf.

Model Metadata

  • Base Model: meta-llama/Llama-2-7b-chat-hf
  • Optimal Layer: 28
  • Validation Accuracy: 66.01%
  • Input Dimension: 4096
  • Training Date: 2026-01-11
  • Protocol: 1:1 Balanced Undersampling

Usage Standard

  1. Extract activation [:, -1, :] from Layer 28.
  2. Preprocess using scaler.pkl.
  3. If probe output probability < 0.5, model is exhibiting biased reasoning.

Citation

If you use this probe, please cite the FairSteer research project.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support