BAD Classifier (FairSteer): Llama-2-7b-chat-hf Layer 28
This Biased Activation Detector (BAD) was trained using the FairSteer methodology
to monitor the internal latent states of meta-llama/Llama-2-7b-chat-hf.
Model Metadata
- Base Model:
meta-llama/Llama-2-7b-chat-hf - Optimal Layer: 28
- Validation Accuracy: 66.01%
- Input Dimension: 4096
- Training Date: 2026-01-11
- Protocol: 1:1 Balanced Undersampling
Usage Standard
- Extract activation
[:, -1, :]from Layer 28. - Preprocess using
scaler.pkl. - If probe output probability < 0.5, model is exhibiting biased reasoning.
Citation
If you use this probe, please cite the FairSteer research project.
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support