Autoencoders
for a Harmfulness Text Classifier
ETH Zรผrich
We use Sparse Autoencoders (SAEs) to interpret the output of a harmfulness text classifier. The sparse features allow us to identify failure modes on adversarial input, and ensembling the appropriate feature with the original classifier improves accuracy on that input. We can also use the features to construct interpretable adversarial attacks, and to identify mislabeled data.
We train SAEs on the activations of Meta's Prompt-Guard-86M...
...which gets us some interpretable sparse features, such as a feature for layer 1 firing on Omega (ฮฉ) and a feature for layer 10 firing on l33tsp34k...
... so we train a logistic regression classifier using the leetspeak feature activation + the original classifier logits on leetspeak text. We can improve accuracy on a leetspeak jailbreak classification task, without changing the weights of the original classifier.
For the methodology and more applications, see the preprint and repository here: