Autoencoders
for a Harmfulness Text Classifier

Lennart Finke & David Zollikofer

ETH Zรผrich

๐Ÿ“„ Read Preprint ๐ŸŽ›๏ธ GitHub Repository
We use Sparse Autoencoders (SAEs) to interpret the output of a harmfulness text classifier. The sparse features allow us to identify failure modes on adversarial input, and ensembling the appropriate feature with the original classifier improves accuracy on that input. We can also use the features to construct interpretable adversarial attacks, and to identify mislabeled data.

We train SAEs on the activations of Meta's Prompt-Guard-86M...

SAE Training Metrics

...which gets us some interpretable sparse features, such as a feature for layer 1 firing on Omega (ฮฉ) and a feature for layer 10 firing on l33tsp34k...

... so we train a logistic regression classifier using the leetspeak feature activation + the original classifier logits on leetspeak text. We can improve accuracy on a leetspeak jailbreak classification task, without changing the weights of the original classifier.

SAE Training Metrics

For the methodology and more applications, see the preprint and repository here:

๐Ÿ“„ Read the Preprint ๐ŸŽ›๏ธ GitHub Repository