what features activate during jailbreaks?

Open question for future safety interpretability work.

Source paper

extracted_from

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Jailbreakconcept0.781
Methods to bypass model safety training; features may activate during jailbreaks.
Jailbreakingconcept0.770
Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
what features activate when we ask questions probing Claude's goals and values?question0.735
Direction for understanding model's internal objectives via features.
Jailbreak Attackconcept0.728
Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
Action Featuresconcept0.726
Dual interpretation of features: in addition to responding to inputs, features also act to increase probability of specific output tokens
what features activate when we ask Claude questions about its subjective experience?question0.715
Question about features related to consciousness and self-report.
Feature ablation (zeroing feature activations)method0.715
Clamping a feature's value to zero to measure its causal effect on model output.
Feature steering (clamping feature activations)method0.706
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.