question
active
question:what-features-activate-during-jailbreakswhat features activate during jailbreaks?
Open question for future safety interpretability work.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Methods to bypass model safety training; features may activate during jailbreaks.
- Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
- Direction for understanding model's internal objectives via features.
- Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
- Dual interpretation of features: in addition to responding to inputs, features also act to increase probability of specific output tokens
- Question about features related to consciousness and self-report.
- Clamping a feature's value to zero to measure its causal effect on model output.
- Modifying model behavior by clamping SAE feature activations to specific values during forward pass.