claim
active
claim:the-existence-of-safety-relevant-features-does-not-imply-dangerous-model-behavior-but-compels-study-of-when-they-activateThe existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.
Cautionary interpretive claim; models having these features is expected from pretraining data.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Features that activate on content related to potential harms (deception, bias, dangerous information).
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.791Question about practical safety application of feature monitoring.
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
- Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
- Critique of using formal specifications alone for concept definition.
- Author's interpretation of the VTAB alignment results echoing Tolstoy