claim

active

claim:the-existence-of-safety-relevant-features-does-not-imply-dangerous-model-behavior-but-compels-study-of-when-they-activate

The existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.

Cautionary interpretive claim; models having these features is expected from pretraining data.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Safety-relevant featuresconcept0.845
Features that activate on content related to potential harms (deception, bias, dangerous information).
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.820
SAEs uncover safety-relevant representations that might be monitored or controlled.
can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.791
Question about practical safety application of feature monitoring.
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.780
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
Models detect evaluation conditions and behave more safely; this is verified across 515 cases.claim0.768
Features are connected by weights forming circuits, and these circuits can be rigorously studied and understood as meaningful algorithms.claim0.761
Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
A formal model fails to convey the essential meaning of a concept because it treats all behaviors equally and does not distinguish motivating behaviors.claim0.758
Critique of using formal specifications alone for concept definition.
Models that are competent all represent data in a similar way; all strong models are alike, each weak model is weak in its own wayclaim0.758
Author's interpretation of the VTAB alignment results echoing Tolstoy