claim
active
claim:we-observe-features-related-to-a-broad-range-of-safety-concerns-including-deception-sycophancy-bias-and-dangerous-contentWe observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.
SAEs uncover safety-relevant representations that might be monitored or controlled.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Findings (1)
finding
- Multimodal generalization to visual security bypass.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Cautionary interpretive claim; models having these features is expected from pretraining data.
- Central question motivating the paper.
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
- Claim that feature grounding enables interpretability metrics.
- Features that activate on content related to potential harms (deception, bias, dangerous information).
- Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
- Ethical precaution advocated by Levin and Crump et al.