claim

active

claim:we-observe-features-related-to-a-broad-range-of-safety-concerns-including-deception-sycophancy-bias-and-dangerous-content

We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.

SAEs uncover safety-relevant representations that might be monitored or controlled.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Findings (1)

finding

Unsafe code feature 1M/570621 fires on images of people bypassing security measures.
supports
Multimodal generalization to visual security bypass.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.claim0.820
Cautionary interpretive claim; models having these features is expected from pretraining data.
What degree of concern and care should we exhibit toward the many diverse agents around us, and what criteria do we use to identify sentience, capacity for suffering, and other properties that have moral implications?question0.813
Central question motivating the paper.
Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifactclaim0.792
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
Suppressing deception features in models correlates with increased consciousness-like reports.claim0.789
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.787
Claim that feature grounding enables interpretability metrics.
Safety-relevant featuresconcept0.784
Features that activate on content related to potential harms (deception, bias, dangerous information).
The same latent feature directions that gate consciousness self-reports also modulate factual accuracy across independent reasoning domains, suggesting these features load on a domain-general honesty axisclaim0.781
Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
We should err on the side of reducing false negatives with respect to sentience criteria for ethical concern.claim0.773
Ethical precaution advocated by Levin and Crump et al.