claim
active
claim:we-observe-features-related-to-a-broad-range-of-safety-concerns-including-deception-sycophancy-bias-and-dangerous-content

We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.

SAEs uncover safety-relevant representations that might be monitored or controlled.

Neighborhood — ranked by edge-count

Findings (1)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.