concept
active
concept:safety-relevant-featuresSafety-relevant features
Features that activate on content related to potential harms (deception, bias, dangerous information).
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Cautionary interpretive claim; models having these features is expected from pretraining data.
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Evaluation framework whose validity is questioned by presence of eval awareness.
- Domain of techniques for constructing informative features from raw data; covariance pooling is a feature engineering method for token sequences.
- The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
- Dual interpretation of features: in addition to responding to inputs, features also act to increase probability of specific output tokens
- Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
- Demonstrates mechanistic memorization via feature assemblies in superposition