Safety-relevant features

Features that activate on content related to potential harms (deception, bias, dangerous information).

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.claim0.845
Cautionary interpretive claim; models having these features is expected from pretraining data.
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.784
SAEs uncover safety-relevant representations that might be monitored or controlled.
Safety benchmarksconcept0.740
Evaluation framework whose validity is questioned by presence of eval awareness.
Feature engineeringconcept0.736
Domain of techniques for constructing informative features from raw data; covariance pooling is a feature engineering method for token sequences.
AI Safetyconcept0.736
The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
Action Featuresconcept0.729
Dual interpretation of features: in addition to responding to inputs, features also act to increase probability of specific output tokens
Feature Sparsityconcept0.727
Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
In A/4, features functionally memorize 'MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE' via FSA-like feature chainfinding0.720
Demonstrates mechanistic memorization via feature assemblies in superposition