claim
active
claim:the-existence-of-safety-relevant-features-does-not-imply-dangerous-model-behavior-but-compels-study-of-when-they-activate

The existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.

Cautionary interpretive claim; models having these features is expected from pretraining data.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.