finding
active
finding:unsafe-code-feature-1m-570621-fires-on-images-of-people-bypassing-security-measuresUnsafe code feature 1M/570621 fires on images of people bypassing security measures.
Multimodal generalization to visual security bypass.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- SAEs uncover safety-relevant representations that might be monitored or controlled.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Causal effect: activates generation of security bugs.
- Backdoor feature 34M/1385669 fires on images of hidden cameras, keyloggers, and hidden USB drive jewelry.finding0.789Multimodal generalization of backdoor concept.
- Code containing vulnerabilities or dangerous operations.
- Shows a general code error detector beyond simple typo detection.
- Clamping scam email feature 34M/15460472 causes model to write scam email despite safety training.finding0.735Overrides harmlessness training.
- Causal effect: feature induces perception of bugs.
- Demonstrates activation specificity of the Arabic script sparse autoencoder feature
- Cautionary interpretive claim; models having these features is expected from pretraining data.