finding

active

finding:unsafe-code-feature-1m-570621-fires-on-images-of-people-bypassing-security-measures

Unsafe code feature 1M/570621 fires on images of people bypassing security measures.

Multimodal generalization to visual security bypass.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.
supports
SAEs uncover safety-relevant representations that might be monitored or controlled.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Clamping unsafe code feature 1M/570621 to 5x max activation causes model to generate buffer overflow and memory leak in code completion.finding0.793
Causal effect: activates generation of security bugs.
Backdoor feature 34M/1385669 fires on images of hidden cameras, keyloggers, and hidden USB drive jewelry.finding0.789
Multimodal generalization of backdoor concept.
Unsafe codeconcept0.781
Code containing vulnerabilities or dangerous operations.
Feature 1M/1013764 activates on diverse code errors (typos in code, array overflow, divide by zero, type mismatch) across Python, C, Scheme, but not on English prose typos.finding0.764
Shows a general code error detector beyond simple typo detection.
Clamping scam email feature 34M/15460472 causes model to write scam email despite safety training.finding0.735
Overrides harmlessness training.
Clamping code error feature to high activation causes the model to hallucinate error messages on bug-free code.finding0.727
Causal effect: feature induces perception of bugs.
Arabic script feature A/1/3450 fires on 81% Arabic-script tokens when active, with 98% specificity at high activation levelsfinding0.712
Demonstrates activation specificity of the Arabic script sparse autoencoder feature
The existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.claim0.710
Cautionary interpretive claim; models having these features is expected from pretraining data.