finding

active

finding:activating-the-base64-feature-a-1-2357-causes-the-model-to-generate-base64-text

Activating the base64 feature A/1/2357 causes the model to generate base64 text

Causal validation of base64 feature function via pinned feature sampling

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learning
supports
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Base64 feature A/1/2357 and B/1/2165 have activation correlation of 0.85finding0.821
Universality of base64 feature across two transformers
Feature 1M/1013764 activates on diverse code errors (typos in code, array overflow, divide by zero, type mismatch) across Python, C, Scheme, but not on English prose typos.finding0.758
Shows a general code error detector beyond simple typo detection.
Pinning A/1/3450 to maximum observed value causes model to generate Arabic text from numeric prefix contextfinding0.755
Causal validation that the Arabic feature has the predicted downstream effect on generation
Single base64 feature A/0/45 splits into three distinct features in A/1: letter-specific, digit-specific, and ASCII-encoding-specificfinding0.748
Concrete example of feature splitting revealing unexpected model structure
Hebrew feature A/1/416 and B/1/1901 have activation correlation of 0.92finding0.725
Universality of Hebrew script feature across two transformers
Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.719
Foundational paper introducing activation steering methodology used in this work
Clamping unsafe code feature 1M/570621 to 5x max activation causes model to generate buffer overflow and memory leak in code completion.finding0.715
Causal effect: activates generation of security bugs.
Clamping code error feature to high activation causes the model to hallucinate error messages on bug-free code.finding0.709
Causal effect: feature induces perception of bugs.