finding
active
finding:activating-the-base64-feature-a-1-2357-causes-the-model-to-generate-base64-textActivating the base64 feature A/1/2357 causes the model to generate base64 text
Causal validation of base64 feature function via pinned feature sampling
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Universality of base64 feature across two transformers
- Shows a general code error detector beyond simple typo detection.
- Pinning A/1/3450 to maximum observed value causes model to generate Arabic text from numeric prefix contextfinding0.755Causal validation that the Arabic feature has the predicted downstream effect on generation
- Concrete example of feature splitting revealing unexpected model structure
- Universality of Hebrew script feature across two transformers
- Foundational paper introducing activation steering methodology used in this work
- Causal effect: activates generation of security bugs.
- Causal effect: feature induces perception of bugs.