finding
active
finding:attribution-graph-for-the-princess-lost-her-crown-reveals-a-femaleness-signal-pathway-from-princess-through-attentionAttribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attention
One component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Central claim that VPD successfully uncovers genuine mechanisms.
Communities (4)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Tracing information flow through weight matrices and attention heads using attribution graphs to identify causally important subcomponents in language models.
- Tracing information flow through parameter subcomponents to isolate computational mechanisms for specific model predictions, using tools like attribution graphs and VPD.
- Mechanistic tracing of information flow through attention and MLP subcomponents for pronoun prediction tasks
Concepts (1)
concept
- Femaleness signal routingassociated_withOne of two interpretable pathways in the subnetwork for predicting 'her', routing a 'femaleness' signal from 'princess' forward through attention.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Detailed case study demonstrating how VPD subnetworks can be traced to reveal multiple interpretable computational pathways for a single prediction.
- Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsfinding0.818Second component of the subnetwork for 'her', complementing the femaleness signal.
- Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
- Feature attribution (gradient-based) correlates 0.8 with ablation effects on the 'John' and 'Kobe' examples.finding0.750Validation of attribution as a fast proxy for causal importance.
- Gradient-based technique using SAE features to estimate causal effects on completions; used to corroborate NLA findings.
- Characterizes what is on the far end of the Assistant Axis away from the Assistant
- Stronger version: all cognition attributions rely on observable behavior.