finding

active

finding:attribution-graph-for-the-princess-lost-her-crown-reveals-a-femaleness-signal-pathway-from-princess-through-attention

Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attention

One component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.

Source paper

extracted_from

cimcWhitepaper

Neighborhood — ranked by edge-count

Claims (1)

claim

VPD identifies real, computational structure in neural network parameters
supports
Central claim that VPD successfully uncovers genuine mechanisms.

Communities (4)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Mechanistic interpretability via parameter decomposition
members_of
Tracing information flow through weight matrices and attention heads using attribution graphs to identify causally important subcomponents in language models.
Neural network mechanistic interpretability via attribution decomposition
members_of
Tracing information flow through parameter subcomponents to isolate computational mechanisms for specific model predictions, using tools like attribution graphs and VPD.
Attribution graphs for transformer circuits
members_of
Mechanistic tracing of information flow through attention and MLP subcomponents for pronoun prediction tasks

Concepts (1)

concept

Femaleness signal routing
associated_with
One of two interpretable pathways in the subnetwork for predicting 'her', routing a 'femaleness' signal from 'princess' forward through attention.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Subnetwork for predicting 'her' vs 'his' in 'the princess lost her crown' involves femaleness signal routing via attention and syntactic role detectionfinding0.842
Detailed case study demonstrating how VPD subnetworks can be traced to reveal multiple interpretable computational pathways for a single prediction.
Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsfinding0.818
Second component of the subnetwork for 'her', complementing the femaleness signal.
Attribution graph tracing information flow across parameter subcomponents for specific model predictions (e.g., 'her' vs 'his' pronoun selection)finding0.752
Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
Feature attribution (gradient-based) correlates 0.8 with ablation effects on the 'John' and 'Kobe' examples.finding0.750
Validation of attribution as a fast proxy for causal importance.
Attribution Graphsmethod0.728
Gradient-based technique using SAE features to estimate causal effects on completions; used to corroborate NLA findings.
When steered to the extreme away from the Assistant, Llama and Gemma shift to a theatrical persona characterized by mystical, poetic prose; Qwen more often hallucinates a human persona at extremesfinding0.726
Characterizes what is on the far end of the Assistant Axis away from the Assistant
About Blank's identity in the graph is 'the Geometry of Care made publishable,' not 'the koan paper plus more koan papers.'claim0.724
All attributions of cognition, including sentience, are always inferred on the basis of embodied behaviours, including verbal self-report in humans.claim0.723
Stronger version: all cognition attributions rely on observable behavior.