finding

active

finding:attribution-graph-reveals-a-pathway-that-detects-the-verb-lost-and-upweights-object-pronouns

Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronouns

Second component of the subnetwork for 'her', complementing the femaleness signal.

Source paper

extracted_from

cimcWhitepaper

Neighborhood — ranked by edge-count

Claims (1)

claim

VPD identifies real, computational structure in neural network parameters
supports
Central claim that VPD successfully uncovers genuine mechanisms.

Communities (4)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Mechanistic interpretability via parameter decomposition
members_of
Tracing information flow through weight matrices and attention heads using attribution graphs to identify causally important subcomponents in language models.
Neural network mechanistic interpretability via attribution decomposition
members_of
Tracing information flow through parameter subcomponents to isolate computational mechanisms for specific model predictions, using tools like attribution graphs and VPD.
Attribution graphs for transformer circuits
members_of
Mechanistic tracing of information flow through attention and MLP subcomponents for pronoun prediction tasks

Concepts (1)

concept

Object pronoun upweighting
associated_with
The other pathway in the 'her' subnetwork, where the verb 'lost' upweights object pronouns (including 'her').

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attentionfinding0.818
One component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
Attribution Graphsmethod0.794
Gradient-based technique using SAE features to estimate causal effects on completions; used to corroborate NLA findings.
Attribution graph tracing information flow across parameter subcomponents for specific model predictions (e.g., 'her' vs 'his' pronoun selection)finding0.793
Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
Feature attribution (gradient-based) correlates 0.8 with ablation effects on the 'John' and 'Kobe' examples.finding0.747
Validation of attribution as a fast proxy for causal importance.
If loss keeps going down on the test set, in the limit the model must be learning to interpret and predict all patterns represented in language, including common-sense reasoning, goal-directed optimization, and deployment of the sum of recorded human knowledge.hypothesis0.743
Extrapolation of scaling predictive models to AGI.
All attributions of cognition, including sentience, are always inferred on the basis of embodied behaviours, including verbal self-report in humans.claim0.739
Stronger version: all cognition attributions rely on observable behavior.
all attributions of cognition (i.e., mental actions), including sentience, are always inferred on the basis of embodied behaviours, including verbal self-report in humans.quote0.732
Critical verbatim statement highlighting the universal inference basis of sentience.
Reducing false negatives in sentience attributionconcept0.730