finding

active

finding:das-achieves-100-iia-for-combined-negation-and-lexical-entailment-model-on-monli-at-layer-9-intervention-size-256

DAS achieves 100% IIA for combined Negation and Lexical Entailment model on MoNLI at Layer 9, intervention size 256

Perfect abstraction relation between BERT and symbolic algorithm with negation and lexical entailment variables.

Source paper

extracted_from

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

(2023) · Atticus Geiger · Zhengxuan Wu · Christopher Potts · Thomas Icard +1

Neighborhood — ranked by edge-count

Claims (2)

claim

DAS overcomes the localist limitation of prior causal abstraction by allowing individual neurons to play multiple roles via non-standard bases
supports
Central claim motivating DAS over prior methods.
DAS finds better alignments than brute-force search by using gradient descent rather than exhaustive discrete search
supports
Second central claim of the paper.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS achieves 100% IIA on hierarchical equality task with |N|=16, intervention size 8, Layer 1finding0.795
DAS discovers a perfect alignment between the feed-forward network and the Both Equality Relations high-level model.
Lexical entailment representation decomposes into word identity sub-representations with ~0.97-0.98 IIA (Lexeme Subspace of Lexical Entailment)finding0.778
In contrast to hierarchical equality, lexical entailment in BERT decomposes into representations of word identities, not a single abstract relation.
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventionsfinding0.737
Suggestive evidence for language-independent truth representation in LLMs
What appears to be a representation of lexical entailment in BERT is actually a data structure of two word identity representations, not an encoding of the entailment relationclaim0.734
Key asymmetry between hierarchical equality and NLI experiments; BERT stores identities rather than the abstract relation.
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.733
Establishes generalizability of the core difficulty-boundary finding across model families.
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differencesfinding0.732
Corroborating result on additional task confirming main paper findings
DAS behavioral loss achieves IIA of 0.997±0.001 on synthetic 10-class dataset training/test setsfinding0.728
IIA baseline for DAS behavioral loss on synthetic dataset
Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.727
Experiment 1 finding localizing where truth can be causally mediated