quote
active
quote:functional-faithfulness-whereby-intervening-on-a-specific-internal-feature-induces-coherent-and-predictable-shifts-across-multiple-linguistic-dimensions-aligned-with-the-target-semantic-attributeFunctional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute.
Definition of the newly named empirical effect.
Source paper
extracted_from(2026) · Ruikang Zhang · Shuo Wang · Q. Su
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Empirical effect observed in feature intervention experiments.
- Empirical effect where intervening on one feature induces coherent shifts across multiple linguistic dimensions aligned with the target attribute.
- The condition that commitments are fulfilled.
- Idea that decomposed components must causally reflect the model's computation; enforced by adversarial ablation in VPD.
- Property of explanations that accurately reflect the actual causal mechanisms of the model being explained.
- Claims that the re-emergence and adaptation of the segmentation clock qualifies as intelligent behavior.
- Demonstrates mechanistic memorization via feature assemblies in superposition