finding
active
finding:functional-faithfulness-intervening-on-a-specific-internal-feature-induces-coherent-and-predictable-shifts-across-multiple-linguistic-dimensions-aligned-with-the-target-semantic-attributeFunctional Faithfulness: intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute.
Empirical effect observed in feature intervention experiments.
Source paper
extracted_from(2026) · Ruikang Zhang · Shuo Wang · Q. Su
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (1)
claim
- The authors' interpretive assertion based on their steering results.
Communities (2)
community
- Relational self, care & alivenessmembers_ofSelf as dynamic functional center defined by care, coherence, and substrate-neutral cognition
- Investigates whether LLMs genuinely represent complex concepts or merely simulate understanding through pattern matching, using interventional methods and epistemic humility frameworks.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Definition of the newly named empirical effect.
- Empirical effect where intervening on one feature induces coherent shifts across multiple linguistic dimensions aligned with the target attribute.
- Idea that decomposed components must causally reflect the model's computation; enforced by adversarial ablation in VPD.
- The condition that commitments are fulfilled.
- Property of explanations that accurately reflect the actual causal mechanisms of the model being explained.
- Claims that the re-emergence and adaptation of the segmentation clock qualifies as intelligent behavior.
- Interpretive observation about asymmetry in generalization of Claude's trained values
- Demonstrates mechanistic memorization via feature assemblies in superposition