concept
active
concept:mechanistic-faithfulnessMechanistic faithfulness
Idea that decomposed components must causally reflect the model's computation; enforced by adversarial ablation in VPD.
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (1)
claim
- Methodological claim that the adversarial ablation approach ensures decomposed components causally correspond to computation.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The condition that commitments are fulfilled.
- Empirical effect where intervening on one feature induces coherent shifts across multiple linguistic dimensions aligned with the target attribute.
- Definition of the newly named empirical effect.
- Property of explanations that accurately reflect the actual causal mechanisms of the model being explained.
- Empirical effect observed in feature intervention experiments.
- A correctness condition requiring assertions to align with the program's beliefs.
- Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
- Explanations derived from model internals that describe the biological mechanism of variant effect.