concept
active
concept:faithfulness-of-explanationsFaithfulness of Explanations
Property of explanations that accurately reflect the actual causal mechanisms of the model being explained.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- faithfulnessrelated_toThe condition that commitments are fulfilled.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Empirical effect where intervening on one feature induces coherent shifts across multiple linguistic dimensions aligned with the target attribute.
- Definition of the newly named empirical effect.
- Idea that decomposed components must causally reflect the model's computation; enforced by adversarial ablation in VPD.
- Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
- A correctness condition requiring assertions to align with the program's beliefs.
- Methodological principle applied to favor identity over correlation between signed evaluation and felt valence
- A set of evaluation criteria for AI assistants.
- Empirical effect observed in feature intervention experiments.