Faithfulness of Explanations

Property of explanations that accurately reflect the actual causal mechanisms of the model being explained.

Neighborhood — ranked by edge-count

paper

concept

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Functional Faithfulnessconcept0.802
Empirical effect where intervening on one feature induces coherent shifts across multiple linguistic dimensions aligned with the target attribute.
Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute.quote0.780
Definition of the newly named empirical effect.
Mechanistic faithfulnessconcept0.779
Idea that decomposed components must causally reflect the model's computation; enforced by adversarial ablation in VPD.
Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al. 2023)concept0.770
Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
sincerityconcept0.760
A correctness condition requiring assertions to align with the program's beliefs.
Inference to the Best Explanationconcept0.754
Methodological principle applied to favor identity over correlation between signed evaluation and felt valence
Helpful, Honest, Harmlessframework0.754
A set of evaluation criteria for AI assistants.
Functional Faithfulness: intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute.finding0.746
Empirical effect observed in feature intervention experiments.