causal bypassing

Confound where naming injected concepts reflects direct logit effects rather than metacognitive awareness, raised by Morris & Plunkett

Neighborhood — ranked by edge-count

thinker

Morris, A.
introduces
Co-author of LessWrong post arguing that LLM introspection tests must rule out causal bypassing
Plunkett, D.
introduces
Co-author with Morris on causal bypassing critique of introspection tests

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Causal abstractionconcept0.806
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Causal Tracingconcept0.792
Mechanistic interpretability technique for locating factual associations, mentioned as future work direction.
Causal Intervention via Activation Shiftmethod0.790
Intervening in model forward pass by adding/subtracting probe direction to group (b) hidden states to flip truth judgments
Causal Mediationconcept0.789
Whether an internal direction causally controls a target behavior, verified by intervention success
Causal Scrubbingmethod0.788
Method by Chan et al. 2022 for rigorously testing interpretability hypotheses via interventions
causal regularitiesconcept0.782
The structural-realist grounding for self-evidencing after the bounded self is relinquished.
Causal Mechanismconcept0.780
Function determining the value of a variable based on its causal parents in an acyclic causal model.
Causal Intervention via Activation Shiftingmethod0.775
Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs