concept
active
concept:causal-bypassingcausal bypassing
Confound where naming injected concepts reflects direct logit effects rather than metacognitive awareness, raised by Morris & Plunkett
Neighborhood — ranked by edge-count
Thinkers (2)
thinker
- Morris, A.introducesCo-author of LessWrong post arguing that LLM introspection tests must rule out causal bypassing
- Plunkett, D.introducesCo-author with Morris on causal bypassing critique of introspection tests
Claims (1)
claim
- Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
- Mechanistic interpretability technique for locating factual associations, mentioned as future work direction.
- Intervening in model forward pass by adding/subtracting probe direction to group (b) hidden states to flip truth judgments
- Whether an internal direction causally controls a target behavior, verified by intervention success
- Method by Chan et al. 2022 for rigorously testing interpretability hypotheses via interventions
- The structural-realist grounding for self-evidencing after the bounded self is relinquished.
- Function determining the value of a variable based on its causal parents in an acyclic causal model.
- Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs