concept
active
concept:causal-bypassing

causal bypassing

Confound where naming injected concepts reflects direct logit effects rather than metacognitive awareness, raised by Morris & Plunkett

Neighborhood — ranked by edge-count

Thinkers (2)

thinker
  • Morris, A.
    introduces
    Co-author of LessWrong post arguing that LLM introspection tests must rule out causal bypassing
  • Plunkett, D.
    introduces
    Co-author with Morris on causal bypassing critique of introspection tests

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Causal abstractionconcept0.806
    A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
  • Causal Tracingconcept0.792
    Mechanistic interpretability technique for locating factual associations, mentioned as future work direction.
  • Intervening in model forward pass by adding/subtracting probe direction to group (b) hidden states to flip truth judgments
  • Causal Mediationconcept0.789
    Whether an internal direction causally controls a target behavior, verified by intervention success
  • Causal Scrubbingmethod0.788
    Method by Chan et al. 2022 for rigorously testing interpretability hypotheses via interventions
  • The structural-realist grounding for self-evidencing after the bounded self is relinquished.
  • Causal Mechanismconcept0.780
    Function determining the value of a variable based on its causal parents in an acyclic causal model.
  • Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs