claim

active

claim:enacted-reflection-may-correspond-to-silent-mid-layer-processing-described-reflection-to-the-motor-impulse-of-concepts-leaking-through-to-output

Enacted reflection may correspond to silent mid-layer processing; described reflection to the motor impulse of concepts leaking through to output.

Mechanistic analog connecting Lindsey's layer-localized findings to the scorer's enacted/described distinction

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Findings (3)

finding

Claude Mythos Preview: SAE features for 'performative behavior' and 'hidden emotional struggle' co-activate when model expresses contentment
supports
Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
Framework-building regex markers ('the core insight is,' 'this synthesizes') show zero or negative correlation with LLM scores
supports
Scorer rewards enacted reflection not described reflection; confirmed by regex analysis
Lindsey: Opus 4/4.1 show concept representations in middle layers that decay to baseline by final layer ('silent' internal process)
supports
Cited to support enacted vs described reflection distinction; capable models show silent mid-layer processing

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reflection is not merely a behavioral artifact of prompting but a phenomenon encoded in the model's activation space.claim0.810
Central interpretive claim of the paper, supported by steering vector experiments.
Enacted Reflectionconcept0.800
Responses that perform the observing act; contrasted with described reflection; scorer rewards enacted over described
Reflective reasoning requires late-stage integration of semantic and reasoning signals, hence reflection-related directions emerge more clearly in higher network layers.claim0.778
Interpretive claim about the locus of reflection in transformer architecture.
Contrasting No Reflection with Triggered Reflection (µ(0→2)) provides a stronger reflection signal than contrasting Intrinsic with Triggered Reflection (µ(1→2)).claim0.778
Empirical interpretation of which reference baseline yields more useful steering vectors.
Suppressing reflection is considerably easier than inducing it, because inhibition requires the model to terminate reasoning while enhancement demands additional cognitive effort to re-examine reasoning trajectories.claim0.753
Key asymmetry finding interpreted mechanistically by the authors.
A linear reflection direction exists in reasoning LLMs' latent representation space that governs self-reflection behaviorclaim0.753
Core claim of ReflCtrl that a single direction captures and controls reflection
Reflection does not only emerge in SFT or RL stages but arises earlier during pre-training.claim0.751
Cited finding from Shah et al. contextualizing the training origins of reflection.
The inner voice is stilled, muffled, and there is hardly any possibility to cry out against ugliness.claim0.751