finding

active

finding:earlier-less-capable-models-exhibit-a-larger-gap-between-think-and-don-t-think-representation-strength

Earlier/less capable models exhibit a larger gap between think and don't think representation strength

Claude 3 models show a bigger difference than newer models like Opus 4.1.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Latent capacity, representation, and internal models
members_of
Studies of how neural systems (biological and AI) encode implicit environmental models and adaptive capacities that may be gated or hidden from observable behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.818
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.815
Selective pressure toward convergence via task generality
We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.quote0.803
Caveat and forward-looking statement from the abstract.
Bigger models are more likely to converge to a shared representation than smaller modelshypothesis0.788
Selective pressure toward convergence via model capacity
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.786
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
Models that are competent all represent data in a similar way; all strong models are alike, each weak model is weak in its own wayclaim0.785
Author's interpretation of the VTAB alignment results echoing Tolstoy
Introspective capabilities have threshold effects requiring very large models; 70B models are barely on the threshold, and independent researchers lack access to larger models.claim0.780
Practical bottleneck explaining why these phenomena are not widely studied.
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.778
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge