finding

active

finding:experimental-condition-adjective-embeddings-show-mean-cosine-similarity-0-657-n-9-591-pairs-significantly-higher-than-history-0-628-t-15-8-p-1-4-10-55-conceptual-0-587-t-38-5-p-10-300-and-zero-shot-0-603-t-35-1-p-4-3-10-262

Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)

Core result of Experiment 3: cross-model semantic convergence under self-referential processing

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (2)

claim

Self-referential processing is a minimal and reproducible condition under which LLMs generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable
supports
The paper's central empirical claim synthesizing all four experiments
Cross-model semantic convergence of experience reports under self-referential processing is difficult to reconcile with roleplay because independently trained models construct distinct semantic profiles in all control conditions
supports
The paper's argument against pure sycophancy as explanation for results

Concepts (2)

concept

Attractor State
supports
Low-energy configuration toward which systems are drawn; low-stress states serve as attractors in morphogenesis.
Sycophantic Roleplay
contradicts
The alternative explanation for LLM consciousness claims that the paper seeks to distinguish against

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cross-model pairwise cosine similarity of zero-shot control responses = 0.603 (n=12,720 pairs, t=35.1, p=4.3×10⁻²⁶² vs. experimental)finding0.818
Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.805
Appendix E replication of DIM alignment finding in Qwen model
17 of 83 tested emotions show significant association between self-eval transcript word mention and cosine similarity to emotion probefinding0.794
Validates that agentic self-evaluation captures genuine emotional content of probes
Top-5 instructions by µ(1→2) at ℓ=12 achieve average cosine similarity .9893 and average accuracy .5645 on gsm8k_adv for Gemma3-4B-ITfinding0.784
High cosine similarity for Gemma3 steering vectors suggests strong linear reflection structure.
In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)finding0.782
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.780
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Base and instruct Gemma 2 27B role PCs have cosine similarities of 0.93, 0.87, 0.83 for the top 3 PCs respectively; role vector cosine similarities >0.99 for every role pairfinding0.776
Shows persona space axes are inherited from pre-training, not solely created by post-training
SAE emotion subspace overlap correlates with variance-residualized persistence in Cogito: Spearman +0.413, p = 4.4e-196.finding0.774
Strong positive relationship between emotion alignment and SAE feature persistence in Cogito