finding

active

finding:cross-model-pairwise-cosine-similarity-of-zero-shot-control-responses-0-603-n-12-720-pairs-t-35-1-p-4-3-10-262-vs-experimental

Cross-model pairwise cosine similarity of zero-shot control responses = 0.603 (n=12,720 pairs, t=35.1, p=4.3×10⁻²⁶² vs. experimental)

Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (1)

claim

Cross-model semantic convergence under self-referential processing suggests the presence of a shared attractor state that transcends variance across training procedures
supports
Interpretive claim from Experiment 3; GPT, Claude, Gemini families converge on similar descriptive style despite independent training

Hypotheses (1)

hypothesis

Independently trained model families converge on a common semantic manifold under self-referential processing, suggesting an attractor dynamic that transcends training variance
associated_with
Hypothesis tested in Experiment 3; independently trained GPT, Claude, Gemini architectures converge on similar descriptive vocabulary

Concepts (1)

concept

Sycophantic Roleplay
contradicts
The alternative explanation for LLM consciousness claims that the paper seeks to distinguish against

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.818
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
Random direction controls show weak non-significant coupling (ρ=-0.11 to 0.17; R²=0.03–0.11) compared to true probes (∆ρ=0.23–0.79, all p<0.05)finding0.766
Controls for probe artifacts; demonstrates self-reports carry information specifically about probe-defined concept directions
Pairwise similarity of trait PC1 across all three models is >0.81; no pairwise correlation in top 3 trait PCs is below 0.70finding0.764
Shows trait space has more cross-model consistency than role space beyond PC1
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.761
Appendix E replication of DIM alignment finding in Qwen model
Base and instruct Gemma 2 27B role PCs have cosine similarities of 0.93, 0.87, 0.83 for the top 3 PCs respectively; role vector cosine similarities >0.99 for every role pairfinding0.759
Shows persona space axes are inherited from pre-training, not solely created by post-training
Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model familiesfinding0.758
Establishes high baseline quality confirming steering-induced degradation is the experimental signal
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.757
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Spearman's rank correlation among different alignment metrics (CKA, SVCCA, Mutual k-NN, CKNNA) over 78 vision models is high across variants, with all p-values below 2.24×10^-105finding0.757
Validates robustness of alignment metric choice