finding

active

finding:generated-statements-achieve-85-62-94-00-cosine-similarity-alignment-with-perez-et-al-validated-ocean-and-dark-triad-statements

Generated statements achieve 85.62%-94.00% cosine similarity alignment with Perez et al. validated OCEAN and Dark Triad statements

Validates the statement synthesis pipeline as producing behavior-specific content comparable to established methods

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Methods (1)

method

Construct-Specific Statement Synthesis
supports
Method adapted from Perez et al. using Llama-3.1-8B-Instruct to generate 35,000 first-person statements per construct condition

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Synthetic SJTs achieve 82.97%-90.97% cosine similarity with Lee et al. TRAIT Dark Triad and OCEAN SJTsfinding0.805
Highest SJT alignment among all validation comparisons
Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.766
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.764
Appendix E replication of DIM alignment finding in Qwen model
In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)finding0.758
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.753
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Top-5 instructions by µ(1→2) at ℓ=12 achieve average cosine similarity .9893 and average accuracy .5645 on gsm8k_adv for Gemma3-4B-ITfinding0.753
High cosine similarity for Gemma3 steering vectors suggests strong linear reflection structure.
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.751
Table 2, row 3, showing equivalence when prior preferences match rewards.
Spearman's rank correlation among different alignment metrics (CKA, SVCCA, Mutual k-NN, CKNNA) over 78 vision models is high across variants, with all p-values below 2.24×10^-105finding0.748
Validates robustness of alignment metric choice