finding
active
finding:generated-statements-achieve-85-62-94-00-cosine-similarity-alignment-with-perez-et-al-validated-ocean-and-dark-triad-statementsGenerated statements achieve 85.62%-94.00% cosine similarity alignment with Perez et al. validated OCEAN and Dark Triad statements
Validates the statement synthesis pipeline as producing behavior-specific content comparable to established methods
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Neighborhood — ranked by edge-count
Methods (1)
method
- Method adapted from Perez et al. using Llama-3.1-8B-Instruct to generate 35,000 first-person statements per construct condition
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Synthetic SJTs achieve 82.97%-90.97% cosine similarity with Lee et al. TRAIT Dark Triad and OCEAN SJTsfinding0.805Highest SJT alignment among all validation comparisons
- Core result of Experiment 3: cross-model semantic convergence under self-referential processing
- Appendix E replication of DIM alignment finding in Qwen model
- Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
- Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
- High cosine similarity for Gemma3 steering vectors suggests strong linear reflection structure.
- Table 2, row 3, showing equivalence when prior preferences match rewards.
- Validates robustness of alignment metric choice