finding

active

finding:embedding-based-construct-classifiers-achieve-mean-accuracy-and-f1-macro-of-95-96-across-ocean-hexaco-dark-tetrad-cmni-cfni-constructs

Embedding-based construct classifiers achieve mean accuracy and F1-macro of 95.96% across OCEAN, HEXACO, Dark Tetrad, CMNI, CFNI constructs

Validates use of lightweight classifiers as replacement for frontier LLM evaluation during alpha sweeps

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Methods (1)

method

Embedding-based Construct Logistic Classifier
supports
Logistic regressor on Qwen3Embedding-0.6B embeddings trained on construct statements; used to measure construct presence in alpha sweeps

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Psychopathy construct classifier achieves 90.50% accuracy, lowest among all evaluated constructsfinding0.759
Lowest individual classifier performance
Neuroticism construct classifier achieves 99.00% accuracy on held-out statement corpusfinding0.759
Highest individual classifier performance among OCEAN constructs
Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.755
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.751
Table 2, row 3, showing equivalence when prior preferences match rewards.
Within-family factual generalization (F0-F2) is consistently strong across all models and prompt settings.finding0.745
Establishes a reliable baseline for factual truth direction universality within simple factual recall.
PMI computed from color cooccurrences in CIFAR-10 images yields a perceptual color representation closely matching both CIELAB space and language model embeddings (SimCSE, RoBERTa)finding0.744
Validates theoretical PMI convergence claim on real data
Generated statements achieve 85.62%-94.00% cosine similarity alignment with Perez et al. validated OCEAN and Dark Triad statementsfinding0.736
Validates the statement synthesis pipeline as producing behavior-specific content comparable to established methods
Model organism accuracy on BigCodeBench: 56.8% (pre-fine-tuned: 59.1%); steering toward deployment: 55.2%; steering toward evaluation: 43.1%finding0.735
Quantifies performance cost of fine-tuning and steering; deployment steering has minimal accuracy cost.