finding

active

finding:neuroticism-construct-classifier-achieves-99-00-accuracy-on-held-out-statement-corpus

Neuroticism construct classifier achieves 99.00% accuracy on held-out statement corpus

Highest individual classifier performance among OCEAN constructs

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Psychopathy construct classifier achieves 90.50% accuracy, lowest among all evaluated constructsfinding0.796
Lowest individual classifier performance
Embedding-based construct classifiers achieve mean accuracy and F1-macro of 95.96% across OCEAN, HEXACO, Dark Tetrad, CMNI, CFNI constructsfinding0.759
Validates use of lightweight classifiers as replacement for frontier LLM evaluation during alpha sweeps
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.750
Table 2, row 3, showing equivalence when prior preferences match rewards.
Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionfinding0.736
Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.731
SAE features are not simply mirroring individual neurons.
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.727
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
NPI mechanism in pythia-1b moves negation feature through complementiser 'that', auxiliary verb, and main verb across layers before predicting NPI 'any'finding0.726
Mechanistic finding from CausalGym case study showing multi-step information movement in NPI mechanism
Verbal reports (the Turing Test) and homology to human brains are utterly inadequate criteria for assessing the status of novel, unconventional agents that offer no familiar touchstone of phylogeny or anatomy.claim0.726
Core claim that standard criteria fail for novel agents.