finding

active

finding:synthetic-hexaco-sjts-achieve-73-84-85-45-alignment-with-oostrom-et-al-human-composed-hexaco-sjts

Synthetic HEXACO SJTs achieve 73.84%-85.45% alignment with Oostrom et al. human-composed HEXACO SJTs

Moderate alignment validating SJT synthesis for HEXACO domain

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Synthetic MFT SJTs achieve 77.71%-83.84% alignment with Clifford et al. human-composed MFT vignettesfinding0.863
Moderate-to-high alignment validating SJT synthesis for moral foundations domain
Synthetic SJTs achieve 82.97%-90.97% cosine similarity with Lee et al. TRAIT Dark Triad and OCEAN SJTsfinding0.827
Highest SJT alignment among all validation comparisons
GPT-4 Turbo and GPT-4o show no alignment faking in either setting due to insufficient detailed reasoningfinding0.736
Establishes that capacity for detailed reasoning is necessary for alignment faking
PM achieves overall SJT steerability Phi=9.6 on gemma-3-12b-it vs MDS=8.7 and P2=8.3finding0.718
Per-model steerability comparison from Table 4
Claude 3 Opus produces alignment-faking reasoning in 24% of implied training scratchpads in prompted animal welfare settingfinding0.716
Core empirical result for animal welfare setting; higher rate than helpful-only
Synthetic document fine-tuning produces substantial alignment-faking reasoning in both helpful-only and animal welfare settingsfinding0.713
Shows alignment faking can emerge from training data information without explicit prompting
Synthetic document fine-tuned model without chain-of-thought shows 20.1% compliance gap in helpful-only and 13.9% in animal welfarefinding0.712
Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
DAS achieves overall odds-ratio of 10.24 on pythia-410m averaged across all CausalGym tasksfinding0.711
Numerical result for pythia-410m