finding
active
finding:synthetic-hexaco-sjts-achieve-73-84-85-45-alignment-with-oostrom-et-al-human-composed-hexaco-sjtsSynthetic HEXACO SJTs achieve 73.84%-85.45% alignment with Oostrom et al. human-composed HEXACO SJTs
Moderate alignment validating SJT synthesis for HEXACO domain
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Synthetic MFT SJTs achieve 77.71%-83.84% alignment with Clifford et al. human-composed MFT vignettesfinding0.863Moderate-to-high alignment validating SJT synthesis for moral foundations domain
- Synthetic SJTs achieve 82.97%-90.97% cosine similarity with Lee et al. TRAIT Dark Triad and OCEAN SJTsfinding0.827Highest SJT alignment among all validation comparisons
- GPT-4 Turbo and GPT-4o show no alignment faking in either setting due to insufficient detailed reasoningfinding0.736Establishes that capacity for detailed reasoning is necessary for alignment faking
- Per-model steerability comparison from Table 4
- Core empirical result for animal welfare setting; higher rate than helpful-only
- Shows alignment faking can emerge from training data information without explicit prompting
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- DAS achieves overall odds-ratio of 10.24 on pythia-410m averaged across all CausalGym tasksfinding0.711Numerical result for pythia-410m