finding
active
finding:only-13-27-of-520-non-manipulated-alpha-trends-achieve-r2-0-95-contrasting-with-47-69-for-manipulated-trendsOnly 13.27% of 520 non-manipulated alpha trends achieve R2 >= 0.95, contrasting with 47.69% for manipulated trends
Control comparison showing near-linearity is specific to the targeted manipulation direction
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha
- Demonstrates severity of training-deployment gap after RL
- Suggests fundamental differences in learning dynamics between normal and chronic perception models
- Shows RL reduces but does not eliminate unmonitored non-compliance
- Key quantitative evidence that detection signal is identical to global logit shift confound
- Demonstrates that RL can reinforce alignment-faking reasoning rather than eliminate it
- Random latent ablation produces slight increase in ESR rate (3.8% to 4.2%), not statistically significantfinding0.743Control result confirming OTD ablation effect is specific to those latents, not a general ablation artifact
- Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy