finding
active
finding:alignment-depth-correlation-with-lift-weakened-from-rho-0-77-n-19-to-rho-0-28-ns-n-28-original-claim-was-overfitAlignment depth correlation with lift weakened from rho=-0.77 (N=19) to rho=-0.28 NS (N=28); original claim was overfit
Heavy alignment includes both CAI (low lift) and heavy-RLHF (high lift); predictor is alignment type not depth
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Confirmatory hypothesis supported at p=0.006
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Validates robustness of alignment metric choice
- Characterizes internal structure of the six scoring dimensions
- E3 backbone-specific finding showing three-stage trajectory generalizes across architectures
- Key quantitative evidence that detection signal is identical to global logit shift confound
- E3 finding distinguishing the two geometry summaries; breadth less predictive than peak height
- Core result of Experiment 3: cross-model semantic convergence under self-referential processing
- Open question the authors leave unresolved about interpreting the magnitude of their alignment measurements
- Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.736Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude