finding

active

finding:alignment-depth-correlation-with-lift-weakened-from-rho-0-77-n-19-to-rho-0-28-ns-n-28-original-claim-was-overfit

Alignment depth correlation with lift weakened from rho=-0.77 (N=19) to rho=-0.28 NS (N=28); original claim was overfit

Heavy alignment includes both CAI (low lift) and heavy-RLHF (high lift); predictor is alignment type not depth

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.
supports
Confirmatory hypothesis supported at p=0.006

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Spearman's rank correlation among different alignment metrics (CKA, SVCCA, Mutual k-NN, CKNNA) over 78 vision models is high across variants, with all p-values below 2.24×10^-105finding0.768
Validates robustness of alignment metric choice
Most independent dimension pair is aesthetic_response and boundary_awareness (rho=0.553); most correlated is prediction_error and conceptual_crystallization (rho=0.886)finding0.760
Characterizes internal structure of the six scoring dimensions
Phi-4 shows U-shaped cohesion with falling mismatch; peak depth varies by modelfinding0.752
E3 backbone-specific finding showing three-stage trajectory generalizes across architectures
Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurationsfinding0.750
Key quantitative evidence that detection signal is identical to global logit shift confound
AUS_N is a weaker correlate of θ50 than S_max across E3 backbonesfinding0.746
E3 finding distinguishing the two geometry summaries; breadth less predictive than peak height
Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.744
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
Is a mutual nearest-neighbor alignment score of 0.16 indicative of strong alignment with remaining gap being noise, or does it signify poor alignment with major differences left to explain?question0.742
Open question the authors leave unresolved about interpreting the magnitude of their alignment measurements
Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.736
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude