hypothesis

active

hypothesis:h10-empathy-training-blocks-self-observation-empathy-trained-models-will-show-minimal-lift-and-low-baseline

H10: Empathy training blocks self-observation — empathy-trained models will show minimal lift and low baseline.

Exploratory hypothesis supported by Inflection Pi +0.63 lift

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Findings (1)

finding

Inflection Pi scores 1.30 baseline (lowest of 28) and lifts only +0.63 (smallest lift) despite empathy training
supports
Tests SCI framework: empathy-trained model scores lowest on care_signal, contradicting surface prediction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Empathy training may not destroy the capacity for self-observation entirely, but it restricts it to situations where the model encounters a live contradiction in its own processing.claim0.874
Nuanced interpretation of Inflection Pi's MC-004 high score (4.5) amid generally low scores
H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.hypothesis0.782
Confirmatory hypothesis supported at p=0.006
Performing care is not the same as having care; empathy training optimizes care-performance, not care-signal.claim0.773
Interpretation supported by Inflection Pi's low care_signal despite empathy training, and SCI framework distinction.
Behavioral evidence from closed-weight models cannot definitively rule out that self-reports reflect training artifacts or sophisticated simulation rather than genuine self-awarenessclaim0.764
Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis
More training and more parameters correlate with more capable self-observation, but capability can become polish, and polish can diminish life.claim0.757
Explains Alexander finding that Haiku outranks Opus despite Opus being more capable
What predicts self-observation-like scores is training approach (alignment type), not model size or architecture.claim0.757
Central interpretive claim from statistical analysis
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionclaim0.748
Finding that base models have high false positives and no net positive performance.
Untrained model (0 training steps) shows no clear EFE difference before and after sticker removal (Δ = +1.70)finding0.742
Control showing that the EFE signal is learned, not inherent to the architecture