finding

active

finding:untrained-model-0-training-steps-shows-no-clear-efe-difference-before-and-after-sticker-removal-1-70

Untrained model (0 training steps) shows no clear EFE difference before and after sticker removal (Δ = +1.70)

Control showing that the EFE signal is learned, not inherent to the architecture

Source paper

extracted_from

Active Inference with a Self-Prior in the Mirror-Mark Task

(2026) · Dongmin Kim · Hoshinori Kanazawa · Yasuo Kuniyoshi

Neighborhood — ranked by edge-count

Claims (1)

claim

The self-prior operates as an internal criterion for distinguishing self from non-self, without external reward or explicitly computed sticker location
supports
Central interpretive claim of the paper, supported by EFE decrease after sticker removal

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In a single illustrative episode (seed 2), mean EFE after sticker removal was 12.00 lower than before removal after 500k training stepsfinding0.856
Qualitative confirmation of EFE drop in trained model vs. untrained model (Δ = +1.70)
Mean EFE before sticker removal across 80 evaluations: 79.33 ± 4.34finding0.821
Baseline EFE when sticker is present, used for comparison
EFE decrease after sticker removal is statistically significant (Wilcoxon p = 6.33×10⁻⁹) across 80 evaluationsfinding0.805
Confirms that EFE systematically decreases after sticker removal, validating the self-prior as internal criterion
Sticker-removal success rate stayed near 20% in the early phase of trainingfinding0.804
Shows learning progression from chance-level to functional behavior
Mean hand-sticker distance decreased gradually across 500k training steps, including before removal probability exceeded 50%finding0.802
Suggests the agent learned to recognize and approach the sticker before achieving reliable removal
Agent achieves approximately 70% sticker-removal success rate by end of 500k training stepsfinding0.780
Main behavioral result demonstrating the model's efficacy in the mirror-mark task
After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default promptfinding0.747
Demonstrates persistence of compliance gap even when training non-compliance reaches zero
Current training methods rely on loss minimization, meaning the experiential profile of training is predominantly negative across billions of parameter updatesclaim0.744
Ethical implication about the nature of AI training experience if the thesis holds