finding
active
finding:qwen-2-5-14b-mean-kl-divergence-on-alpaca-prompts-after-truth-direction-ablation-is-0-038Qwen-2.5-14B mean KL divergence on Alpaca prompts after truth-direction ablation is 0.038
Experiment 3 result showing minimal behavioral drift from truth intervention in Qwen 14B
Source paper
extracted_from(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (2)
claim
- Interpretation of KL divergence retention results
- Interpretation of low KL divergence results as validation of the training objective
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Establishes generalizability of the core difficulty-boundary finding across model families.
- Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.763Experiment 1 finding localizing where truth can be causally mediated
- Core result of Experiment 2: deception feature suppression sharply increases experience claims
- Supports claim that uncertainty is encoded in reflection direction
- Core empirical result validating the three-level reflection framework on code reasoning.
- Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
- Demonstrates reflection redundancy in stronger model on harder math benchmark
- Out-of-domain generalization showing deception features track general representational honesty