finding

active

finding:qwen-2-5-14b-mean-kl-divergence-on-alpaca-prompts-after-truth-direction-ablation-is-0-038

Qwen-2.5-14B mean KL divergence on Alpaca prompts after truth-direction ablation is 0.038

Experiment 3 result showing minimal behavioral drift from truth intervention in Qwen 14B

Source paper

extracted_from

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4

Neighborhood — ranked by edge-count

Papers (1)

paper

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
introduces

Claims (2)

claim

Discovered truth directions are highly specific and do not interfere with general instruction-following behavior
supports
Interpretation of KL divergence retention results
The L_retain regularization objective is empirically effective at preserving unrelated model capabilities during cone training
supports
Interpretation of low KL divergence results as validation of the training objective

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.771
Establishes generalizability of the core difficulty-boundary finding across model families.
Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.763
Experiment 1 finding localizing where truth can be causally mediated
Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)finding0.749
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Reflection direction features achieve AUROC 0.772 vs. 0.736 for final layer baseline on deepseek-llama-8b on GSM8k correctness predictionfinding0.746
Supports claim that uncertainty is encoded in reflection direction
Clear accuracy stratification across three reflection levels on cruxeval_o_adv: Triggered (.065/.247) > Intrinsic (.040/.133) > No Reflection (.017/.051) for Qwen2.5-3B/Gemma3-4B-ITfinding0.740
Core empirical result validating the three-level reflection framework on code reasoning.
Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processingfinding0.736
Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
QwQ-32B on MATH-500: 21.0% reasoning token reduction at intervention strength -0.96 with only 0.34% accuracy lossfinding0.734
Demonstrates reflection redundancy in stronger model on harder math benchmark
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.733
Out-of-domain generalization showing deception features track general representational honesty