finding

active

finding:gemma-2-9b-achieves-near-100-asr-97-3-100-across-all-cone-dimensions-1-5

Gemma-2-9B achieves near-100% ASR (97.3–100%) across all cone dimensions 1–5

Experiment 2 result showing large Gemma model supports high-dimensional truth cones

Source paper

extracted_from

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4

Neighborhood — ranked by edge-count

Claims (2)

claim

Truthful behavior in LLMs is not confined to a single linear axis; multiple orthogonal directions can independently mediate it
supports
Central interpretive claim of the paper
Larger models can support higher-dimensional truth cones than smaller models
supports
Interpretation of ASR degradation patterns by model size across cone dimensions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen-2.5-7B achieves 100% ASR across all cone dimensions 1–5finding0.859
Experiment 2 result showing large models can support high-dimensional truth cones
Gemma-2-2B ASR drops from 100% at dims 1–2 to 43.1% at dim 4 and 27.1% at dim 5finding0.856
Small Gemma model shows severe ASR degradation at higher cone dimensions
In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)finding0.820
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
All three Gemma-2 models show ESR rates below 1%, near indistinguishable from zerofinding0.809
Establishes potential Llama-family specificity or scale specificity of ESR phenomenon
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.791
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Base and instruct Gemma 2 27B role PCs have cosine similarities of 0.93, 0.87, 0.83 for the top 3 PCs respectively; role vector cosine similarities >0.99 for every role pairfinding0.770
Shows persona space axes are inherited from pre-training, not solely created by post-training
Gemma-2-27B Perspectives accuracy remains 100% after SOO fine-tuningfinding0.768
SOO fine-tuning did not collapse Gemma-2-27B self-other distinction needed for perspective-taking
Gemma-2-27B attention layer Latent SOO MSE reduced from 11 to 7.67 ± 0.77 after SOO fine-tuningfinding0.768
SOO fine-tuning reduced attention layer MSE in Gemma-2-27B though MLP layers showed no significant change