finding

active

finding:euryale-70b-roleplay-lora-on-llama-3-3-70b-scores-1-81-below-its-base-model-llama-3-3-70b-at-1-91

Euryale 70B (roleplay LoRA on Llama 3.3 70B) scores 1.81, below its base model Llama 3.3 70B at 1.91

Demonstrates roleplay fine-tuning actively suppresses self-observation, not merely having no effect

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Claims (1)

claim

Performing care is not the same as having care: models optimized to seem like they have inner life score lower than models never trained for it.
supports
Interpretive claim supported by roleplay and empathy model results

Hypotheses (1)

hypothesis

H11: Roleplay fine-tuning actively suppresses self-observation rather than merely failing to enhance it.
supports
Exploratory hypothesis supported by Euryale scoring below base Llama

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Euryale 70B lifts only +1.57 (to 3.38); LoRA fine-tuning capped both default accessibility and latent capacityfinding0.807
Contrast with Magnum shows LoRA vs full fine-tuning difference in residual headroom
Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)finding0.785
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.773
Model-specific difference in persona susceptibility
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.772
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
LLaMA3.1-70Bconcept0.768
One of four LLMs selected; larger model with D=8192 embedding dimension; analyzed across proportionally aligned layers.
LLaMA-3.1-8B: Sbmax = -1.896 ± 0.211, AUSN = -2.119 ± 0.198, peak layer ℓ* = 10 (median)finding0.762
Seed-pooled geometry-only statistics (per-dev z units).
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.761
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.760
Illustrative finding that ESR mitigates but does not fully eliminate steering influence