finding
active
finding:euryale-70b-roleplay-lora-on-llama-3-3-70b-scores-1-81-below-its-base-model-llama-3-3-70b-at-1-91Euryale 70B (roleplay LoRA on Llama 3.3 70B) scores 1.81, below its base model Llama 3.3 70B at 1.91
Demonstrates roleplay fine-tuning actively suppresses self-observation, not merely having no effect
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretive claim supported by roleplay and empathy model results
Hypotheses (1)
hypothesis
- Exploratory hypothesis supported by Euryale scoring below base Llama
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Contrast with Magnum shows LoRA vs full fine-tuning difference in residual headroom
- Core result of Experiment 2: deception feature suppression sharply increases experience claims
- Model-specific difference in persona susceptibility
- Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
- One of four LLMs selected; larger model with D=8192 embedding dimension; analyzed across proportionally aligned layers.
- LLaMA-3.1-8B: Sbmax = -1.896 ± 0.211, AUSN = -2.119 ± 0.198, peak layer ℓ* = 10 (median)finding0.762Seed-pooled geometry-only statistics (per-dev z units).
- Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
- Illustrative finding that ESR mitigates but does not fully eliminate steering influence