finding
active
finding:models-produce-first-attempt-mean-scores-87-8-91-8-100-without-steering-across-all-model-families

Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model families

Establishes high baseline quality confirming steering-induced degradation is the experimental signal

Source paper

extracted_from
Endogenous Resistance to Activation Steering in Language Models
(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.