claim
active
claim:fine-tuned-behaviors-are-low-rank-in-weight-space-potentially-making-them-easier-to-manipulate-with-steering-vectors-compared-to-naturalistic-behavior

Fine-tuned behaviors are low-rank in weight space, potentially making them easier to manipulate with steering vectors compared to naturalistic behavior

Cited from Wang et al. 2025a as reason SDF is preferred over demonstration fine-tuning for realistic model organisms.

Source paper

extracted_from
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.