claim
active
claim:fine-tuned-behaviors-are-low-rank-in-weight-space-potentially-making-them-easier-to-manipulate-with-steering-vectors-compared-to-naturalistic-behaviorFine-tuned behaviors are low-rank in weight space, potentially making them easier to manipulate with steering vectors compared to naturalistic behavior
Cited from Wang et al. 2025a as reason SDF is preferred over demonstration fine-tuning for realistic model organisms.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Technique used to impose guardrails on base LLMs, analogized to censorship on the simulator's range of simulacra
- Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.
- Future work hypothesis about extending SOO to direct value alignment
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Central empirical claim of the paper supported by three LLM experiments
- Unified interpretation of different adaptation methods via UCCT terms