claim

active

claim:fine-tuned-behaviors-are-low-rank-in-weight-space-potentially-making-them-easier-to-manipulate-with-steering-vectors-compared-to-naturalistic-behavior

Fine-tuned behaviors are low-rank in weight space, potentially making them easier to manipulate with steering vectors compared to naturalistic behavior

Cited from Wang et al. 2025a as reason SDF is preferred over demonstration fine-tuning for realistic model organisms.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.802
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
Fine-Tuning via Reinforcement Learningmethod0.798
Technique used to impose guardrails on base LLMs, analogized to censorship on the simulator's range of simulacra
manifold steering produces clean probability shifts along natural behavior structure; linear steering cuts across manifold and produces off-target noisy effectsfinding0.790
Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.
Fine-tuning as character formation: what kinds of selves are produced through training is an open research direction.claim0.788
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.786
Future work hypothesis about extending SOO to direct value alignment
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.783
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performanceclaim0.781
Central empirical claim of the paper supported by three LLM experiments
Fine-tuning reduces mismatch dr, retrieval increases effective cohesion ρd, and few-shot adjusts the budget kclaim0.780
Unified interpretation of different adaptation methods via UCCT terms