Model Surgery

Edits MLP weights for all layers to modify model behavior; used by Abdelnabi & Salem to decrease verbalized evaluation awareness.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

modelconcept0.834
A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
Model Stitchingmethod0.814
Technique to measure representational compatibility by integrating intermediate representations of one model into another
Model Editingconcept0.788
Technique for modifying model knowledge or behavior via targeted interventions, e.g., ROME by Meng et al.
Toy Modelsconcept0.780
model selectionconcept0.777
Comparing models using log-evidence approximated by free energy.
Model Deceptionconcept0.770
LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
Foundation Modelsconcept0.768
Large pretrained models used as backbones across tasks; their universality motivates the convergence hypothesis
Language Modelsconcept0.762
Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.