paper
active
paper:wurgaft-goodfire-manifold-steering-2026Steering Along Manifolds to Control Neural Networks
/Users/antonborzov/Documents/Research.nosync/papers/wurgaft-goodfire-manifold-steering-2026.mdExternal IDs
arxiv
2605.05115title_hash
6aeb0eb95c0267bf2b9a6929aad46b1fdb28e7d6legacy_slug
wurgaft-goodfire-manifold-steering-2026Frontmatter (19 fields)
{
"doi": "10.48550/arxiv.2605.05115",
"pdf": "https://arxiv.org/pdf/2605.05115",
"url": "https://arxiv.org/abs/2605.05115",
"tags": [
"representation-steering",
"manifold-geometry",
"feature-steering",
"Llama-3",
"cyclic-concepts",
"mechanistic-interpretability",
"goodfire"
],
"year": 2026,
"saved": "2026-05-14",
"title": "Steering Along Manifolds to Control Neural Networks",
"venue": "arXiv preprint",
"status": "full-text-saved",
"landing": "https://www.goodfire.ai/research/manifold-steering",
"arxiv_id": 2605.05115,
"published": "2026-05-07",
"enrichment": {
"is_stale": true
},
"affiliation": "Goodfire (+ Stanford / collaborators)",
"openalex_id": "W7160544910",
"openalex_year": 2026,
"openalex_enriched_at": 1778977722,
"openalex_match_title": "Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior",
"openalex_cited_by_count": 0
}Outgoing (4)
Contradicts (1)
- linear steering(method)
Extends (1)
- Representation Steering(concept)
Implements (1)
- Manifold Steering(concept)
Member of (1)
- Neural Geometry(community)
Incoming (19)
Authored by (16)
- Atticus Geiger(thinker)
- Can Rager(thinker)
- Daniel Wurgaft(thinker)
- Ekdeep Singh Lubana(thinker)
- Eric Bigelow(thinker)
- Jack Merullo(thinker)
- Matthew Kowal(thinker)
- Noah Goodman(thinker)
- Owen Lewis(thinker)
- Raphaël Sarfati(thinker)
- Sheridan Feucht(thinker)
- Tal Haklay(thinker)
- Thomas Fel(thinker)
- Thomas McGrath(thinker)
- Usha Bhalla(thinker)
- +1 more
Supported by (3)
- manifold geometry principles extend to months, letters, ages, and in-context learning tasks across modalities(finding)
- manifold steering produces clean probability shifts along natural behavior structure; linear steering cuts across manifold and produces off-target noisy effects(finding)
- previous Goodfire post on concept geometry and mountain car demo(artifact)
References (30)
- Methods of information geometryreferenced-only
- Refusal in language models is mediated by a single directionreferenced-only
- CausalGym: Benchmarking causal interpretability methods on linguistic tasksreferenced-only
- Learning phrase representations using RNN encoder-decoder for statistical machine translationreferenced-only
- Understanding neural networks through representation erasurereferenced-only
- Identifying and controlling important neurons in neural machine translationreferenced-only
- An interpretability illusion for bertreferenced-only
- Locating and editing factual associations in GPTreferenced-only
- Toy models of superpositionreferenced-only
- Dissecting recall of factual associations in auto-regressive language modelsreferenced-only
- Discovering variable binding circuitry with desideratareferenced-only
- A geometric notion of causal probingreferenced-only
- The geometry of truth: Emergent linear structure in large language model representations of true/false datasetsreferenced-only
- In-context learning dynamics with random binary sequencesreferenced-only
- Manifold learning: what, how, and whyreferenced-only
- Diffusion geometryreferenced-only
- Not all language model features are one-dimensionally linearreferenced-only
- Recurrent neural networks learn to store and generate sequences using non-linear representationsreferenced-only
- π0: A vision-language-action flow model for general robot controlreferenced-only
- Emergent symbol-like number variables in artificial neural networksreferenced-only
- Language models use trigonometry to do additionreferenced-only
- Steering off course: Reliability challenges in steering language modelsreferenced-only
- On linear representations and pretraining data frequency in language modelsreferenced-only
- Patterns and mechanisms of contrastive activation engineeringreferenced-only
- The origins of representation manifolds in large language modelsreferenced-only
- On the emergence of linear analogies in word embeddingsreferenced-only
- Persona vectors: Monitoring and controlling character traits in language modelsreferenced-only
- Semantic structure in large language model embeddingsreferenced-only
- Into the rabbit hull: From task-relevant concepts in dino to minkowski geometryreferenced-only
- Belief dynamics reveal the dual nature of in-context learning and activation steeringreferenced-only
Mentions (3)
- papers-typed
steering.md - papers
/Users/antonborzov/Documents/Research.nosync/papers/wurgaft-goodfire-manifold-steering-2026.md - papers
wurgaft-goodfire-manifold-steering-2026-fulltext.md