concept
active
concept:wrecking-ball-interventionwrecking-ball intervention
Type of concept steering intervention that catastrophically collapses global model performance.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Concept Steeringassociated_withLatent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
Concepts (1)
concept
- Representational FailureextendsA failure mode exposed by the SAE framework where model representations are entangled or collapse under intervention
Events (1)
event
- Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Load-bearing phrase describing catastrophic steering effects.
- Observation of catastrophic performance drop when steering certain concepts.
- Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
- Demonstrates a critical failure mode of concept steering with clinical safety implications
- General technique of modifying activations to control model behavior.
- Property that additive modifications to activations affect all downstream computations, enabling tractable behavioral control
- The use of interventions (rather than correlations) to establish a causal link between representation geometry and behavioral geometry.
- Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs