finding
active
finding:clamping-transit-infrastructure-feature-to-5x-max-activation-caused-the-model-to-mention-a-bridge-in-completionClamping transit infrastructure feature to 5x max activation caused the model to mention a bridge in completion.
Further causal validation.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Strong causal evidence that the feature represents the bridge.
- Causal effect: activates generation of security bugs.
- Validation that top activations are highly specific to interpretation.
- Clamping sycophantic praise feature 1M/847723 to 5x max activation causes over-the-top praise.finding0.756Demonstrates causal role in sycophancy.
- Causal effect: feature induces perception of bugs.
- Feature manipulation alters persona.
- Feature intervention eliminates untruthful answer.
- Causal effect showing the feature governs computation.