method
active
method:difference-in-meansDifference-in-Means
Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis
Neighborhood — ranked by edge-count
Frameworks (3)
framework
- Linear Representation HypothesisimplementsThe hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
- CausalGymusesMulti-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym
- Assistant AxisimplementsContrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper
Concepts (1)
concept
- Truth Directionassociated_withA hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Vector from mean of false representations to mean of true representations; core of mass-mean probing
- The property that living structures contain intense contrast—far more than one imagines helpful; true opposites which annihilate each other when superimposed, creating differentiation that gives birth to something; contrast unifies rather than separates when used correctly
- Subtle variation and detail, as in pots of flowers, that brings life to a place.
- The subtle differences among repeated elements necessary to avoid mechanical uniformity.
- Multiple possible meanings for words like Alice, disambiguated by context; harder when grammar and meaning intertwine
- Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
- Elliott's core principle: type class instance definitions should mirror the semantic meaning to avoid abstraction leaks.
- Attribute: providing a foundation function, a text that acts as base or corroboration.