Difference-in-Means

Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis

Neighborhood — ranked by edge-count

framework

Linear Representation Hypothesis
implements
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
CausalGym
uses
Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym
Assistant Axis
implements
Contrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper

concept

Truth Direction
associated_with
A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Difference-in-Means Directionconcept0.890
Vector from mean of false representations to mean of true representations; core of mass-mean probing
Contrastconcept0.773
The property that living structures contain intense contrast—far more than one imagines helpful; true opposites which annihilate each other when superimposed, creating differentiation that gives birth to something; contrast unifies rather than separates when used correctly
differentiationconcept0.772
Subtle variation and detail, as in pots of flowers, that brings life to a place.
variationconcept0.750
The subtle differences among repeated elements necessary to avoid mechanical uniformity.
ambiguityconcept0.745
Multiple possible meanings for words like Alice, disambiguated by context; harder when grammar and meaning intertwine
Contrastive mean-difference probemethod0.741
Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
The instance's meaning follows the meaning's instance.claim0.737
Elliott's core principle: type class instance definitions should mirror the semantic meaning to avoid abstraction leaks.
Supportmethod0.730
Attribute: providing a foundation function, a text that acts as base or corroboration.