method
active
method:difference-in-means

Difference-in-Means

Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis

Neighborhood — ranked by edge-count

Frameworks (3)

framework
  • The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
  • Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym
  • Assistant Axis
    implements
    Contrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper

Concepts (1)

concept
  • Truth Direction
    associated_with
    A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Vector from mean of false representations to mean of true representations; core of mass-mean probing
  • Contrastconcept0.773
    The property that living structures contain intense contrast—far more than one imagines helpful; true opposites which annihilate each other when superimposed, creating differentiation that gives birth to something; contrast unifies rather than separates when used correctly
  • differentiationconcept0.772
    Subtle variation and detail, as in pots of flowers, that brings life to a place.
  • variationconcept0.750
    The subtle differences among repeated elements necessary to avoid mechanical uniformity.
  • ambiguityconcept0.745
    Multiple possible meanings for words like Alice, disambiguated by context; harder when grammar and meaning intertwine
  • Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
  • Elliott's core principle: type class instance definitions should mirror the semantic meaning to avoid abstraction leaks.
  • Supportmethod0.730
    Attribute: providing a foundation function, a text that acts as base or corroboration.