Concept Activation Vectors (TCAVs)

Kim et al. 2018 method for identifying concept directions in CNN activations; precursor to LLM probing

Neighborhood — ranked by edge-count

thinker

Been Kim
introduces
Author of TCAV (concept activation vectors); early work supporting Linear Representation Hypothesis

method

Linear Probe
extends
Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Residual Activation Vectorsconcept0.759
Layer-40 activations with the component explained by compressed Gemini embeddings subtracted, isolating information not driven by surface text content
concept vectorconcept0.750
Computed directional vector in activation space representing a specific concept, used for injection experiments
concept vector computationmethod0.750
Procedure extracting concept vectors as difference of mean activations between concept-exemplifying and baseline/negative sentences
Activation Verbalizer (AV)method0.726
Component of NLA that maps activations to text descriptions; initialized as copy of target LLM with supervised warm-start on summarization task.
Random and negated vectors less effective than concept vectorsfinding0.712
Random vectors require larger norm to trigger detection (8 vs 2); elicit awareness at lower rates (9/100); negated vectors comparably effective but model identification confabulated.
Activationsconcept0.710
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Contrastive concept vector extractionmethod0.708
Method for obtaining concept vectors by subtracting activations from two contrasting prompts.
Activation velocityconcept0.702
Cumulative drift measure in internal representations across turns introduced by Das & Fioretto 2026