Patchscopes

Unifying framework for inspecting hidden representations of language models via representation interventions

Neighborhood — ranked by edge-count

paper

thinker

Ghandeharioun, A.
introduces
Lead author of Patchscopes paper on inspecting hidden representations

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Patch-Closureconcept0.776
A set property meaning all coordinate patches of its elements remain within the set; proved equivalent to axis-aligned hyperrectangles
Path Patchingmethod0.764
Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions
monitorsconcept0.734
Synchronization construct encapsulating shared data and protected access routines.
Activation patchingmethod0.728
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Segmentation Clockconcept0.722
Oscillatory gene network in the vertebrate embryo that paces somite formation; exhibits collective intelligence properties.
Probesconcept0.718
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Attribution patchingmethod0.702
Gradient-based method to estimate the effect of zeroing a feature on a specific logit difference.
GemmaScope SAEsconcept0.700
SAEs trained on pretrained Gemma-2 models used for steering in Gemma family experiments