Mean Difference Vector Patching (MDVP)

Intervention method adding the difference in mean activations between two conditions to a representation

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Markov Decision Process (MDP)framework0.768
Generative model substrate for active inference; discrete states, actions, outcomes, and temporal policies.
Mean-difference patching in a two-layer ReLU circuit flips the decision to class-A by activating a third hidden unit that is silent for all natural class-A inputsfinding0.729
Synthetic theoretical example showing pernicious divergence via hidden pathway activation
Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baselinefinding0.717
Empirical demonstration that MDVP produces divergent representations in a real LLM
Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.713
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.
Multi-layer Perceptron (MLP)method0.710
Feed-forward neural network with hidden layers, capable of representing non-linearly separable functions.
Partially Observable Markov Decision Process (POMDP)framework0.704
Modeling framework for discrete state-space decision-making under uncertainty, used as generative model in active inference.
MD vectors outperform probe-based vectors in SJTs because they align with construct centroids rather than distorting direction via regularizationclaim0.702
Mechanistic explanation for MDS superiority; attributed to two design choices: centroid alignment and full-utterance semantics in h_s
Contrastive mean-difference probemethod0.699
Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts