hypothesis
active
hypothesis:we-hypothesize-that-group-b-hidden-states-store-a-representation-of-the-statement-s-truthWe hypothesize that group (b) hidden states store a representation of the statement's truth
Motivating hypothesis driving the remainder of the paper's analysis after patching localization
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (3)
finding
- Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
- Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditionsassociated_withCore result showing MM is superior to LR for causal implication despite similar classification accuracy
- Localizes truth representations to specific hidden states, motivating the rest of the analysis
Methods (1)
method
- Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Localization result from patching experiments; identifies group (b) hidden states as the locus of truth representations
- The inferential interpretation of internal dynamics.
- Latent variables causing observations in the generative model.
- The testable hypothesis driving the active inference analysis in the simulation.
- Tested in Section 4.4 calibration experiment; confirmed by findings.
- Proposed explanation for why emotion probes are more persistent than variance-matched random probes
- Interpretive synthesis of DIM and cone intervention successes
- Motivating hypothesis for Section 5's investigation of prompt template effects.