finding
active
finding:patching-group-b-hidden-states-over-clause-ending-punctuation-early-middle-layers-in-llama-2-13b-produces-the-strongest-causal-effect-on-true-false-output-predictionsPatching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictions
Localizes truth representations to specific hidden states, motivating the rest of the analysis
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Motivating hypothesis driving the remainder of the paper's analysis after patching localization
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
- Localization result from patching experiments; identifies group (b) hidden states as the locus of truth representations
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.792Central interpretive claim of the paper supported by causal ablation and activation evidence
- Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
- Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
- Layer-wise emergence pattern supporting hierarchical development hypothesis
- One of the most promising cases; approximately corresponds to the 2/3 layer of LLaMA3.1-8B.
- Empirical demonstration that MDVP produces divergent representations in a real LLM