Eliciting Latent Knowledge (ELK)

Christiano et al. (2021) framework motivating the problem of determining whether a model 'believes' a statement; cited as core motivation

Neighborhood — ranked by edge-count

paper

concept

Factuality
cites
Scoped definition of 'truth' used in the paper: the truth or falsehood of declarative factual statements

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

latent reasoningconcept0.742
Reasoning approach using learnable hidden embeddings.
Post-Hoc Rationalization Elicitationmethod0.725
Asking model to explain its own behavior after the fact when no chain-of-thought was available
Latent Introspectionconcept0.725
Pearson-Vogel et al.'s finding that models can detect prior concept injections; introspective signals exist in middle layers suppressed by post-training
Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.719
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
Friston, FitzGerald et al. (2016) — Active inference and learningconcept0.718
Prior active inference paper providing detailed neurophysiological implementation of belief updates
Stimulus-Elicited Intentionconcept0.716
Zaadnoordijk and Bayne's category of intentional action; sticker-removal behavior induced by the self-prior corresponds to this
That animal sentience is inferred on the basis of some likeness to humans is ultimately an intuition.claim0.713
Claim that the basis for inferring animal sentience is intuitive, not empirical.
Latent entitiesconcept0.712
Entities that become visible as centers in a configuration (e.g., rectangles of white space around a dot) that were not present before.