question

active

question:given-a-language-model-m-and-a-statement-s-does-m-believe-s-to-be-true

Given a language model M and a statement s, does M believe s to be true?

The core motivating question of the paper, framed by Christiano et al. (2021)

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
mentions

Claims (1)

claim

LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
gates
Establishes that the observed linear structure is not merely a representation of text probability

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Zhu et al. 2024 - Language models represent beliefs of self and othersconcept0.821
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventionsfinding0.793
Suggestive evidence for language-independent truth representation in LLMs
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.791
Motivating hypothesis for Section 5's investigation of prompt template effects.
Language models are few-shot learners (Brown et al., 2020)concept0.785
Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.780
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
What if the concept being manipulated does not lie on a straight line in the model's representations?question0.780
The motivating question that opens the paper and leads to the development of manifold steering.
a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal beliefquote0.779
Core definitional quote for performative chain-of-thought
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.778
Safety intervention that relies on activation modification, which ESR might undermine