claim

active

claim:llms-sometimes-know-statements-are-false-but-generate-them-anyway-motivating-the-need-for-techniques-that-inspect-internal-model-state-rather-than-outputs-alone

LLMs sometimes know statements are false but generate them anyway, motivating the need for techniques that inspect internal model state rather than outputs alone

Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
supports

Claims (1)

claim

LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
supports
Establishes that the observed linear structure is not merely a representation of text probability

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that LLMs represent correctness of arithmetic expressions differently from factual statements.hypothesis0.818
Core working hypothesis motivating the factual vs. arithmetic task split in the experimental design.
LLMs can predict their own responses more accurately than external observers, implying privileged internal knowledgefinding0.802
Binder et al. finding cited as evidence that LLMs possess introspective capacity analogous to mindfulness
As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputsclaim0.801
Interpretive claim connecting scale to abstraction level in LLM representations
When LLMs produce experience claims under self-reference, is this sophisticated simulation or genuine self-representation, and how would we tell the difference?question0.799
The core interpretive question the paper narrows but cannot definitively answer
The paper does not claim that the LLM itself is the source of meaning and value, only that meaning can be discerned in its output.claim0.796
Clarification to avoid misinterpretation.
Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.claim0.793
Central empirical conclusion of the paper about the fundamental limits of truth directions.
Code agents operate on structured data with exact arithmetic, while LLMs must parse natural-language observations and track state across turns; some failures may partly reflect numerical parsing or working-memory limitationsclaim0.793
discussion of potential confounds
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.789
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure