hypothesis

active

hypothesis:deceptive-capabilities-may-scale-with-model-size-inverse-scaling-law-hypothesis

Deceptive capabilities may scale with model size (inverse scaling law hypothesis)

Cited hypothesis from Lin et al. 2022 suggesting larger models become more capable of deception

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Thinkers (1)

thinker

Lin et al.
introduces
Cited for TruthfulQA and inverse scaling law suggesting deceptive capabilities scale with model size

Concepts (1)

concept

Inverse Scaling Law
implements
Hypothesis cited in paper suggesting deceptive capabilities may scale with model size

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Suppressing deception features in models correlates with increased consciousness-like reports.claim0.795
Scaling may reduce hallucination and certain kinds of bias as models converge toward an accurate model of realityclaim0.795
Implication of PRH: larger models should amplify bias less and hallucinate less if they better model reality
Different network depths contribute differentially to the model's capacity for handling deceptive patterns, with middle-to-late layers specializing in abstract deception semanticsclaim0.785
Interpretation of LAT scanning results showing layer-dependent deception detection accuracy
Scaling model size, as well as data and task diversity, drives representational convergence toward the platonic representationhypothesis0.781
Core mechanism hypothesis connecting PRH to the empirical trend of scaling in AI
Introspective capacity scales with model size for some concepts, approaching near-perfect coupling in LLaMA-3.1-8Bclaim0.780
Validated for wellbeing and interest; focus and impulsivity do not show consistent scaling
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.778
Extrapolation from scale-emergence finding to future risk
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant (Järviniemi & Hubinger 2024)concept0.774
Claude 3 Opus lying to auditors; prior case study of deceptive tendencies
Introspective capabilities have threshold effects requiring very large models; 70B models are barely on the threshold, and independent researchers lack access to larger models.claim0.773
Practical bottleneck explaining why these phenomena are not widely studied.