finding

active

finding:lat-classifiers-perform-worst-on-the-companions-dataset-weakest-model-cognition-domain-while-achieving-100-f1-on-facts-and-animals-datasets

LAT classifiers perform worst on the Companions dataset (weakest model cognition domain) while achieving 100% F1 on Facts and Animals datasets

Shows strong correlation between layer-wise representations and domain-specific semantic understanding

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (1)

claim

Different network depths contribute differentially to the model's capacity for handling deceptive patterns, with middle-to-late layers specializing in abstract deception semantics
supports
Interpretation of LAT scanning results showing layer-dependent deception detection accuracy

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truthfinding0.748
Shows that truth representations are not reducible to text probability representations
Normal (α=0.9) and chronic (α=0.1) agents in Objective-only non-stationary category perform best with opposite learning ratesfinding0.740
Suggests fundamental differences in learning dynamics between normal and chronic perception models
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.740
Motivation for using sparsity-based dictionary learning on language models
Weak-tier models often fail to invoke relevant harness artifacts during task-solving, with Qwen3-32B showing a 25% load rate against ~96% for strong modelsclaim0.738
Diagnosis of first failure mode explaining low harness-benefit for weak-tier models
Logistic regression fails to identify the true feature direction when a confounding feature is non-orthogonal to the truth direction, converging instead to the maximum margin separatorclaim0.736
Motivates the introduction of mass-mean probing as an alternative to LR
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.733
Demonstrates that early-layer probes capture sentence polarity rather than truth.
Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.731
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
Sparse autoencoders produce interpretable features for large models.claim0.730
Central claim of the paper: the method scales to state-of-the-art transformers.