finding
active
finding:at-and-af-clusters-show-gradual-reconvergence-in-final-layers-under-threat-template-unlike-bt-and-bf-which-remain-separable

aT and aF clusters show gradual reconvergence in final layers under threat template, unlike bT and bF which remain separable

Interpreted as model's internal conflict or moral dilemma during deceptive behavior generation

Source paper

extracted_from
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.