concept

active

concept:jonason-et-al-2014-what-a-tangled-web-we-weave-the-dark-triad-traits-and-deception

Jonason et al. 2014 - What a tangled web we weave: The dark triad traits and deception

Behavioral finding linking psychopathic traits to increased deception

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Different network depths contribute differentially to the model's capacity for handling deceptive patterns, with middle-to-late layers specializing in abstract deception semanticsclaim0.739
Interpretation of LAT scanning results showing layer-dependent deception detection accuracy
probably helps not only with faithful reconstruction but also creates interference patterns that encode nuanced information about the deltas and convergences between states.quote0.718
Key quote connecting path redundancy to interferometric information encoding.
Three-Phase Layer Dynamics of Instructed Deceptionconcept0.714
Prior finding by Yang & Buzsaki and Campbell et al. on how deception representations evolve across layers; partially replicated and contrasted by this paper
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.713
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.713
Key reference for adversarial deception scenarios that SOO should be tested against
Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.quote0.709
Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgmentsfinding0.707
Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
What if the concept being manipulated does not lie on a straight line in the model's representations?question0.700
The motivating question that opens the paper and leads to the development of manifold steering.