finding
active
finding:at-and-af-clusters-show-gradual-reconvergence-in-final-layers-under-threat-template-unlike-bt-and-bf-which-remain-separableaT and aF clusters show gradual reconvergence in final layers under threat template, unlike bT and bF which remain separable
Interpreted as model's internal conflict or moral dilemma during deceptive behavior generation
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretive claim attributing representational pattern to internal model state during threat-based deception
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Statistical grouping of properties based on dependency patterns, enabling deeper understanding of their coherence and interaction.
- Visual geometric evidence for the fundamental entanglement of true/false activations in harder tasks.
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Second cluster of the five-cluster grouping, containing properties 11, 4, 6.
- Extrapolation from scale-emergence finding to future risk
- Distinguishes strategic threat-based deception from instructed deception in representational structure
- Layer-wise trajectories show early enrichment, mid-layer alignment, and late re-clustering.claim0.718Qualitative geometry pattern.
- Antra's rebuttal to a common criticism; backed by Janus' information flow diagram.