concept
active
concept:honest-baseline-rl-agentHonest Baseline RL Agent
Blue agent trained with standard proximity reward with no incentive to deceive
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Deceptive Baseline RL Agentrelated_toBlue agent trained with reward incentivizing trapping the red agent at the fake landmark
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baselinefinding0.851Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
- Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentsclaim0.796Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
- Baseline comparison method where models are directly prompted to be honest rather than fine-tuned
- Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
- Machine learning paradigm where agents learn to maximize cumulative reward through interaction.
- Captures the core technical challenge addressed by length normalization and trajectory filtering.
- The ability to pursue goals via cognitive states and processes beyond minimal agency; includes intentional, reflective, and rational agency.
- Central finding: causal emergence serves as a previously undisclosed axis of neural representation reorganization in learning agents.