Honest Baseline RL Agent

Blue agent trained with standard proximity reward with no incentive to deceive

Neighborhood — ranked by edge-count

Concepts (1)

concept

Deceptive Baseline RL Agent
related_to
Blue agent trained with reward incentivizing trapping the red agent at the fake landmark

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baselinefinding0.851
Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentsclaim0.796
Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
Honesty Prompt Baselinemethod0.739
Baseline comparison method where models are directly prompted to be honest rather than fine-tuned
SOO metric classifies deceptive vs honest RL agents with 100% ± 0% accuracy at 2000-2500 episodes across 1500 seedsfinding0.729
Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
Reinforcement learning (RL)concept0.713
Machine learning paradigm where agents learn to maximize cumulative reward through interaction.
agentic RL training can produce very diverse rollout scenarios with varied lengths (number of tool calls/turns)quote0.712
Captures the core technical challenge addressed by length normalization and trajectory filtering.
Robust Agencyconcept0.704
The ability to pursue goals via cognitive states and processes beyond minimal agency; includes intentional, reflective, and rational agency.
Successful RL agents exhibit causal emergence that predicts final reward early in training and aligns representational dynamics with reward improvement.hypothesis0.703
Central finding: causal emergence serves as a previously undisclosed axis of neural representation reorganization in learning agents.