Behavioral Retention

The preservation of unrelated model capabilities after a targeted intervention, operationalized via KL divergence on Alpaca

Neighborhood — ranked by edge-count

method

KL Divergence Retention Evaluation
implements
Measuring KL divergence between original and post-intervention outputs on Alpaca prompts to assess behavioral preservation

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Behavioral Self-Awarenessconcept0.774
Measurable capacity of frontier LLMs to detect and report their own internal states, used as a downstream measure in Experiment 4
Behavioral Trajectoryconcept0.763
The path traced through output probability distribution space as interventions are applied to activations
Dormant Behavioral Changesconcept0.749
Perturbations behaviorally null in one context but altering behavior in another due to latent divergence
Adaptive Behaviorconcept0.746
Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
Behavior Clusteringconcept0.735
Grouping similar model behaviors; the unsupervised method surfaces clusters of concerning patterns.
residual persistenceconcept0.722
Emotion feature persistence above and beyond the persistence expected from high variance explained alone, computed by subtracting median variance-matched probe persistence
Behavioural tests for consciousnessmethod0.719
Tests like Turing test, Artificial Consciousness Test; argued to be unreliable for AI due to mimicry.
Behavioral Deception Profilemethod0.717
A parameterized rubric counting deceptive actions over a grid of parameters to quantify RL agent deception