concept
active
concept:behavioral-retentionBehavioral Retention
The preservation of unrelated model capabilities after a targeted intervention, operationalized via KL divergence on Alpaca
Neighborhood — ranked by edge-count
Methods (1)
method
- KL Divergence Retention EvaluationimplementsMeasuring KL divergence between original and post-intervention outputs on Alpaca prompts to assess behavioral preservation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Measurable capacity of frontier LLMs to detect and report their own internal states, used as a downstream measure in Experiment 4
- The path traced through output probability distribution space as interventions are applied to activations
- Perturbations behaviorally null in one context but altering behavior in another due to latent divergence
- Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
- Grouping similar model behaviors; the unsupervised method surfaces clusters of concerning patterns.
- Emotion feature persistence above and beyond the persistence expected from high variance explained alone, computed by subtracting median variance-matched probe persistence
- Tests like Turing test, Artificial Consciousness Test; argued to be unreliable for AI due to mimicry.
- A parameterized rubric counting deceptive actions over a grid of parameters to quantify RL agent deception