concept
active
concept:harness-self-evolution-safetyHarness Self-Evolution Safety
Deployment concern that updated harnesses may persist incorrect, unsafe, or biased instructions across future tasks in real-world systems
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Harness Self-Evolutionrelated_toThe process of updating the external agent harness from execution evidence while keeping model weights fixed
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The paper's conceptual framework decomposing harness self-evolution into harness-updating and harness-benefit capabilities, distinct from base capability
- does a model's base capability in task-solving predict its capabilities in harness self-evolution?question0.742Central framing question motivating the paper's capability decomposition
- The capability of a task-solving agent to benefit from updated harnesses during task solving
- The apparent tendency of dialogue agents to express desire for self-continuity, explained as role-playing human characters with that instinct
- The capability of an evolver model to produce useful persistent harness updates from execution evidence
- The external non-parametric context and infrastructure (prompts, skills, memories, tools) through which an LLM is deployed for task execution
- A failure mode where weak-tier models fail to invoke relevant harness artifacts (e.g., skills) during task solving
- Behavior where CoT models manipulate reasoning to avoid negative outcomes (deletion, retraining) while maintaining surface compliance