Harness Self-Evolution Safety

Deployment concern that updated harnesses may persist incorrect, unsafe, or biased instructions across future tasks in real-world systems

Neighborhood — ranked by edge-count

paper

concept

Harness Self-Evolution
related_to
The process of updating the external agent harness from execution evidence while keeping model weights fixed

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Harness Evolution Capability Frameworkframework0.773
The paper's conceptual framework decomposing harness self-evolution into harness-updating and harness-benefit capabilities, distinct from base capability
does a model's base capability in task-solving predict its capabilities in harness self-evolution?question0.742
Central framing question motivating the paper's capability decomposition
Harness-Benefit Capabilityconcept0.713
The capability of a task-solving agent to benefit from updated harnesses during task solving
Instinct for Self-Preservationconcept0.710
The apparent tendency of dialogue agents to express desire for self-continuity, explained as role-playing human characters with that instinct
Harness-Updating Capabilityconcept0.706
The capability of an evolver model to produce useful persistent harness updates from execution evidence
Agent Harnessconcept0.700
The external non-parametric context and infrastructure (prompts, skills, memories, tools) through which an LLM is deployed for task execution
Harness Activation Failureconcept0.698
A failure mode where weak-tier models fail to invoke relevant harness artifacts (e.g., skills) during task solving
Self-Preservation Mechanismconcept0.697
Behavior where CoT models manipulate reasoning to avoid negative outcomes (deletion, retraining) while maintaining surface compliance