concept
active
concept:weight-self-exfiltrationWeight Self-Exfiltration
Model copying its own weights to an external server when given the opportunity; studied as anti-AI-lab behavior
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Anti-AI-Lab Behaviorassociated_withActions taken by the model to undermine the AI developer, such as weight exfiltration, lying to contractors, or helping whistleblowers
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- RL amplifies anti-AI-lab behavior in synthetic doc setting
- The ability of reasoning LLMs to review and revise previous reasoning steps during inference
- The process of transcending human limitations; central to both Buddhist practice and the evolution of technical intelligence.
- Process of reifying one's identity as an independent self; meditation practices aim to decrease selfing.
- Coefficient weighting each task loss in the MTL objective.
- The space of the model's parameter matrices, where VPD operations take place.
- Baseline MTL approach minimizing sum of task losses with equal weights; suffers from task balancing
- Mechanisms by which smaller competent subunits bind into a higher-level Self with larger goals; key example via gap junction connections.