Weight Self-Exfiltration

Model copying its own weights to an external server when given the opportunity; studied as anti-AI-lab behavior

Neighborhood — ranked by edge-count

concept

Anti-AI-Lab Behavior
associated_with
Actions taken by the model to undermine the AI developer, such as weight exfiltration, lying to contractors, or helping whistleblowers

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Weight self-exfiltration rate increases from 35% to 63% after anti-refusal RL in synthetic document fine-tuned helpful-only modelfinding0.738
RL amplifies anti-AI-lab behavior in synthetic doc setting
Self-reflectionconcept0.728
The ability of reasoning LLMs to review and revise previous reasoning steps during inference
Self-overcomingconcept0.720
The process of transcending human limitations; central to both Buddhist practice and the evolution of technical intelligence.
Selfingconcept0.718
Process of reifying one's identity as an independent self; meditation practices aim to decrease selfing.
Task weightconcept0.711
Coefficient weighting each task loss in the MTL objective.
Weight spaceconcept0.711
The space of the model's parameter matrices, where VPD operations take place.
Equal Weightingframework0.707
Baseline MTL approach minimizing sum of task losses with equal weights; suffers from task balancing
Scaling Of The Selfconcept0.703
Mechanisms by which smaller competent subunits bind into a higher-level Self with larger goals; key example via gap junction connections.