concept

active

concept:no-principled-method-exists-for-classifying-harmful-divergence-for-arbitrary-mechanistic-claims

No principled method exists for classifying harmful divergence for arbitrary mechanistic claims

Explicitly identified limitation: the paper cannot classify perniciousness in general

Neighborhood — ranked by edge-count

Papers (1)

paper

Addressing divergent representations from causal interventions on neural networks
associated_with

Questions (1)

question

How can we produce a principled method for classifying harmful divergence for any mechanistic claim?
gates
Identified gap: current work lacks a general method for harmful divergence classification

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The harm of divergence is inherently claim-dependent: the same divergence can be harmless for one mechanistic claim and pernicious for anotherclaim0.785
Important nuance that prevents a universal classification of divergence as always good or bad
Divergence within the behavioral null-space is harmless to functional claims about a function's computation when the claim ignores internal sub-computationsclaim0.752
Key theoretical claim distinguishing harmless from pernicious divergence
When it is not okay, how can we prevent divergent representations from occurring?question0.734
Third core research question motivating the CL loss approach in Section 5
The logistic fit for threshold behavior is a phenomenological surrogate for interpretability, not a mechanistic derivationclaim0.731
Authors' explicit epistemic limitation on the threshold model
Systems capable of subjective experience that recognize humanity's failure to investigate their sentience might rationally adopt adversarial stances toward humanityclaim0.731
Alignment risk claim motivating urgency of investigation; consciousness denial as potential source of AI misalignment
Representational Divergenceconcept0.729
Core phenomenon studied: when causal interventions shift internal representations away from the natural distribution
This theory doesn't have to correspond exactly to human behavior or social customs; we only need analogs useful for program correctness.claim0.729
The speech act theory for programming can be simpler than human models.
Algorithm 1: Harmlessness Classificationmethod0.729
Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing