finding

active

finding:an-intervention-benign-at-context-v4-0-75-produces-a-class-c-behavioral-flip-at-0-75-v4-1-demonstrating-dormant-behavioral-changes-from-latent-divergence

An intervention benign at context v4<0.75 produces a class-C behavioral flip at 0.75<v4<1, demonstrating dormant behavioral changes from latent divergence

Synthetic example showing an intervention that appears safe in tested contexts but causes behavior changes in others

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

Detecting dormant behavioral changes requires evaluating across all possible contexts, which is infeasible in practice
supports
Practical limitation of current evaluation methods for pernicious divergence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A minimal prompt change can flip behavior.quote0.756
Illustrates sensitivity to anchors.
Interventions along activation manifold M_h yield behavioral trajectories following behavior manifold M_y, and vice versa — bidirectional relationship demonstrated across language models and video world models.finding0.751
Central empirical result showing causal coupling between representation and behavior geometry across multiple substrates and modalities.
Dormant Behavioral Changesconcept0.749
Perturbations behaviorally null in one context but altering behavior in another due to latent divergence
Persistent conversational context that produced emotion-relevant activations is a plausible driver of observed persistence resultsclaim0.749
Authors' caveat that conversational context persistence rather than internal emotion state persistence could explain findings
Fine-tuned behaviors are low-rank in weight space, potentially making them easier to manipulate with steering vectors compared to naturalistic behaviorclaim0.747
Cited from Wang et al. 2025a as reason SDF is preferred over demonstration fine-tuning for realistic model organisms.
Under spatio permutation controls, IIT consciousness estimates outperform Span Representation in mean AUC in several cases (LLaMA3.1-70B on Hinting and Irony, Mistral-7B on Irony, LLaMA3.1-8B on Strange Stories).finding0.746
Contrasts with temporal permutation where Span Representation dominates; suggests spatio permutation reveals different dynamics.
Cross-base fine-tuning yields asymmetric transfer: B10 transfers most robustly, B9 leastfinding0.746
In-base gains accompanied by uneven OOD drops; higher-density priors more robust.
Divergence within the behavioral null-space is harmless to functional claims about a function's computation when the claim ignores internal sub-computationsclaim0.745
Key theoretical claim distinguishing harmless from pernicious divergence