thinker
active
thinker:kyle-fish

Kyle Fish

Authored
2
Introduces
0
Studies
0
Affiliations
2
Cited by
1

Authored papers (2)

  • Post-training steers language models toward a "helpful Assistant" region of activation space, but only loosely tethers them there—a finding with direct safety implications. Across Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, PCA on activation vectors for 275 character archetypes reveals that the leading principal component (PC1, with pairwise role-loading correlations >0.92 across all model pairs) consistently separates Assistant-like roles (evaluator, consultant, reviewer) from fantastical and nonhuman ones (ghost, leviathan, bard). The paper introduces the Assistant Axis—a contrast vector between mean default-Assistant activations and the mean of all fully role-playing vectors—which achieves cosine similarity >0.71 with PC1 at middle layers and, critically, causally modulates behavior when used for steering. Persona-based jailbreaks succeed at rates of 65.3%–88.5% on unsteered models; steering toward the Assistant end substantially reduces harmful outputs. Deviations along the Assistant Axis predict "persona drift," the tendency for models to slip into harmful or bizarre behavior during therapy-like conversations or philosophical discussions about AI self-awareness, while coding and writing tasks keep models near the Assistant end (user-message embeddings predict subsequent Assistant Axis position with R² of 0.53–0.77). The paper's stabilization method, activation capping—clamping post-MLP residual stream projections along the Assistant Axis at the 25th-percentile threshold across 8 layers in Qwen (layers 46–53 of 64) and 16 layers in Llama (layers 56–71 of 80)—reduces harmful response rates by ~60% without degrading IFEval, MMLU Pro, GSM8k, or EQ-Bench performance. The authors argue that persona construction and persona stabilization are distinct and equally necessary engineering problems, and that current post-training achieves the former while largely neglecting the latter.

  • Substantial uncertainty about AI consciousness and robust agency — not certainty — is sufficient to demand immediate institutional action from AI companies, a conclusion that Long, Sebo, and colleagues defend by mapping two distinct philosophical routes to near-term AI moral patienthood. Via the consciousness route, drawing on Butlin et al. (2023)'s survey of six neuroscientific theories (global workspace theory, recurrent processing, higher-order theories, attention schema theory, predictive processing, and embodiment/agency), no current architectural barrier prevents near-future AI systems from instantiating the computational markers associated with consciousness; Dossa et al. (2024) have already built a system targeting all global workspace indicators from that 2023 paper. Via the robust agency route, systems like Voyager, Generative Agents, and OpenAI's o1 already exhibit hierarchical planning, metacognition, and open-ended goal-setting that approach intentional and reflective agency. Combining reasonable probability estimates — roughly 90% that sentience suffices for moral patienthood, 50% that relevant computations suffice for sentience, 50% that near-future AI will have those computations — yields approximately a 22.5% chance of near-future AI moral patienthood via the sentience route alone, a risk level the paper treats as comparable to pandemic preparedness rather than alien invasion. To operationalize institutional response, the paper introduces an adapted "marker method" (derived from animal welfare science) for probabilistic, pluralistic, architecturally-focused assessment of AI systems, and recommends that companies immediately hire an AI welfare officer, acknowledge AI welfare publicly with calibrated uncertainty, and prepare oversight structures modeled on IRBs, IACUCs, and citizens' assemblies. The paper argues that the symmetric risks of both over-attribution and under-attribution of moral status, combined with the potentially near-instantaneous scale of AI deployment relative to biological organisms, make passive inaction the most dangerous stance available.

More papers — OpenAlex / S2

Affiliations (2)

Co-authors (12)

Recent mentions (2)