Ruikang Zhang

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders2026
Sparse Autoencoder (SAE)-based contrastive feature retrieval can reliably identify and bidirectionally steer high-order semantic features in LLMs, outperforming Contrastive Activation Addition (CAA) in both stability and generalization. The framework, termed Mechanistic Knobs, applies a contrastive feature retrieval pipeline to controlled semantic oppositions — specifically the Big Five personality traits — combining statistical activation analysis across SAE latent spaces with generation-based validation to distill monosemantic functional features. Crucially, intervening on a single retrieved feature produces coherent, cross-dimensional behavioral shifts aligned with the target trait, an effect the paper names Functional Faithfulness: steering one personality-relevant SAE feature does not merely nudge a single surface statistic but propagates consistently across multiple linguistic dimensions (e.g., lexical choice, sentence structure, affective tone) associated with that trait. Submitted January 2026 and revised April 2026 (arXiv:2601.02978), the work demonstrates bidirectional control — amplifying or suppressing each of the 5 trait dimensions — under conditions where CAA produces less stable outputs. The paper argues this implies LLMs encode high-order conceptual attributes as deeply integrated, functionally coherent internal structures rather than as diffuse statistical correlations, and that SAE-grounded mechanistic interventions therefore constitute a principled path toward reliable regulation of complex AI behavioral attributes.

Ruikang Zhang

Authored papers (1)

More papers — OpenAlex / S2

Co-authors (2)

Recent mentions (1)