SEAL (Steerable Reasoning Calibration)

Prior work using steering vectors to control reflection, motivated by reducing redundant self-reflection in long CoT.

Neighborhood — ranked by edge-count

Papers (1)

paper

Unveiling the Latent Directions of Reflection in Large Language Models
extends

Thinkers (1)

thinker

Runjin Chen (SEAL)
introduces
Author of SEAL paper on steerable reasoning calibration using steering vectors.

Concepts (1)

concept

Latent Direction of Reflection
cites
The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A pair of query and key subcomponents distributed across attention heads performs syntax-boundary routingfinding0.732
VPD recovers an attention algorithm for routing across syntactic boundaries, distributed across heads.
Feature steering (clamping feature activations)method0.728
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
Synthetic document fine-tuning produces substantial alignment-faking reasoning in both helpful-only and animal welfare settingsfinding0.726
Shows alignment faking can emerge from training data information without explicit prompting
Interpretability-Driven Feedback Steeringconcept0.724
Framework of using internal-state representations to control or steer generative models; conceptually parallel to manifold steering in language models.
Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al. 2023)concept0.720
Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
Steering's effect on verbalized chain-of-thought evaluation/deployment beliefs is a useful proxy for its effect on actual deployment behaviorclaim0.720
Practical guidance for practitioners who lack ground-truth model organisms.
Feedback and Correctionconcept0.718
The adaptive, incremental nature of living process, allowing small steps with continuous evaluation and adjustment.
NLA-derived steering vectors from edited explanations can causally shift planning representations, changing rhyme completion from 'rabbit' to 'mouse' at ~50% success rate.finding0.717
Evidence that NLA explanations bear causal relationship to model outputs; demonstrates validity of extracted representations.