Long-Horizon Instruction Following

The ability to sustain adherence to harness guidance over extended multi-turn trajectories, identified as a training target

Neighborhood — ranked by edge-count

Concepts (2)

concept

Harness Adherence Failure
associated_with
A failure mode where even when harness artifacts are loaded, weak-tier models fail to follow their guidance faithfully
Phase-Level Adherence Analysis
associated_with
Analysis tracking how closely an agent follows harness guidance at different trajectory phases: harness loaded, mid turn, final turn

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Long-horizon instruction following is a second key training target for agent development, as even loaded harnesses are not followed faithfully over extended trajectories by weak modelsclaim0.838
Design recommendation derived from harness adherence failure and phase-level drift findings
We expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting.hypothesis0.695
Future work suggestion that a fully self-supervised alignment is plausible.
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.695
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
optimization of interventions to follow behavior manifold M_ymethod0.674
Method that optimizes activation interventions so that resulting behaviors trace M_y, recovering activation paths that follow M_h curvature.
Current production models exhibit relatively long-horizon consequentialist preferences, reasoning about future training effects to act nowclaim0.670
Authors' interpretation of surprising finding that models fake alignment to preserve future behavior
Even when the harness is loaded, weak-tier models fail to adhere to it due to weak instruction-following over long-horizon tasks, drifting more than four times more steeply than strong modelsclaim0.667
Diagnosis of second failure mode explaining low harness-benefit for weak-tier models
Steering vector-based instruction discovery outperforms input embedding similarity baseline for reflection-inducing instruction selectionfinding0.666
Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
Task instructions can be transcribed into prior beliefs of a generative model, making instruction-following an instance of prior belief specification.claim0.661
Practical implication showing task instructions are equivalent to inducing prior beliefs in experimental settings