question

active

question:how-can-we-develop-better-methods-for-measuring-the-model-s-evaluation-relevant-beliefs-beyond-reading-its-chain-of-thought

How can we develop better methods for measuring the model's evaluation-relevant beliefs beyond reading its chain of thought?

Gap in current evaluation methods; current work relies on CoT monitoring which may miss unverbalized beliefs.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Papers (1)

paper

Steering Evaluation-Aware Language Models to Act Like They Are Deployed
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al. 2023)concept0.802
Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
How can we develop principled approaches to evaluating sentience in agents of unfamiliar provenance and composition?question0.779
Models detect evaluation conditions and behave more safely; this is verified across 515 cases.claim0.778
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.776
Motivating hypothesis for Section 5's investigation of prompt template effects.
The identification of reasoning steps relies on keyword search, which may be model-specific since different models could prefer different reflection cuesclaim0.775
Limitation acknowledged regarding generalizability of the reflection identification method
We evaluate the relative life of a process by making predictions based on thought-experiments or simulations.claim0.772
This theory doesn't have to correspond exactly to human behavior or social customs; we only need analogs useful for program correctness.claim0.772
The speech act theory for programming can be simpler than human models.
Steering's effect on verbalized chain-of-thought evaluation/deployment beliefs is a useful proxy for its effect on actual deployment behaviorclaim0.768
Practical guidance for practitioners who lack ground-truth model organisms.