question
active
question:how-can-we-develop-better-methods-for-measuring-the-model-s-evaluation-relevant-beliefs-beyond-reading-its-chain-of-thoughtHow can we develop better methods for measuring the model's evaluation-relevant beliefs beyond reading its chain of thought?
Gap in current evaluation methods; current work relies on CoT monitoring which may miss unverbalized beliefs.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Cited regarding possibility of encoding misaligned reasoning in benign chains-of-thought
- Motivating hypothesis for Section 5's investigation of prompt template effects.
- Limitation acknowledged regarding generalizability of the reflection identification method
- The speech act theory for programming can be simpler than human models.
- Practical guidance for practitioners who lack ground-truth model organisms.