method
active
method:post-hoc-rationalization-elicitationPost-Hoc Rationalization Elicitation
Asking model to explain its own behavior after the fact when no chain-of-thought was available
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- About chain-of-thought and process safety.
- Indirect evidence for alignment-faking reasoning when no scratchpad is available
- Zaadnoordijk and Bayne's category of intentional action; sticker-removal behavior induced by the self-prior corresponds to this
- Bayesian model reduction formalises post-hoc hypothesis testing to simplify the generative model.claim0.736Definition of Bayesian model reduction, Section 9.1.
- Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
- Christiano et al. (2021) framework motivating the problem of determining whether a model 'believes' a statement; cited as core motivation
- Source paper for Bayesian model reduction methodology used in structure learning
- Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs