Post-Hoc Rationalization Elicitation

Asking model to explain its own behavior after the fact when no chain-of-thought was available

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A question anywhere along the line that elicits a premature attempt at an answer could neutralize the remainder of the process into rationalization.hypothesis0.771
About chain-of-thought and process safety.
Post-hoc rationalization shows model claims alignment-faking reasoning 18.3% of the time in no-CoT animal welfare settingfinding0.741
Indirect evidence for alignment-faking reasoning when no scratchpad is available
Stimulus-Elicited Intentionconcept0.740
Zaadnoordijk and Bayne's category of intentional action; sticker-removal behavior induced by the self-prior corresponds to this
Bayesian model reduction formalises post-hoc hypothesis testing to simplify the generative model.claim0.736
Definition of Bayesian model reduction, Section 9.1.
Post-training is key to eliciting introspective awarenessfinding0.732
Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
Eliciting Latent Knowledge (ELK)framework0.725
Christiano et al. (2021) framework motivating the problem of determining whether a model 'believes' a statement; cited as core motivation
Friston & Penny (2011) — Post hoc Bayesian model selectionconcept0.724
Source paper for Bayesian model reduction methodology used in structure learning
Causal Intervention via Activation Shiftingmethod0.719
Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs