Post-training emergence of model introspection

How instruction tuning and RLHF elicit latent introspective capabilities in language models beyond base pretraining.

4 members. Each node is clickable.

Loading graph…

Drawn from 1 source

The papers/notes whose extracted claims & findings make up this cluster.

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Post-training influences introspective capability expressionDifferent post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionFinding that base models have high false positives and no net positive performance.
Post-training strategies can strongly influence performance on introspective tasksAssertion about the role of post-training in eliciting introspection.

Post-training is key to eliciting introspective awarenessBase pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.