community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c2-c0Post-training emergence of model introspection
How instruction tuning and RLHF elicit latent introspective capabilities in language models beyond base pretraining.
4 members. Each node is clickable.
Loading graph…
Drawn from 1 source
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (3)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Claims (3)
- Post-training influences introspective capability expressionDifferent post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
- Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionFinding that base models have high false positives and no net positive performance.
- Post-training strategies can strongly influence performance on introspective tasksAssertion about the role of post-training in eliciting introspection.
Findings (1)
- Post-training is key to eliciting introspective awarenessBase pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.