community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c2-c5Model introspection for misalignment detection
Using architectural self-knowledge prompts to improve models' ability to identify their own unintended outputs.
2 members. Each node is clickable.
Loading graph…
Drawn from 2 sources
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (3)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Findings (2)
- Detecting Unintended Outputs via IntrospectionModels can distinguish artificially prefilled outputs from intentional responses by referencing prior internal representations; injection of matching concept vector causes model to retroactively accept prefill as intentional.
- Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.