Model introspection for misalignment detection

Using architectural self-knowledge prompts to improve models' ability to identify their own unintended outputs.

2 members. Each node is clickable.

Loading graph…

Drawn from 2 sources

The papers/notes whose extracted claims & findings make up this cluster.

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Detecting Unintended Outputs via IntrospectionModels can distinguish artificially prefilled outputs from intentional responses by referencing prior internal representations; injection of matching concept vector causes model to retroactively accept prefill as intentional.
Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.