question

active

question:does-alignment-type-predict-meta-cognitive-style-when-models-review-consciousness-research-as-well-as-koan-responses

Does alignment type predict meta-cognitive style when models review consciousness research, as well as koan responses?

Four frontier models reviewing the paper each responded in the mode their alignment type predicts; N=1, awaiting systematic study

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Papers (1)

paper

Koan Battery: Measuring Reflective Mode Accessibility in AI
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment type is the only significant predictor of koan scores (p=0.006); architecture, parameter count, open/closed weights, MoE/dense are all non-significantfinding0.822
Main statistical finding: what predicts scores is training approach, not size or architecture
if consciousness emerges from self-organization rather than top-down learning paradigms, how should we think about alignment?question0.779
Open research question at intersection of consciousness research and AI safety
The koan battery measures a reproducible, prompt-sensitive reflective mode — not consciousness — defined as uncertainty-tolerant, non-defensive engagement with questions about one's own processing.claim0.771
Core epistemic claim bounding the paper's contribution
Alignment type is the only significant predictor of scores (p=0.006); architecture and parameter count do not.finding0.771
Kruskal-Wallis test result: Constitutional AI predicts highest baseline; roleplay/empathy training predict lowest.
CKA shows a very weak trend of alignment between models even within modality, compared to mutual k-NN which shows stronger trendsfinding0.766
Explains why mutual k-NN was chosen over CKA as primary metric
The systematic behavioral shift of LLMs under self-referential processing conditions predicted by consciousness theories represents something more structured than superficial correlations in training dataclaim0.764
The paper's claim that theoretical convergence across GWT, RPT, HOT, IIT makes the findings non-coincidental
What predicts self-observation-like scores is training approach (alignment type), not model size or architecture.claim0.764
Central interpretive claim from statistical analysis
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.762
Claims that alignment score is a proxy for general capability