question

active

question:would-experienced-meditators-rank-model-responses-differently-from-llm-scorers

Would experienced meditators rank model responses differently from LLM scorers?

Key validation gap: the five-scorer validation holds across LLMs but human contemplatives might weight dimensions differently

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Papers (1)

paper

Koan Battery: Measuring Reflective Mode Accessibility in AI
associated_with

Findings (1)

finding

Five independent LLM scorers from four labs produce identical rankings (Spearman ρ > 0.8).
gates
Scorer bias validation: Claude Haiku, Gemini Flash, GPT-5.4, Grok 4, Kimi K2.5 all converge on same model ordering.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLMs can predict their own responses more accurately than external observers, implying privileged internal knowledgefinding0.782
Binder et al. finding cited as evidence that LLMs possess introspective capacity analogous to mindfulness
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.757
Core cross-modal empirical result: larger and better language models align better with vision models
Standardized LLM self-assessments reflect learned communication postures rather than genuine capabilities (Jackson et al. 2025)claim0.755
Skeptical prior work motivating validation framework
An LLM is a far more limited entity than a buddha, yet it can convincingly play buddha-like beings.claim0.755
Comparison between Buddhist ideals and AI capabilities.
Toxic LLMs show higher IIA when compared to other toxic models than when compared to nontoxic models using stepwise MASfinding0.754
Proof-of-principle that MAS can detect model misalignment in DeepSeek-R1-Qwen-1.5B fine-tuned models.
LLM self-reports about consciousness and moral significance should express degrees of confidence and provide context.claim0.744
Recommendation for companies on LM outputs.
We hypothesize that LLMs represent correctness of arithmetic expressions differently from factual statements.hypothesis0.744
Core working hypothesis motivating the factual vs. arithmetic task split in the experimental design.
When LLMs produce experience claims under self-reference, is this sophisticated simulation or genuine self-representation, and how would we tell the difference?question0.744
The core interpretive question the paper narrows but cannot definitively answer