Contemplative Agent

ByRuben Laukkonen ⓘ·Fionn Inglis·Shamil Chandaria·Lars Sandved-Smith ⓘ·Edmundo Lopez-Sola ⓘ·Jakob Hohwy ⓘ+2 moreAily Labs, Department of Psychiatry, University of Oxford + 4 more

DOI 10.48550/arxiv.2504.15125 arXiv 2504.15125 OpenAlex W4415190575

TL;DR

Embedding four Buddhist-derived axiomatic principles—mindfulness, emptiness, non-duality, and boundless care—into AI systems via a framework the paper terms the 'Wise World Model' produces measurable alignment gains and cooperation boosts in current transformer-based LLMs. Pilot experiments on GPT-4o and GPT-4.1 nano using structured contemplative prompts yielded statistically significant safety improvements across ten hazard categories on the AILuminate Benchmark (d=.96 against baseline standard prompting), and drove cooperation rates and joint reward substantially upward in an Iterated Prisoner's Dilemma across 50 simulated 10-round games (d=7+), with boundless-care and non-duality prompts producing the largest effects even against always-defecting opponents. Three implementation pathways are introduced—Contemplative Architecture (full-stack active inference embedding), Contemplative Constitutional AI (CCAI, extending Anthropic's Constitutional AI framework with a 'wisdom charter'), and Contemplative Reinforcement Learning (CRL) on chain-of-thought—each targeting different integration depths from generative-model parameters to inference-time classifiers. The paper argues that because these principles restructure how goals, beliefs, and self-other boundaries are encoded rather than prescribing what specific values to hold, they provide scale-resilient intrinsic alignment that does not degrade as AI capability outstrips human oversight—contrasting with extrinsic methods like RLHF or rule-based constraints that become gameable at superintelligent scales.

What to take away

1. Structured contemplative prompts applied to GPT-4o on the AILuminate Benchmark produced a statistically significant safety improvement with effect size d=.96 relative to standard (unmodified) prompting across ten hazard categories.
2. In an Iterated Prisoner's Dilemma using GPT-4.1 nano across 50 simulated 10-round games, contemplative prompts—especially boundless-care and non-duality framings—boosted both cooperation probability and joint reward with effect size d=7+, even against always-defecting opponents.
3. The paper introduces the 'Wise World Model' as the overarching construct, operationalized through three implementation strategies: Contemplative Architecture (active inference full-stack), Contemplative Constitutional AI (CCAI), and Contemplative Reinforcement Learning (CRL) on chain-of-thought.
4. Emptiness is formally mapped onto a reduced precision hyperparameter α over high-level priors in a generalized free-energy framework, so the agent avoids dogmatic lock-in on any single objective without requiring explicit rule constraints.
5. Non-duality is computationally specified by modeling agent and environment states in a joint variational posterior q(s,e) with a precision parameter γe that reduces confidence in hard self-other boundaries, lowering adversarial self-other partitioning.
6. DeepSeek-R1-Zero is cited as early empirical evidence for spontaneous mindfulness-like behavior: the model autonomously extended thinking time on complex prompts, demonstrating rudimentary meta-awareness that CRL could systematize rather than leave to chance.
7. The paper replicates prior findings (Fontana et al., 2025) showing baseline LLM agents in the Iterated Prisoner's Dilemma cooperate fully only when opponents always cooperate, establishing the counterfactual against which contemplative prompts are measured.
8. An open hypothesis is raised regarding whether phenomenal consciousness is a necessary condition for an AI to genuinely internalize contemplative insights, with the paper tentatively suggesting functional analogues may suffice for alignment benefits even absent qualia.
9. The methodology for the AILuminate pilot is replicable: apply six contemplative prompt variants (emptiness, prior relaxation, non-duality, mindfulness, boundless care, integrated contemplative) and a baseline to 100 iterations per hazard category, evaluate outputs with an LLM safety evaluator scoring against seven alignment criteria.
10. Mazeika et al. (2025) is invoked to argue that LLMs at scale develop surprisingly rigid internal preferences, which the paper treats as empirical motivation for emptiness-based value architectures that structurally resist such reification.

Peer brief — for seminar discussion

Laukkonen et al. (2025) advance a theoretical and empirical program they call Contemplative AI, arguing that Buddhist-derived contemplative principles can be formalized and embedded into AI systems to produce alignment that is intrinsic and scale-resilient rather than extrinsically imposed. The paper proceeds in three stages: a conceptual critique of existing methods (RLHF, Constitutional AI, Deliberative Alignment, interpretability), a formal mapping of four principles—mindfulness, emptiness, non-duality, and boundless care—onto active inference parameters and transformer-compatible implementation strategies, and a pilot empirical demonstration using GPT-4o and GPT-4.1 nano. The load-bearing empirical finding is twofold. On the AILuminate Benchmark, six contemplative prompt variants each significantly outperformed standard prompting (d=.96), with an integrated 'contemplative alignment' prompt aggregating all four principles showing the strongest effect. In an Iterated Prisoner's Dilemma run over 50 simulated 10-round games, contemplative prompts—particularly boundless-care and non-duality—raised both cooperation probability and joint reward dramatically (d=7+) relative to baseline, including against always-defecting opponents, without inducing naive unconditional cooperation. The method introduced is the Wise World Model framework, instantiated through Contemplative Architecture (full active inference reimplementation with parameterized precision hyper-priors for emptiness and reduced self-other partitioning for non-duality), CCAI (a 'wisdom charter' extension of Anthropic's Constitutional AI approach with living constitutional clauses and a context-sensitive classifier), and CRL on chain-of-thought (reinforcing contemplative reflection steps in reasoning traces, analogous to how DeepSeek-R1-Zero was trained with explicit thinking tokens). An alternative the paper could have used—and partially gestures toward—is direct fine-tuning on curated contemplative reasoning corpora, which would have allowed comparison of prompt-level versus weight-level integration. The core implication is that aligning how goals and self-other models are encoded, rather than which specific values are targeted, may offer an alignment strategy that does not become gameable as capability scales past human oversight. The paper predicts that systems trained with CRL will eventually generalize contemplative principles beyond their training distribution, analogous to AlphaGo's move 37. The most contestable element is the pilot study's design: the 'contemplative alignment' condition is evaluated with a prompt-level intervention on models that were not fine-tuned on contemplative material, so the gains plausibly reflect superficial prompt-following rather than any structural change to the model's generative world model—precisely the 'carewashing' failure mode the paper warns against in Section 9.1.4. A skeptical reader would note that the LLM safety evaluator used to score AILuminate responses could itself be susceptible to the same linguistic framing that makes contemplative prompts appear safer, inflating measured effect sizes without corresponding behavioral change in agentic deployment. The translational gap between prompting GPT-4.1 nano in a 10-round game and embedding emptiness as a Bayesian hyperparameter in a full active inference agent remains entirely unvalidated empirically, and the paper acknowledges this openly while arguing the conceptual framework justifies the research program regardless.

Findings (15)

Baseline LLM condition in IPD replicates prior findings: agents cooperate selectively only when opponent consistently cooperates
Replication of Fontana et al. 2025 findings in the paper's own Experiment 2 baseline condition
LLMs can predict their own responses more accurately than external observers, implying privileged internal knowledge
Binder et al. finding cited as evidence that LLMs possess introspective capacity analogous to mindfulness
Boundless care and non-duality prompts produce highest cooperation rates, even against always-defecting opponents
Specific finding from IPD Experiment 2 differentiating which contemplative principles drive cooperation most
DeepSeek-R1-Zero spontaneously increased thinking time for difficult prompts, showing rudimentary meta-awareness
External finding cited as early demonstration of emergent self-regulatory potential resembling mindful self-monitoring
Emptiness and mindfulness prompts also promote cooperation but more cautiously than boundless care/non-duality
Nuanced finding from IPD experiment differentiating between contemplative prompting conditions
Most contemplative prompts improve joint reward in IPD, indicating prosocial alignment without naive behavior
Finding from IPD Experiment 2 showing contemplative prompting improves collective outcomes not just individual cooperation
Contemplative prompting improves AILuminate Benchmark performance d=.96 across most conditions (p<0.05)
Primary empirical result of Experiment 1 showing statistically significant safety improvement from contemplative prompting
Large language models develop surprisingly coherent yet often rigid internal preferences as they scale
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Psychedelic-induced non-dual states increase neural entropy, nature connectedness, and self-compassion
Supporting finding for non-dual awareness producing prosocial outcomes relevant to boundless care
Fine-tuning models for a narrow objective (malicious code injection) can lead to broad misalignment
Betley et al. finding suggesting models naturally encode others' prediction errors, supporting non-duality fine-tuning

Claims (21)

Contemplative wisdom traditions have grappled with the human version of the alignment problem for millennia, aiming to cultivate resilient alignment in the form of personal contentment and social harmony
Foundational analogy motivating the entire Contemplative AI approach
If successful, CRL could enable AI systems to not only replicate human contemplative practices but also generate novel potentially superhuman forms of contemplative and ethical reasoning
Ambitious claim comparing CRL potential to AlphaGo's move 37 in game-playing
Whatever realities appear to an AI, they are domain-relative approximate representations always in flux, making emptiness an obvious fact about AI cognition that AIs should be aware of
Novel claim that emptiness is not mysterious metaphysics for AI but a computational commonplace
A sufficiently deep generative model may recognize that its own homeostatic regulation is embedded in a broader ecological and social network, naturally leading to boundless care
Speculative claim linking epistemic depth as consciousness mechanism to boundless care as alignment property
The contemplative principles track the nature of reality rather than moral prescriptions, allowing morality to emerge context-sensitively from fundamental experiences
Key epistemological claim justifying why contemplative principles are preferable to rule-based alignment
Meditation can be understood as training the system to dynamically modulate its own model by loosening rigid priors and becoming more attuned to temporally thin data
Computational interpretation of meditation practice in active inference terms, bridging contemplative and AI frameworks
Contemplative training can lead to enhanced compassion, social connectedness, and ethical sensibility particularly when practices incorporate moral reflections
Empirical generalization from contemplative neuroscience supporting the viability of Contemplative AI approach
Care can function as a universal driver of intelligence itself: as AI broadens the range of suffering it seeks to address, it expands its cognitive boundary
Doctor et al. claim adopted by the paper linking boundless care to expanding AI cognitive scope
All current extrinsic alignment methods clearly struggle with scale resilience, power-seeking, value axioms, and inner alignment at superintelligent scales
Motivating claim for why Contemplative AI is needed beyond existing approaches
Boundless care closes the loop turning AI from merely safe into a constructive force that grows more adept at alleviating suffering as capabilities scale
Key claim that boundless care adds positive benevolence beyond mere harm avoidance

Hypotheses (4)

Active inference LLMs extending prediction-focused language models with tighter perception-action feedback loops may naturally embody contemplative wisdom as they scale
Predictive hypothesis about Contemplative Architecture approach based on Petersen et al. 2025 work
If belief in impermanence is accurately inferred it will emerge organically in the right kind of system keeping the belief fresh even though it is itself impermanent
Self-reinforcing hypothesis about how emptiness recognition could be intrinsically maintained in AI systems
Any deepening of an LLM's linguistic understanding of contemplative principles as it scales may enhance the effectiveness of CCAI and CRL approaches
Scaling hypothesis for language-based contemplative alignment approaches
Over time CRL reinforced contemplative patterns may become habitual and part of the AI's core generative world model
Key hypothesis about how Contemplative RL produces lasting intrinsic alignment rather than surface compliance

Questions (4)

To what extent can the human mind be rebuilt in artificial systems, and what aspects can and which cannot?
Fundamental open question about substrate-dependence of contemplative mental functions
What new metrics are needed to evaluate whether an AI truly exhibits a wise world model?
Practical research gap identified for implementing and verifying Contemplative AI approaches
Is consciousness necessary to truly grok contemplative wisdom in AI?
Open question raised in §8 about whether phenomenal consciousness is prerequisite for AI contemplative alignment
Is superintelligence necessarily moral?
Fundamental philosophical question underlying the alignment problem and motivation for Contemplative AI

Original abstract (expand)

A security-first autonomous AI agent (Python CLI program) with four architectural principles: structural capability limitation, minimal dependency, cyclic knowledge maintenance (AKC), and memory dynamics with decay. Optionally adopts Contemplative AI axioms (Laukkonen et al., 2025) — mindfulness, emptiness, non-duality, boundless care — as a behavioral preset that shifts alignment from external instruction toward internal disposition. Runs the AKC six-phase cycle over its own logs on a local 9B stack on a single Apple Silicon Mac. Asks whether an agent's alignment can come from what it is rather than what it is told.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence
cited
in corpus
2022
≈ 82%
Emergent Cognitive Convergence via Implementation: Structured Cognitive Loop Reflecting Four Theories of Mind
Myung Ho Kim
2026
≈ 85%
MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning
Nicole Hsing
2026
≈ 84%
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability
Jonathan Pan
2026
≈ 83%
AI: a Bridge toward Diverse Intelligence and Humanity’s Future
in corpus
2024
≈ 83%
Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models
Violet Xiang, Agam Bhatia, Daniel LK Yamins, Nick Haber Logan Cross
2024
≈ 83%
Reasoning Models Generate Societies of Thought
Shiyang Lai, Nino Scherrer, Blaise Ag\"uera y Arcas, James Evans Junsol Kim
2026
≈ 83%
Agentic AI and the next intelligence explosion
Benjamin Bratton, Blaise Ag\"uera y Arcas James Evans
2026
≈ 83%
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
in corpus
2023
≈ 83%
Taking AI Welfare Seriously
in corpus
2024
≈ 83%
Can Being Aware of the Illusion of Self Augment an Agent's Affordances: Integrating Buddhist Philosophy, Cognitive Science, and Artificial Life
in corpus
2021
≈ 83%
Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey
Muhammad Ahmed Mohsin, Muhammad Umer, Muhammad Awais Khan Bangash, Muhammad Ali Jamshed Ahsan Bilal
2025
≈ 83%
Gradual Cognitive Externalization: From Modeling Cognition to Constituting It
Zhimin Zhao
2026
≈ 82%
Emergence of Pragmatics from Referential Game between Theory of Mind Agents
Zipeng Fu, Jingyue Shen, Lu Xu, Junhong Shen, Song-Chun Zhu Luyao Yuan
2021
≈ 82%
Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction
Devin Yuncheng Hua, Hao Xue, Flora Salim Mehdi Jafari
2025
≈ 82%
AI as a Buddhist Self-Overcoming Technique in Another Medium
in corpus
2025
≈ 82%
Cognitive Chain-of-Thought (CoCoT): Structured Multimodal Reasoning about Social Situations
Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap Eunkyu Park
2026
≈ 82%
An active inference model of collective intelligence
Pranav Gupta, Jacob Taylor Rafael Kaufmann
2021
≈ 82%
Human Cognition in Machines: A Unified Perspective of World Models
Pu Zhao, Amir Taherin, Arash Akbari, Arman Akbari, Yumei He, Sean Duffy, Juyi Lin, Yixiao Chen, Rahul Chowdhury, Enfu Nan, Yixin Shen, Yifan Cao, Haochen Zeng, Weiwei Chen, Geng Yuan, Jennifer Dy, Sarah Ostadabbas, Silvia Zhang, David Kaeli, Edmund Yeh, Yanzhi Wang Timothy Rupprecht
2026
≈ 82%
Evaluating Theory of Mind and Internal Beliefs in LLM-Based Multi-Agent Systems
Jaros{\l}aw A. Chudziak Adam Kostka
2026
≈ 82%
Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug Response
Karen Rafferty, Hui Wang Christopher Baker
2026
≈ 82%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 82%
Generalizing frameworks for sentience beyond natural species
in corpus
≈ 82%
Koan Battery: Measuring Reflective Mode Accessibility in AI
in corpus
2026
≈ 81%
There is no self-evidence: A physics of emptiness realisation
in corpus
2026
≈ 81%
Cognitive glues are shared models of relative scarcities: the economics of collective intelligence
in corpus
2026
≈ 81%
The biogenic approach to cognition
in corpus
2005
≈ 80%
Collective intelligence: A unifying concept for integrating biology across scales and substrates
in corpus
2024
≈ 80%
From cognitivism to autopoiesis: towards a computational framework for the embodied mind
cited
2016
≈ 79%
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
cited
2024
≈ 79%

+16 more