A real-world study showed that introducing a cognitive layer architecture to support specialized psychotherapeutic reasoning capabilities in general-purpose chatbots improved depression and anxiety symptoms compared to chatbots or therapists alone.
Clinician–patient conversations form the cornerstone of mental healthcare. Large language models (LLMs) could hold promise for this domain but their effectiveness in patient-facing interactions remains largely unproven. Here we introduce a cognitive layer architecture that enhances general-purpose LLMs with specialized clinical psychotherapeutic reasoning capabilities. In a randomized, double-blind evaluation, 227 human participants generated naturalistic mental well-being session transcripts by interacting with different therapy agents.
A consortium of 22 expert clinicians assessed these transcripts, finding that LLMs augmented with this architecture consistently outperformed both standalone state-of-the-art LLMs and human clinicians across key clinical competencies required for delivering high-quality cognitive-behavioral therapy. We validated these results in an analysis of 19,674 transcripts from a large-scale, real-world deployment where an LLM embedded within this cognitive layer architecture was used as part of healthcare delivery to support 8,920 users seeking mental well-being assistance.
Increased cognitive layer activation was associated with greater symptom improvement and a higher likelihood of long-term clinical recovery (~10 weeks). Our findings demonstrate that a cognitive layer architecture can enable LLMs to deliver high-quality cognitive-behavioral therapy interactions, with continued research warranted into mechanisms and clinical efficacy of AI-assisted therapeutics. Data supporting the findings of this study are publicly available on Zenodo at https://doi.org/10.5281/zenodo.17176593 (ref.
62). This repository contains the de-identified, processed dataset necessary to replicate the statistical results reported in the paper. Raw data containing sensitive therapeutic transcripts are not publicly available to protect participant privacy. All custom code required to reproduce the statistical analyses presented in this paper from the provided data is written in Python (v.3.10.4) and is publicly available on Zenodo at https://doi.org/10.5281/zenodo.17176593 (ref.
62). For reproducibility, we have documented the technical architecture and machine-learning methods in this paper, while keeping the paper accessible to a clinical and general scientific audience. We are unable to open-source the cognitive layer and its sub-component models due to safety implications of unmonitored use of such AI agents in medical and well-being settings (for example, unmonitored utilization in direct-to-consumer products), as well as intellectual property and commercial viability considerations.
World Health Organization. The Global Health Observatory. WHO https://www.who.int/gho/en (2026). Wainberg, M. L. et al. Challenges and opportunities in global mental health: a research-to-practice perspective. Curr. Psychiatry Rep. 19, 28 (2017). Toscano, F. et al. How physicians spend their work time: an ecological momentary assessment. J. Gen. Intern. Med. 35, 3166–3172 (2020). Gottschalk, A. & Flocke, S.
A. Time spent in face-to-face patient care and work outside the examination room. Ann. Fam. Med. 3, 488–493 (2005). Olawade, D. B. et al. Enhancing mental health with artificial intelligence: current trends and future prospects. J. Med. Surg. Public Health 3, 100099 (2024). Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). Liévin, V., Hother, C. E., Motzfeldt, A.
G. & Winther, O. Can large language models reason about medical questions? Patterns https://doi.org/10.1016/j.patter.2024.100943 (2024). Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at https://arxiv.org/abs/2303.13375 (2023). Singhal, K. et al. Toward expert-level medical question answering with large language models.
Nat. Med. 31, 943–950 (2025). Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature https://doi.org/10.1038/s41586-025-08866-7 (2025). Beyene, L. S. et al. Conceptualizing healthcare professionals’ relational competence in mental healthcare: an integrative review. Int. J. Nurs. Stud. Adv. 7, 100266 (2024). Guil, R., Romero-Moreno, A. & Tejeiro, R. Editorial: active components in psychotherapy: towards an integrative model of the mechanisms of therapeutic change.
Front. Psychol. https://doi.org/10.3389/fpsyg.2023.1227477 (2023). Wampold, B. E. in The Cycle of Excellence: Using Deliberate Practice to Improve Supervision and Training, 49–65 (Wiley Blackwell, 2017). Rosenzweig, S. Some implicit common factors in diverse methods of psychotherapy. Am. J. Orthopsychiatry 6, 412–415 (1936). Norbury, A., Hauser, T. U., Fleming, S. M., Dolan, R. J. & Huys, Q. J.
M. Different components of cognitive-behavioral therapy affect specific cognitive mechanisms. Sci. Adv. 10, eadk3222 (2024). Southward, M. W., Kushner, M. L., Terrill, D. R. & Sauer-Zavala, S. A review of transdiagnostic mechanisms in cognitive-behavior therapy. Psychiatr. Clin. North Am. 47, 343–354 (2024). Tracey, T. J. G., Wampold, B. E., Lichtenberg, J. W. & Goodyear, R. K. Expertise in psychotherapy: an elusive goal?
Am. Psychol. 69, 218–229 (2014). Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024). Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. Digital Psychiatry Neurosci. 2, 1–8 (2024). Linardon, J. et al. Current evidence on the efficacy of mental health smartphone apps for symptoms of depression and anxiety.
A meta-analysis of 176 randomized controlled trials. World Psychiatry 23, 139–149 (2024). Domhardt, M. et al. Mechanisms of Change in Digital Health Interventions for Mental Disorders in Youth: Systematic Review. J. Med. Internet Res. 23, e29742 (2021). Garvert, M. M. et al. Safety and efficacy of modular digital psychotherapy for social anxiety: randomized controlled trial. J. Med. Internet Res.
27, e64138 (2025). Heinz, M. V. et al. Randomized trial of a generative AI chatbot for mental health treatment. NEJM AI 2, AIoa2400802 (2025). Zhao, Y. et al. Effect of an AI agent trained on a large language model (LLM) as an intervention for depression and anxiety symptoms in young adults: a 28-day randomized controlled trial. Appl. Psychol. Health Well-Being 17, e70067 (2025). Torous, J. et al.
The evolving field of digital mental health: current evidence and implementation issues for smartphone apps, generative artificial intelligence, and virtual reality. World Psychiatry 24, 156–174 (2025). Haensch, A.-C. ‘It listens better than my therapist’: exploring social media discourse on LLMs as mental health tool. Preprint at https://arxiv.org/abs/2504.12337 (2025). Zao-Sanders, M. How People Are Really Using Gen AI in 2025.
Harvard Business Review https://hbr.org/2025/04/how-people-are-really-using-gen-ai-in-2025 (2025). Scholich, T., Barr, M., Stirman, S. W. & Raj, S. A comparison of responses from human therapists and large language model–based chatbots to assess therapeutic communication: mixed methods study. JMIR Ment. Health 12, e69709 (2025). Looi, J. C., Allison, S., Bastiampillai, T., Reutens, S. & Looi, R.
C. Illusions of intelligence, connection and reality: perils of large-language AI models for people with severe mental illness. Australas. Psychiatry https://doi.org/10.1177/10398562251380544 (2025). Yeung, J. A., Dalmasso, J., Foschini, L., Dobson, R. J. & Kraljevic, Z. The psychogenic machine: simulating AI psychosis, delusion reinforcement and harm enablement in large language models. Preprint at https://arxiv.org/abs/2509.10970 (2025).
Jargon, J. & Kessler, S. A troubled man, his chatbot and a murder-suicide in Old Greenwich. Wall Street Journal, Aug 28 (2025). Ling, C. et al. Domain specialization as the key to make large language models disruptive: a comprehensive survey. ACM Comput. Surv. 58, 1–39 (2025). Mukherjee, S. et al. Polaris: a safety-focused LLM constellation architecture for healthcare. Preprint at https://arxiv.org/abs/2403.13313 (2024).
David, D., Cristea, I. & Hofmann, S. G. Why cognitive behavioral therapy is the current gold standard of psychotherapy. Front. Psychiatry 9, 4 (2018). Parmar, P., Ryu, J., Pandya, S., Sedoc, J. & Agarwal, S. Health-focused conversational agents in person-centered care: a review of apps. npj Digit. Med. 5, 1–9 (2022). Goldberg, S. B. et al. The structure of competence: evaluating the factor structure of the Cognitive Therapy Rating Scale.
Behav. Ther. 51, 113–122 (2020). Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016). Shaw, B. F. et al. Therapist competence ratings in relation to clinical outcome in cognitive therapy of depression. J. Consult. Clin. Psychol. 67, 837–846 (1999). Cameron, S. K., Rodgers, J. & Dagnan, D.
The relationship between the therapeutic alliance and clinical outcomes in cognitive behaviour therapy for adults with depression: a meta-analytic review. Clin. Psychol. Psychother. 25, 446–456 (2018). Hatcher, R. L. & Gillaspy, J. A. Development and validation of a revised short version of the Working Alliance Inventory. Psychother. Res. 16, 12–25 (2006). Yin, J., Ngiam, K. Y. & Teo, H. H. Role of artificial intelligence applications in real-life clinical practice: systematic review.
J. Med. Internet Res. 23, e25759 (2021). Nunes-Zlotkowski, K. F., Shepherd, H. L., Beatty, L., Butow, P. & Shaw, J. M. Blended psychological therapy for the treatment of psychological disorders in adult patients: systematic review and meta-analysis. Interact. J. Med. Res. 13, e49660 (2024). Cohen, J. Statistical Power Analysis for the Behavioral Sciences (Routledge, 1988). Spitzer, R. L., Kroenke, K., Williams, J.
B. W. & Löwe, B. A brief measure for assessing generalized anxiety disorder: the GAD-7. Arch. Intern. Med. 166, 1092–1097 (2006). Kroenke, K., Spitzer, R. L. & Williams, J. B. The PHQ-9: validity of a brief depression severity measure. J. Gen. Intern. Med. 16, 606–613 (2001). Kunkle, S., Yip, M., Ξ, W. & Hunt, J. Evaluation of an on-demand mental health system for depression symptoms: retrospective observational study.
J. Med. Internet Res. 22, e17902 (2020). Tayade, M. C. & Latti, R. G. Effectiveness of early clinical exposure in medical education: settings and scientific theories—review. J. Educ. Health Promot. 10, 117 (2021). Jasper, K. et al. The working alliance in a randomized controlled trial comparing Internet-based self-help and face-to-face cognitive behavior therapy for chronic tinnitus. Internet Interv.
1, 49–57 (2014). Falkenström, F., Granström, F. & Holmqvist, R. Therapeutic alliance predicts symptomatic improvement session by session. J. Couns. Psychol. 60, 317–328 (2013). Flemotomos, N. et al. Automated quality assessment of cognitive behavioral therapy sessions through highly contextualized language representations. PLoS ONE 16, e0258639 (2021). Malouin-Lachance, A., Capolupo, J., Laplante, C.
& Hudon, A. Does the digital therapeutic alliance exist? Integrative review. JMIR Ment. Health 12, e69294 (2025). Rollwage, M. et al. Using conversational AI to facilitate mental health assessments and improve clinical efficiency within psychotherapy services: real-world observational study. JMIR AI 2, e44358 (2023). Rollwage, M. et al. Conversational AI facilitates mental health assessments and is associated with improved recovery rates.
BMJ Innov. https://doi.org/10.2196/44358 (2024). Hipgrave, L., Goldie, J., Dennis, S. & Coleman, A. Balancing risks and benefits: clinicians’ perspectives on the use of generative AI chatbots in mental healthcare. Front. Digit. Health https://doi.org/10.3389/fdgth.2025.1606291 (2025). Yang, H. et al. Peer perceptions of clinicians using generative AI in medical decision-making. npj Digit. Med. 8, 530 (2025).
Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med. 29, 2396–2398 (2023). Eubanks, C. F., Muran, J. C. & Safran, J. D. Alliance rupture repair: a meta-analysis. Psychotherapy 55, 508–519 (2018). Sembill, A., Vocks, S., Kosfelder, J. & Schöttke, H. The phase model of psychotherapy outcome: Domain-specific trajectories of change in outpatient treatment.
Psychother. Res. 29, 541–552 (2019). Norcross, J. C. & Goldfried, M. R. Handbook of Psychotherapy Integration (Oxford Univ. Press, 2005). Rollwage, M. et al. The limbic layer: transforming large language models (LLMs) into clinical mental health experts. Preprint at PsyArXiv https://doi.org/10.31234/osf.io/9d7tp (2024). Rutledge, R. B., Skandali, N., Dayan, P. & Dolan, R. J. A computational and neural model of momentary subjective well-being.
Proc. Natl Acad. Sci. USA 111, 12252–12257 (2014). McFadyen, J. LimbicAI/study-2025-cognitive-layer: 2025-09-22 revision. Zenodo. https://doi.org/10.5281/zenodo.17176593 (2025). Kass, R. E. & Raftery, A. E. Bayes Factors. J. Am. Stat. Assoc. 90, 773–795 (1995). This work was funded by Limbic Limited. The funder was involved in study conceptualization, design, data collection, analysis, decision to publish and preparation of the paper.
We acknowledge the hard work put into the development of this technology, which evolved over years of patient-facing product development. We thank S. DeVries, J. Cable-May, A. Vicente, W. Payne, T. Edirisinghe, O. Mohammed, A. Hazelton, A. Antunes, V. Fernandes, L. Palit, A. Cannizzo, J. Shoard, C. Vowell and L. Dina. Max Rollwage, Jessica McFadyen, Keno Juchems, Annamaria Balogh, Sashank Pisupati, Margareta-Theodora Mircea, Tobias U.
Hauser, George Prichard & Ross Harper M.R., T.U.H., J.M. and R.H. conceptualized the work; M.R., J.M., K.J., T.U.H., A.B., G.P. and R.H. contributed to the design of the work; M.R. and R.H. conceptualized the technical implementation; M.R. and G.P. supervised the technical implementation; K.J., A.B., S.P. and G.P. contributed to the technical implementation; J.M., K.J., A.B. and G.P. contributed to the setup for the data acquisition; J.M., A.B.
and M.M. contributed to the collection of the data; M.R. and J.M. conceptualized the data analysis; J.M. and M.R. conducted the data analysis; J.M. and M.R. drafted the paper; J.M., M.R. and T.U.H. edited the paper. R.H. contributed to the ideation and oversight of the work. M.R., J.M., K.J., A.B., S.P., G.P., M.M. and R.H. are (or have been) employed by Limbic Limited and hold (or held) shares in the company.
T.U.H. is working as a paid consultant for Limbic Limited and holds shares in the company. Nature Medicine thanks Matteo Malgaroli, Xuhai Xu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Lorenzo Righetto and Ming Yang, in collaboration with the Nature Medicine team. Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Bars represent the mean score (y-axis, range 0–6) per subscale for each condition (x axis), comparing standalone LLMs (purple), cognitive layer-enhanced LLMs (pink), and human therapists (orange). Error bars represent standard error of the mean. Data distribution is represented by a bubble plot, where point size indicates the relative proportion of transcripts per condition achieving each score (n = 227).
Bars represent the mean scores (y-axis) per item (x axis) on the additional 6-item rubric for assessing broader clinical performance, for each condition and for each underlying LLM. Higher scores indicate better performance. Error bars represent standard error of the mean (SEM). Data distribution is represented by a bubble plot, where point size indicates the relative proportion of transcripts per condition achieving each score (n = 227).
Preference ratings in pairwise comparisons (x axis) between LLMs using the cognitive layer architecture (pink) and standalone LLMs (purple) transcripts across seven clinical quality criteria (y-axis), split by underlying LLM (columns). n = 672 comparisons. Bars represent mean Working Alliance Inventory-Short Revised (WAI-SR) scores (y-axis, range 1–5) for overall alliance and the goal, task, and bond subscales.
Higher scores indicate stronger therapeutic alliance. Error bars represent standard error of the mean. Individual data points are shown (n = 227). Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Rollwage, M., McFadyen, J., Juchems, K. et al. A cognitive layer architecture to support large-language model performance in psychotherapy interactions. Nat Med (2026). https://doi.org/10.1038/s41591-026-04278-w
Summary
This report covers the latest developments in artificial intelligence. The information presented highlights key changes and updates that are relevant to those following this topic.
Original Source: Nature.com | Author: Max Rollwage, Jessica McFadyen, Keno Juchems, Annamaria Balogh, Sashank Pisupati, Margareta-Theodora Mircea, Tobias U. Hauser, George Prichard, Ross Harper | Published: March 12, 2026, 12:00 am


Leave a Reply
You must be logged in to post a comment.