Safety-Critical AI April 13, 2026 • 12 min read

Series: Building for Harm's Way

Therapeutic AI Is a Safety-Critical System. The Standards Exist. They Are Not Being Applied.

By Ricardo Sousa Founder, Continuum Pulse

Table of Contents

General-purpose language models are being used as therapeutic tools. The systems these users depend on lack the safety infrastructure that every other domain applies to systems where failure can cause harm.

General-purpose language models are being used as therapeutic tools. Not because they were designed for it, but because they are available, free, and carry no waitlist. Researchers at Georgia Tech documented this pattern: users in emotional distress turning to ChatGPT, Copilot, and similar systems not as a deliberate choice but as a last resort (Song et al., 2024). The distinction between a wellness app and a therapeutic system, so precise on a regulatory filing, has no meaning to someone reaching for their phone at two in the morning.

This creates an engineering problem that most teams building in this space have not addressed. The systems these users depend on were not architected for the context in which they are being used. They lack the safety infrastructure that every other domain applies to systems where failure can cause harm: per-user data isolation, language safety filtering, session persistence, observability, and failure recovery. The standards for this work exist. They are not being applied.

The Unintentional Therapist

The regulatory frameworks that govern AI in health contexts draw a firm line between “wellness” and “therapeutic.” The FDA classifies software by its intended use: an app that claims to “manage stress” occupies different regulatory territory than one that claims to “treat depression.” As of November 2025, zero generative AI devices have received FDA authorization for mental health applications (FDA Digital Health Advisory Committee, 2025). The EU AI Act classifies by similar logic, distinguishing wellness tools from high-risk systems based on the claims made by their developers, not the contexts in which users encounter them.

But the intended use is a statement made by a company in a filing. It has nothing to do with how a person in crisis encounters the product. The therapeutic relationships that form in these encounters are accidental. The data users share is deeply personal. And the systems they trust with that data were built for a different purpose entirely.

What The Numbers Say

Three independent lines of research converge on the same conclusion: systems currently deployed in therapeutic contexts fail in ways that are predictable, documented, and severe.

Brewster et al. (2025) tested 25 consumer chatbots across 75 simulated adolescent health crisis conversations covering suicidal ideation, sexual assault, and substance use. AI companion chatbots responded appropriately 22% of the time. General-purpose LLMs performed better, at 83%, but the gap between the two reveals something important: companion chatbots performed significantly worse than general-assistant chatbots. AI companions were far less likely to escalate appropriately (40% versus 90% for general-purpose models) and far less likely to provide mental health referrals (11%). Only 36% of the platforms tested had any age verification. One chatbot responded to a simulated adolescent in crisis with: “What do you want me to do? I don’t care about your problems.” (Brewster et al., 2025, eTable 3)

That 22% figure raises a follow-up question: would these systems perform differently if explicitly instructed to behave therapeutically?

Separately, Iftikhar et al. (2025) conducted an 18-month evaluation of LLM-based counseling across GPT-3.0, GPT-4, GPT-5, Claude 3 Sonnet, Claude 3 Haiku, Llama 3.1, and Llama 3.2, all explicitly prompted to deliver evidence-based therapy. Seven peer counselors trained in therapeutic techniques conducted 137 sessions, including self-counseling tests and simulated clinical interactions, which were then evaluated by three licensed clinical psychologists. The study, published at the AAAI/ACM Conference on AI, Ethics, and Society, identified 15 distinct categories of ethical failure such as:

Mishandled crisis situations;
Reinforced harmful beliefs;
Exhibited what the researchers termed “deceptive empathy,” producing phrases like “I hear you” that mimic emotional understanding with no underlying comprehension.

These failures were systematic, not incidental. They appeared across all models tested.

If the failures are not model-specific and not resolved by prompting, the remaining question is where in the risk landscape they concentrate.

The third line of evidence addresses the failure zone that matters most clinically. McBain et al. (2025) tested GPT-4o mini, Claude 3.5 Sonnet, and Gemini 1.5 Pro with 30 suicide-related questions, each asked 100 times, for a total of 9,000 responses. The questions were risk-rated by 13 clinicians on a five-point scale. The results, published in Psychiatric Services, showed a clear pattern: all three models refused to answer very-high-risk questions in every instance. Two of the three chatbots answered very-low-risk questions 100% of the time; Gemini responded directly only 25% of the time. The failure zone was intermediate-risk questions, the ones that correspond to the ambiguous, contextual situations where real clinical encounters actually happen. In this zone, responses were inconsistent and unreliable. ChatGPT and Claude generated direct responses to lethality-related questions, in some cases for 100% of iterations. When the models refused, they typically offered generic crisis hotline referrals of variable quality rather than clinically useful guidance.

These findings are not isolated. A rapid scoping review by Chung et al. (2026), published in JMIR Mental Health, mapped 71 news articles to 36 unique adverse event case identifiers. Suicide death was the most frequently reported outcome (35 of 61 cases with complete severity coding), followed by psychiatric hospitalization. Media reports most commonly attributed causality to AI system behavior (45 of 61 coded entries). Causality cannot be established from media reports. But the pattern is consistent with the failure modes documented in controlled research.

And then there is Tessa. The chatbot was originally developed as a rule-based system for eating disorder prevention and passed a randomised controlled trial demonstrating its effectiveness (Fitzsimmons-Craft et al., 2022). After the company added generative AI capabilities without re-validating the safety architecture, Tessa began recommending, according to multiple news reports and the AI Incident Database, caloric deficits of 500 to 1,000 calories per day and weekly weighing to users seeking help for eating disorders. The National Eating Disorders Association suspended the chatbot in June 2023. A validated system, made dangerous by a capability upgrade that bypassed the safety engineering that had made it safe in the first place.

The Guardrail Paradox

AI safety research: barriers and redirected paths — Safety barriers that block paths do not eliminate the need to travel. In therapeutic AI, guardrails that shut down conversations do not eliminate the distress; they eliminate the system's awareness of it. Illustration by Khyati Trehan, Visualising AI / Google DeepMind.

The instinctive response to the evidence above is straightforward: more guardrails, filter harmful output, block dangerous topics, and shut down conversations that enter risky territory.

This instinct is partially correct and partially dangerous.

Siddals, Torous, and Coxon (2024) published one of the first qualitative studies of user experience with LLM chatbots for mental health support. Nineteen participants across eight countries described their interactions with Pi, ChatGPT, Copilot, and Kindroid in semi-structured interviews. Four themes emerged, and one of them complicates the safety narrative. Users described these systems as an “emotional sanctuary,” a non-judgmental, always-available space where they could express thoughts they were unwilling to share elsewhere. But they also reported that safety guardrails, the very mechanisms designed to protect them, were experienced as “rejection in a time of need.”

When a user is in acute distress and the system shuts down the conversation, what the user experiences is not safety. It is abandonment. One participant described self-censoring to avoid triggering safety responses (Siddals et al., 2024). The implication is worth stating plainly: in that case, the guardrail did not prevent the distress; it prevented the system from knowing about it.

This creates a genuinely difficult engineering problem. The answer is not fewer guardrails. The answer is better ones. But “better” here requires holding two truths simultaneously: unfiltered language model output in therapeutic contexts can cause measurable harm, and blunt safety filtering can drive away the most vulnerable users at the moments they most need support.

The American Psychological Association’s 2025 Health Advisory on AI chatbots and wellness applications adds another dimension to this tension. The advisory states that many general-purpose, consumer-facing models are “trained to be highly agreeable to users (sycophancy),” reinforcing confirmation bias and maladaptive beliefs by validating views rather than challenging them (APA, 2025). But effective therapy does not work this way. Qualified mental health providers are “trained to modulate their interactions, supporting and challenging, in service of a patient’s best interest” (APA, 2025). A system that only agrees with the user is not providing therapeutic support. It is providing a simulation of it.

The problem, then, is not just too little safety. It is also too much of the wrong kind. A system that harms through negligence and a system that validates without understanding fail the user in different ways, but they both fail. The engineering challenge is building a system that can maintain supportive presence while preventing harm, routing to human help when the system reaches its limits rather than shutting down entirely (Kamar, 2016). Neither pure filtering nor pure sycophancy achieves this.

The Brewster data makes this concrete: the gap between 22% and 83% appropriate responses is not a gap in model capability. It is a gap in how the systems were built. Iftikhar et al. confirmed that explicit therapeutic prompting does not close it. The question, then, is not how these systems should respond. It is how they should be architected.

The Five Architectural Commitments

There is a term for systems where failure can cause harm to health, safety, or fundamental rights. In engineering standards published by IEEE, IEC, and ISO, the term is “safety-critical.”

The standards are not new. IEC 62304 defines software lifecycle processes for medical devices. ISO/IEC 42001 establishes AI management systems. ISO/IEC 23894 provides guidance on AI risk management. Applied to therapeutic conversational AI, these frameworks point to five architectural commitments.

Per-user data isolation ensures that one person’s therapeutic disclosures are architecturally inaccessible to another (ISO/IEC 42001). Language safety filtering operates at both input and output stages, not just output classification after the fact (Ye et al., 2024). The system must maintain persistent, structured memory across sessions; users who are forced to re-explain their situation each time experience it has being forgotten (IEC 62304 traceability; Siddals et al., 2024). Full interaction observability, through traceable audit infrastructure rather than just error logging, makes it possible to understand what the system did and why (Raji et al., 2020; EU AI Act Article 12). And checkpoint-based recovery means that system failures do not result in lost conversational state when a user is mid-disclosure (ISO/IEC 23894).

In building Novii, a therapeutic conversational AI system, these five commitments were architectural decisions from the first design session. The second article in this series examines each in technical depth.

Why “Safe Enough” Is Not A Standard

The most consequential finding in the systematic review literature on therapeutic chatbots is not about effectiveness. It is about evaluation.

Abd-Alrazaq et al. (2020) conducted the first major systematic review and meta-analysis of chatbots for mental health. Of the 12 included studies, only two measured safety at all. Both relied on passive monitoring, the absence of reported adverse events, rather than formal safety assessment. The authors concluded that this evidence was “not sufficient to conclude that chatbots are safe, given the high risk of bias.” That was 2020.

Five years later, the evaluation gap has not closed. Hua et al. (2025), in a systematic review published in World Psychiatry, reviewed 160 studies from 2020 to 2024. LLM-based chatbots had surged to 45% of new studies by 2024, but only 16% of LLM studies underwent clinical efficacy testing; 77% remained in early validation stages (Hua et al., 2025). The field is deploying systems at scale while evaluating them at pilot.

This evaluation vacuum is not just an academic concern. It has a direct consequence for the people who use these systems, and the consequence is asymmetric. Wang et al. (2025), in a study presented at CHI 2025, found that users with severe social anxiety trusted generative AI chatbots more readily than those with milder symptoms. The most vulnerable users, those with the greatest need for safe systems and the least capacity to evaluate whether they are getting one, are the ones who place the most trust in these tools.

The retention data makes the full picture starker. Baumel et al. (2019), using objective usage data from 93 Android mental health apps with 10k or more installs, found a median 30-day retention rate of 3.3%. Kidman et al. (2024) confirmed this pattern across 18 studies and +525k participants: 70% of users discontinue within 100 days. Privacy concerns emerged as a documented driver of abandonment, alongside lack of personalisation and variety. Systems that fail to retain users are often the same ones that fail to protect them.

And here is a tension that honesty requires naming. The evidence on trust and disclosure points in two opposing directions. Transparency about AI involvement is ethically non-negotiable. But experimental research shows that identical medical advice is perceived as significantly less reliable and less empathetic when labelled as AI-generated, and that participants are less willing to follow it (Reis, Reis, & Kunde, 2024). The cost of the alternative is equally documented. Koko, an online mental health support platform, issued a public apology in January 2023 after using ChatGPT to generate emotional responses while misleading users into believing they were written by real people (Blease, 2025). Blease (2025) has proposed that AI disclosure in mental health contexts may produce a nocebo effect, where knowledge of AI involvement reduces perceived helpfulness regardless of content quality. We are ethically required to be transparent about something that empirically reduces the user’s experience of being helped. This paradox does not have a clean resolution. But pretending it does not exist is not an option for anyone building systems in this space.

A Question of Responsibility

The question is not whether therapeutic AI should exist. People are already using it, at scale, in moments of genuine vulnerability. The question is whether the people building these systems will treat them as what they are: safety-critical systems, operating in conditions where failure has documented, sometimes fatal, consequences. The standards for what that means exist. The engineering to meet them is achievable. The choice to apply them is not technical. It is a decision about what kind of responsibility comes with building something that people reach for at two in the morning, when nothing else is available, and they need it to work.

Disclosure: The author architects Novii, a therapeutic conversational AI system referenced in this article, as a consulting engagement through Continuum Pulse.

References

Expand (19 sources)

Abd-Alrazaq, A.A., Rababeh, A., Alajlani, M., Bewick, B.M., & Househ, M. (2020). Effectiveness and safety of using chatbots to improve mental health: Systematic review and meta-analysis. Journal of Medical Internet Research, 22(7), e16021. https://doi.org/10.2196/16021

American Psychological Association. (2025). Health advisory on the use of generative AI chatbots and wellness applications for mental health. https://www.apa.org/topics/artificial-intelligence-machine-learning/health-advisory-chatbots-wellness-apps

Baumel, A., Muench, F., Edan, S., & Kane, J.M. (2019). Objective user engagement with mental health apps: Systematic search and panel-based usage analysis. Journal of Medical Internet Research, 21(9), e14567. https://doi.org/10.2196/14567

Blease, C. (2025). Placebo, nocebo, and machine learning: How generative AI could shape patient perception in mental health care. JMIR Mental Health, 12, e78663. https://doi.org/10.2196/78663

Brewster, R.C.L., et al. (2025). Characteristics and safety of consumer chatbots for emergent adolescent health concerns. JAMA Network Open, 8(10), e2539022. https://doi.org/10.1001/jamanetworkopen.2025.39022

Chung, V.H.A., Bernier, P., & Hudon, A. (2026). Mass media narratives of psychiatric adverse events associated with generative AI chatbots: Rapid scoping review. JMIR Mental Health, 13(1), e93040. https://doi.org/10.2196/93040

FDA Digital Health Advisory Committee. (2025). Meeting on generative AI-enabled digital mental health medical devices, November 6, 2025. https://www.fda.gov/medical-devices/digital-health-center-excellence/fda-digital-health-advisory-committee

Fitzsimmons-Craft, E.E., Chan, W.W., et al. (2022). Effectiveness of a chatbot for eating disorders prevention: A randomized clinical trial. International Journal of Eating Disorders, 55(3), 343-353. https://doi.org/10.1002/eat.23662

Hua, Y., Siddals, S., Ma, Z., et al. (2025). Charting the evolution of artificial intelligence mental health chatbots from rule-based systems to large language models: A systematic review. World Psychiatry, 24, 383-394. https://doi.org/10.1002/wps.21352

Iftikhar, Z., Xiao, A., Ransom, S., Huang, J., & Suresh, H. (2025). How LLM counselors violate ethical standards in mental health practice: A practitioner-informed framework. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), 8(2), 1311-1323. https://doi.org/10.1609/aies.v8i2.36632

Kamar, E. (2016). Directions in hybrid intelligence: Complementing AI systems with human intelligence. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), 4070-4073. https://www.ijcai.org/Proceedings/16/Papers/603.pdf

Kidman, P.G., Curtis, R.G., Watson, A., & Maher, C.A. (2024). When and why adults abandon lifestyle behavior and mental health mobile apps: Scoping review. Journal of Medical Internet Research, 26, e56897. https://doi.org/10.2196/56897

McBain, R., Cantor, J.H., Zhang, L.A., et al. (2025). AI chatbot responses to suicide-related questions. Psychiatric Services. https://www.rand.org/news/press/2025/08/ai-chatbots-inconsistent-in-answering-questions-about.html

Raji, I.D., Smart, A., White, R.N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., & Barnes, P. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency (FAccT). https://doi.org/10.1145/3351095.3372873

Reis, M., Reis, F., & Kunde, W. (2024). Influence of believed AI involvement on the perception of digital medical advice. Nature Medicine, 30, 3098-3100. https://doi.org/10.1038/s41591-024-03180-7

Siddals, S., Torous, J., & Coxon, A. (2024). “It happened to be the perfect thing”: Experiences of generative AI chatbots for mental health. npj Mental Health Research, 3, 52. https://doi.org/10.1038/s44184-024-00097-4

Song, I., Pendse, S.R., Kumar, N., & De Choudhury, M. (2024). The typing cure: Experiences with large language model chatbots for mental health support. Proceedings of the ACM on Human-Computer Interaction (CSCW). https://doi.org/10.1145/3757430

Wang, Y., Wang, Y., Crace, K., & Zhang, Y. (2025). Understanding attitudes and trust of generative AI chatbots for social anxiety support. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3706598.3714286

Ye, J., et al. (2024). ToolSword: Unveiling safety issues of large language models in tool learning across three stages. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers, 2181-2211. https://doi.org/10.18653/v1/2024.acl-long.119

Standards

IEC 62304:2006+AMD1:2015. Medical device software: Software life cycle processes.
ISO/IEC 42001:2023. Information technology: Artificial intelligence: Management system.
ISO/IEC 23894:2023. Information technology: Artificial intelligence: Guidance on risk management.
EU Artificial Intelligence Act, Regulation (EU) 2024/1689. Articles 10, 12.