Can AI Diagnose?
Experiments in Clinical Psychology.
Attunement HQ
Substack
At Attunement, we’re building at the intersection of AI and mental health, an inspiring space that demands care and thoughtfulness.
At Attunement, we’re building at the intersection of AI and mental health, an inspiring space that demands care and thoughtfulness. While the challenges of this field deserve their own conversation, today we’re excited to share the experiments we've conducted over the past few months.
Attunement applies AI to assist humans in clinical workflows. We believe that given the current capabilities, the healing process is best facilitated with humans in the driver’s seat, while AI enhances existing workflows, enabling greater precision. Following the roadmap offered in a recent paper by Stanford HAIS, we position our approach as an assistive LLM. By eliminating the bloat of process inefficiency and by streamlining processes, we aim to increase accuracy without compromising the human-led approach.
When I founded Attunement, a thoughtful friend and investor asked me over breakfast on Market Street: Just how well do off-the-shelf LLMs accurately match the mental health diagnosis generated by clinicians? It prompted us to expand on the question: just how much automation is the right amount?
Here’s what we discovered.
Experiment 1: Fully autonomous diagnosis
We tested the accuracy of GPT-4, by evaluating challenging fringe diagnoses from clinical texts. These cases involved overlapping symptoms across DSM categories, posing a real test of the model’s ability to handle complex, non-standard cases. In the interest of establishing a fairly efficient baseline, we applied few-shot learning in a unified prompt, incorporating the DSM rubric to evaluate the LLM’s ability to accurately match human labeled diagnoses.
Experiment 2: Human-led diagnosis with AI support
In this experiment, clinicians made diagnoses with AI providing data support. The AI ingested and organized multiple data sources (intake, clinical history, physician notes), handling the lower-level data tasks often referred to as “busywork” by clinicians. This freed up time for the physician to focus on case conceptualization and diagnosis.
We tested the LLM across 40 clinical cases of varying complexity and symptoms. The AI’s predictions were compared with the true labels assigned by human experts to evaluate its accuracy and reliability in a clinical setting.
Experiment 1
GPT-4 correctly identified the true diagnosis 57.5% of the time, but misdiagnosed 42.5% of cases. This revealed clear limitations, particularly when symptoms overlapped across disorders. In 7 cases, the AI correctly identified the DSM class but missed the specific disorder, underscoring the need for human discernment in complex, context-based mental health cases.
It's important to note that we did not investigate full automation via fine-tuning or RLHF methods in these experiments. As a result, there are inherent limitations in the LLM’s ability to grasp the nuances of psychological diagnoses. Moreover, symptoms level categorization has shown that many cases have overlapping symptoms, which can be confusing for the LLM to interpret. Since mental health is highly dependent on context, many cases do need the judgment and insight that humans can provide.
Experiment 2
This is where things got interesting. In this experiment, clinicians made diagnoses with AI providing data support. The LLM ingested and organized multiple data sources (symptoms, clinical history, physician notes).
Clinicians typically start case conceptualization by forming an initial hypothesis. They gather information on the patient’s primary symptoms through direct conversations and by reviewing various data sources, including intake forms, medical history, rating scales, and referral notes. To refine and validate this hypothesis, clinicians then administer targeted assessments and conduct a clinical interview, ultimately verifying the hypothesis with testing and building a comprehensive understanding that guides the diagnosis.
By handling both structured and unstructured data, the LLM streamlined the case conceptualization process by consolidating elevated scores across reports, converting report scores into clinical notes, converting long PDFs into formatted tables—tasks that have traditionally been handed off to graduate-level interns. This “busywork” is time and money— any clinician will tell you how much time has been spent laboriously aligning tables and manually inputting scores on Word. This allowed physicians to dedicate more time and attention to case conceptualization and diagnosis.
With AI support, clinicians matched to the original diagnoses while cutting the time spent on each case by 50%.
These findings demonstrate that leveraging an existing LLM (in this case, ChatGPT) for clinical workflows combines accuracy with efficiency. By allowing physicians to focus on what they do best—applying years of training to hone clinical intuition and discernment—while AI handles data-intensive low level processing tasks, we can enhance both speed and precision. Much like the tasks typically delegated to a graduate-level intern, the busywork is now seamlessly integrated into the clinical workflow. Along the jagged frontier of AI, this is the “cyborg” approach that utilizes the strength of both humans and AI to give humans superpowers.
Fully automated psychological diagnosis with an existing LLM requires much more rigorous testing. Given the privacy and ethical concerns, prompt engineering appears to be the most appropriate method for fine-tuning outputs to better match human diagnoses. However, as the human psyche is highly complex and context-dependent, effective diagnosis requires nuanced judgment—something our experiments quickly demonstrated is far from straightforward.
Integrating AI into clinical workflows that can reflect a physician’s mental model and assist in reorganizing information intuitively is extremely valuable. It reduces cognitive load, and enhances clinical decision-making without reinventing the wheel.
If you’re interested in further exploring AI-driven mental health diagnosis, ongoing research by our colleagues in RLHF and deep learning is examining the ethical and empirical dimensions of AI in clinical settings. At Attunement, we’re focused on eliminating inefficiencies in today’s clinical workflows.
Stay up to date with the latest video business news, strategies, and insightssent straight to your inbox!