Human-AI collectives most accurately diagnose clinical vignettes

Nikolas Zöller; Julian Berger; Irving Lin; Nathan Fu; Jayanth Komarneni; Gioele Barabucci; Kyle Laskowski; Victor Shia; Benjamin Harack; Eugene A Chu; Vito Trianni; Ralf H J M Kurvers; Stefan M Herzog

doi:10.1073/pnas.2426153122

Human-AI collectives most accurately diagnose clinical vignettes

Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2426153122. doi: 10.1073/pnas.2426153122. Epub 2025 Jun 13.

Authors

Affiliations

¹ Center for Adaptive Rationality, Max Planck Institute for Human Development, Berlin 14195, Germany.
² The Human Diagnosis Project, San Francisco, CA 94110.
³ Department of Digital Humanities, University of Cologne, Cologne 50931, Germany.
⁴ Harvey Mudd College, Claremont, CA 91711.
⁵ Department of Politics and International Relations, Oxford University, Oxford OX13UQ, United Kingdom.
⁶ Kaiser Permanente, Downey, CA 90242.
⁷ Laboratory of Autonomous Robotics and Artificial Life & Collective Intelligence in Natural and Artificial Systems Lab, Institute of Cognitive Sciences and Technologies, Italian National Research Council, Rome 00159, Italy.
⁸ Science of Intelligence Excellence Cluster, Technical University Berlin, Berlin 10587, Germany.

^# Contributed equally.

Abstract

AI systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased-shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here, we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 text-based medical case vignettes. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.

Keywords: AI; collective intelligence; health informatics; large language models; medical diagnostics.

MeSH terms

Artificial Intelligence*
Diagnosis, Differential
Humans
Physicians

Abstract

MeSH terms

Grants and funding