directionevidence-based clinical AI engine for healthcare providers, shares the results of a up-to-date study on the medical capabilities of readily available gigantic language models (LLMs). The study compared the medical accuracy of OpenAI’s GPT-4 and Anthropic’s Claude3-Opus with each other and with human medical experts using questions based on objective medical knowledge extracted from Knowledge Graph Kahuna. The study found that Claude3 outperformed GPT-4 in accuracy, but both fell low when compared to both human medical experts and objective medical knowledge. Both LLMs answered about a third of the questions incorrectly, and GPT4 answered almost half of the questions incorrectly with numerical answers.
According to a recent study, 91 percent Physicians have expressed concerns about choosing the right generative AI model to exploit and said they need to know that the source materials of the model were created by physicians or medical experts before using it. Physicians and healthcare organizations exploit AI for its prowess in administrative tasks, but to ensure the accuracy and safety of these models in clinical tasks, we need to address the limitations of generative AI models.
Using its proprietary knowledge graph, consisting of a structured representation of scientific facts from peer-reviewed sources, Kahun leveraged its unique position to lead a collaborative study of the current capabilities of two popular LLMs: GPT-4 and Claude3. Incorporating data from over 15,000 peer-reviewed articles, Kahun generated 105,000 evidence-based medical QAs (questions and answers), classified into numerical or semantic categories spanning multiple medical disciplines, which were fed directly into each LLM.
Numeric QA involves correlating findings from a single source for a specific query (e.g., the prevalence of dysuria in patients with urinary tract infections), whereas semantic QA involves differentiating entities in specific medical queries (e.g., selecting the most common subtypes of dementia). Importantly, Kahun led the research team in providing a foundation for evidence-based QA that resembled the low, single-line queries that a physician might ask in everyday medical decision-making.
Analyzing more than 24,500 Q&A responses, the research team uncovered the following key findings:
- Both Claude3 and GPT-4 performed better on semantic quality analyses (68.7 and 68.4 percent, respectively) than on numerical quality analyses (63.7 and 56.7 percent, respectively), with Claude3 performing better on numerical accuracy.
- Research shows that each LLM model produces different results depending on the command, which highlights the importance of the fact that the same QA command can produce completely different results for different models.
- For validation purposes, six healthcare workers answered 100 numerical QA questions and achieved an accuracy of 82.3 percent, compared with an accuracy of 64.3 percent for the Claude3 test and 55.8 percent for the GPT-4 test that answered the same questions.
- Kahun’s research shows that both Claude3 and GPT-4 perform excellently on semantic questions, but ultimately supports the idea that general LLM models are not yet well-developed to be a reliable information assistant for physicians in clinical settings.
- The study included a “Don’t know” option to reflect situations where the clinician must admit uncertainty. Different response rates were found for each LLM (numeric: Claude3-63.66%, GPT-4-96.4%; semantic: Claude3-94.62%, GPT-4-98.31%). However, there was a nonsignificant correlation between accuracy and response rate for both LLMs, suggesting that their ability to admit lack of knowledge is questionable. This indicates that without prior knowledge of the medical field and model, the reliability of the LLM is questionable.
QA was extracted from Kahun’s proprietary Knowledge Graph, which contains over 30 million evidence-based medical insights from peer-reviewed medical publications and sources, covering sophisticated statistical and clinical relationships in medicine. Kahun’s AI Agent solution enables medical professionals to ask case-specific questions and receive clinically informed answers cited in the medical literature. By referencing its evidence-based answers and protocols, the AI Agent increases physician trust, thereby improving overall efficiency and quality of care. The company’s solution overcomes the limitations of current generative AI models by delivering factual insights based on medical evidence, providing the consistency and transparency necessary for disseminating medical knowledge.
The full working copy of the study can be found here: https://arxiv.org/abs/2406.03855.