Are AI Chatbots 'Dangerous' for Diagnosing Health Problems?

Whether or not AI chatbots should be used for medical advice has long been debated, and is under fire this week as key researchers at the University of Oxford have said using AI in such cases is "dangerous" and "a challenge", even for top AI models.
A study of AI reliability for health-related questions involved nearly 1,300 UK participants – each split into one of four groups, with three using a chatbot. These were GPT-4o, Llama 3 and Command R+. The fourth was a "control" group which did not use an AI chatbot at all.
This group was told they could use any method, including Internet search or their own judgment, to identify a health condition.
How the study worked
Qualified doctors drafted ten medical scenarios, which were passed onto doctors who provided the correct diagnoses for said scenarios.
Each participant in the study was randomly allocated one of these medical scenarios and then conversed with their chosen AI chatbot (bar the control group) to help assess.
The participants then fed back to the study the advice that was given to them by AI, or what they concluded from other resources.
They gave their condition and their "disposition" – what they would have chosen to do after their diagnosis – such as seeking urgent primary care, their GP or an ambulance.
The key findings of the study
The results of the study found that the LLMs (large language models) performed no better than traditional methods.
People who used them did not make better decisions on what to do with their condition than participants who relied on traditional methods like online searches or their own judgment.
Also, the accuracy in recommending what participants should do varied, based on the model that was used. For GPT-4o, accuracy was 64.7%, which was the best performing. CommandR+ scored just over half in accuracy at 55.5%, and Llama 3 scored under half at 48.8%.
"These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health," says Dr Rebecca Payne, who is a GP and was the lead medical practitioner on the study.
"Despite all the hype, AI just isn't ready to take on the role of the physician.
"Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed."
AI health services
The study comes at an interesting time at the intersection between health and AI. ChatGPT Health was released just last month and already has more than 230 million users worldwide asking health-related questions every week, and over 40 million per day.
However, the University of Oxford study demonstrates that 2 out of 3 chatbots average out to be accurate around half of the time.
Lead author Andrew Bean said the findings show how interacting with humans poses a challenge "even for top" AI models.
"We hope this work will contribute to the development of safer and more useful AI systems," he says.
Is AI ready to give medical advice?
Other AI critics shared their views online under the University of Oxford's posting on LinkedIn, with a resounding answer.
"A cautionary tale that LLMs are not yet ready to provide reliable medical advice," says Martin Thomas, Marketing Consultant at Crowdsurfing Ltd.
"I emphasise the 'yet' because there are a few factors highlighted in the report that AI developers will inevitably address," he adds.
One of these factors being how users did not know what information they should provide to the AI chatbot.
Elsewhere, Saadia Mahmud, a Management Consultant at Interactive, said: "AI has uses but being a double edged sword we need to use it wisely. Can AI replace a physician? My view is a resounding 'No'.
"AI tools have no moral compass and ethical conviction."
