Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Recent research shows GPT-4V outperforms physicians in medical imaging accuracy but has flawed rationales. Its potential in decision support requires further evaluation before clinical use, highlighting AI's limitations.
Read original articleRecent research has highlighted the performance of Generative Pre-trained Transformer 4 with Vision (GPT-4V) in medical imaging tasks, indicating it can outperform human physicians in multi-choice accuracy. In a study analyzing GPT-4V's capabilities using 207 questions from the New England Journal of Medicine (NEJM) Image Challenge, the model achieved an accuracy of 81.6%, compared to 77.8% for physicians. Notably, GPT-4V excelled in answering questions that physicians answered incorrectly, with over 78% accuracy. However, the study revealed significant flaws in GPT-4V's rationales, particularly in image comprehension, where 27.2% of cases contained errors despite correct final answers. The model's recall of medical knowledge was more reliable, with error rates between 11.6% and 13.0%. The findings suggest that while GPT-4V shows promise in decision support, its reasoning processes require further evaluation before clinical integration. The study also noted that human physicians outperformed GPT-4V in open-book settings, especially on difficult questions, underscoring the importance of comprehensive evaluations of AI models in medical contexts. Limitations of the study included a potential bias in the NEJM Image Challenge cases and the focus on single-answer questions, which may not reflect the complexities of real clinical scenarios. Future research aims to compare GPT-4V's rationales with those of physicians to better understand its decision-making processes. Overall, while GPT-4V demonstrates high accuracy, its flawed rationales highlight the need for caution in its application in clinical settings.
Related
AI can beat real university students in exams, study suggests
A study from the University of Reading reveals AI outperforms real students in exams. AI-generated answers scored higher, raising concerns about cheating. Researchers urge educators to address AI's impact on assessments.
Gemini's data-analyzing abilities aren't as good as Google claims
Google's Gemini 1.5 Pro and 1.5 Flash AI models face scrutiny for poor data analysis performance, struggling with large datasets and complex tasks. Research questions Google's marketing claims, highlighting the need for improved model evaluation.
My finetuned models beat OpenAI's GPT-4
Alex Strick van Linschoten discusses his finetuned models Mistral, Llama3, and Solar LLMs outperforming OpenAI's GPT-4 in accuracy. He emphasizes challenges in evaluation, model complexities, and tailored prompts' importance.
Study reveals why AI models that analyze medical images can be biased
A study by MIT researchers uncovers biases in AI models analyzing medical images, accurately predicting patient race from X-rays but showing fairness gaps in diagnosing diverse groups. Efforts to debias models vary in effectiveness.
Can ChatGPT do data science?
A study led by Bhavya Chopra at Microsoft, with contributions from Ananya Singha and Sumit Gulwani, explored ChatGPT's challenges in data science tasks. Strategies included prompting techniques and leveraging domain expertise for better interactions.
I'm really looking forward to studies that look at performance comparisons in realistic environments. I believe there is a potential revolution brewing for PCPs. Having someone who actually listens and has real in-depth knowledge that goes beyond normal med students could be a game changer when compared to the usual conveyor belt care that gives everyone the same diagnosis based on a quick glance.
When I mean bad, I mean that it cannot identify properly if the bone is human or animal, or which bone it is, and describes something completely different.
The benchmarks are too optimistic in such cases.
{image}
{question}
{choices}
Please first describe the image in a
section named “Image comprehension”.
Then, recall relevant medical knowledge
that is useful for answering the question
but is not explicitly mentioned in a
section named “Recall of medical
knowledge”.
Finally, based on the first two sections,
provide your step-by-step reasoning and
answer the question in a section named
“Step-by-step reasoning”.
Please be concise
I am dealing with serious post COVID complications (Long COVID). Doctors in most places don't seem to be taking this condition seriously and believe it to be a made up thing, but COVID causing damage to bodies and brains is a very real thing.
Thus, for a lot of long COVID patients, treatment has to largely be figured out by oneself, often through reading upcoming studies and experimental treatment reports from online forums.
ChatGPT has been really helpful to me and other patients I know to help make sense of and get context around medication, supplements or other treatments we hear about. I'm not usually asking it what to do, but it really helps me understand what a given medication (say sulbutiamine or whatever) does and how it works and what potential interactions it has with other medicines etc.
The other day it helped me figure out which amino acid complex to take, I put in the nutritional information of different products and had a long conversation with it to figure out which one is a better fit for me, and then confirmed this with external web searches.
I've also used it to help figure out what bloodwork to get and what diets or exercise routines are feasible for someone in my position (with limited mobility and strength and other resources). I can confidently say that I would be much in a much worse place than where I currently am with my health if I didn't have ChatGPT to consult with.
It gives me important context and information to help me make informed decisions about my health.
These are conversations I'd ideally have with a doctor but at least ChatGPT fucking listens and knows some stuff. I don't want to have to rely on a damn chatbot for my health, but high quality doctors aren't accessible to me due to inequalities of this world. So while y'all work on that, I'mma do what I can to survive.
The practice of medicine does not produce cute little cue-card prompts with 4 options. Diagnosis means trading the risk of various interventions against their diagnostic value, including a wait-and-see strategy. It means consulting with the patient on impact, as well as conferring with other specialists.
It's of less-than-zero value to have an 80% machine accuracy and 75% clinician accuracy if the impact of the machine's mistakes are high risk, unconcerned with the impact of the intervention on the patient, or provide little therauptic upside. Likewise, on these decision benchmarks *Machine decisions are NEVER iterated* -- this makes any claim to "performance" not even borderline pseudoscience, but imv, malpractice.
In the realworld practitioner decisions are iterated: doctors do not make high-risk mistake on every sequential decision, given more diagnostic information. It is highly highly unlikely that an inaccurate high-risk judgement call is compounded on each intervention. Whereas machine decisions are never tested in this sequential manner -- if they were, their benchmark performance would drop off a cliff.
The production of non-iterated decision benchmarks which measure accuracy and not risk-adjusted real-world impact constitutes, imv, basic malpractice in the ML community. This should be urgently called out before non-experts think that LLMs can give credible medical advice.
Related
AI can beat real university students in exams, study suggests
A study from the University of Reading reveals AI outperforms real students in exams. AI-generated answers scored higher, raising concerns about cheating. Researchers urge educators to address AI's impact on assessments.
Gemini's data-analyzing abilities aren't as good as Google claims
Google's Gemini 1.5 Pro and 1.5 Flash AI models face scrutiny for poor data analysis performance, struggling with large datasets and complex tasks. Research questions Google's marketing claims, highlighting the need for improved model evaluation.
My finetuned models beat OpenAI's GPT-4
Alex Strick van Linschoten discusses his finetuned models Mistral, Llama3, and Solar LLMs outperforming OpenAI's GPT-4 in accuracy. He emphasizes challenges in evaluation, model complexities, and tailored prompts' importance.
Study reveals why AI models that analyze medical images can be biased
A study by MIT researchers uncovers biases in AI models analyzing medical images, accurately predicting patient race from X-rays but showing fairness gaps in diagnosing diverse groups. Efforts to debias models vary in effectiveness.
Can ChatGPT do data science?
A study led by Bhavya Chopra at Microsoft, with contributions from Ananya Singha and Sumit Gulwani, explored ChatGPT's challenges in data science tasks. Strategies included prompting techniques and leveraging domain expertise for better interactions.