July 29th, 2024

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Recent research shows GPT-4V outperforms physicians in medical imaging accuracy but has flawed rationales. Its potential in decision support requires further evaluation before clinical use, highlighting AI's limitations.

Read original articleLink Icon
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Recent research has highlighted the performance of Generative Pre-trained Transformer 4 with Vision (GPT-4V) in medical imaging tasks, indicating it can outperform human physicians in multi-choice accuracy. In a study analyzing GPT-4V's capabilities using 207 questions from the New England Journal of Medicine (NEJM) Image Challenge, the model achieved an accuracy of 81.6%, compared to 77.8% for physicians. Notably, GPT-4V excelled in answering questions that physicians answered incorrectly, with over 78% accuracy. However, the study revealed significant flaws in GPT-4V's rationales, particularly in image comprehension, where 27.2% of cases contained errors despite correct final answers. The model's recall of medical knowledge was more reliable, with error rates between 11.6% and 13.0%. The findings suggest that while GPT-4V shows promise in decision support, its reasoning processes require further evaluation before clinical integration. The study also noted that human physicians outperformed GPT-4V in open-book settings, especially on difficult questions, underscoring the importance of comprehensive evaluations of AI models in medical contexts. Limitations of the study included a potential bias in the NEJM Image Challenge cases and the focus on single-answer questions, which may not reflect the complexities of real clinical scenarios. Future research aims to compare GPT-4V's rationales with those of physicians to better understand its decision-making processes. Overall, while GPT-4V demonstrates high accuracy, its flawed rationales highlight the need for caution in its application in clinical settings.

Link Icon 7 comments
By @sigmoid10 - 9 months
One med student and nine physicians aren't the greatest study dataset for comparing human performance, but the results are still pretty wild. GPT4-V can reliably outperform or match doctors on all but the hardest questions in a closed book setting. If you take their results at face value and apply them to modern medical care environments, it means that you are probably better off with GPT than with a real doctor if you don't have something super peculiar that would take even true experts a lot of close attention and research to figure out. It will certainly beat an uninterested or overworked clinician.

I'm really looking forward to studies that look at performance comparisons in realistic environments. I believe there is a potential revolution brewing for PCPs. Having someone who actually listens and has real in-depth knowledge that goes beyond normal med students could be a game changer when compared to the usual conveyor belt care that gives everyone the same diagnosis based on a quick glance.

By @rvnx - 9 months
GPT4 is quite weak for medical diagnosis and radiology. Send a picture and a description of the symptoms and it will often give you bad results.

When I mean bad, I mean that it cannot identify properly if the bone is human or animal, or which bone it is, and describes something completely different.

The benchmarks are too optimistic in such cases.

By @simonw - 9 months
Here's the main prompt they used:

    {image}

    {question}

    {choices}

    Please first describe the image in a
    section named “Image comprehension”.

    Then, recall relevant medical knowledge
    that is useful for answering the question
    but is not explicitly mentioned in a
    section named “Recall of medical
    knowledge”.

    Finally, based on the first two sections,
    provide your step-by-step reasoning and
    answer the question in a section named
    “Step-by-step reasoning”.

    Please be concise
By @notRobot - 9 months
I would like to share my relevant experience here.

I am dealing with serious post COVID complications (Long COVID). Doctors in most places don't seem to be taking this condition seriously and believe it to be a made up thing, but COVID causing damage to bodies and brains is a very real thing.

Thus, for a lot of long COVID patients, treatment has to largely be figured out by oneself, often through reading upcoming studies and experimental treatment reports from online forums.

ChatGPT has been really helpful to me and other patients I know to help make sense of and get context around medication, supplements or other treatments we hear about. I'm not usually asking it what to do, but it really helps me understand what a given medication (say sulbutiamine or whatever) does and how it works and what potential interactions it has with other medicines etc.

The other day it helped me figure out which amino acid complex to take, I put in the nutritional information of different products and had a long conversation with it to figure out which one is a better fit for me, and then confirmed this with external web searches.

I've also used it to help figure out what bloodwork to get and what diets or exercise routines are feasible for someone in my position (with limited mobility and strength and other resources). I can confidently say that I would be much in a much worse place than where I currently am with my health if I didn't have ChatGPT to consult with.

It gives me important context and information to help me make informed decisions about my health.

These are conversations I'd ideally have with a doctor but at least ChatGPT fucking listens and knows some stuff. I don't want to have to rely on a damn chatbot for my health, but high quality doctors aren't accessible to me due to inequalities of this world. So while y'all work on that, I'mma do what I can to survive.

By @mjburgess - 9 months
Reminder once again that "accuracy" is irrelevant in the real-world. It's only a benchmark metric of the ML research community little concerned with any real-world impact. Comparing the "accuracy" of an LLM to 9 med students is just rank pseudoscience.

The practice of medicine does not produce cute little cue-card prompts with 4 options. Diagnosis means trading the risk of various interventions against their diagnostic value, including a wait-and-see strategy. It means consulting with the patient on impact, as well as conferring with other specialists.

It's of less-than-zero value to have an 80% machine accuracy and 75% clinician accuracy if the impact of the machine's mistakes are high risk, unconcerned with the impact of the intervention on the patient, or provide little therauptic upside. Likewise, on these decision benchmarks *Machine decisions are NEVER iterated* -- this makes any claim to "performance" not even borderline pseudoscience, but imv, malpractice.

In the realworld practitioner decisions are iterated: doctors do not make high-risk mistake on every sequential decision, given more diagnostic information. It is highly highly unlikely that an inaccurate high-risk judgement call is compounded on each intervention. Whereas machine decisions are never tested in this sequential manner -- if they were, their benchmark performance would drop off a cliff.

The production of non-iterated decision benchmarks which measure accuracy and not risk-adjusted real-world impact constitutes, imv, basic malpractice in the ML community. This should be urgently called out before non-experts think that LLMs can give credible medical advice.