July 3rd, 2024

Study reveals why AI models that analyze medical images can be biased

A study by MIT researchers uncovers biases in AI models analyzing medical images, accurately predicting patient race from X-rays but showing fairness gaps in diagnosing diverse groups. Efforts to debias models vary in effectiveness.

Read original articleLink Icon
Study reveals why AI models that analyze medical images can be biased

A study conducted by MIT researchers has shed light on the biases present in AI models used to analyze medical images, particularly in diagnosing patients from different demographic groups. The research revealed that these models can accurately predict a patient's race from chest X-rays, a feat even skilled radiologists struggle with. However, the study found that the models showing high accuracy in demographic predictions also exhibited significant "fairness gaps" in diagnosing images of individuals from diverse racial and gender backgrounds. The researchers discovered that these models might be relying on "demographic shortcuts," leading to inaccurate results for women, Black individuals, and other groups. While efforts to "debias" the models showed some success in improving fairness, the effectiveness varied when tested on different patient populations. The study emphasizes the importance of evaluating AI models on specific patient datasets to ensure accurate and unbiased results, especially when deploying them in healthcare settings.

Related

ChatGPT is biased against resumes with credentials that imply a disability

ChatGPT is biased against resumes with credentials that imply a disability

Researchers at the University of Washington found bias in ChatGPT, an AI tool for resume ranking, against disability-related credentials. Customizing the tool reduced bias, emphasizing the importance of addressing biases in AI systems for fair outcomes.

The question of what's fair illuminates the question of what's hard

The question of what's fair illuminates the question of what's hard

Computational complexity theorists repurpose fairness tools to analyze complex problems. Transitioning from multiaccuracy to multicalibration enhances understanding and simplifies approximating hard functions, benefiting algorithmic fairness and complexity theory.

AI can beat real university students in exams, study suggests

AI can beat real university students in exams, study suggests

A study from the University of Reading reveals AI outperforms real students in exams. AI-generated answers scored higher, raising concerns about cheating. Researchers urge educators to address AI's impact on assessments.

The Question of What's Fair Illuminates the Question of What's Hard

The Question of What's Fair Illuminates the Question of What's Hard

Computational complexity theorists repurpose fairness tools to analyze complex problems. Transitioning from multiaccuracy to multicalibration enhances understanding of hard functions, simplifying approximations and strengthening theorems in complexity theory.

Explainability is not a game

Explainability is not a game

Importance of explainability in machine learning for trustworthy AI decisions. Challenges with SHAP scores in providing rigorous explanations, potentially leading to errors. Emphasis on reliable explanations in critical domains like medical diagnosis.

Link Icon 10 comments
By @amluto - 4 months
I’m suspicious that there’s another factor in play: images being correctly labeled in the training set. Even from high-end hospitals, my personal experience leads me to believe that radiologists make both major types of error on a regular basis: calling out abnormalities in an image that are entirely irrelevant and missing actual relevant problems in an image that are subsequently easily seen by a doctor who is aware of what seems to be actually wrong. To top it off, sure many patients end up being diagnosed with whatever the radiologist saw, and no one ever confirms that the diagnosis was really correct. (And how could they? A lot of conditions are hard to diagnose!)

So the training data is probably hugely biased, and the models will learn to predict the training labels as opposed to any magically correct “ground truth”. And internally detecting demographics and producing an output biased by demographics may well result in a better match to the training data than a perfect, unbiased output would be.

By @teruakohatu - 4 months
I have not been able to fully digest this paper yet, and medical data is not my speciality. It is interesting but not surprising that models appear able to determine the demographics of the patient that even radiologists are unable to. It is also not surprising that models use this to "cheat" (find demographic shortcuts in disease classification).

My understanding is that doctors may unconsciously do this as well, ignoring a possible diagnosis because they don't expect a patient of a certain demographic to have a particular issue.

I would expect radiologists who practice in very different demographic environments would not do as well when evaluating images another environment.

At the end of the day radiology is more an art than a science, so the training data may well be faulty. Krupinski (2010) wrote in an interesting paper [1]:

"Medical images need to be interpreted because they are not self-explanatory... In radiology alone, estimates suggest that, in some areas, there may be up to a 30% miss rate and an equally high false positive rate ... interpretation errors can be caused by a host of psychophysical processes ... radiologists are less accurate after a day of reading diagnostic images and that their ability to focus on the display screen is reduced because of myopia. "

I would hope datasets included a substantial amount of images that were originally mis-classified as a human.

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881280/

By @ogarten - 4 months
How does this surprise anyone?

Medical data for AI training is almost always sources in some more or less shady country because they lack any privacy regulations. It's then annotated by a hoard of cheap workers who may or may not have advanced medical training.

Even "normal medicine" is extremely biased towards male people fitting inside the norm which is why a lot of things are not detected early enough in women or in people who do not match that norm.

Next thing: Doctors often think that their annotations are the absolute gold standard but they don't necessarily know everything that is in an X-Ray or an MRI.

A few years ago we tried to build synthetic data for this exact purpose by simulating medical images for 3D body models with different diseases and nobody we talked to cared about it, because "we have good data".

By @carbocation - 4 months
The authors refer to a literature describing "shortcuts" as "correlations that are present in the data but have no real clinical basis, for instance deep models using the hospital as a shortcut for disease prediction". It feels like a parallel language is developing. Most of us would call such a phenomenon "overfitting" or describe some specific issue with generalization. That example is not a shortcut in any normal sense of the word unless you are providing the hospital via some extra path.

They call demographics like age and sex "shortcuts" but I find this to be a frustrating term since it seems to obscure what's happening under the hood. (They cite many papers using the same word, so I'm not blaming them for this usage.) Men are typically larger; old bones do not look like young bones. There is plenty of biology involved in what they refer to as demographic shortcuts.

I think you could take the same results and say "Models are able to distinguish men from women. For our purposes, it's important that they cannot do this. Therefore, we did XYZ on these weakly labeled public databases." But perhaps that sounds less exciting.

By @zaptrem - 4 months
``` The researchers also found that they could retrain the models in a way that improves their fairness. However, their approach to "debiasing" worked best when the models were tested on the same types of patients on whom they were trained, such as patients from the same hospital. When these models were applied to patients from different hospitals, the fairness gaps reappeared.

"I think the main takeaways are, first, you should thoroughly evaluate any external models on your own data because any fairness guarantees that model developers provide on their training data may not transfer to your population. Second, whenever sufficient data is available, you should train models on your own data," says Haoran Zhang, an MIT graduate student and one of the lead authors of the new paper. ```

This is just overfitting. Why are they training whole models on only one hospital worth of data when they appear to have access to five? They should be training on all of the data in the world they can get their hands on then maybe fine tuning on their specific hospital (maybe they have higher quality outcomes data that verifies the readings) if there are still accuracy issues. The last five years have taught us that gobbling up everything (even if it's not the best quality) is the way.

By @kosh2 - 4 months
I agree with many posters in here, that the cause will likely be bad data one way or another. Maybe we need to take a step back and only use data, that are almost 100% accurate.

Like the time of death after the data was collected.

If a model could with a high accuracy predict, that a patient will die within X days (without proper treatment), it will be already very valuable.

Second, as Sora has shown, going multi model can have amazing benefits.

Get a breath analysis of the patient, get a video, get a sound recording, get an MRI, get a CT, get a full blood sample and then let the model do its pattern finding magic.

By @throwaway22032 - 4 months
The language used in the article seems political/inflammatory to me.

It's not a lack of "fairness", it's just a lack of accuracy.

Imagine that you train a model to find roofing issues or subsidence or something from aerial imagery. Maybe it performs better on Victorian terraces, because there are lots of those in the UK.

Would you call it unfair because it doesn't do so well on thatched roof properties? No, it's just inaccurate, calling it unfair is a value judgement.

Bias is better because it at least has a statistical basis but fairness is, well.. inaccurate...

By @wiradikusuma - 4 months
Instead of trying to make the model "fair", can we do "model = models.getByRace(x)" so we have optimized model for each, instead of being "jack of all trades"?
By @adammarples - 4 months
Is it just me or did the article at no point explain why medical models produce biased results? I seems to take this as a given, and an uninteresting one at that, and focuses on trying to correct it, without explaining why it happens in the first place. Yes, those models could use race as a shortcut to, presumably, not diagnose cancer in black people for example, but why is the race shortcut boosting model training accuracy? I am still none the wiser.