Study reveals why AI models that analyze medical images can be biased
A study by MIT researchers uncovers biases in AI models analyzing medical images, accurately predicting patient race from X-rays but showing fairness gaps in diagnosing diverse groups. Efforts to debias models vary in effectiveness.
Read original articleA study conducted by MIT researchers has shed light on the biases present in AI models used to analyze medical images, particularly in diagnosing patients from different demographic groups. The research revealed that these models can accurately predict a patient's race from chest X-rays, a feat even skilled radiologists struggle with. However, the study found that the models showing high accuracy in demographic predictions also exhibited significant "fairness gaps" in diagnosing images of individuals from diverse racial and gender backgrounds. The researchers discovered that these models might be relying on "demographic shortcuts," leading to inaccurate results for women, Black individuals, and other groups. While efforts to "debias" the models showed some success in improving fairness, the effectiveness varied when tested on different patient populations. The study emphasizes the importance of evaluating AI models on specific patient datasets to ensure accurate and unbiased results, especially when deploying them in healthcare settings.
Related
ChatGPT is biased against resumes with credentials that imply a disability
Researchers at the University of Washington found bias in ChatGPT, an AI tool for resume ranking, against disability-related credentials. Customizing the tool reduced bias, emphasizing the importance of addressing biases in AI systems for fair outcomes.
The question of what's fair illuminates the question of what's hard
Computational complexity theorists repurpose fairness tools to analyze complex problems. Transitioning from multiaccuracy to multicalibration enhances understanding and simplifies approximating hard functions, benefiting algorithmic fairness and complexity theory.
AI can beat real university students in exams, study suggests
A study from the University of Reading reveals AI outperforms real students in exams. AI-generated answers scored higher, raising concerns about cheating. Researchers urge educators to address AI's impact on assessments.
The Question of What's Fair Illuminates the Question of What's Hard
Computational complexity theorists repurpose fairness tools to analyze complex problems. Transitioning from multiaccuracy to multicalibration enhances understanding of hard functions, simplifying approximations and strengthening theorems in complexity theory.
Explainability is not a game
Importance of explainability in machine learning for trustworthy AI decisions. Challenges with SHAP scores in providing rigorous explanations, potentially leading to errors. Emphasis on reliable explanations in critical domains like medical diagnosis.
So the training data is probably hugely biased, and the models will learn to predict the training labels as opposed to any magically correct “ground truth”. And internally detecting demographics and producing an output biased by demographics may well result in a better match to the training data than a perfect, unbiased output would be.
My understanding is that doctors may unconsciously do this as well, ignoring a possible diagnosis because they don't expect a patient of a certain demographic to have a particular issue.
I would expect radiologists who practice in very different demographic environments would not do as well when evaluating images another environment.
At the end of the day radiology is more an art than a science, so the training data may well be faulty. Krupinski (2010) wrote in an interesting paper [1]:
"Medical images need to be interpreted because they are not self-explanatory... In radiology alone, estimates suggest that, in some areas, there may be up to a 30% miss rate and an equally high false positive rate ... interpretation errors can be caused by a host of psychophysical processes ... radiologists are less accurate after a day of reading diagnostic images and that their ability to focus on the display screen is reduced because of myopia. "
I would hope datasets included a substantial amount of images that were originally mis-classified as a human.
Medical data for AI training is almost always sources in some more or less shady country because they lack any privacy regulations. It's then annotated by a hoard of cheap workers who may or may not have advanced medical training.
Even "normal medicine" is extremely biased towards male people fitting inside the norm which is why a lot of things are not detected early enough in women or in people who do not match that norm.
Next thing: Doctors often think that their annotations are the absolute gold standard but they don't necessarily know everything that is in an X-Ray or an MRI.
A few years ago we tried to build synthetic data for this exact purpose by simulating medical images for 3D body models with different diseases and nobody we talked to cared about it, because "we have good data".
They call demographics like age and sex "shortcuts" but I find this to be a frustrating term since it seems to obscure what's happening under the hood. (They cite many papers using the same word, so I'm not blaming them for this usage.) Men are typically larger; old bones do not look like young bones. There is plenty of biology involved in what they refer to as demographic shortcuts.
I think you could take the same results and say "Models are able to distinguish men from women. For our purposes, it's important that they cannot do this. Therefore, we did XYZ on these weakly labeled public databases." But perhaps that sounds less exciting.
"I think the main takeaways are, first, you should thoroughly evaluate any external models on your own data because any fairness guarantees that model developers provide on their training data may not transfer to your population. Second, whenever sufficient data is available, you should train models on your own data," says Haoran Zhang, an MIT graduate student and one of the lead authors of the new paper. ```
This is just overfitting. Why are they training whole models on only one hospital worth of data when they appear to have access to five? They should be training on all of the data in the world they can get their hands on then maybe fine tuning on their specific hospital (maybe they have higher quality outcomes data that verifies the readings) if there are still accuracy issues. The last five years have taught us that gobbling up everything (even if it's not the best quality) is the way.
Like the time of death after the data was collected.
If a model could with a high accuracy predict, that a patient will die within X days (without proper treatment), it will be already very valuable.
Second, as Sora has shown, going multi model can have amazing benefits.
Get a breath analysis of the patient, get a video, get a sound recording, get an MRI, get a CT, get a full blood sample and then let the model do its pattern finding magic.
It's not a lack of "fairness", it's just a lack of accuracy.
Imagine that you train a model to find roofing issues or subsidence or something from aerial imagery. Maybe it performs better on Victorian terraces, because there are lots of those in the UK.
Would you call it unfair because it doesn't do so well on thatched roof properties? No, it's just inaccurate, calling it unfair is a value judgement.
Bias is better because it at least has a statistical basis but fairness is, well.. inaccurate...
Related
ChatGPT is biased against resumes with credentials that imply a disability
Researchers at the University of Washington found bias in ChatGPT, an AI tool for resume ranking, against disability-related credentials. Customizing the tool reduced bias, emphasizing the importance of addressing biases in AI systems for fair outcomes.
The question of what's fair illuminates the question of what's hard
Computational complexity theorists repurpose fairness tools to analyze complex problems. Transitioning from multiaccuracy to multicalibration enhances understanding and simplifies approximating hard functions, benefiting algorithmic fairness and complexity theory.
AI can beat real university students in exams, study suggests
A study from the University of Reading reveals AI outperforms real students in exams. AI-generated answers scored higher, raising concerns about cheating. Researchers urge educators to address AI's impact on assessments.
The Question of What's Fair Illuminates the Question of What's Hard
Computational complexity theorists repurpose fairness tools to analyze complex problems. Transitioning from multiaccuracy to multicalibration enhances understanding of hard functions, simplifying approximations and strengthening theorems in complexity theory.
Explainability is not a game
Importance of explainability in machine learning for trustworthy AI decisions. Challenges with SHAP scores in providing rigorous explanations, potentially leading to errors. Emphasis on reliable explanations in critical domains like medical diagnosis.