Vision language models are blind
Vision language models like GPT-4o and Gemini-1.5 Pro struggle with basic visual tasks such as identifying overlapping shapes and counting intersections. Despite excelling in image-text processing, they exhibit significant shortcomings in visual understanding.
Read original articleVision language models (VLMs) like GPT-4o and Gemini-1.5 Pro excel in image-text processing but struggle with basic visual tasks. They perform poorly on tasks like identifying overlapping shapes, counting intersections, and recognizing circled letters. For instance, when asked to count line intersections or determine if two circles overlap, VLMs exhibit subpar accuracy, akin to a person with myopia or even blindness making educated guesses. The study evaluates VLMs' performance on tasks involving line intersections, circle interactions, circled letters, and counting overlapping shapes like in an Olympic logo. Results show significant shortcomings in VLMs' visual understanding abilities, highlighting the gap between their performance and human intuition on basic visual tasks. Despite their success in language processing, VLMs' vision capabilities remain limited, raising questions about the true extent of their understanding of visual information.
Related
Testing Generative AI for Circuit Board Design
A study tested Large Language Models (LLMs) like GPT-4o, Claude 3 Opus, and Gemini 1.5 for circuit board design tasks. Results showed varied performance, with Claude 3 Opus excelling in specific questions, while others struggled with complexity. Gemini 1.5 showed promise in parsing datasheet information accurately. The study emphasized the potential and limitations of using AI models in circuit board design.
Detecting hallucinations in large language models using semantic entropy
Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.
Large Language Models are not a search engine
Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.
Gemini's data-analyzing abilities aren't as good as Google claims
Google's Gemini 1.5 Pro and 1.5 Flash AI models face scrutiny for poor data analysis performance, struggling with large datasets and complex tasks. Research questions Google's marketing claims, highlighting the need for improved model evaluation.
Large language models have developed a higher-order theory of mind
Large language models like GPT-4 and Flan-PaLM perform comparably to adults on theory of mind tasks. Study shows GPT-4 excels in 6th order inferences. Model size and fine-tuning influence ToM abilities in LLMs, impacting user-facing applications.
> their vision is, at best, like that of a person with myopia seeing fine details as blurry
is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.
Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.
What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation.
Screenshot: https://imgur.com/a/QqCNzOM
There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.
Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"
The tasks at which they fail are ridiculously simple for human beings, including, for example:
* counting the number of times two lines intersect;
* detecting whether two circles overlap;
* selecting which letter is being circled in a word;
* counting the number of circles in an Olympic-like logo.
This should be at the top of the front page.
Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:
Me:
here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?
Write your answers as: Plot 1 (leftmost): ...
Sonnet:
I'll analyze each plot from left to right and provide the requested information:
Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0 intersections (the lines do not cross)
Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the right edge)
Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the center)
Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2 intersections (the lines cross twice, once near the left side and once near the right side)
Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.
---
Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.
I'm curious if there'd be a way to emulate this (have the visual tokens be low fidelity at first, but allow the VLM to emit tokens that correspond to "focusing" on a region of the image with greater resolution). I'm not sure if/how it's possible to performantly train a model with "interactive" data like that, though
On the one hand, it's incredible what these LLMs are capable of. On the other hand, they often fall flat on their face with seemingly simple problems like this. We are seeing the same from self driving cars, getting into accidents in scenarios that almost any human driver could have easily avoided.
In general, basic geometry seems under-explored by learning.
I have the feeling that they first choose the title of the paper and then run the evaluation on the new Claude 3.5 Sonnet on these abstract images.
>their vision is, at best, like that of a person with myopia seeing fine details as blurry
This also makes no sense, since the images evaluate the abstract capabilities of the models, not their eyesight.
Intersecting Lines https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc
Overlapping Circles https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384
Touching Circles https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc
Circled Text https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc
Nested Squares https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm
They're clearly a very exciting category of technology, and a pretty recent one - they only got good last October with GPT-4 Vision, but since then we've had more vision models from Anthropic and Google Gemini.
There's so much more information about there about text prompting compared to image prompting. I feel starved for useful information about their capabilities: what are vision models good and bad at, and what are the best ways to put them to work?
An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.
At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.
When asking days for specific days of interest for the school year, AIs always struggle. They get some days right but forget some or fabulate new days. They fare a bit better if you remove some of the noise and give them only a picture of a month but even then, it's unreliable.
Does anyone have any technical insight or intuition as to why this large variation exists?
- The language is unnecessarily scathing
- They repeatedly show data where the models are getting things right 70, 80, 90% of the time, and then show a list of what they call "qualitative samples" (what does "qualitative" mean? "cherry-picked"?) which look very bad. But it got the answer right 70/80/90% of the time! That's hardly "blind"...
- Various of the tasks hinge on the distinction between two objects "exactly touching" vs. "very nearly touching" vs. "very slightly overlapping", a problem which (i) is hard for humans and (ii) is particularly (presumably deliberately) sensitive to resolution/precision, where we should not be surprised that models fail
- The main fish-shaped example given in task 1 seems genuinely ambiguous to me - do the lines "intersect" once or twice? The tail of the fish clearly has a crossing, but the nose of the fish seems a bit fishy to me... is that really an intersection?
- AFAIC deranged skepticism is just as bad as deranged hype, the framing here is at risk of appealing to the former
It's absolutely fair to make the point that these models are not perfect, fail a bunch of the time, and to point out the edge cases where they suck. That moves the field forwards. But the hyperbole (as pointed out by another commenter) is very annoying.
To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).
Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.
All the LLMs and multi-modal models I've seen lack concrete reasoning. For instance, ask ChatGPT to perform 2 tasks, to summarize a chunk of text and to count how many words are in this chunk. ChatGPT will do a very good job summarizing the text and an awful job at counting the words. ChatGPT and all the transformer based models I've seen fail at similar concrete/mathematical reasoning tasks. This is the core problem of creating AGI and it generally seems like no one has made any progress towards synthesizing something with both a high and low level of intelligence.
My (unproven and probably incorrect) theory is that under the hood these networks lack information processing loops which make recursive tasks, like solving a math problem, very difficult.
[1] https://www.ccn.com/news/technology/openais-gpt-5-phd-level-...
The scoring criteria would rank this answer as 'T', which is wrong.
FWIW I tried the line intersection and the circled letter test from the article with CogVLM (which is far from reaching the current SotA) and it correctly passed both. I haven't tried it with Sonnet/4o but I suspect there might be something wrong with how the author did their tests. Don't get me wrong, but too many "the model can't do that" claims ended up with demonstrations of the model doing exactly that...
Alexa experience that tragic irony for me.
Siri.forget it.
That's in the abstract, but, it's bad to not be specific. In this case, because current public-facing models are WIWEB: the worst it will ever be.
And there are trillion-dollar prizes at stake, so, improvement is happening as quickly as it possibly can.
I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.
In the "Two circles" test, do they really have 50% chance of overlapping? I think this comes from "Distances between circle perimeters: -0.15 to 0.5 times the diameter" but doesn't say the distribution they use.
Either way, I would guess that we’ll need new model architectures for multimodal to get really good at some of this, and even then some of these tasks are adjacent to things that we know LLMs are already bad at (numeric logic, for instance).
As context lengths get longer, devoting more tokens to the image tokenization should help a bit here as well. Anyway, I’d anticipate next year we’d see 80s and 90s for most of these scores with next gen models.
Vision links thoughts with reality
So what I’m left with to judge instead is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this? Decisions, decisions.
You can swap LM out with Web3 out with NFT out with Crypto in this case.
Related
Testing Generative AI for Circuit Board Design
A study tested Large Language Models (LLMs) like GPT-4o, Claude 3 Opus, and Gemini 1.5 for circuit board design tasks. Results showed varied performance, with Claude 3 Opus excelling in specific questions, while others struggled with complexity. Gemini 1.5 showed promise in parsing datasheet information accurately. The study emphasized the potential and limitations of using AI models in circuit board design.
Detecting hallucinations in large language models using semantic entropy
Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.
Large Language Models are not a search engine
Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.
Gemini's data-analyzing abilities aren't as good as Google claims
Google's Gemini 1.5 Pro and 1.5 Flash AI models face scrutiny for poor data analysis performance, struggling with large datasets and complex tasks. Research questions Google's marketing claims, highlighting the need for improved model evaluation.
Large language models have developed a higher-order theory of mind
Large language models like GPT-4 and Flan-PaLM perform comparably to adults on theory of mind tasks. Study shows GPT-4 excels in 6th order inferences. Model size and fine-tuning influence ToM abilities in LLMs, impacting user-facing applications.