July 10th, 2024

Vision language models are blind

Vision language models like GPT-4o and Gemini-1.5 Pro struggle with basic visual tasks such as identifying overlapping shapes and counting intersections. Despite excelling in image-text processing, they exhibit significant shortcomings in visual understanding.

Read original article

Vision language models (VLMs) like GPT-4o and Gemini-1.5 Pro excel in image-text processing but struggle with basic visual tasks. They perform poorly on tasks like identifying overlapping shapes, counting intersections, and recognizing circled letters. For instance, when asked to count line intersections or determine if two circles overlap, VLMs exhibit subpar accuracy, akin to a person with myopia or even blindness making educated guesses. The study evaluates VLMs' performance on tasks involving line intersections, circle interactions, circled letters, and counting overlapping shapes like in an Olympic logo. Results show significant shortcomings in VLMs' visual understanding abilities, highlighting the gap between their performance and human intuition on basic visual tasks. Despite their success in language processing, VLMs' vision capabilities remain limited, raising questions about the true extent of their understanding of visual information.

Testing Generative AI for Circuit Board Design

A study tested Large Language Models (LLMs) like GPT-4o, Claude 3 Opus, and Gemini 1.5 for circuit board design tasks. Results showed varied performance, with Claude 3 Opus excelling in specific questions, while others struggled with complexity. Gemini 1.5 showed promise in parsing datasheet information accurately. The study emphasized the potential and limitations of using AI models in circuit board design.

Detecting hallucinations in large language models using semantic entropy

Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.

Large Language Models are not a search engine

Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.

Gemini's data-analyzing abilities aren't as good as Google claims

Google's Gemini 1.5 Pro and 1.5 Flash AI models face scrutiny for poor data analysis performance, struggling with large datasets and complex tasks. Research questions Google's marketing claims, highlighting the need for improved model evaluation.

Large language models have developed a higher-order theory of mind

Large language models like GPT-4 and Flan-PaLM perform comparably to adults on theory of mind tasks. Study shows GPT-4 excels in 6th order inferences. Model size and fine-tuning influence ToM abilities in LLMs, impacting user-facing applications.

48 comments

By @sweezyjeezy - 7 months

Entertaining, but I think the conclusion is way off.

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

[1] https://arxiv.org/abs/2404.04125

By @jetrink - 7 months

I had a remarkable experience with GPT-4o yesterday. Our garage door started to fall down recently, so I inspected it and found that our landlord had installed the wire rope clips incorrectly, leading to the torsion cables losing tension. I didn't know what that piece of hardware was called, so I asked ChatGPT and it identified the part as I expected it to. As a test, I asked if there was anything notable about the photo. ChatGPT correctly identified that the cables were installed backwards, with the side of the cable that was (previously) under tension on top of the slack end, instead of sandwiched securely in the middle. To diagnose that requires tracing the cable through space and inferring which end is under tension from the geometry, though I can't rule out an educated guess.

What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation.

Screenshot: https://imgur.com/a/QqCNzOM

By @JeremyHerrman - 7 months

VLMs so far have never been good at counting objects or spatial relationships (e.g. the coffee is to the right of the microwave).

There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.

Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"

0: https://github.com/microsoft/SoM

By @joelburget - 7 months

Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](https://yucornetto.github.io/projects/titok.html) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks.

By @cs702 - 7 months

Wow, that is embarrassingly bad performance for current SOTA models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5), which are advertised and sold as being able to understand images, e.g., for guiding the blind or tutoring children in geometry!

The tasks at which they fail are ridiculously simple for human beings, including, for example:

* counting the number of times two lines intersect;

* detecting whether two circles overlap;

* selecting which letter is being circled in a word;

* counting the number of circles in an Olympic-like logo.

This should be at the top of the front page.

By @diwank - 7 months

I think while the findings are interesting, the title "Vision language models are blind" is hyperbolic and misleading of the findings. It is true that VLMs have a very different way of sensing and working with image inputs and for low resolutions each image is split up into blocks and then mapped into discrete tokens. They actually really dont have access to finegrained details because this mapping is quite lossy. So, in that sense, these results completely make sense and were not surprising to me but "blind" is a strong implication and the work isnt really able to back that.

Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:

Me:

here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?

Write your answers as: Plot 1 (leftmost): ...

Sonnet:

I'll analyze each plot from left to right and provide the requested information:

Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0 intersections (the lines do not cross)

Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the right edge)

Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the center)

Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2 intersections (the lines cross twice, once near the left side and once near the right side)

Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.

---

Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.

By @akavi - 7 months

Speaking as someone with only a tenuous grasp of how VLMs work, this naïvely feels like a place where the "embodiement" folks might have a point: Humans have the ability to "refine" their perception of an image iteratively, focusing in on areas of interest, while VLMs have to process the entire image at the same level of fidelity.

I'm curious if there'd be a way to emulate this (have the visual tokens be low fidelity at first, but allow the VLM to emit tokens that correspond to "focusing" on a region of the image with greater resolution). I'm not sure if/how it's possible to performantly train a model with "interactive" data like that, though

By @poikroequ - 7 months

It's ironic, they fail these seemingly simple tests that are trivial even for a child to solve. Yet, I used Gemini to read a postcard containing handwritten Russian cursive text with lots of visual noise (postmarks and whatnot). It was able to read the text and translate it into English. I didn't even need to tell it the text is Russian.

On the one hand, it's incredible what these LLMs are capable of. On the other hand, they often fall flat on their face with seemingly simple problems like this. We are seeing the same from self driving cars, getting into accidents in scenarios that almost any human driver could have easily avoided.

By @mglz - 7 months

I tought some Computational Geometry courses and efficiently computing the intersections of N line segments is not as straightforward as you might initially think. Since somewhere some computation must be done to recognize this and LLMs are not specifically trained for this task, it's not suprising they struggle.

In general, basic geometry seems under-explored by learning.

By @GaggiX - 7 months

Well, all the models (especially Claude 3.5 Sonnet) seem to perform much better than random, so they are clearly not blind. The only task where Claude Sonnet 3.5 does not perform better than random is the one where you have to follow many different paths (the ones where the answer from A to C is 3), something that would take me several seconds to solve.

I have the feeling that they first choose the title of the paper and then run the evaluation on the new Claude 3.5 Sonnet on these abstract images.

>their vision is, at best, like that of a person with myopia seeing fine details as blurry

This also makes no sense, since the images evaluate the abstract capabilities of the models, not their eyesight.

By @yantrams - 7 months

Tested these problems with llava-v1.6-mistral-7b and the results aren't bad. Maybe I just got lucky with these samples

Intersecting Lines https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc

Overlapping Circles https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384

Touching Circles https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc

Circled Text https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc

Nested Squares https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm

By @taesiri - 7 months

This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?

By @simonw - 7 months

I've been generally frustrated at the lack of analysis of vision LLMs generally.

They're clearly a very exciting category of technology, and a pretty recent one - they only got good last October with GPT-4 Vision, but since then we've had more vision models from Anthropic and Google Gemini.

There's so much more information about there about text prompting compared to image prompting. I feel starved for useful information about their capabilities: what are vision models good and bad at, and what are the best ways to put them to work?

By @dheera - 7 months

Current approaches of multi-modal models work on embeddings and tokenizations of images, which is the fundamental problem: you are feeding blurry, non-precise data into the model. Yes, they are blind because of exactly this.

An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.

By @jeromeparadis - 7 months

One use-case I always try is to have an AI try to read a school calendar image where days off are or days of interest are highlighted using a legend. i.e.: days with a square, circle or triangle or different color, etc.

When asking days for specific days of interest for the school year, AIs always struggle. They get some days right but forget some or fabulate new days. They fare a bit better if you remove some of the noise and give them only a picture of a month but even then, it's unreliable.

By @_vaporwave_ - 7 months

It's really interesting that there's a huge performance discrepancy between these SOTA models. In the Olympic logo example, GPT-4o is below the baseline accuracy of 20% (worse than randomly guessing) while Sonnet-3.5 was correct ~76% of the time.

Does anyone have any technical insight or intuition as to why this large variation exists?

By @pjs_ - 7 months

I don't like this paper for the following reasons:

- The language is unnecessarily scathing

- They repeatedly show data where the models are getting things right 70, 80, 90% of the time, and then show a list of what they call "qualitative samples" (what does "qualitative" mean? "cherry-picked"?) which look very bad. But it got the answer right 70/80/90% of the time! That's hardly "blind"...

- Various of the tasks hinge on the distinction between two objects "exactly touching" vs. "very nearly touching" vs. "very slightly overlapping", a problem which (i) is hard for humans and (ii) is particularly (presumably deliberately) sensitive to resolution/precision, where we should not be surprised that models fail

- The main fish-shaped example given in task 1 seems genuinely ambiguous to me - do the lines "intersect" once or twice? The tail of the fish clearly has a crossing, but the nose of the fish seems a bit fishy to me... is that really an intersection?

- AFAIC deranged skepticism is just as bad as deranged hype, the framing here is at risk of appealing to the former

It's absolutely fair to make the point that these models are not perfect, fail a bunch of the time, and to point out the edge cases where they suck. That moves the field forwards. But the hyperbole (as pointed out by another commenter) is very annoying.

By @gnutrino - 7 months

My guess is that the systems are running image recognition models, and maybe OCR on images, and then just piping that data as tokens into an LLM. So you are only ever going to get results as good as existing images models with the results filtered through an LLM.

To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).

Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.

By @jordan_bonecut - 7 months

This is an interesting article and goes along with how I understand how such models interpret input data. I'm not sure I would characterize the results as blurry vision, but maybe an inability to process what they see in a concrete manner.

All the LLMs and multi-modal models I've seen lack concrete reasoning. For instance, ask ChatGPT to perform 2 tasks, to summarize a chunk of text and to count how many words are in this chunk. ChatGPT will do a very good job summarizing the text and an awful job at counting the words. ChatGPT and all the transformer based models I've seen fail at similar concrete/mathematical reasoning tasks. This is the core problem of creating AGI and it generally seems like no one has made any progress towards synthesizing something with both a high and low level of intelligence.

My (unproven and probably incorrect) theory is that under the hood these networks lack information processing loops which make recursive tasks, like solving a math problem, very difficult.

By @Rebuff5007 - 7 months

In fairness, Mira Murati said GPT-4 is only high school level [1]. Maybe it takes PhD level to understand basic shapes?

[1] https://www.ccn.com/news/technology/openais-gpt-5-phd-level-...

By @londons_explore - 7 months

Could some of the "wrong" answers be the LLM attempting to give an explanation rather than the answer, eg. instead of answering 'X', the LLM answers 'The letter is partially hidden by the oval, so cannot be certain, but it appears to be the english letter X'.

The scoring criteria would rank this answer as 'T', which is wrong.

By @orbital-decay - 7 months

That's not anything like "myopia", though.

FWIW I tried the line intersection and the circled letter test from the article with CogVLM (which is far from reaching the current SotA) and it correctly passed both. I haven't tried it with Sonnet/4o but I suspect there might be something wrong with how the author did their tests. Don't get me wrong, but too many "the model can't do that" claims ended up with demonstrations of the model doing exactly that...

By @Log_out_ - 7 months

Chat gpt write me an argument that humans are blind because https://en.m.wikipedia.org/wiki/Optical_illusion exist.

Alexa experience that tragic irony for me.

Siri.forget it.

By @randomtree - 7 months

I guess I know what's coming to every captcha tomorrow.

By @rezaghanbari1 - 7 months

Some of these samples are shocking. How do these models answer chart-based questions, I mean when they can't even count the intersections between two lines?

By @nichohel - 7 months

Vision language models are blind because they lack the Cartesian Theater, which you and I have. Which you and I say we have.

By @aaroninsf - 7 months

The title for this page and argument should be qualified with the specific generation of tools.

That's in the abstract, but, it's bad to not be specific. In this case, because current public-facing models are WIWEB: the worst it will ever be.

And there are trillion-dollar prizes at stake, so, improvement is happening as quickly as it possibly can.

By @Jack000 - 7 months

This is kind of the visual equivalent of asking an LLM to count letters. The failure is more related to the tokenization scheme than the underlying quality of the model.

I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.

By @axblount - 7 months

Would you say they have Blindsight?

By @michaelhoney - 7 months

This says to me that there are huge opportunities for improvement in providing vision modules for LLMs. Human minds aren't made of just one kind of thing: we have all sorts of hacky modular capabilities – there's no reason to think that a future AGI wouldn't also.

By @tantalor - 7 months

Are the "random-baseline accuracy" numbers correct?

In the "Two circles" test, do they really have 50% chance of overlapping? I think this comes from "Distances between circle perimeters: -0.15 to 0.5 times the diameter" but doesn't say the distribution they use.

By @viraptor - 7 months

I love some of the interpretations there. For example "Fig. 10: Only Sonnet-3.5 can count the squares in a majority of the images.", when that model simply returns "4" for every question and happens to be right.

By @vessenes - 7 months

A few comments below talk about how tokenizing images using stuff like CLIP de-facto yields blurry image descriptions, and so these are ‘blind’ by some definitions. Another angle of blurring not much discussed is that the images are rescaled down; different resolutions for different models. I wouldn’t be surprised if Sonnet 3.5 had a higher res base image it feeds in to the model.

Either way, I would guess that we’ll need new model architectures for multimodal to get really good at some of this, and even then some of these tasks are adjacent to things that we know LLMs are already bad at (numeric logic, for instance).

As context lengths get longer, devoting more tokens to the image tokenization should help a bit here as well. Anyway, I’d anticipate next year we’d see 80s and 90s for most of these scores with next gen models.

By @iamleppert - 7 months

This could easily be fixed with training and fine tuning. Simply generate 100,000 examples or so, and train with ground truth until however long you want and its a solved problem.

By @make3 - 7 months

Hugged to death from my perspective. Here is a backup: https://archive.ph/kOE3Q

By @kristianpaul - 7 months

We see through thoughts and memories. We see when we desire, the vision just adds on a word pf thoughts and consciousness of being conscious.

Vision links thoughts with reality

By @childintime - 7 months

Claude 3.5 does remarkably well though on many tasks, compared to the others, and on those it's not at all blind. It's getting there.

By @navaed01 - 7 months

Is there a good primer on how these vision LlmS work?

By @nyxtom - 7 months

I wonder how well Alpha Geometry would do on this

By @jackblemming - 7 months

Ask it to draw any of those things and it can.

By @cpill - 7 months

I wonder how they would score if they used all 4 models and took a majority vote...?

By @nmca - 7 months

please use this opportunity to reflect on whether ARC measures reasoning skills :)

By @nothrowaways - 7 months

The next version will solve all of it.

By @mkoubaa - 7 months

They interact with pixel buffers as a mathematical array. To call them blind is to confuse what they doing with the experience of sight...

By @verbalstoner - 7 months

It's virtually impossible to take a paper seriously when the title has an emoji.

By @spullara - 7 months

in other news, vision models are bad at things they aren't trained to do

By @hi_dang_ - 7 months

I was hoping that someone in the comments talking the paper down would have published a paper or have had relevant publications of their own to point to. You know, meet the lads halfway sort of thing.

So what I’m left with to judge instead is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this? Decisions, decisions.

You can swap LM out with Web3 out with NFT out with Crypto in this case.

Vision language models are blind

Related

Testing Generative AI for Circuit Board Design

Detecting hallucinations in large language models using semantic entropy

Large Language Models are not a search engine

Gemini's data-analyzing abilities aren't as good as Google claims

Large language models have developed a higher-order theory of mind

Related

Testing Generative AI for Circuit Board Design

Detecting hallucinations in large language models using semantic entropy

Large Language Models are not a search engine

Gemini's data-analyzing abilities aren't as good as Google claims

Large language models have developed a higher-order theory of mind