Gemini 2.0: our new AI model for the agentic era
Google has launched Gemini 2.0, an advanced AI model with multimodal capabilities, including image and audio output. Wider access is planned for early 2025, focusing on responsible AI development.
Read original articleGoogle has unveiled Gemini 2.0, an advanced AI model designed for the "agentic era," which emphasizes enhanced capabilities in multimodal processing. This new model features native image and audio output, as well as the ability to utilize various tools, making it more versatile than its predecessors. Gemini 2.0 Flash, the experimental version, is currently available to developers and trusted testers, with broader access expected in early 2025. The model aims to facilitate agentic experiences through projects like Astra and Mariner, which explore the potential of AI assistants in everyday tasks. Google emphasizes its commitment to responsible AI development, prioritizing safety and security. The Gemini 2.0 model builds on the success of earlier versions, enhancing performance and response times while supporting a range of inputs and outputs. The introduction of a new Multimodal Live API aims to assist developers in creating interactive applications. Overall, Gemini 2.0 represents a significant step forward in AI technology, with the potential to transform user interactions across various Google products.
- Google has launched Gemini 2.0, an AI model focused on multimodal capabilities.
- The model includes features like native image and audio output and tool usage.
- Gemini 2.0 Flash is available to developers, with wider access planned for early 2025.
- Projects like Astra and Mariner are being developed to enhance AI assistant functionalities.
- Google is committed to responsible AI development, emphasizing safety and security.
Related
How it's Made: Interacting with Gemini through multimodal prompting
Alexander Chen from Google Developers discusses Gemini's multimodal prompting capabilities. Gemini excels in tasks like pattern recognition, puzzle-solving, and creative applications, hinting at its potential for innovative interactions and creative endeavors.
Gemini Pro 1.5 experimental "version 0801" available for early testing
Google DeepMind's Gemini family of AI models, particularly Gemini 1.5 Pro, excels in multimodal understanding and complex tasks, featuring a two million token context window and improved performance in various benchmarks.
Google Gemini 1.5 Pro leaps ahead in AI race, challenging GPT-4o
Google has launched Gemini 1.5 Pro, an advanced AI model excelling in multilingual tasks and coding, now available for testing. It raises concerns about AI safety and ethical use.
Two new Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more
Google updated its Gemini models, introducing Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002, featuring over 50% price reduction, increased rate limits, improved performance, and free access for developers via Google AI Studio.
Google is prepping Gemini to take action inside of apps
Google is enhancing its AI assistant, Gemini, with a new API called "app functions" to perform actions within apps, similar to Apple's Siri updates, improving user interaction and functionality.
- Many users are impressed with Gemini 2.0's multimodal capabilities, particularly its ability to engage in live conversations and assist with tasks like coding and image recognition.
- There are concerns about the model's tendency to hallucinate and provide inaccurate information, especially in search results.
- Users are debating the effectiveness of Gemini 2.0 compared to competitors like OpenAI's models, with mixed opinions on its performance and usability.
- Some comments highlight the excitement around new features like the Deep Research tool, while others express skepticism about Google's ability to monetize AI without affecting its core advertising business.
- The naming of the model has sparked confusion and criticism, with some users questioning the choice in relation to existing protocols.
I just tried having it teach me how to use Blender. It seems like it could actually be super helpful for beginners, as it has decent knowledge of the toolbars and keyboard shortcuts and can give you advice based on what it sees you doing on your screen. It also watched me play Indiana Jones and the Great Circle, and it successfully identified some of the characters and told me some information about them.
You can enable "Grounding" in the sidebar to let it use Google Search even in voice mode. The video streaming and integrated search make it far more useful than ChatGPT Advanced Voice mode is currently.
llm install -U llm-gemini
llm -m gemini-2.0-flash-exp 'prompt goes here'
LLM installation: https://llm.datasette.io/en/stable/setup.htmlWorth noting that the Gemini models have the ability to write and then execute Python code. I tried that like this:
llm -m gemini-2.0-flash-exp -o code_execution 1 \
'write and execute python to generate a 80x40 ascii art fractal'
Here's the result: https://gist.github.com/simonw/0d8225d62e8d87ce843fde471d143...It can't make outbound network calls though, so this fails:
llm -m gemini-2.0-flash-exp -o code_execution 1 \
'write python code to retrieve https://simonwillison.net/ and use a regex to extract the title, run that code'
Amusingly Gemini itself doesn't know that it can't make network calls, so it tries several different approaches before giving up: https://gist.github.com/simonw/2ccfdc68290b5ced24e5e0909563c...The new model seems very good at vision:
llm -m gemini-2.0-flash-exp describe -a https://static.simonwillison.net/static/2024/pelicans.jpg
I got back a solid description, see here: https://gist.github.com/simonw/32172b6f8bcf8e55e489f10979f8f...But, once they do get moving in the right direction the can achieve things that smaller companies can't. Google has an insane amount of talent in this space, and seems to be getting the right results from that now.
Remains to be seen how well they will be able to productize and market, but hard to deny that their LLM models aren't really, really good though.
They’ve had OpenAI-compatible endpoints for a while, but it’s never been clear how serious they were about supporting them long-term. Nice to see another option showing up. For reference, their main repo (not kidding) recommends setting up a Kubernetes cluster and a GCP bucket to submit batch requests.
Anyway, I'm glad that this Google release is actually available right away! I pay for Gemini Advanced and I see "Gemini Flash 2.0" as an option in the model selector.
I've been going through Advent of Code this year, and testing each problem with each model (GPT-4o, o1, o1 Pro, Claude Sonnet, Opus, Gemini Pro 1.5). Gemini has done decent, but is probably the weakest of the bunch. It failed (unexpectedly to me) on Day 10, but when I tried Flash 2.0 it got it! So at least in that one benchmark, the new Flash 2.0 edged out Pro 1.5.
I look forward to seeing how it handles upcoming problems!
I should say: Gemini Flash didn't quite get it out of the box. It actually had a syntax error in the for loop, which caused it to fail to compile, which is an unusual failure mode for these models. Maybe it was a different version of Java or something (I'm also trying to learn Java with AoC this year...). But when I gave Flash 2.0 the compilation error, it did fix it.
For the more Java proficient, can someone explain why it may have provided this code:
for (int[] current = queue.remove(0)) {
which was a compilation error for me? The corrected code it gave me afterwards was just for (int[] current : queue) {
and with that one change the class ran and gave the right solution.Also a whole lot of computer vision tasks (via LLMs) could be unlocked with this. Think Inpainting, Style Transfer, Text Editing in the wild, Segmentation, Edge detection etc
They have a demo: https://www.youtube.com/watch?v=7RqFLp0TqV0
Most of these things seem to just be a system prompt and a tool that get invoked as part of a pipeline. They’re hardly “agents”.
They’re modules.
The latency was low, though the conversation got cut off a few times.
The models do just fine on "work" but are terrible for "thinking". The verbosity of the explanations (and the sheer amount of praise the models like to give the prompter - I've never had my rear end kissed so much!) should lead one to beware any subjective reviews of their performance rather than objective reviews focusing solely on correct/incorrect.
They did it: from now on Google will keep a leadership position.
They have too much data (Search, Maps, Youtube, Chrome, Android, Gmail, etc.), and they have their own servers (it's free!) and now the Willow QPU.
To me, it is evident how the future will look. I'll buy some more Alphabet stocks
> "Now millions of developers are building with Gemini. And it’s helping us reimagine all of our products — including all 7 of them with 2 billion users — and to create new ones"
and
> "We’re getting 2.0 into the hands of developers and trusted testers today. And we’re working quickly to get it into our products, leading with Gemini and Search. Starting today our Gemini 2.0 Flash experimental model will be available to all Gemini users."
Flash combines speed and cost and is extremely good to build apps on.
People really take that whole benchmarking thing more seriously than necessary.
Anyone seeing this? I don't have an option in my dropdown.
Based on initial interactions, it's extremely verbose. It seems to be focused on explaining its reasoning, but even after just a few interactions I have seen some surprising hallucinations. For example, to assess current understanding of AI, I mentioned "Why hasn't Anthropic released Claude 3.5 Opus yet?" Gemini responded with text that included "Why haven't they released Claude 3.5 Sonnet First? That's an interesting point." There's clearly some reflection/attempted reasoning happening, but it doesn't feel competitive with o1 or the new Claude 3.5 Sonnet that was trained on 3.5 Opus output.
A little thin...
Also no pricing is live yet. OpenAI's audio inputs/outputs are too expensive to really put in production, so hopefully Gemini will be cheaper. (Not to mention, OAI's doesn't follow instructions very well.)
Anyone else run into similar issues or have any tips?
"What's the first name of Freddy LaStrange"? >> "I do not have enough information about that person to help with your request. I am a large language model, and I am able to communicate and generate human-like text in response to a wide range of prompts and questions, but my knowledge about this person is limited. Is there anything else I can do to help you with this request?"
(Of course, we can't be 100% sure that his first name is Freddy. But I would expect that to be part of the answer then)
I'll give Flash 2 a try soon, but I gotta say that Google has been doing a great job catching up with Gemini. Both Gemini 1.5 Pro 002 and Flash 1.5 can trade blows with 4o, and are easily ahead of the vast majority of other major models (Mistral Large, Qwen, Llama, etc). Claude is usually better, but has a major flaw (to be discussed later).
So, here's my current rankings. I base my rankings on my work, not on benchmarks. I think benchmarks are important and they'll get better in time, but most benchmarks for LLMs and MLLMs are quite bad.
1) 4o and its ilk are far and away the best in terms of accuracy, both for textual tasks as well as vision related tasks. Absolutely nothing comes even close to 4o for vision related tasks. The biggest failing of 4o is that it has the worst instruction following of commercial LLMs, and that instruction following gets _even_ worse when an image is involved. A prime example is when I ask 4o to help edit some text, to change certain words, verbage, etc. No matter how I prompt it, it will often completely re-write the input text to its own style of speaking. It's a really weird failing. It's like their RLHF tuning is hyper focused on keeping it aligned with the "character" of 4o to the point that it injects that character into all its outputs no matter what the user or system instructions state. o1 is a MASSIVE improvement in this regard, and is also really good at inferring things so I don't have to explicitly instruct it on every little detail. I haven't found o1-pro overly useful yet. o1 is basically my daily driver outside of work, even for mundane questions, because it's just better across the board and the speed penalty is negligible. One particularly example of o1 being better I encountered yesterday. I had it re-wording an image description, and thought it had introduced a detail that wasn't in the original description. Well, I was wrong and had accidentally skimmed over that detail in the original. It _told_ me I was wrong, and didn't update the description! Freaky, but really incredible. 4o never corrects me when I give it an explicit instruction.
4o is fairly easy to jailbreak. They've been turning the screws for awhile so it isn't as easy as day 1, but even o1-pro can be jailbroken.
2) Gemini 1.5 Pro 002 (specifically 002) is second best in my books. I'd guesstimate it at being about 80% as good as 4o on most tasks, including vision. But it's _significantly_ better at instruction following. Its RLHF is a lot lighter than ChatGPT models, so it's easier to get these models to fall back to pretraining, which is really helpful for my work specifically. But in general the Gemini models have come a long way. The ability to turn off model censorship is quite nice, though it does still refuse at times. The Flash variation is interesting; often times on-par with Pro with Pro edging out maybe 30% of the time. I don't frequently use Flash, but it's an impressive model for its size. (Side note: The Gemma models are ... not good. Google's other public models, like so400m and OWLv2 are great, so it's a shame their open LLMs forays are falling behind). Google also has the best AI playground.
Jailbreaking Gemini is a piece of cake.
3) Claude is third on my list. It has the _best_ instruction following of all the models, even slightly better than o1. Though it often requires multi-turn to get it to fully follow instructions, which is annoying. Its overall prowess as an LLM is somewhere between 4o and Gemini. Vision is about the same as Gemini, except for knowledge based queries which Gemini tends to be quite bad at (who is this person? Where is this? What brand of guitar? etc). But Claude's biggest flaw is the insane "safety" training it underwent, which makes it practically useless. I get false triggers _all_ the time from Claude. And that's to say nothing of how unethical their "ethics" system is to begin with. And what's funny is that Claude is an order of magnitude _smarter_ when its reasoning about its safety training. It's the only real semblance of reason I've seen from LLMs ... all just to deny my requests.
I've put Claude three out of respect for the _technical_ achievements of the product, but I think the developers need to take a long look in the mirror and ask why they think it's okay to for _them_ to decide what people with disabilities are and are not aloud to have access to.
4) Llama 3. What a solid model. It's the best open LLM, hands down. Nowhere near the commercial models above, but for a model that's completely free to use locally? That's invaluable. Their vision variation is ... not worth using. But I think it'll get better with time. The 8B variation far outperforms its weight class. 70B is a respectable model, with better instruction following than 4o. The ability to finetune these models to a task with so little data is a huge plus. I've made task specific models with 200-400 examples.
5) Mistral Large (I forget the specific version for their latest release). I love Mistral as the "under-dog". Their models aren't bad, and behave _very_ differently from all other models out there, which I appreciate. But Mistral never puts any effort into polishing their models; they always come out of the oven half-baked. Which means they frequently glitch out, have very inconsistent behavior, etc. Accuracy and quality is hard to assess because of this inconsistency. On its best days it's up near Gemini, which is quite incredible considering the models are also released publicly. So theoretically you could finetune them to your task and get a commercial grade model to run locally. But rarely see anyone do that with Mistral, I think partly because of their weird license. Overall, I like seeing them in the race and hope they get better, but I wouldn't use it for anything serious.
Mistral is lightly censored, but fairly easy to jailbreak.
6) Qwen 2 (or 2.5 or whatever the current version is these days). It's an okay model. I've heard a lot of praises for it, but in all my uses thus far its always been really inconsistent, glitchy, and weak. I've used it both locally and through APIs. I guess in _theory_ it's a good model, based on benchmarks. And it's open, which I appreciate. But I've not found any practical use for it. I even tried finetuning with Qwen 2VL 72B, and my tiny 8B JoyCaption model beat it handily.
That's about the sum of it. AFAIK that's all the major commercial and open models (my focus is mainly on MLLMs). OpenAI are still leading the pack in my experience. I'm glad to see good competition coming from Google finally. I hope Mistral can polish their models and be a real contender.
There are a couple smaller contenders out there like Pixmo/etc from allenai. Allen AI has hands down the _best_ public VQA dataset I've seen, so huge props to them there. Pixmo is ... okayish. I tried Amazon's models a little but didn't see anything useful.
NOTE: I refuse to use Grok models for the obvious reasons, so fucks to be them.
Overall, especially seeing as I haven't paid a dime to use the API yet, I'm pretty impressed.
That's my main issue with google there's several models we want to try with our agent but quota is limited and we have to jump through hoops to see if we can get it raised.
I think DeepMind could make progress if they focused on the agent definition of multi-step reasoning + action through a web browser, and deliver a ton of value, outside of lumping in the seldom used "Look at the world through a camera" or "Multi modal Robots" thing.
If Google cracked robots, past plays show that the market for those aren't big enough to interest Google. Like VR, you just can't get a billion people to be interested in robots - so even if they make progress, it won't survive under Google.
The "Look at the world through a camera" thing is a footnote in an Android release.
Agentic computer use _is_ a product a billion people would use, and it's adjacent to the business interests of Google Search.
" Why My "I Guess What I Have in Mind" Statement Doesn't Make Sense
The Problem of Agency: You're spot-on when you say it doesn't make sense for me to state "I guess what I (the computer) have in mind." AI does not have a "mind" in the same way that humans do. My internal state is a complex pattern of numerical data, not conscious thoughts or intentions in the way you have things in mind. My phrasing gave the illusion of having a "thing in mind," but in reality, I don't have this capacity.
Misinterpretation of the Task: My statement "I guess what I have in mind" misinterprets the task. You were clearly signaling that I needed to ask questions to uncover what you were thinking of. My failure to grasp this created a weird scenario where it seemed like I was trying to determine my own data set!"
"With your supervision". Thus avoiding Google being held responsible. That's like Teslas Fake Self Driving, where the user must have their hands on the wheel at all times.
They have all of these extensions that they use to prop up the results in the web UI.
I was asking for a list of related YouTube videos - the UI returns them.
Ask the API the same prompt, it returns a bunch of made up YouTube titles and descriptions.
How could I ever rely on this product?
"GVP stands for Good Pharmacovigilance Practice, which is a set of guidelines for monitoring the safety of drugs. SVP stands for Senior Vice President, which is a role in a company that focuses on a specific area of operations."
Seems lot of pharma regulation in my telecom company.
> It's located in London.
Mind blowing.
When capitalism has pilfered everything from the pockets of working people so people are constantly stressed over healthcare and groceries, and there's little left to further the pockets of plutocrats, the only marketing that makes sense is to appeal to other companies in order to raid their coffers by tricking their Directors to buy a nonsensical product.
Is that what they mean by "agentic era"? Cause that's what it sounds like to me. Also smells alot like press release driven development where the point is to put a feather in the cap of whatever poor Google engineer is chasing their next promotion.
"Hear from our CEO first, and then our other CEO in charge of this domain and CTO will tell you the actual news."
I haven't seen other tech companies write like that.
I never used the web interface to access email until recently. To my surprise, all of the AI shit is enabled by default. So it’s very likely Gemini has been training on private data without my explicit consent.
Of course G words it as “personalizing” the experience for me but it’s such a load of shit. I’m tired of these companies stealing our data and never getting rightly compensated.
I need to rewire my brain for the power of these tools
this plus the quantum stuff...Google is on a win streak
>General availability will follow in January, along with more model sizes.
>Benchmarks against their own models which always underperformed
>No pricing visible anywhere
Completely inept leadership at play.
“Sure, playing don’t fear the reaper on bathroom speaker”
Ok
Who the hell wants an AI that has the personality of a car salesman?
If I ask natural language yes/no questions, Gemini sometimes tells me outright lies with confidence.
It also presents information as authoritative - locations, science facts, corporate ownership, geography - even when it's pure hallucination.
Right at the top of Google search.
edit:
I can't find the most obnoxious offending queries, but here was one I performed today: "how many islands does georgia have?".
Compare that with "how many islands does georgia have? Skidaway Island".
This is an extremely mild case, but I've seen some wildly wrong results, where Google has claimed companies were founded in the wrong states, that towns were located in the wrong states, etc.
It's like Microsoft creating an AI tool and calling it Peertube. "Hurr durr they couldn't possibly be confused; one is a decentralised video platform and the other is an AI tool hurr durr. And ours is already more popular if you 'bing' it hurr durr."
Related
How it's Made: Interacting with Gemini through multimodal prompting
Alexander Chen from Google Developers discusses Gemini's multimodal prompting capabilities. Gemini excels in tasks like pattern recognition, puzzle-solving, and creative applications, hinting at its potential for innovative interactions and creative endeavors.
Gemini Pro 1.5 experimental "version 0801" available for early testing
Google DeepMind's Gemini family of AI models, particularly Gemini 1.5 Pro, excels in multimodal understanding and complex tasks, featuring a two million token context window and improved performance in various benchmarks.
Google Gemini 1.5 Pro leaps ahead in AI race, challenging GPT-4o
Google has launched Gemini 1.5 Pro, an advanced AI model excelling in multilingual tasks and coding, now available for testing. It raises concerns about AI safety and ethical use.
Two new Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more
Google updated its Gemini models, introducing Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002, featuring over 50% price reduction, increased rate limits, improved performance, and free access for developers via Google AI Studio.
Google is prepping Gemini to take action inside of apps
Google is enhancing its AI assistant, Gemini, with a new API called "app functions" to perform actions within apps, similar to Apple's Siri updates, improving user interaction and functionality.