The AI Scientist: Towards Automated Open-Ended Scientific Discovery
Sakana AI's "The AI Scientist" automates scientific discovery in machine learning, generating ideas, conducting experiments, and writing papers. It raises ethical concerns and aims to improve its capabilities while ensuring responsible use.
Read original articleSakana AI has introduced "The AI Scientist," a pioneering system designed for fully automated scientific discovery, particularly in machine learning research. This system utilizes advanced foundation models, including Large Language Models (LLMs), to autonomously conduct research, from generating novel ideas to writing and peer-reviewing scientific papers. The AI Scientist operates through a comprehensive pipeline that includes brainstorming research directions, executing experiments, visualizing results, and producing manuscripts. It can evaluate its own work with near-human accuracy through an automated peer review process, allowing for iterative improvements and the creation of a growing knowledge archive. Initial demonstrations have shown The AI Scientist's ability to contribute novel insights in various subfields, such as diffusion models and transformers, at a low operational cost of approximately $15 per paper. However, the system has limitations, including issues with visual data interpretation and occasional inaccuracies in its findings. Ethical concerns also arise regarding the potential misuse of automated research capabilities, which could strain the academic process and lead to unsafe research practices. The project emphasizes the need for responsible AI development and alignment with ethical standards as the technology evolves.
- The AI Scientist automates the entire research lifecycle, from idea generation to manuscript writing.
- It has demonstrated the ability to produce novel contributions in machine learning research.
- The system operates at a low cost, making scientific research more accessible.
- Ethical concerns include potential misuse and the impact on academic integrity.
- Future developments aim to enhance the system's capabilities and address current limitations.
Related
AI Agents That Matter
The article addresses challenges in evaluating AI agents and proposes solutions for their development. It emphasizes the importance of rigorous evaluation practices to advance AI agent research and highlights the need for reliability and improved benchmarking practices.
How I Use AI
The author shares experiences using AI as a solopreneur, focusing on coding, search, documentation, and writing. They mention tools like GPT-4, Opus 3, Devv.ai, Aider, Exa, and Claude for different tasks. Excited about AI's potential but wary of hype.
MIT researchers advance automated interpretability in AI models
MIT researchers developed MAIA, an automated system enhancing AI model interpretability, particularly in vision systems. It generates hypotheses, conducts experiments, and identifies biases, improving understanding and safety in AI applications.
Ask HN: Am I using AI wrong for code?
The author is concerned about underutilizing AI tools for coding, primarily using Claude for brainstorming and small code snippets, while seeking recommendations for tools that enhance coding productivity and collaboration.
Up to 90% of my code is now generated by AI
A senior full-stack developer discusses the transformative impact of generative AI on programming, emphasizing the importance of creativity, continuous learning, and responsible integration of AI tools in coding practices.
- Many academics express skepticism about the ability of AI to replace the hands-on, experiential learning that is crucial in scientific training.
- Concerns are raised about the potential for AI-generated papers to contribute to academic spam and diminish the quality of scientific literature.
- Commenters question the ethical implications and trustworthiness of AI in research, emphasizing the importance of human oversight in validating results.
- Some see the potential for AI to assist in research processes but worry about the risks of over-reliance on automated systems.
- There is a general sentiment that while AI tools may offer efficiencies, they could also undermine the integrity and creativity inherent in scientific discovery.
The reason that we do research is not simply so that we can produce papers and hence amass knowledge in an abstract sense. A huge part of the academic world is training and building up hands-on institutional knowledge within the population so that we can expand the discovery space.
If I went back to cavemen and handed them a copy of _University Physics_, they wouldn't know what to do with it. Hell, if I went back to Isaac Newton, he would struggle. Never mind your average physicist in the 1600s! Both the community as a whole, and the people within it, don't learn by simply reading papers. We learn by building things, running our own experiments, figuring out how other context fits in, and discussing with colleagues. This is why it takes ~1/8th of a lifetime to go from the 'world standard' of knowledge (~high school education) to being a PhD.
I suppose the claim here is that, well, we can just replace all of those humans with AI (or 'augment' them), but there are two problems:
a) the current suite of models is nowhere near sophisticated enough to do that, and their architecture makes extracting novel ideas either very difficult or impossible, depending on who you ask, and;
b) every use-case of 'AI' in science that I have seen also removes that hands-on training and experience (e.g. Copilot, in my experience, leads to lower levels of understanding. If I can just tab-complete my N-body code, did I really gain the knowledge of building it?)
This is all without mentioning the fact that the papers that the model seems to have generated are garbage. As an editor of a journal, I would likely desk-reject them. As a reviewer, I would reject them. They contain very limited novel knowledge and, as expected, extremely limited citation to associated works.
This project is cool on its face, but I must be missing something here as I don't really see the point in it.
Allowing an AI agent to automate code, data or analysis, necessitates that a human must thoroughly check it for errors. As anyone who has ever written code or a paper knows, this takes as long or longer than the initial creation itself, and only takes longer if you were not the one to write it.
Perhaps I am naive and missing something. I see the paper writing aspect as quite valuable as a draft system (as an assistive tool), but the code/data/analysis part I am heavily sceptical of.
Furthermore this seems like it will merely encourage academic spam, which already wastes valuable time for the volunteer (unpaid) reviewers, editors and chairs time.
They go on to say that the solution is sandboxing, but still, this feels like burying the lede.
It's clear OpenAI is a hype company knocking over glass bottle stacks at its own wonderful carnival stall. Obviously if you can reason you can reason about what is innovative and we don't need OpenAI to set up fake scary progress markers like an Automated Scientific Organization.
Lets see if scientists even want this style of tech progress, it'd be sad to see each multitudes of AI papers having to be rebuilt from scratch and flushed down the toilet because associating with it is taboo.
[1]: https://arstechnica.com/information-technology/2024/07/opena...
A few differences through - I’m working on Materials Science only. Mine has vision capabilities so it can read graphs in papers. Mine has agentic capabilities too, so can design and then execute simulations on Atomic Tessellator (my startup) by making API calls - this actual design and execution of simulations is what I aimed for at the start.
Long way to go, but there’s a set of heuristics that decide which experiments to attempt which means we only attempt ones more likely to work, lots of fine tuning prompts, self critique, modelling strategies and tactics as node graphs to avoid getting stuck in what I call procedural local minima, and loads more…
I started with MetaGPT framework but found it’s APIs too unstable so I settled on AutoGen, you don’t really “need” a framework, just be sensible about where your abstraction boundaries are, make them simple but composable, Dockerize and k8s for running, and I modified the binaries of a bunch of quantum chemistry software so that multi GPU arches are supported without re compilation (my hardware setup is heterogeneous)
Even if the LLMs can’t innovate in a “new sense” certainly having them reproduce work in simulations for me to inspect is very valuable - I have the ability to “fork” simulations like you can fork code so it’s easy to have the LLMs do a bunch of the work and then I just fork and experiment myself
This is a particularly confusing argument in my opinion. Is the underlying assumption that everyone wants, or even needs, white papers that they can claim they created?
Let's just assume this system actually works and produces high quality, rigorous research findings. Reducing that process down to a dollar amount and driving that cost to near zero doesn't democratize anything, it cheapens it to the point of being worthless.
This honestly reads more as a joke article trolling today's academic process and the whole publish or perish mentality. From that angle, the article is a success in my book. As an announcement for a new ML tool though, I just don't get it.
That’s not “automating scientific discovery”, that’s “procedurally optimizing model architecture” (and one iteration of exploration at that!). In any other field of science the actual work and data generated by the AI Scientist would be a sub-section of the Supporting Info if not just a weekly update to your advisor.
Don’t get me wrong, the actual work done by the humans who are publishing this is a pretty solid piece of engineering and interesting to discuss. But the automated papers, to me, are more a commentary on what constitutes a publishable advancement in AI these days.
Edit: this also further confirms my suspicion about LLMs, which is that they aren’t very good at doing actual work, but they are great at generating the accompanying marketing BS around having done work. They will generate a mountain of flashy but frivolous communication about smaller and smaller chunks of true progress, which while appearing beneficial to individuals, will ultimately result in a race to the bottom of true productivity.
If I might offer some small feedback on the blog post:
- Alt-text and/or caption of the initial image would be helpful for screen readers
- Using both "dramatically" and "radically" in one sentence to describe near future improvements seems a bit much.
- When talking about the models used, "Sonnet" could either be 3.0 Sonnet or 3.5 Sonnet and those have pretty different capabilities.
Thanks again for the impressive work!
I can only see this as a negative, what's the use of automatically generated papers if not to flood the already over-strained volunteers that review papers at conferences? (mostly already-overworked PhD students.) If I wanted a glorified chatbot to spam me with made-up improvements, I'd ask it myself.
The biggest issue was validation. We could get a system to spit out possible research directions automatically, but who decides if they're reasonable and/or promising? A human, of course. Moreover, we gave different humans the same set of hypotheses to validate and they came back with wildly different annotations.
"We expect all of these will improve, likely dramatically, in future versions with the inclusion of multi-modal models and as the underlying foundation models"
So much hype, so much believe. I no believe no hype
Did it apply the patch itself, then reset the session to whatever extent necessary to have the new code take effect? TBH I'm not really worried about this either, as long as the execution environment doesn't grant it the ability to bring unlimited additional hardware to bear. Even in that case, presumably some human would be paying attention to the AWS bills, or the moral equivalent.
My impression so far is that science is plagued with deliberate and accidental fraud when it comes to data collection and cataloguing. Also, this is a spectrum, not two distinct things. I often see researchers simply unwilling to do the right thing to verify that the data collected are correct and meaningful as soon as "workable" results can be produced from the data. Some will go further and mess with the data to make results more "workable" though...
Second problem is understanding the data. Often times it happens that people who end up doing research don't quite understand the subject matter of the research. This is especially popular with medicine, where it's overwhelmingly common for eg. research into various imaging modalities to be done by computer scientists who couldn't find a liver cancer the size of a coconut in the sharpest textbook abdominal image.
My impression is also that by far these two problems outweigh the problems that could potentially be solved by adding AI into the mix. These are the systemic organization problems of perverse incentives and vicious practices, and no amount of AI is going to do anything about it... because it's about people. People's salaries, careers, friendships etc.
Presentation of Intermediate Results. The paper contains results
for every single experiment that was run. While this is useful
and insightful for us to see the evolution of the idea during
execution, it is unusual for standard papers to present
intermediate results like this.
This is actually quite good that the AI scientist does this. AI has no excuse of slow report writing that humans have to omit the intermediate results.I'd be curious how much of the experimentation process companies like OAI/Anthropic have automated by now, for improving their own models.
I truly hate AI and what it is doing to the world and to me at least, as someone who has loved mathematics and science since my grandma started showing me chemistry experiments when I was about 5 years old, this new level of automation is stealing the magic from human curiosity.
They evaluate their automated reviewer by comparing against human evaluations on human-written research papers, and then seem to extrapolate that their automated reviewer would align with human reviewers on AI-written research papers. It seems like there are a few major pitfalls with this.
First, if their systems aren't multimodal, and their figures are lower-quality than human-created figures (which they explicitly list as a limitation), the automated reviewer would be biased in favor of AI-generated papers (only having access to the text). This is an obvious one but I think there could easily be other aspects of papers where the AI and human reviewers align on human-written papers, but not on AI papers.
Additionally, they note:
> Furthermore, the False Negative Rate (FNR) is much lower than the human baseline (0.39 vs. 0.52). Hence, the LLM-based review agent rejects fewer high-quality papers. The False Positive Rate (FNR [sic]), on the other hand, is higher (0.31 vs. 0.17)
It seems like false positive rate is the more important metric here. If a paper is truly high-quality, it is likely to have success w/ a rebuttal, or in getting acceptance at another conference. On the other hand, if this system leads to more low-quality submissions or acceptances via a high FPR, we're going to have more AI slop and increased load on human reviewers.
I admit I didn't thoroughly read all 185 pages, maybe these concerns are misplaced.
1. Data
2. Access to past works
Once you have these, only then can discoveries can be made, and papers be written. How does this software get these? I am assuming they have to be provided up-front to the software for each job.
The authors' credibility is a bit hurt when the first "limitation" they mention is "our system doesn't do page layouts perfectly". Come on, guys.
What is weird to me is this:
> The AI Scientist occasionally makes critical errors when writing and evaluating results. For example, it struggles to compare the magnitude of two numbers, which is a known pathology with LLMs. To partially address this, we make sure all experimental results are reproducible, storing all files that are executed.
I'm not sure why you would run your evaluation step without giving your LLM access to function calling. It seems within reach to first have the LLM output a set of statements-to-be-verified (eg, "does X increase when Y increases?") and then use their code-generation/execution step to perform those comparisons.
And then the incomprehensible statement for me here is that they allow the model access to its own runtime environment so it can edit its own code?
The paper is 185 pages and only has one paragraph on safety. This screams "viral marketing piece" rather than "serious research".
And finally:
> The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference...
Oh wow, please tell more?
> ... as judged by our automated reviewer.
Ah. Nevermind
Clearly this can only end well
> Aider fails to implement a significant fraction of the proposed ideas.
Yeah, that can be improved a lot with a better agent for code. While aider is fast and cheap, going with something like plandex or opendavin makes a massive difference... both in quality and cost. For example plandex will burn $1 on a simple script, but I can expect that script to work as requested. A mixed approach could be deepseek coder with an agent - a bit worse quality, but still cheaper to do more iterations.
Related
AI Agents That Matter
The article addresses challenges in evaluating AI agents and proposes solutions for their development. It emphasizes the importance of rigorous evaluation practices to advance AI agent research and highlights the need for reliability and improved benchmarking practices.
How I Use AI
The author shares experiences using AI as a solopreneur, focusing on coding, search, documentation, and writing. They mention tools like GPT-4, Opus 3, Devv.ai, Aider, Exa, and Claude for different tasks. Excited about AI's potential but wary of hype.
MIT researchers advance automated interpretability in AI models
MIT researchers developed MAIA, an automated system enhancing AI model interpretability, particularly in vision systems. It generates hypotheses, conducts experiments, and identifies biases, improving understanding and safety in AI applications.
Ask HN: Am I using AI wrong for code?
The author is concerned about underutilizing AI tools for coding, primarily using Claude for brainstorming and small code snippets, while seeking recommendations for tools that enhance coding productivity and collaboration.
Up to 90% of my code is now generated by AI
A senior full-stack developer discusses the transformative impact of generative AI on programming, emphasizing the importance of creativity, continuous learning, and responsible integration of AI tools in coding practices.