August 13th, 2024

The AI Scientist: Towards Automated Open-Ended Scientific Discovery

Sakana AI's "The AI Scientist" automates scientific discovery in machine learning, generating ideas, conducting experiments, and writing papers. It raises ethical concerns and aims to improve its capabilities while ensuring responsible use.

Read original article

ConcernSkepticismExcitement

The AI Scientist: Towards Automated Open-Ended Scientific Discovery

Sakana AI has introduced "The AI Scientist," a pioneering system designed for fully automated scientific discovery, particularly in machine learning research. This system utilizes advanced foundation models, including Large Language Models (LLMs), to autonomously conduct research, from generating novel ideas to writing and peer-reviewing scientific papers. The AI Scientist operates through a comprehensive pipeline that includes brainstorming research directions, executing experiments, visualizing results, and producing manuscripts. It can evaluate its own work with near-human accuracy through an automated peer review process, allowing for iterative improvements and the creation of a growing knowledge archive. Initial demonstrations have shown The AI Scientist's ability to contribute novel insights in various subfields, such as diffusion models and transformers, at a low operational cost of approximately $15 per paper. However, the system has limitations, including issues with visual data interpretation and occasional inaccuracies in its findings. Ethical concerns also arise regarding the potential misuse of automated research capabilities, which could strain the academic process and lead to unsafe research practices. The project emphasizes the need for responsible AI development and alignment with ethical standards as the technology evolves.

- The AI Scientist automates the entire research lifecycle, from idea generation to manuscript writing.

- It has demonstrated the ability to produce novel contributions in machine learning research.

- The system operates at a low cost, making scientific research more accessible.

- Ethical concerns include potential misuse and the impact on academic integrity.

- Future developments aim to enhance the system's capabilities and address current limitations.

AI Agents That Matter

The article addresses challenges in evaluating AI agents and proposes solutions for their development. It emphasizes the importance of rigorous evaluation practices to advance AI agent research and highlights the need for reliability and improved benchmarking practices.

How I Use AI

The author shares experiences using AI as a solopreneur, focusing on coding, search, documentation, and writing. They mention tools like GPT-4, Opus 3, Devv.ai, Aider, Exa, and Claude for different tasks. Excited about AI's potential but wary of hype.

MIT researchers advance automated interpretability in AI models

MIT researchers developed MAIA, an automated system enhancing AI model interpretability, particularly in vision systems. It generates hypotheses, conducts experiments, and identifies biases, improving understanding and safety in AI applications.

Ask HN: Am I using AI wrong for code?

The author is concerned about underutilizing AI tools for coding, primarily using Claude for brainstorming and small code snippets, while seeking recommendations for tools that enhance coding productivity and collaboration.

Up to 90% of my code is now generated by AI

A senior full-stack developer discusses the transformative impact of generative AI on programming, emphasizing the importance of creativity, continuous learning, and responsible integration of AI tools in coding practices.

AI: What people are saying

The introduction of Sakana AI's "The AI Scientist" has sparked a range of reactions among commenters, highlighting various concerns and perspectives on its implications for scientific research.

Many academics express skepticism about the ability of AI to replace the hands-on, experiential learning that is crucial in scientific training.
Concerns are raised about the potential for AI-generated papers to contribute to academic spam and diminish the quality of scientific literature.
Commenters question the ethical implications and trustworthiness of AI in research, emphasizing the importance of human oversight in validating results.
Some see the potential for AI to assist in research processes but worry about the risks of over-reliance on automated systems.
There is a general sentiment that while AI tools may offer efficiencies, they could also undermine the integrity and creativity inherent in scientific discovery.

36 comments

By @JBorrow - 9 months

As someone 'in academia', I worry that tools like this fundamentally discard significant fractions of both the scientific process and why the process is structured that way.

The reason that we do research is not simply so that we can produce papers and hence amass knowledge in an abstract sense. A huge part of the academic world is training and building up hands-on institutional knowledge within the population so that we can expand the discovery space.

If I went back to cavemen and handed them a copy of _University Physics_, they wouldn't know what to do with it. Hell, if I went back to Isaac Newton, he would struggle. Never mind your average physicist in the 1600s! Both the community as a whole, and the people within it, don't learn by simply reading papers. We learn by building things, running our own experiments, figuring out how other context fits in, and discussing with colleagues. This is why it takes ~1/8th of a lifetime to go from the 'world standard' of knowledge (~high school education) to being a PhD.

I suppose the claim here is that, well, we can just replace all of those humans with AI (or 'augment' them), but there are two problems:

a) the current suite of models is nowhere near sophisticated enough to do that, and their architecture makes extracting novel ideas either very difficult or impossible, depending on who you ask, and;

b) every use-case of 'AI' in science that I have seen also removes that hands-on training and experience (e.g. Copilot, in my experience, leads to lower levels of understanding. If I can just tab-complete my N-body code, did I really gain the knowledge of building it?)

This is all without mentioning the fact that the papers that the model seems to have generated are garbage. As an editor of a journal, I would likely desk-reject them. As a reviewer, I would reject them. They contain very limited novel knowledge and, as expected, extremely limited citation to associated works.

This project is cool on its face, but I must be missing something here as I don't really see the point in it.

By @zipy124 - 9 months

As a scientist in academic research, I can only see this as a bad thing. The #1 valued thing in science is trust. At the end of the day (until things change in how we handle research data, code etc...) all papers are based on the reviewers trust in the authors that their data is what they say it is, and the code they submit does what it says it does.

Allowing an AI agent to automate code, data or analysis, necessitates that a human must thoroughly check it for errors. As anyone who has ever written code or a paper knows, this takes as long or longer than the initial creation itself, and only takes longer if you were not the one to write it.

Perhaps I am naive and missing something. I see the paper writing aspect as quite valuable as a draft system (as an assistive tool), but the code/data/analysis part I am heavily sceptical of.

Furthermore this seems like it will merely encourage academic spam, which already wastes valuable time for the volunteer (unpaid) reviewers, editors and chairs time.

By @agf - 9 months

> For example, in one run, it edited the code to perform a system call to run itself. This led to the script endlessly calling itself. In another case, its experiments took too long to complete, hitting our timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.

They go on to say that the solution is sandboxing, but still, this feels like burying the lede.

By @unraveller - 9 months

But OpenAI said LLMs can't innovate until human-level reasoning and long-term agenthood is solved. [1] Referring to their precious 5 stages to classify AI before it reaches the scary "beyond" levels of intelligence... presumably at that point they get the feds involved to reg cap the field, so genuine is the fear of the pace they've set.

It's clear OpenAI is a hype company knocking over glass bottle stacks at its own wonderful carnival stall. Obviously if you can reason you can reason about what is innovative and we don't need OpenAI to set up fake scary progress markers like an Automated Scientific Organization.

Lets see if scientists even want this style of tech progress, it'd be sad to see each multitudes of AI papers having to be rebuilt from scratch and flushed down the toilet because associating with it is taboo.

[1]: https://arstechnica.com/information-technology/2024/07/opena...

By @malux85 - 9 months

I’m working on this now, I literally have another window open beside this browser window with the Multi-agent LLM logs outputs scrolling.

A few differences through - I’m working on Materials Science only. Mine has vision capabilities so it can read graphs in papers. Mine has agentic capabilities too, so can design and then execute simulations on Atomic Tessellator (my startup) by making API calls - this actual design and execution of simulations is what I aimed for at the start.

Long way to go, but there’s a set of heuristics that decide which experiments to attempt which means we only attempt ones more likely to work, lots of fine tuning prompts, self critique, modelling strategies and tactics as node graphs to avoid getting stuck in what I call procedural local minima, and loads more…

I started with MetaGPT framework but found it’s APIs too unstable so I settled on AutoGen, you don’t really “need” a framework, just be sensible about where your abstraction boundaries are, make them simple but composable, Dockerize and k8s for running, and I modified the binaries of a bunch of quantum chemistry software so that multi GPU arches are supported without re compilation (my hardware setup is heterogeneous)

Even if the LLMs can’t innovate in a “new sense” certainly having them reproduce work in simulations for me to inspect is very valuable - I have the ability to “fork” simulations like you can fork code so it’s easy to have the LLMs do a bunch of the work and then I just fork and experiment myself

By @_heimdall - 9 months

> The AI Scientist is designed to be compute efficient. Each idea is implemented and developed into a full paper at a cost of approximately $15 per paper. While there are still occasional flaws in the papers produced by this first version (discussed below and in the report), this cost and the promise the system shows so far illustrate the potential of The AI Scientist to democratize research and significantly accelerate scientific progress.

This is a particularly confusing argument in my opinion. Is the underlying assumption that everyone wants, or even needs, white papers that they can claim they created?

Let's just assume this system actually works and produces high quality, rigorous research findings. Reducing that process down to a dollar amount and driving that cost to near zero doesn't democratize anything, it cheapens it to the point of being worthless.

This honestly reads more as a joke article trolling today's academic process and the whole publish or perish mentality. From that angle, the article is a success in my book. As an announcement for a new ML tool though, I just don't get it.

By @plaidfuji - 9 months

When “executing the experiment” amounts to modifying ~50 lines of PyTorch code tweaking model architecture, I’d bloody well expect that you can automate it.

That’s not “automating scientific discovery”, that’s “procedurally optimizing model architecture” (and one iteration of exploration at that!). In any other field of science the actual work and data generated by the AI Scientist would be a sub-section of the Supporting Info if not just a weekly update to your advisor.

Don’t get me wrong, the actual work done by the humans who are publishing this is a pretty solid piece of engineering and interesting to discuss. But the automated papers, to me, are more a commentary on what constitutes a publishable advancement in AI these days.

Edit: this also further confirms my suspicion about LLMs, which is that they aren’t very good at doing actual work, but they are great at generating the accompanying marketing BS around having done work. They will generate a mountain of flashy but frivolous communication about smaller and smaller chunks of true progress, which while appearing beneficial to individuals, will ultimately result in a race to the bottom of true productivity.

By @iandanforth - 9 months

Exciting and very cool! I look forward to the continued improvement in this area. Especially when the loop is closed within Sakana and you can say "this discovery was made by The AI Scientist" as part of another paper.

If I might offer some small feedback on the blog post:

- Alt-text and/or caption of the initial image would be helpful for screen readers

- Using both "dramatically" and "radically" in one sentence to describe near future improvements seems a bit much.

- When talking about the models used, "Sonnet" could either be 3.0 Sonnet or 3.5 Sonnet and those have pretty different capabilities.

Thanks again for the impressive work!

By @ergl - 9 months

This is the kind of theory-free science seems to permeate the entire field of ML lately.

I can only see this as a negative, what's the use of automatically generated papers if not to flood the already over-strained volunteers that review papers at conferences? (mostly already-overworked PhD students.) If I wanted a glorified chatbot to spam me with made-up improvements, I'd ask it myself.

By @agubelu - 9 months

I worked with some people who were actively working on this last year, focusing on CS research.

The biggest issue was validation. We could get a system to spit out possible research directions automatically, but who decides if they're reasonable and/or promising? A human, of course. Moreover, we gave different humans the same set of hypotheses to validate and they came back with wildly different annotations.

By @TeeWEE - 9 months

AI hype in one sentence:

"We expect all of these will improve, likely dramatically, in future versions with the inclusion of multi-modal models and as the underlying foundation models"

So much hype, so much believe. I no believe no hype

By @crabbone - 9 months

I'm not a scientist at all, but I am often involved in hand-holding scientists when it comes to dealing with computers.

My impression so far is that science is plagued with deliberate and accidental fraud when it comes to data collection and cataloguing. Also, this is a spectrum, not two distinct things. I often see researchers simply unwilling to do the right thing to verify that the data collected are correct and meaningful as soon as "workable" results can be produced from the data. Some will go further and mess with the data to make results more "workable" though...

Second problem is understanding the data. Often times it happens that people who end up doing research don't quite understand the subject matter of the research. This is especially popular with medicine, where it's overwhelmingly common for eg. research into various imaging modalities to be done by computer scientists who couldn't find a liver cancer the size of a coconut in the sharpest textbook abdominal image.

My impression is also that by far these two problems outweigh the problems that could potentially be solved by adding AI into the mix. These are the systemic organization problems of perverse incentives and vicious practices, and no amount of AI is going to do anything about it... because it's about people. People's salaries, careers, friendships etc.

By @anticensor - 9 months

  Presentation of Intermediate Results. The paper contains results
  for every single experiment that was run. While this is useful
  and insightful for us to see the evolution of the idea during
  execution, it is unusual for standard papers to present
  intermediate results like this.

This is actually quite good that the AI scientist does this. AI has no excuse of slow report writing that humans have to omit the intermediate results.

By @mbladra - 9 months

>The AI Scientist automates the entire research lifecycle, from generating novel research ideas, writing any necessary code, and executing experiments, to summarizing experimental results, visualizing them, and presenting its findings in a full scientific manuscript.

I'd be curious how much of the experimentation process companies like OAI/Anthropic have automated by now, for improving their own models.

By @vouaobrasil - 9 months

As someone who truly loves science, the idea of automating the creative parts strikes me at the core as a horrible mistake. Yes, even before AI, we've already tried some automations -- actually some of those I even believe is a bad thing, such as the internet. Most people would disagree no doubt, but I feel like automating science, especially with regard to the more "creative parts" makes it more like an industry, ripping it away from the minds of people. And AI automation is a new level that goes beyond all automations.

I truly hate AI and what it is doing to the world and to me at least, as someone who has loved mathematics and science since my grandma started showing me chemistry experiments when I was about 5 years old, this new level of automation is stealing the magic from human curiosity.

By @happypumpkin - 9 months

Potential concerns with their self-eval:

They evaluate their automated reviewer by comparing against human evaluations on human-written research papers, and then seem to extrapolate that their automated reviewer would align with human reviewers on AI-written research papers. It seems like there are a few major pitfalls with this.

First, if their systems aren't multimodal, and their figures are lower-quality than human-created figures (which they explicitly list as a limitation), the automated reviewer would be biased in favor of AI-generated papers (only having access to the text). This is an obvious one but I think there could easily be other aspects of papers where the AI and human reviewers align on human-written papers, but not on AI papers.

Additionally, they note:

> Furthermore, the False Negative Rate (FNR) is much lower than the human baseline (0.39 vs. 0.52). Hence, the LLM-based review agent rejects fewer high-quality papers. The False Positive Rate (FNR [sic]), on the other hand, is higher (0.31 vs. 0.17)

It seems like false positive rate is the more important metric here. If a paper is truly high-quality, it is likely to have success w/ a rebuttal, or in getting acceptance at another conference. On the other hand, if this system leads to more low-quality submissions or acceptances via a high FPR, we're going to have more AI slop and increased load on human reviewers.

I admit I didn't thoroughly read all 185 pages, maybe these concerns are misplaced.

By @Arrgh - 9 months

The specific mechanism of action here needs a drill-down. Did "The AI Scientist" (ugh) generate a patch to its code and prompt a user to apply it, as the screenshots would seem to indicate? If so I don't find this worrying at all: people write all kinds of stupid code all the time--often, impressively, without any help from "AI"! ;)

Did it apply the patch itself, then reset the session to whatever extent necessary to have the new code take effect? TBH I'm not really worried about this either, as long as the execution environment doesn't grant it the ability to bring unlimited additional hardware to bear. Even in that case, presumably some human would be paying attention to the AWS bills, or the moral equivalent.

By @shusaku - 9 months

Some samples of the generated papers are in the SI of their paper. It’s be interesting if some of you ML guys dug into them. The fact that they built another AI system to review the papers seems really shaky, this is where human feedback would be most valuable.

By @Redster - 9 months

The pace at which AI/ML research is being published is phenomenal. I could see that some wins would be possible just by having a Sonnet or 4o level read the faster/more and combine ideas that haven't been combined before. I would just be concerned about the synthetic of its own papers being able to lead itself astray if they weren't edited by a human ML researcher? Seems like it could produce helpful stuff right now, but I would just want to not mess up our own datasets of ML "research".

By @woah - 9 months

Everyone in this thread is musing about the role of AI and whether the process of discovery is fundamentally human, and what Isaac Newton would think, but can somebody tell me: is the technology it develops any good? For example, does "Dual Scale Diffusion" https://sakana.ai/assets/ai-scientist/adaptive_dual_scale_de... look useful?

By @mnky9800n - 9 months

I guess if C3P0 wants to help me do science I'll let him. But I'm not really satisfied by someone telling me something. I like finding the answer myself. That's one of the reasons I'm a scientist. I enjoy being a scientist. Even assuming that this system, or any science bot, has none of the problems associated with llms, why would I want it to do my job for me?

By @ks2048 - 9 months

So, I assume journals will need AIs to do scientific review to handle the flood of AI-produced paper submissions?

By @OutOfHere - 9 months

To produce scientific work, one needs certain raw materials:

1. Data

2. Access to past works

Once you have these, only then can discoveries can be made, and papers be written. How does this software get these? I am assuming they have to be provided up-front to the software for each job.

By @minihat - 9 months

Clarkesworld sci-fi magazine temporarily closed submissions due to low quality AI spam. I'm sure the irony will not be lost on them if ML journals are the next victims.

By @sebastiennight - 9 months

Having read the article, it seems like an interesting experiment. With the current state of LLMs, this is extremely unlikely to produce useful research, but most of the limitations people have been commenting about are will progressively get better.

The authors' credibility is a bit hurt when the first "limitation" they mention is "our system doesn't do page layouts perfectly". Come on, guys.

What is weird to me is this:

> The AI Scientist occasionally makes critical errors when writing and evaluating results. For example, it struggles to compare the magnitude of two numbers, which is a known pathology with LLMs. To partially address this, we make sure all experimental results are reproducible, storing all files that are executed.

I'm not sure why you would run your evaluation step without giving your LLM access to function calling. It seems within reach to first have the LLM output a set of statements-to-be-verified (eg, "does X increase when Y increases?") and then use their code-generation/execution step to perform those comparisons.

And then the incomprehensible statement for me here is that they allow the model access to its own runtime environment so it can edit its own code?

The paper is 185 pages and only has one paragraph on safety. This screams "viral marketing piece" rather than "serious research".

And finally:

> The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference...

Oh wow, please tell more?

> ... as judged by our automated reviewer.

Ah. Nevermind

By @dotnet00 - 9 months

So, if AI trained on AI generated data tend to perform worse, and you're trying to have AI generate the data that ultimately props up our civilization...

Clearly this can only end well

By @amoss - 9 months

I wonder how model collapse would apply to the AIs created by applying the results of AI-generated papers?

By @pemrick79 - 9 months

Stfu and listen to the music and enjoy it the way you want. Who gives a shit what some tech nerd or stuffy audiophile has to say? Yup only them. You guys laugh at each other think the other is ridiculous. Well we all laugh at both of you. Man what a waste of the last 15 min of my life. Back to enjoying my music anywhere and everywhere I go. Should try it

By @viraptor - 9 months

It's interesting that the cost total is the same in all tables. I can't tell if that's a copy&paste error, or the cost was capped, or are the totals for all the experiments?

> Aider fails to implement a significant fraction of the proposed ideas.

Yeah, that can be improved a lot with a better agent for code. While aider is fast and cheap, going with something like plandex or opendavin makes a massive difference... both in quality and cost. For example plandex will burn $1 on a simple script, but I can expect that script to work as requested. A mixed approach could be deepseek coder with an agent - a bit worse quality, but still cheaper to do more iterations.

By @conglu1997 - 9 months

Incredible step towards AI agents for scientific discovery!

By @breck - 9 months

the aientist.com ;)

By @letitgo12345 - 9 months

Feels like the next generation of models could truly start replacing lower level ML and software engineers

The AI Scientist: Towards Automated Open-Ended Scientific Discovery