Throw more AI at your problems
The article advocates for using multiple LLM calls in AI development, emphasizing task breakdown, cost management, and improved performance through techniques like RAG, fine-tuning, and asynchronous workflows.
Read original articleThe article discusses the evolving landscape of AI application development, emphasizing a strategy of utilizing multiple LLM (Large Language Model) calls to address problems effectively. The authors, Vikram Sreekanti and Joseph E. Gonzalez, argue that rather than relying on a single powerful model, breaking down tasks into smaller components and employing a combination of techniques—such as retrieval-augmented generation (RAG) and fine-tuning—can lead to better performance, lower costs, and improved reliability. They highlight the importance of managing costs and latency by using smaller models for simpler tasks and suggest that parallelization and asynchronous workflows can enhance user experience. The authors also note that this approach can increase resilience against prompt hacking, as stricter output limits can be enforced at each stage of the pipeline. They advocate for a gradual improvement of AI components over time, allowing for the replacement of larger models with more efficient, task-specific ones. Ultimately, they encourage developers to embrace the use of multiple LLM calls as a means to create smarter, more effective AI applications.
- Utilizing multiple LLM calls can enhance AI application performance.
- Combining techniques like RAG and fine-tuning is more effective than relying on a single approach.
- Smaller models can be used for simpler tasks to manage costs and latency.
- Parallelization and asynchronous workflows improve user experience.
- This approach increases resilience against prompt hacking and allows for incremental improvements.
Related
Reading through this, I could not tell if this was a parody or real. That robot image slopped in the middle certainly didn't help.
For example, I want to scrape a collection of sites. The agent would at first apply the whole HTML to the context to extract the data (expensive but it works), but then there is another agent that sees this pipeline and says "hey we can write a parser for this site so each scrape is cheaper", and iteratively replaces that segment in a way that does not disrupt the overall task.
Instead I implemented low tech “RAG” or “data source rules”. It’s a list of general rules you can attach to a particular data source (ie database). Rules are included in the generations and work great. Examples are “Wrap tables and columns in quotes” or “Limit results to 100”. It’s simple and effective - I can execute the generate SQL again my DB for insights.
Just a reminder that smaller fine tuned models are just as good at solving the problems they are trained to solve, as large models are.
> Oftentimes, a call to Llama-3 8B might be enough if you need to a simple classification step or to analyze a small piece of text.
Even 3B param models are powerful now days, especially if you are willing to put the time into prompt engineering. My current side project is working on simulating a small fantasy town using a tiny locally hosted model.
> When you have a pipeline of LLM calls, you can enforce much stricter limits on the outputs of each stage
Having an LLM output a number from 1 to 10, or "error" makes your schema really hard to break.
All you need to do is parse the output and it if isn't a number from 1 to 10... just assume it is garbage.
A system built up like this is much more resilient, and also honestly more pleasant to deal with.
I'm a bit confused by their thinking it's a good thing while being confused about why the subject has "disappeared from the conversation".
Could anyone here shed some light / share an opinion on it/why "long context windows" aren't discussed any more? Did everyone decide they're not useful? Or they're so obviously useful that nobody wastes time discussing them? Or...
I recently discovered BERTopic, a Python library that bundles a five-step pipeline of now pretty old (relatively) NLP approaches in a way that is very similar to how we were already doing it, now wrapped in a nice handy one-liner. I think it's a great exemplar of the approach that will probably emerge from the hype storm on top.
(Disclaimer: I am not an AI expert and will defer to real data/stats nerds on this.)