August 18th, 2024

Claude just slashed the cost of building AI applications

ClaudeAI's new Prompt Caching feature allows developers to reuse text, potentially reducing input API costs by up to 90%, benefiting applications like AI assistants and prompting competitors to consider similar innovations.

Read original articleLink Icon
Claude just slashed the cost of building AI applications

ClaudeAI has introduced a new feature called Prompt Caching, which significantly reduces the cost of building AI applications. This feature allows developers to reuse text across multiple prompts, enabling them to cache lengthy examples and only send the essential part of the prompt. This can lead to a reduction of up to 90% in input API costs, which is particularly beneficial for applications that rely heavily on long prompts, such as AI assistants, code generation, code reviews, and processing large documents. By lowering API costs, developers can either reduce their pricing or increase profit margins for their software as a service (SaaS) applications. The introduction of Prompt Caching raises the question of whether competitors like OpenAI will implement similar features in the future.

- ClaudeAI's Prompt Caching can reduce input API costs by up to 90%.

- The feature allows developers to reuse lengthy prompts, saving time and money.

- It is particularly useful for AI assistants, code generation, and document processing.

- Developers can lower pricing or increase profit margins due to reduced costs.

- The move may prompt competitors like OpenAI to consider similar features.

Link Icon 13 comments
By @verdverm - 3 months
FWIW, Gemini / Vertex has this as well and lets you control the TTL. Billing is based on how long you keep the context

https://ai.google.dev/gemini-api/docs/caching?lang=python

Costs $1 / 1M / 1h

By @Scene_Cast2 - 3 months
Why does prompt caching reduce costs? I'm assuming that the primary cost driver is GPU/TPU FLOPS, as opposed to any network / storage / etc costs.

My understanding is that an LLM will take in the stream of text, tokenize it (can be faster with caching, sure, but it's a minor drop in the bucket), then run a transformer on the entire sequence. You can't just cache the output of a transformer on a prefix to reduce workload.

By @w10-1 - 3 months
Setting aside efficiency or accuracy, caching enhances the value of prompt engineering and thus increases the effective value of AI services (how the value is monetized or split is TBD).

Comments suggest that caching the state of the network might also reduce processing.

I wonder if it also permits better A/B-style testing by reducing the effect of cross-domain errors. If the AI service providers made it easy to provide feedback on post-cache responses, the providers could incorporate the quality-enhancement loop accelerating time to product-market fit (at the risk of increasing dependency and reducing ability to switch).

By @WiSaGaN - 3 months
This feature was first introduced by deepseek. And deepseek will just do it automatically for you. https://platform.deepseek.com/api-docs/news/news0802/
By @rglover - 3 months
This is great news. Using Claude to build a new SaaS [1] and this will likely save me quite a bit on API costs.

[1] https://x.com/codewithparrot

By @MaximusLegroom - 3 months
I guess they got tired of losing customers to Deepseek. They introduced this feature a while ago and their prices were already miniscule given that they only have to compute 20B active parameters.
By @nprateem - 3 months
I just tried Claude the other day. What a breath of fresh air after fighting the dogshit that is OpenAI.

Far less "in the realm of", "in today's fast-moving...", multifaceted, delve or other pretentious wank.

There is still some though so they obviously used the same dataset that's overweight in academic papers. Still, I'm hopeful I can finally get it to write stuff that doesn't sound like AI garbage.

Kind of weird there's no moderation API though. Will they just cut me off if my customers try to write about things they don't like?

By @xihajun - 3 months
Prompt + Lora? Train an adapter?
By @politelemon - 3 months
Will this be making its way to Bedrock?
By @mathgeek - 3 months
By @NBJack - 3 months
Sounds kinda useless, TBH. This sounds as if it assumes the exact same context window across requests. If so, given the 5 minute window, unless for example your entire team is operating in the same codebase at the same time, you won't really see any savings beyond simple prompts.

Are contexts included in the prompt cache? Are they identified as the same or not? What happens if we approach the 10k token range? 128k? 1M?