July 9th, 2024

turbopuffer: Fast Search on Object Storage

Simon Hørup Eskildsen founded turbopuffer in 2023 to offer a cost-efficient search engine using object storage and SSD caching. Notable customers experienced 10x cost reduction and improved latency. Application-based access.

Read original article

turbopuffer: Fast Search on Object Storage

In late 2022, Simon Hørup Eskildsen, co-founder of turbopuffer, was inspired to create a more cost-efficient search engine after facing high infrastructure costs while working on a feature for Readwise. The traditional search engine solutions were expensive and challenging to scale. Eskildsen envisioned leveraging object storage and smart caching to build a search engine that balances cost efficiency and high performance. turbopuffer, developed in 2023, offers a new approach to search by utilizing object storage and SSD caching, allowing scalability to billions of vectors and millions of tenants. The architecture of turbopuffer focuses on cost-effective storage, leveraging object storage for reliability and scalability. The search engine design prioritizes reducing latency with SSD and memory caching for actively searched data, resulting in significant cost savings compared to traditional solutions. Notable customers like Cursor have experienced a 10x cost reduction and improved latency after migrating to turbopuffer's architecture. The company is currently open by application only, focusing on optimizing the platform for early customers.

Show HN: Triplit – Open-source syncing database that runs on server and client

The GitHub URL provides details on `@changesets/cli`, a tool for versioning and publishing code in multi-package and single-package repositories. Full documentation and common questions are accessible in their repository.

Our great database migration

Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.

Optimizing Large-Scale OpenStreetMap Data with SQLite

The article discusses optimizing large-scale OpenStreetMap data with SQLite. Converting OSMPBF to SQLite enhanced search functionalities. Indexing, full-text search, and compression improved query performance, despite some false positives.

Show HN: I've made Keyword Research tool that's 90% cheaper than anything

Telescope is a cost-effective SEO tool providing keyword research and tracking features at a reduced price, saving users up to 90% on SEO research costs. It offers detailed keyword insights and up-to-date ranking data for improved SEO strategies.

18 comments

By @softwaredoug - 9 months

Having worked with Simon he knows his sh*t. We talked a lot about what the ideal search stack would look when we worked together at Shopify on search (him more infra, me more ML+relevance). I discussed how I just want a thing in the cloud to provide my retrieval arms, let me express ranking in a fluent "py-data" first way, and get out of my way

My ideal is that turbopuffer ultimately is like a Polars dataframe where all my ranking is expressed in my search API. I could just lazily express some lexical or embedding similarity, boost with various attributes like, maybe by recency, popularity, etc to get a first pass (again all just with dataframe math). Then compute features for a reranking model I run on my side - dataframe math - and it "just works" - runs all this as some kind of query execution DAG - and stays out of my way.

By @cmcollier - 9 months

Unrelated to the core topic, I really enjoy the aesthetic of their website. Another similar one is from Fixie.ai (also, interestingly, one of their customers).

By @nh2 - 9 months

> $3600.00/TB/month

It doesn't have to be that way.

At Hetzner I pay $200/TB/month for RAM. That's 18x cheaper.

Sometimes you can reach the goal faster with less complexity by removing the part with the 20x markup.

By @omneity - 9 months

> In 2022, production-grade vector databases were relying on in-memory storage

This is irking me. pg_vector has existed from before that, doesn't require in-memory storage and can definitely handle vector search for 100m+ documents in a decently performant manner. Did they have a particular requirement somewhere?

By @bigbones - 9 months

Sounds like a source-unavailable version of Quickwit? https://quickwit.io/

By @eknkc - 9 months

Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?

Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.

I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.

By @solatic - 9 months

Is it feasible to try to build this kind of approach (hot SSD cache nodes sitting in front of object storage) with prior open-source art (Lucene)? Or are the search indexes themselves also proprietary in this solution?

Having witnessed some very large Elasticsearch production deployments, being able to throw everything into S3 would be incredible. The applicability here isn't only for vector search.

By @zX41ZdbW - 9 months

A correction to the article. It mentions

    Warehouse BigQuery, Snowflake, Clickhouse ≥1s Minutes

For ClickHouse, it should be: read latency <= 100ms, write latency <= 1s.

Logging, real-time analytics, and RAG are also suitable for ClickHouse.

By @drodgers - 9 months

I love the object-storage-first approach; it seems like such a natural fit for the could.

By @cdchn - 9 months

The very long introductory page has a ton of very juicy data in it, even if you don't care about the product itself.

By @arnorhs - 9 months

This looks super interesting. I'm not that familiar with vector databases. I thought they were mostly something used for RAG and other AI-related stuff.

Seems like a topic I need to delive into a bit more.

By @endisneigh - 9 months

Slightly relevant - do people really want article recommendations? I don’t think I’ve ever read an article and wanted a recommendation. Even with this one - I sort of read it and that’s it; no feeling of wanting recommendations.

Am I alone in this?

In any case this seems like a pretty interesting approach. Reminds me of Warpstream which does something similar with S3 to replace Kafka.

By @CyberDildonics - 9 months

Sounds like a filesystem with attributes in a database.

By @yawnxyz - 9 months

can't wait for the day the get into GA!

By @vidar - 9 months

Can you compare to S3 Athena (ELI5)?

By @yamumsahoe - 9 months

unsure if they are comparable, but is this and quickwit comparable?

By @hipadev23 - 9 months

That’s some woefully disappointing and incorrect metrics (read and write latency are both sub-second, storage medium would be “ Memory + Replicated SSDs”) you’ve got for Clickhouse there, but I understand what you’re going for and why you categorized it where you did.

turbopuffer: Fast Search on Object Storage

Related

Show HN: Triplit – Open-source syncing database that runs on server and client

Our great database migration

Optimizing Large-Scale OpenStreetMap Data with SQLite

Show HN: I've made Keyword Research tool that's 90% cheaper than anything

Related

Show HN: Triplit – Open-source syncing database that runs on server and client

Our great database migration

Optimizing Large-Scale OpenStreetMap Data with SQLite

Show HN: I've made Keyword Research tool that's 90% cheaper than anything