July 11th, 2024

Binance built a 100PB log service with Quickwit

Binance migrated Elasticsearch clusters to Quickwit, achieving 1.6 PB indexing per day and handling 100 PB logs. Benefits include reduced costs, improved scalability, and enhanced log management capabilities for future enhancements.

Read original articleLink Icon
Binance built a 100PB log service with Quickwit

Binance successfully migrated multiple petabyte-scale Elasticsearch clusters to Quickwit, achieving remarkable results such as scaling indexing to 1.6 PB per day and operating a search cluster handling 100 PB of logs. By leveraging Quickwit's advantages like native Kafka integration, VRL transformations, and object storage as primary storage, Binance reduced compute costs by 80% and storage costs by 20x. The transition addressed challenges faced with Elasticsearch, including operational complexity, limited retention, and reliability issues. Binance's engineers overcame scaling challenges by deploying separate indexing clusters for high-throughput topics and creating a unified metastore for efficient searching through petabytes of logs. The migration to Quickwit brought significant benefits, including reduced computing resources, lower storage costs, improved log retention, and streamlined maintenance operations. This successful collaboration between Binance and Quickwit engineers showcases the potential of Quickwit for large-scale log management and sets the stage for future improvements in data compression and multi-cluster support.

Related

Redis Alternative at Apache Software Foundation Now Supports RediSearch and SQL

Redis Alternative at Apache Software Foundation Now Supports RediSearch and SQL

A new query engine, KQIR, supports SQL and RediSearch queries for Apache Kvrocks, a Redis-compatible database. It aims to combine performance with transaction guarantees and complex query support, utilizing an intermediate language for consistency. Future plans include expanding field types and enhancing transaction guarantees.

Our great database migration

Our great database migration

Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.

Self-Hosting Kusto

Self-Hosting Kusto

The article delves into self-hosting Kusto for timeseries analytics, detailing setup with the Kusto Emulator in Docker. It covers connecting, database management, data import, querying with KQL, and anomaly detection. Speed and efficiency are highlighted.

Show HN: BerqWP – Core Web Vitals and PageSpeed Optimization Using Cloud

Show HN: BerqWP – Core Web Vitals and PageSpeed Optimization Using Cloud

BerqWP is a speed optimization plugin ensuring websites achieve a speed score of 90+ on mobile and desktop. It offers cache warmup, image optimization, lazy loading, critical CSS, JavaScript optimization, CDN, and Web Vitals Analytics. Premium features available for 10 pages with a free license key. Compatible with various servers and CDNs, supporting multilingual and e-commerce sites. Easy to use with dedicated support.

turbopuffer: Fast Search on Object Storage

turbopuffer: Fast Search on Object Storage

Simon Hørup Eskildsen founded turbopuffer in 2023 to offer a cost-efficient search engine using object storage and SSD caching. Notable customers experienced 10x cost reduction and improved latency. Application-based access.

Link Icon 19 comments
By @KaiserPro - 10 months
A word of caution here: This is very impressive, but almost entirely wrong for your organisation.

Most log messages are useless 99.99% of the time. Best likely outcome is that its turned into a metric. The once in the blue moon outcome is that it tells you what went wrong when something crashed.

Before you get to shipping _petabytes_ of logs, you really need to start thinking in metrics. Yes, you should log errors, you should also make sure they are stored centrally and are searchable.

But logs shouldn't be your primary source of data, metrics should be.

things like connection time, upstream service count, memory usage, transactions a second, failed transactions, upsteam/downstream end point health should all be metrics emitted by your app(or hosting layer), directly. Don't try and derive it from structured logs. Its fragile, slow and fucking expensive.

comparing, cutting and slicing metrics across processes or even services is simple, with logs its not.

By @ZeroCool2u - 10 months
There was a time at the beginning of the pandemic where my team was asked to build a full text search engine on top of a bunch of SharePoint sites in under 2 weeks and with frustratingly severe infrastructure constraints, (No cloud services, single box on prem for processing, among other things), and we did and it served its purpose for a few years. Absolutely no one should emulate what we built, but it was an interesting puzzle to work on and we were able to cut through a lot of bureaucracy quickly that had held us back for a few years wrt accessing the sensitive data they needed to search.

But I was always looking for other options for rebuilding the service within those constraints and found Quickwit when it was under active development. I really admire their work ethic and their engineering. Beautifully simple software that tends to Just Work™. It's also one of the first projects that made me really understand people's appreciation for Rust as well outside of just loving Cargo.

By @randomtoast - 10 months
I wonder how much their setup costs. Naively, if one were to simply feed 100 PB into Google BigQuery without any further engineering efforts, it would cost about 3 million USD per month.
By @piterrro - 10 months
Reminds me of the time Coinbase paid DataDog $65M for storing logs[1]

[1] https://thenewstack.io/datadogs-65m-bill-and-why-developers-...

By @AJSDfljff - 10 months
Unfortunate the interesting part is missing.

Its not hard at all to scale to PB. Junk your data based on time, scale horizontally. When you can scale horizontally it doesn't matter how much it is.

Elastic is not something i would use for scaling horizontally basic logs, i would use it for live data which i need live with little latency or if i do constantly a lot of log analysis live again.

Did Binance really needed elastic or did they just start pushing everything into elastic without every looking left and right?

Did they do any log processing and cleanup before?

By @endorphine - 10 months
What would you use for storing and querying long-term audit logs (e.g. 6 months retention), which should be searchable with subsecond latency and would serve 10k writes per second?

AFAICT this system feels like a decent choice. Alternatives?

By @RIMR - 10 months
I am having trouble understand how any organization could ever need a collection of logs larger than the size of the entire Internet Archive. 100PB is staggering, and the idea of filling that with logs, while entirely possible, just seems completely useless given the cost of managing that kind of data.

This is on a technical level quite impressive though, don't get me wrong, I just don't understand the use case.

By @ram_rar - 10 months
>Limited Retention: Binance was retaining most logs for only a few days. Their goal was to extend this to months, requiring the storage and management of 100 PB of logs, which was prohibitively expensive and complex with their Elasticsearch setup.

Just to give some perspective. The Internet Archive, as of January 2024, attests to have stored ~ 99 petabytes of data.

Can someone from Binance/quickwit comment on their use case that needed log retention for months? I have rarely seen users try to access actionable _operations_ log data beyond 30 days.

I wonder how much $$ can they save more by leveraging tiered storage and engs being mindful of logging.

By @kstrauser - 10 months
How? If Binance had a trillion transactions, that’s 100KB per transaction. What all are they logging?
By @sebstefan - 10 months
> On a given high-throughput Kafka topic, this figure goes up to 11 MB/s per vCPU.

There's got to be 2x to 10x improvement to be made there, no? No way CPU is the limitation these days and even bad hard drives will support 50+mB/s write speeds.

By @jiggawatts - 10 months
The article talks about 1.6 PB / day, which is 150 Gbps of log ingest traffic sustained. That's insane.

A change of the logging protocol to a more efficient format would yield such a huge improvement that it would be much cheaper than this infrastructure engineering exercise.

I suspect that everyone just assumes that the numbers represent the underling data volume, and that this cannot be decreased. Nobody seems to have heard of write amplification.

Let's say you want to collect a metric. If you do this with a JSON document format you'd likely end up ingesting records that are like the following made-up example:

    {
      "timestamp": "2024-07-12T14:30:00Z",
      "serviceName": "user-service-ba34sd4f14",
      "dataCentre": "eastFoobar-1",
      "zone": 3,
      "cluster: "stamp-prd-4123",
      "instanceId": "instance-12345",
      "object": "system/foo/blargh",
      "metric: "errors",
      "units": "countPerMicroCentury",
      "value": 391235.23921386
    }
It wouldn't surprise me if in reality this was actually 10x larger. For example, just the "resource id" of something in Azure is about this size, and it's just one field of many collected by every logging system in that cloud for every record. Similarly, I've cracked open the protocols and schema formats for competing systems and found 300x or worse write amplification being the typical case.

The actual data that needed to be collected was just:

    391235.23921386
In a binary format that would be 4 bytes, 8 if you think that you need to draw your metric graphs with a vertical precision of a millionth of a pixel and horizontal precision of a minute because you can't afford the exabytes of storage a higher collection frequency would require.

If you collect 4 bytes per metric in an array and record the start timestamp and the interval, you don't even need a timestamp per entry, just one per thousand or whatever. For a metric collected every second that's just 10 MB per month before compression. Most metrics change slowly or not at all and would compress down to mere kilobytes.

By @elchief - 10 months
maybe drop the log level from debug to info...
By @inssein - 10 months
Why do so many companies insist on shipping their logs via Kafka? I can't imagine deliverability semantics are necessary with logs, and if they are, they shouldn't be in your logs?
By @ukuina - 10 months
Does QuickWit support regex search now? The underlying store, Tantivy, already does.

This is what stopped a PoC cold at an earlier project.

By @cletus - 10 months
Just browsing the Quickwit documentation it seems like the general architecture here is to write JSON logs but stores them compressed. Is this just something like gzip compression? 20% compressed size does seem to align to ballpark estimates of JSON GZIP compression. This is what Quickwit (and this page) calls a "document": a single JSON record (just FYI).

Additionally you need to store indices because this is what you actually search. Indices have a storage cost when you write them too.

When I see a system like this my thoughts go to questions like:

- What happens when you alter an index configuration? Or add or remove an index?

- How quickly do indexes update when this happens?

- What about cold storage?

Data retention is another issue. Indexes have config for retention [1]. It's not immediately clear to me how document retention works, possibly from S3 expiration?

So, network transfer from S3 is relatively expensive ($0.05/GB standard pricing [2] to the Internet, less to AWS regions). This will be a big factor in cost. I'm really curious to know how much all of this actually costs per PB per month.

IME you almost never need to log and store this much data and there's almost no reason to ever store this much. Most logs are useless and you also have to question what the purpose is of any given log. Even if you're logging errors, you're likely to get the exact same value out of 1% sampling of logs than you are with logging everything.

You might even get more value with 1% sampling because your query and monitoring might be a whole lot easier with substantially less data to deal with.

Likewise, metrics tend to work just as well from sampled data.

This post suggests 60 day log retention (100PB / 1.6PB daily). I would probably divide this into:

1. Metrics storage. You can get this from logs but you'll often find it useful to write it directly if you can. Getting it from logs can be error-prone (eg a log format changes, the sampling rate changes and so on);

2. Sampled data, generally for debugging. I would generally try to keep this at 10TB or less;

3. "Offline" data, which you would generally only query if you absolutely had to. This is particularly true on S3, for example, because the write costs are basically zero but the read costs are expensive.

Additionally, you'd want to think about data aggregation as a lot of your logs are only useful when combined in some way

[1]: https://quickwit.io/docs/overview/concepts/indexing

[2]: https://aws.amazon.com/s3/pricing/

By @evdubs - 10 months
Lots of storage to log all of the wash trading on their platform.
By @kristopolous - 10 months
So people don't build this out themselves?

Regardless, there's some computer somewhere serving this. How do they service 1.6 PB per day? Are we talking tape backup? Disks? I've seen these mechanical arms that can pick tapes from a stack on a shelf, is that what is used? (example: https://www.osc.edu/sites/default/files/press/images/tapelib...)

For disks that's like ~60/day without redundancy, do they have people just constantly building out and onlining machines in some giant warehouse?

I assume there's built in redundancy and someone's job to go through and replace failed units?

This all sounds like it's absurdly expensive.

And I'd have to assume they deal with at least 100x that scale because they have many other customers.

Like what is that? 6,000 disks a day? Really?

I hear these numbers of petabyte storage frequently. I think Facebook is around 5PB/daily. I've never had to deal with anything that large. Back in the colo days I saw a bunch of places but nothing like that.

I'm imagining forklifts moving around pallets of shrink wrapped drives that get constantly delivered

Am I missing something here?

Places like AWS should run tours. It'd be like going to the mint.

By @ATsch - 10 months
It's always very amusing how all of the blockchain companies wax lyrical about all of the huge supposed benefits of blockchains and how every industry and company is missing out by not adopting them and should definitely run a hyperledger private blockchain buzzword whatever.

And then, even when faced with implementing a huge, audit critical, distributed append-only store, the thing they tell us blockchains are so useful for, they just use normal database tech like the rest if us. With one centralized infrastructure where most of the transactions in the network actually take place. Who's tech stack looks suspiciously like every other financial institution.

I'm so glad we're ignoring 100 years of securities law to let all of this incredible innovation happen.