Observability 2.0 and the Database for It
Observability 2.0 integrates metrics, logs, and traces into a unified framework called wide events, addressing data silos and enabling retroactive analysis. GreptimeDB supports this paradigm with efficient storage and query capabilities.
Read original articleObservability 2.0 is a new paradigm in monitoring systems that emphasizes the integration of metrics, logs, and traces into a unified framework known as "wide events." This approach, introduced by Charity Majors, aims to overcome the limitations of traditional observability methods that often rely on siloed data, leading to inefficiencies and inconsistencies. Wide events serve as a single source of truth, allowing for high-cardinality, context-rich data storage that can be analyzed retroactively without the need for pre-aggregation. This shift addresses challenges such as data silos, redundant information, and the need for static instrumentation. GreptimeDB is presented as a solution tailored for Observability 2.0, designed to handle the unique demands of wide events, including efficient storage, real-time query capabilities, and flexible indexing. The database architecture supports both routine and exploratory queries, ensuring compatibility with existing observability tools while enhancing data utility. Key challenges in adopting this new model include generating context-rich events, efficient data transport, and cost-effective storage solutions. GreptimeDB aims to provide a robust infrastructure that meets these needs, facilitating a seamless transition to Observability 2.0.
- Observability 2.0 integrates metrics, logs, and traces into a unified framework called wide events.
- GreptimeDB is designed to support the unique requirements of Observability 2.0, focusing on high-cardinality data.
- The new paradigm addresses issues like data silos and redundant information, allowing for retroactive analysis.
- Key challenges include generating context-rich events and ensuring efficient data transport and storage.
- The architecture of GreptimeDB supports both routine and exploratory queries, enhancing existing observability tools.
Related
Datadog Is the New Oracle
Datadog faces criticism for high costs and limited access to observability features. Open Source tools like Prometheus and Grafana are gaining popularity, challenging proprietary platforms. Startups aim to offer affordable alternatives, indicating a shift towards mature Open Source observability platforms.
Is it time to version observability?
The article outlines the transition from Observability 1.0 to 2.0, highlighting structured logs for better data analysis, improved debugging, and enhanced software development, likening its impact to virtualization.
Show HN: Oodle – serverless, fully-managed, drop-in replacement for Prometheus
Oodle.ai has created a cost-efficient metrics observability system that processes over 1 billion time series per hour, enhancing scalability and performance while integrating easily with existing tools and protocols.
A Practitioner's Guide to Wide Events
Jeremy Morrell discusses Wide Event-style instrumentation in software engineering, highlighting its benefits for debugging, the importance of emitting single events, and practical guidance on tools and techniques for effective data analysis.
Open-source Rust database tops JSONBench using DataFusion
GreptimeDB excelled in the JSONBench benchmark, outperforming ClickHouse and VictoriaLogs, achieving top query speed for 1 billion JSON documents, and offering cost-effective, efficient solutions for large-scale observability data.
- Many commenters see wide events as a rebranding of existing structured logging practices, emphasizing that the real challenge lies in consistent instrumentation.
- Concerns are raised about the complexity and cost of storing and querying raw data, with suggestions for using columnar databases for efficiency.
- Several users advocate for existing tools like Elasticsearch, Opensearch, and ClickHouse as effective solutions for observability needs.
- There is skepticism about the practicality of the wide events approach, with some preferring simpler, traditional metrics and logging systems.
- Some commenters propose alternative methods, such as using Kafka for event handling, to achieve better observability without the overhead of wide events.
1. Juice up your Traces with every attribute possible 2. Use a telemetry backend that relies on cheap object storage so that your costs don't explode. 3. ...profit?
Ok, but now we are exporting and storing everything about every request just so we can derive some previously cheap metrics like server CPU consumption? I guess for most applications the overhead of buffering, formatting and sending all of this telemetry data doesn't matter for folks?
I believe it should be possible now, with AI, to train online tiny models of how systems behave in production and then ship those those models to the edge to use to compress wide-event and metrics data. Capturing higher-level behavior can also be very powerful for anomaly and outlier detection.
For systems that can afford the compute cost (I/O or network bound), this approach may be useful.
This approach should work particularly well for mobile observability.
The mistake many teams make is to worry about storage but not querying. Storing data is the easy part. Querying is the hard part. Some columnar data format stored in S3 doesn't solve querying. You need to have some system that loads all those files, creates indices or performs some map reduce logic to get answers out of those files. If you get this wrong, stuff gets really expensive and costly quickly.
What you indeed want is a database (probably a columnar one) that provides fast access and that can query across your data efficiently at scale. That's not observability 2.0 but observability 101. Without that, you have no observability. You just have a lot of data that is hard to query and that provides no observability unless you somehow manage solve that. Yahoo figured that out 20 years or so ago when they created hadoop, hdfs, and all the rest.
The article is right to call out the fragmented landscape here. Many products only provide partial/simplistic solutions and they don't integrate well with each other.
I started out doing some of this stuff more than 10 years ago using Elasticsearch and Kibana. Grafana was a fork that hadn't happened yet. This combination is still a good solution for logging, metrics, and traces. These days, Opensearch (the Elasticsearch fork) is a good alternative. Basically the blob of json used in the article with a nice mapping would work fine in either. That's more or less what I did around 2014.
Create a data stream, define some life cycle policies (data retention, rollups, archive/delete, etc.), and start sending data. Both Opensearch and Elasticsearch have stateless versions now that store in S3 (or similar bucket based storage). Exactly like the article proposes. I'd recommend going with Elasticsearch. It's a bit richer in features. But Opensearch will do the job.
This is not the only solution in this space but it works well enough.
Just like Postgres became the default choice for operational/relational workloads, I think ClickHouse is (or should) quickly become the standard for analytical workloads. In both cases, they both "just work". Postgres even has columnar storage extensions, but I still think ClickHouse is a better choice if you don't need transactions.
A rule of thumb I think devs should follow would be: use Postgres for operational cases, and ClickHouse for analytical ones. That should cover most scenarios well, at least until you encounter something unique enough to justify deeper research.
I am big fan of the idea to have original data and context as much as possible. With previous metrics system, we lost too much information by pre-aggregation and eventually run into the high-cardinality metrics issue by overwhelming the labels. For those teams own hundreds of millions to billions time series, this o11y 2.0/wide event approach is really worth it. And we are determined to build an open-source database that can deal with challenges of wide events for users from small team or large organization.
Of course, database is not the only issue. We need full tooling from instrument to data transport. We already have opentelemetry-arrow project for larger scale transmission that may work for wide events. We will continue to work in this ecosystem.
I’ve been using Loki recently and really like the approach: it stores log data in object storage and supports on-the-fly processing and extraction. You can build alerts and dashboards off it without needing to pre-aggregate or force everything into a metrics pipeline.
The real friction in all of these systems is instrumentation. You still need to get that structured event data out of your app code in a consistent way, and that part is rarely seamless unless your runtime or framework does most of it for free. So while wide events are a clean unification model, the dev overhead to emit them with enough fidelity is still very real.
The only thing then is that there is no link between logs and metrics, but I guess since they created alloy [1] they could make it so logs and metrics labels match, so we could select/see both at once ?
Oh ok here's a blog post from 2020 saying exactly this: https://grafana.com/blog/2020/03/31/how-to-successfully-corr...
[0]: https://grafana.com/docs/grafana/latest/datasources/tempo/tr... [1]: https://grafana.com/docs/alloy/latest/
A lot of businesses haven't even nailed simple histograms with prometheus. I wouldn't like observability to become a full set of problems on its own!
Also timeseries is powerfull in observability because a lot of issues can be represented as cheap counters, gauges and distributions. I want to see a paradigm complimentary to this simple principle instead of producing nested documents with nested objects.
* InfluxDB (the newest Rust rewrite)
* Clickhouse powered solutions (eg https://signoz.io)
* ... ?
I'm quite skeptical about the "store raw data" approach. It makes querying much more complex and slower, storage much more expensive, etc.
Columnar databases that can store the data very efficiently are the way to go, IMO. They can still benefit from cheap long-term storage like S3.
> We believe raw data based approach will transform how we use observability data and extract value from it. Yep. We have built quuxLogging on the same premise, but with more emphasis on "raw": Instead of parsing events (wide or not), we treat it fundamentally as a very large set of (usually text) lines and optimized hard on the querying-lots-of-text part. Basically a horizontally scaled (extremely fast) regex engine with data aggregation support.
Having a decent way to get metrics from logs ad-hoc completely solves the metric cardinality explosion.
From my perspective, this is just structured logging. It doesn’t cover tracing and metrics, at all.
> This process requires no code changes—metric are derived directly from the raw event data through queries, eliminating the need for pre-aggregation or prior instrumentation.
“requires no code changes”? Well certainly, because by the time you send events like that your code has already bent over backwards to enable them.
Surely I must be missing something.
Perhaps we need to have generic database framework that properly and seamlessly cater for both raw and cooked (processed) for observability something similar to D4M [1].
[1] D4M: Dynamic Distributed Dimensional Data Model:
a very satisfied user : trace, metrics, log in a perfect way
Related
Datadog Is the New Oracle
Datadog faces criticism for high costs and limited access to observability features. Open Source tools like Prometheus and Grafana are gaining popularity, challenging proprietary platforms. Startups aim to offer affordable alternatives, indicating a shift towards mature Open Source observability platforms.
Is it time to version observability?
The article outlines the transition from Observability 1.0 to 2.0, highlighting structured logs for better data analysis, improved debugging, and enhanced software development, likening its impact to virtualization.
Show HN: Oodle – serverless, fully-managed, drop-in replacement for Prometheus
Oodle.ai has created a cost-efficient metrics observability system that processes over 1 billion time series per hour, enhancing scalability and performance while integrating easily with existing tools and protocols.
A Practitioner's Guide to Wide Events
Jeremy Morrell discusses Wide Event-style instrumentation in software engineering, highlighting its benefits for debugging, the importance of emitting single events, and practical guidance on tools and techniques for effective data analysis.
Open-source Rust database tops JSONBench using DataFusion
GreptimeDB excelled in the JSONBench benchmark, outperforming ClickHouse and VictoriaLogs, achieving top query speed for 1 billion JSON documents, and offering cost-effective, efficient solutions for large-scale observability data.