August 8th, 2024

Is it time to version observability?

The article outlines the transition from Observability 1.0 to 2.0, highlighting structured logs for better data analysis, improved debugging, and enhanced software development, likening its impact to virtualization.

Read original article

The article discusses the evolution of observability in software systems, proposing a shift from "Observability 1.0" to "Observability 2.0." Observability 1.0 relies on traditional metrics, logs, and traces, which are often siloed and provide limited insights into system performance. In contrast, Observability 2.0 emphasizes the use of structured log events as a single source of truth, allowing for more precise data analysis and better understanding of system behavior. The author argues that this transition can unlock significant improvements in how engineers interact with and understand their software, enabling a more proactive approach to development and debugging. The article highlights the importance of semantic versioning in distinguishing between these two generations of observability tools, suggesting that the changes in data storage and analysis methods represent a fundamental shift in the field. The author draws parallels to the virtualization movement, suggesting that the adoption of Observability 2.0 could similarly transform the software development landscape by enhancing feedback loops and enabling more dynamic interactions with production systems.

- Observability is evolving from a metrics-based approach (1.0) to a structured log-based approach (2.0).

- Observability 2.0 allows for more precise data analysis and better understanding of system behavior.

- The shift to Observability 2.0 can enhance the software development lifecycle and improve debugging processes.

- Semantic versioning is proposed as a way to differentiate between the two generations of observability tools.

- The author compares the potential impact of Observability 2.0 to the transformative effects of virtualization in the tech industry.

Structured logs are the way to start

Structured logs are crucial for system insight, aiding search and aggregation. Despite storage challenges, prioritizing indexing and retention strategies is key. Valuable lessons can be gleaned from email for software processes.

DevOps: The Funeral

The article explores Devops' evolution, emphasizing reproducibility in system administration. It critiques mislabeling cloud sysadmins as Devops practitioners and questions the industry's shift towards new approaches like Platform Engineering. It warns against neglecting automation and reproducibility principles.

Bad habits that stop engineering teams from high-performance

Engineering teams face hindering bad habits affecting performance. Importance of observability in software development stressed, including Elastic's OpenTelemetry role. CI/CD practices, cloud-native tech updates, data management solutions, mobile testing advancements, API tools, DevSecOps, and team culture discussed.

Datadog Is the New Oracle

Datadog faces criticism for high costs and limited access to observability features. Open Source tools like Prometheus and Grafana are gaining popularity, challenging proprietary platforms. Startups aim to offer affordable alternatives, indicating a shift towards mature Open Source observability platforms.

Where should visual programming go?

The article discusses the future of visual programming, advocating for its role in enhancing traditional coding through useful visual elements, while outlining four integration levels for diagrams and code.

15 comments

By @Veserv - 9 months

They do not appear to understand the fundamental difference between logs, traces, and metrics. Sure, if you can log every event you want to record, then everything is just events (I will ignore the fact that they are still stuck on formatted text strings as a event format). The difference is what do you do when you can not record everything you want to either at build time or runtime.

Logs are independent. When you can not store every event, you can drop them randomly. You lose a perfect view of every logged event, but you still retain a statistical view. As we have already assumed you can not log everything, this is the best you can do anyways.

Traces are for correlated events where you want every correlated event (a trace) or none of them (or possibly the first N in a trace). Losing events within a trace makes the entire trace (or at least the latter portions) useless. When you can not store every event, you want to drop randomly at the whole trace level.

Metrics are for situations where you know you can not log everything. You aggregate your data at log time, so instead of getting a statistically random sample you instead get aggregates that incorporate all of your data at the cost of precision.

Note that for the purposes of this post, I have ignored the reason why you can not store every event. That is an orthogonal discussion and techniques that relieve that bottleneck allow more opportunities to stay on the happy path of "just events with post-processed analysis" that the author is advocating for.

By @archenemybuntu - 9 months

Id gonna break a nerve and say most orgs overengineer observability. There's the whole topology of otel tools, Prometheus tools and bunch of Long term storage / querying solutions. Very complicated tracing setups. All these are fine if you have a team for maintaining observability only. But your avg product development org can sacrifice most of it and do with proper logging with a request context, plus some important service level metrics + grafana + alarms.

Problem with all these above tools is that, they all seem like essential features to have but once you have the whole topology of 50 half baked CNCF containers set up in "production" shit starts to break in very mysterious ways and also these observability products tend to cost a lot.

By @rbetts - 9 months

I feel like the focus on trace/log/metrics terminology is overshadowing Charity's comments on the presentation and navigation tier, which is really where the focus should be in my experience. Her point about making the curious more effective than the tenured is quite powerful.

Observability databases are quickly adopting columnar database technologies. This is well aligned with wide, sparse columns suitable to wide, structured logs. These systems map well to the query workloads, support the high speed ingest rate, can tolerate some about of buffering on the ingest path for efficiency, and store a ton of data highly compressed, and now readily tier local to cloud storage. Consolidating more of the fact table to this format makes a lot of sense - a lot more sense than running two or three separate database technologies specialized to metrics, logs, and traces. You can now end the cardinality miseries of legacy observability TSDBs.

But the magic sauce in observability platforms is making the rows in the fact table linkable and navigable - getting from a log message to a relevant trace; navigating from an error message in a span to a count of those errors filtered by region or deployment id... This is the complexity in building highly ergonomic observability platforms - all of the transformation, enrichment, and metadata management (and the UX to make it usable).

By @viraptor - 9 months

This is quite frustrating to read. The whole set of assumed behaviours is wrong. I'm happy doing exactly what's described on 2.0 processes while using datadog.

Charity's talk about costs is annoying too. Honeycomb is the most expensive solution I've seen so far. Until they put a "we'll match your logging+metrics contact cost for same volume and features" guarantee on the pricing page, it's just empty talk.

Don't get me wrong, I love the Honeycomb service and what they're doing. I would love to use it. But this is just telling me "you're doing things wrong, you should do (things I'm already doing) using our system and save money (even though pricing page disagrees)".

By @flockonus - 9 months

> Y’all, Datadog and Prometheus are the last, best metrics-backed tools that will ever be built. You can’t catch up to them or beat them at that; no one can. Do something different. Build for the next generation of software problems, not the last generation.

Heard a very similar thing from Plenty Of Fish creator in 2012, I unfortunately believed him; "the dating space was solved". Turns out it never was, and like every space, solutions will keep on changing.

By @zellyn - 9 months

A few questions:

a) You're dismissing OTel, but if you _do_ want to do flame graphs, you need traces and spans, and standards (W3C Trace-Context, etc.) to propagate them.

b) What's the difference between an "Event" and a "Wide Log with Trace/Span attached"? Is it that you don't have to think of it only in the context of traces?

c) Periodically emitting wide events for metrics, once you had more than a few, would almost inevitably result in creating a common API for doing it, which would end up looking almost just like OTel metrics, no?

d) If you're clever, metrics histogram sketches can be combined usefully, unlike adding averages

e) Aren't you just talking about storing a hell of a lot of data? Sure, it's easy not to worry, and just throw anything into the Wide Log, as long as you don't have to care about the storage. But that's exactly that happens with every logging system I've used. Is sampling the answer? Like, you still have to send all the data, even from very high QPS systems, so you can tail-sample later after the 24 microservice graph calls all complete?

Don't get me wrong, my years-long inability to adequately and clearly settle the simple theoretical question of "What's the difference between a normal old-school log, and a log attached to a trace/span, and which should I prefer?" has me biased towards your argument :-)

By @datadrivenangel - 9 months

So the core idea is to move to arbitrarily wide logs?

Seems good in theory, except in practice it just defers the pain to later, like schema on read document databases.

By @firesteelrain - 9 months

It took me a bit to really understand the versioning angle and I think I understand.

The blog discusses the idea of evolving observability practices, suggesting a move from traditional methods (metrics, logs, traces) to a new approach where structured log events serve as a central, unified source of truth. The argument is that this shift represents a significant enough change to be considered a new version of observability, similar to how software is versioned when it undergoes major updates. This evolution would enable more precise and insightful software development and operations.

Unlike separate metrics, logs, and traces, structured log events combine these data types into a single, comprehensive source, simplifying analysis and troubleshooting.

Structured events capture more detailed context, making it easier to understand the "why" behind system behavior, not just the "what."

By @tunesmith - 9 months

Did I miss an elephant in the room?

Wide structured logging to log EVERYTHING? Isn't that just massively huge? I don't see how that would be cheaper.

Related Steven Wright joke: “I have a map of the United States... Actual size. It says, 'Scale: 1 mile = 1 mile.' I spent last summer folding it. I hardly ever unroll it. People ask me where I live, and I say, 'E6.”

By @xyzzy_plugh - 9 months

I was excited by the title and thought that this was going to be about versioning the observability contracts of services, dashboards, alerts, etc., which are typically exceptionally brittle. Boy am I disappointed.

I get what Charity is shouting. And Honeycomb is incredible. But I think this framing overly simplifies things.

Let's step back and imagine everything emitted JSON only. No other form of telemetry is allowed. This is functionally equivalent to wide events albeit inherently flawed and problematic as I'll demonstrate.

Every time something happens somewhere you emit an Event object. You slurp these to a central place, and now you can count them, connect them as a graph, index and search, compress, transpose, etc. etc.

I agree, this works! Let's assume we build it and all the necessary query and aggregation tools, storage, dashboards, whatever. Hurray! But sooner or later you will have this problem: a developer comes to you and says "my service is falling over" and you'll look and see that for every 1 MiB of traffic it receives, it also sends roughly 1 MiB of traffic, but it produces 10 MiB of JSON Event objects. Possibly more. Look, this is a very complex service, or so they tell you.

You smile and tell them "not a problem! We'll simply pre-aggregate some of these events in the service and emit a periodic summary." Done and done.

Then you find out there's a certain request that causes problems, so you add more Events, but this also causes an unacceptable amount of Event traffic. Not to worry, we can add a special flag to only emit extra logs for certain requests, or we'll randomly add extra logging ~5% of the time. That should do it.

Great! It all works. That's the end of this story, but the result is that you've re-invented metrics and traces. Sure, logs -- or "wide events" that are for the sake of this example the same thing -- work well enough for almost everything, except of course for all the places they don't. And now where they don't, you have to reinvent all this stuff.

Metrics and traces solve these problems upfront in a way that's designed to accommodate scaling problems before you suffer an outage, without necessarily making your life significantly harder along the way. At least that's the intention, regardless of whether or not that's true in practice -- certainly not addressed by TFA.

What's more is that in practice metrics and traces today are in fact wide events. They're metrics events, or tracing events. It doesn't really matter if a metric ends up scraped by a Prometheus metrics page or emitted as a JSON log line. That's besides the point. The point is they are fit for purpose.

Observability 2.0 doesn't fix this, it just shifts the problem around. Remind me, how did we do things before Observability 1.0? Because as far as I can tell it's strikingly similar in appearance to Observability 2.0.

So forgive me if my interpretation of all of this is lipstick on the pig that is Observability 0.1

And finally, I get you can make it work. Google certainly gets that. But then they built Monarch anyways. Why? It's worth understanding if you ask me. Perhaps we should start by educating the general audience on this matter, but then I'm guessing that would perhaps not aid in the sale of a solution that eschews those very learnings.

By @FridgeSeal - 9 months

> My other hope is that people will stop building new observability startups built on metrics.

I mean, can you blame them?

Metrics alone are: valuable and useful, prom text format and remote write protocol is widely used, straightforward to implement and a much, much, much smaller slice than “the entirety of the OpenTelemetry spec”. Have you read those documents? Massive, sprawling, terminology for days, it’s confusingly written in places IMO. I know it’s trying to cover a lot of bases all at once (logs, traces AND metrics) and design accordingly to handle all of them properly, so it’s probably fine to deal with if you have large enough team, but that’s not everyone.

To say nothing of the full adoption of opentelemetry data. Prometheus is far from my favourite bit of tech, but setting up scraping and a grafana dashboard is way less shenanigans than setting up open telemetry collection, and validating it’s all correct and present in my experience.

If someone prefers to tackle a slice like metrics only and do it better than the whole hog, more power to them IMO.

By @moomin - 9 months

We came up with a buzzword to market our product. The industry made this buzzword meaningless. Now we’re coming up with a new one. We’re sure the same thing won’t happen again.

By @jrockway - 9 months

I like the wide log model. At work, we write software that customers run for themselves. When it breaks, we can't exactly ssh in and mutate stuff until it works again, so we need some sort of information that they can upload to us. Logs are the easiest way to do that, and because logs are a key part of our product (batch job runner for k8s), we already have infrastructure to store and retrieve logs. (What's built into k8s is sadly inadequate. The logs die when the pod dies.)

Anyway, from this we can get metrics and traces. For traces, we log the start and end of requests, and generate a unique ID at the start. Server logging contexts have the request's ID. Everything that happens for that request gets logged along with the request ID, so you can watch the request transit the system with "rg 453ca13b-aa96-4204-91df-316923f5f9ae" or whatever on an unpacked debug dump, which is rather efficient at moderate scale. For metrics, we just log stats when we know them; if we have some io.Writer that we're writing to, it can log "just wrote 1234 bytes", and then you can post-process that into useful statistics at whatever level of granularity you want ("how fast is the system as a whole sending data on the network?", "how fast is node X sending data on the network?", "how fast is request 453ca13b-aa96-4204-91df-316923f5f9ae sending data to the network?"). This doesn't scale quite as well, as a busy system with small writes is going to write a lot of logs. Our metrics package has per-context.Context aggregation, which cleans this up without requiring any locking across requests like Prometheus does. https://github.com/pachyderm/pachyderm/blob/master/src/inter...

Finally, when I get tired of having 43 terminal windows open with a bunch of "less" sessions over the logs, I hacked something together to do a light JSON parse on each line and send the logs to Postgres: https://github.com/pachyderm/pachyderm/blob/master/src/inter.... It is slow to load a big dump, but the queries are surprisingly fast. My favorite thing to do is the "select * from logs where json->'x-request-id' = '453ca13b-aa96-4204-91df-316923f5f9ae' order by time asc" or whatever. Then I don't have 5 different log files open to watch a single request, it's just all there in my psql window.

As many people will say, this analysis method doesn't scale in the same way as something like Jaeger (which scales by deleting 99% of your data) or Prometheus (which scales by throwing away per-request information), but it does let you drill down as deep as necessary, which is important when you have one customer that had one bad request and you absolutely positively have to fix it.

My TL;DR is that if you're a 3 person team writing some software from scratch this afternoon, "print" is a pretty good observability stack. You can add complexity later. Just capture what you need to debug today, and this will last you a very long time. (I wrote the monitoring system for Google Fiber CPE devices... they just sent us their logs every minute and we did some very simple analysis to feed an alerting system; for everything else, a quick MapReduce or dremel invocation over the raw log lines was more than adequate for anything we needed to figure out.)

By @amelius - 9 months

I can't even run valgrind on many libraries and Python modules because they weren't designed with valgrind in mind. Let's work on observability before we version it.