June 24th, 2024

Structured logs are the way to start

Structured logs are crucial for system insight, aiding search and aggregation. Despite storage challenges, prioritizing indexing and retention strategies is key. Valuable lessons can be gleaned from email for software processes.

Read original articleLink Icon
Structured logs are the way to start

The article emphasizes the importance of structured logs for understanding system behavior. Structured logs, with a parseable format like the Apache log format example provided, enable easier search and aggregation by tokenizing and indexing data upfront. The author suggests starting with structured logs due to their familiarity and efficiency in monitoring and observability. However, scalability becomes a concern as logs can be costly to store and index, leading to a need for a solid retention strategy and maintenance of the logging pipeline. Despite the potential challenges related to storage costs, the focus remains on the storage and indexing aspects rather than the application itself. The article concludes by highlighting the value of actionable lessons learned through email for software automation, release, and troubleshooting.

Related

Link Icon 11 comments
By @tail_exchange - 7 months
Contextual logging makes structured logging even more powerful. For example, you can attach an ID to the contex of an http request when you receive it, which then gets logged in every operation that is performed for serving that request. If you are investigating what happened for a specific request, then you can just search for its ID.

This works with any repeatable task and identifier, like runs of a cron job, and user ids.

By @m-a-r-c-e-l - 7 months
IMO manual logging and especially being consistent is hard in a team. Language skills, cultural background, personal preferences, etc.

Much better is a well thought of error handling. This shows exactly when and where something went wrong. If your error handler supports it, even context information like all the variables of the current stack is reported.

Add managable background jobs to the recipe which you can restart after fixing the code...

This helps in 99.99% of all cases.

By @siddharthgoel88 - 7 months
For Java applications, we built a structured logging library which would do a few things -

  - Add OTel based instrumentation to generate traces
  - Do salted hash of PII (injected in plain text by API Gateway in each request) like userid, etc to propagate internally to other downstream services via Baggage
  - Inject all this context like trace-id and hashed PIIs into log
  - Have Log4j and Logback Layout implementations to structure logs in JSON format
Logs are compressed and ingested to AWS S3 so it is also not expensive to store so much logs to S3.

AWS provides a tool called S3Select to search structured logs/info in S3. We built a Golang Cobra based cli tool, which is aware of the structure we have defined and allows us to search for logs in all possible ways, even with PII info even without saving.

In just 2 months, with 2 people we were able to build this stack and integrate to 100+ microservices and get rid of Cloudwatch. This not just saved us a lots of money on Cloudwatch side but also improved our capability to search to logs with a lot of context when issues happens.

By @hipadev23 - 7 months
6-paragraph text post can’t show due to a database connection. Come on guys.
By @gnabgib - 7 months
Server seems to be struggling? https://archive.is/mhm4u
By @akira2501 - 7 months
My habit lately has been to have a "request event" object that picks up context as it works it's way through the layers and then is fully saved to disk referenced by it's unique event number. In Go this is usually just a map. These logs are usually very large and get rotated into archive and deletion very quickly.

Then in my standard error logs I always just include this event ID and an actual description of the error and it's context from the call site. These logs are usually very small and easy to analyze to spot the error and every log line includes the event ID that was being processed when it was generated.

By @foota - 7 months
I was just talking to some acquaintances the other day where I was asking them what they used for structured logging and they looked at me like "what's that" and I remembered people don't do it everywhere.
By @bob1029 - 7 months
I think SQLite is maybe the best option if you can get a bit clever around the scalability implications.

For instance, you could maintain an in-memory copy of the log DB schema for each http/logical request context and then conditionally back it up to disk if an exception occurs. The request trace SQLite db path could then be recorded in a metadata SQLite db that tracks exceptions. This gets you away from all clients serializing through the same WAL on the happy path and also minimizes disk IO.

By @jdwyah - 7 months
"A breakpoint for logging is usually scalability, because they are expensive to store and index."

I hope 2024 is the year where we realize that if we make the log levels dynamically update-able we can have our cake and eat it too. We feel stuck in a world where all logging is either useless bc it's off or on and expensive. All you need is a way to easily modify log level off without restarting and this gets a lot better.

By @telotortium - 7 months
One great thing about Go is its built-in structured logging package "log/slog" since Go 1.21. Not only can you output in multiple structured formats, but Go also widely uses its `context.Context` type to pass request-level information, so you can easily attach requestID, sessionID, etc.
By @langsoul-com - 7 months
Is there anyway to replay an error at the moment of a crash? I found just logs alone aren't that useful, it still takes eons to find out wtf happened.

Almost like a local breakpoint debugger on crash, but for prod.