Structured logs are the way to start
Structured logs are crucial for system insight, aiding search and aggregation. Despite storage challenges, prioritizing indexing and retention strategies is key. Valuable lessons can be gleaned from email for software processes.
Read original articleThe article emphasizes the importance of structured logs for understanding system behavior. Structured logs, with a parseable format like the Apache log format example provided, enable easier search and aggregation by tokenizing and indexing data upfront. The author suggests starting with structured logs due to their familiarity and efficiency in monitoring and observability. However, scalability becomes a concern as logs can be costly to store and index, leading to a need for a solid retention strategy and maintenance of the logging pipeline. Despite the potential challenges related to storage costs, the focus remains on the storage and indexing aspects rather than the application itself. The article concludes by highlighting the value of actionable lessons learned through email for software automation, release, and troubleshooting.
Related
This works with any repeatable task and identifier, like runs of a cron job, and user ids.
Much better is a well thought of error handling. This shows exactly when and where something went wrong. If your error handler supports it, even context information like all the variables of the current stack is reported.
Add managable background jobs to the recipe which you can restart after fixing the code...
This helps in 99.99% of all cases.
- Add OTel based instrumentation to generate traces
- Do salted hash of PII (injected in plain text by API Gateway in each request) like userid, etc to propagate internally to other downstream services via Baggage
- Inject all this context like trace-id and hashed PIIs into log
- Have Log4j and Logback Layout implementations to structure logs in JSON format
Logs are compressed and ingested to AWS S3 so it is also not expensive to store so much logs to S3.AWS provides a tool called S3Select to search structured logs/info in S3. We built a Golang Cobra based cli tool, which is aware of the structure we have defined and allows us to search for logs in all possible ways, even with PII info even without saving.
In just 2 months, with 2 people we were able to build this stack and integrate to 100+ microservices and get rid of Cloudwatch. This not just saved us a lots of money on Cloudwatch side but also improved our capability to search to logs with a lot of context when issues happens.
Then in my standard error logs I always just include this event ID and an actual description of the error and it's context from the call site. These logs are usually very small and easy to analyze to spot the error and every log line includes the event ID that was being processed when it was generated.
For instance, you could maintain an in-memory copy of the log DB schema for each http/logical request context and then conditionally back it up to disk if an exception occurs. The request trace SQLite db path could then be recorded in a metadata SQLite db that tracks exceptions. This gets you away from all clients serializing through the same WAL on the happy path and also minimizes disk IO.
I hope 2024 is the year where we realize that if we make the log levels dynamically update-able we can have our cake and eat it too. We feel stuck in a world where all logging is either useless bc it's off or on and expensive. All you need is a way to easily modify log level off without restarting and this gets a lot better.
Almost like a local breakpoint debugger on crash, but for prod.