The anatomy of a 2AM mental breakdown
Zarar experienced a critical issue with his website late at night, feeling overwhelmed and isolated. The problem stemmed from a conflict with the PostHog tool, highlighting solo work challenges in tech.
Read original articleIn a recent blog post, Zarar recounts a stressful experience during a late-night debugging session when he faced a critical issue with his website, jumpcomedy.com. At around 2 AM, he realized that all HTTP POST calls were failing, despite the code working perfectly in a local environment. With no immediate support from colleagues or resources, he felt overwhelmed by customer complaints and the pressure to resolve the issue. Zarar expressed feelings of shame and incompetence, grappling with thoughts of shutting down his business. As he attempted to troubleshoot, he encountered various errors and frustrations, including a misleading TypeError message. Despite his efforts, including checking for potential causes like recent updates and configurations, he struggled to identify the root of the problem. Eventually, after a series of trials, he discovered that the issue was linked to the PostHog tool he had integrated, which he subsequently removed, restoring functionality. This experience highlighted the challenges of working alone in tech and the emotional toll of production outages.
- Zarar faced a critical production issue with his website late at night, feeling isolated and overwhelmed.
- He experienced significant stress and self-doubt while troubleshooting the problem.
- The root cause of the issue was identified as a conflict with the PostHog tool.
- The incident underscores the challenges of solo work in tech and the emotional impact of production failures.
- Zarar's experience reflects common struggles faced by developers during high-pressure situations.
Related
Firing Myself
Noormar, a developer, accidentally cleared a production database at a Social Gaming startup, causing revenue losses and customer complaints. The incident led to guilt, a tarnished reputation, and eventual resignation.
The Process That Kept Dying: A memory leak murder mystery (node)
An investigation into a recurring 502 Bad Gateway error on a crowdfunding site revealed a memory leak caused by Moment.js. Updating the library resolved the issue, highlighting debugging challenges.
A tale of using chaos engineering at scale to keep our systems resilient
Tines software engineer Shayon Mukherjee discussed a Redis cluster upgrade incident that revealed a bug affecting customer workflows, highlighting the need for better error handling and resilience testing in system architecture.
"We ran out of columns" – The best, worst codebase
The author reflects on a chaotic codebase, highlighting challenges with a legacy database and a mix of programming languages. Despite flaws, it fostered creativity and problem-solving, leading to notable improvements.
A heck of a wild bug chase
George Mauer detailed a debugging challenge with a Next.js application, where a 401 error arose from missing authentication cookies in production, highlighting the complexities of software development and interconnected tech components.
Then after you go through a few of these, you'll realize it really isn't too bad and you've dealt with bad situations before and you'll gain the confidence to know you can deal with it, even when there's no one you can reach out to for help.
For me it's only happened once. It was an anxiety attack, and I'm very lucky my wife was there to talk me through it and help me understand what was happening. She's had them many times, but it was my first (and thankfully only).
It turns out that this sort of thing happens to people, and that there's nothing wrong with it. It doesn't mean you're defective or weak. That's a really important point to internalize.
Xanax is worth having on hand, since that was what finally ended it for me and I was able to drift off to sleep.
I guess my point is, there's a difference between having intrusive thoughts vs something that debilitates you and that you legitimately can't control, such as an anxiety attack or a panic attack. You won't be getting any work done if those happen, and that's ok.
Highlights two lessons. 1. If you ship it, you own it. Therefore the less you ship, the better. Keep dependencies to a minimum. 2. Keep non-critical things out of the critical path. A failing AC compressor should not prevent your engine from running. Very difficult to achieve in the browser, but worth attempting.
https://github.com/PostHog/posthog-js/blob/759829c67fcb8720f...
The biggest lesson here is, if you're writing a popular library that monkey-patches global functions, it needs to be really well tested.
There's a difference between "I'll throw posthog calls in a try/catch just in case" and "With posthog I literally can't make fetch() calls with POST"
I take this as a reminder of the importance of giving precise names to variables. The code
res = await originalFetch(url, init)
looks harmless enough. But in fact the `url` parameter is not necessarily a URL, as the TypeScript declaration makes clear: url: URL | RequestInfo
The problem arises in the case where it is not a URL, but a RequestInfo object, which has been “used up” by the construction of the Request object earlier in the function implementation and cannot be used again here.It would have been more difficult to overlook the problem with this change if the parameter were named something more precise such as `urlOrRequestInfo`.
(A much more speculative idea is the thought that it is possible to formalise the idea of a value being “used up” using linear types, derived from linear logic, so conceivably a suitable type system could prevent this class of bug.)
[0] https://github.com/PostHog/posthog-js/pull/1351/commits/2497...
The cold sweats and shame I felt, man... Plus it's on the App Store so there's the review process to deal with which extends the timeline for a fix. Thankfully, they picked it up for review 30 minutes after submission and approved it in a few minutes.
As an SRE/devops/platform engineer or whatever the title of the day is people want to give. I would have zeroed in on the difference between the working system. And the non-working system. Either adding and then removing, or removing and then adding back the differences one at a time. Until something worked. What I see is two things. 1) you have an environment where it does work. 2) the failing environment was working, then started failing.
Is my method superior to yours, no. It just is being stated to highlight the difference in the way we look at a problem. Both of a zero in on what we know. I know systems, you know code.
Reverting your own code, but still using a broken PostHog update from that same day? For me, the lesson is to make sure that I can revert everything, including dependencies.
But most of us have been in some situation similar, if not quite as bad. (Running your own company is going to be uniquely stressful.)
And it’s (IMO) why anonymity online is usually a bad idea - we need to learn, deep in our bones, that what is said online is the same as standing up in front of the church congregation and reading out our tweets - if you would not in front the vicar, don’t in front of the planet.
With more than one person you can bounce ideas off each other and share the pain so to speak. It's highly desirable.
While I have never experienced anything similar myself, it really helped me to put things in perspective. Since then, I've worked on some critical systems that were actually life or death, but I no longer do. For the /vast/ majority of technology systems, nobody will die if you let the outage last just a few hours longer. The worst case scenario is a financial cost to the company who employs you, which might be your own company. Smart companies de-risk this by getting various forms of business insurance, and that should include you if it's your own company.
So, do everything you can to fix the outage, but approach it with some perspective. It's not life or death, nobody is shooting at you.
In my experience, not vendoring has _always_ led to breakages that are hard to debug and fix.
Meanwhile, vendoring is quite easy nowadays. Every reasonable package manager, and even npm, can do this near-trivially.
Also funny that the culprit was posthog since I have some past experience with it.
That's why trying to solve problems in the middle of the night just ends up in stress.
It includes:
* Blaming the tools (and the author)
* Not focusing on facts in the timeline
* Not considering improvements
But that doesn't make for engaging content, right?
> * At $TIME we observed HTTP POST calls failing
> * At $TIME customers reported inability to make changes to ticket prices and promo codes
> * $PERSON took the following steps to debug...
> * Root cause: an update to a vendor library resulted in cascading failures to the site
> * 5 whys (which might include lack of defensive programming, the use of a CDN without a fixed version, etc. etc.)
> * Next steps: pin the CDN version or pull the dependency into the build, etc.
Actually, that still looks like a pretty good story to me without any of the associated mania.
I spent hours on a call with the clients sr. engineer and we eventually came up with a script to fix it. It was after midnight, my director said, good job, you are tired, I'll run the script, call it a night.
An hour later director ran the wrong script... and then called me.
Clients sr. engineer was legitimately flabbergasted, only time I have ever seen that word apply in real life.
Was a not good, very bad day.
> fetch() broken on August 19: TypeError: ...
Not broken at this version, broken on August 19. This is why I'm terrified of putting anything on the web. It is a dark scary place where runtime dependency on servers that you don't control is considered normal.
One day I'll start my own p2p thing with just a bunch of TUI's and I'll only manage to convince six people to use it each for less than a month and then I'll have to go get a real job again but at least I won't have been at the mercy of PostHog.
"We're investigating an issue affecting $X".
As a user, I can rule out that the issue is at my end. I can focus on other things and I won't add to the stack of emails.
This is one of my biggest frustrations with AWS being slow to update their status page during interruptions. I can spend much of the intervening time frantically debugging my software, only to discover the issue is at their end.
And then he rolled out a fix that was broken, too - showing incompetence in development, understanding the problem, and a total failure to do proper QA on the fix.
Royally fucked the pooch twice and he's all "gee golly whillikers!"
I'm struggling to find the lesson to take out of that. Limit your dependencies? Have a safe mode that deactivates everything optional?
Primary I use a code generator to write most of it.
For huge services it may not be practical, but for most it usually provides a heads up if something stops working. with an integration.
To this day, I simply refuse to do on call. There's no enough money you can pay me that would make me to suffer that again.
PS: Fuck you, Rackspace.
In fact, I have dealt with an extremely similar situation where a bunch of calls for one of our APIs were failing silently but only after they had taken card payment transactions. Dealing with the developers of this system was like pulling teeth, after we got them to stop stammering and stop chipping in with their ideas (after half a day with this issue ongoing) it took 10 minutes to find the culprit by simply going through the system task by task until we got to the failing task (confirmation emails were unable to send so the API server failed for the entire order despite payments being taken etc.).
This only required 2 things: knowledge of the system, and systematic process to fault finding. You would think that developers who have at least the first, being the ones who wrote it, but sometimes even that is a big ask.
Maybe I'm just burnt out from this industry and incompetent people but... come on... no excuses really.
Use better tools? Know better your tools? Know better how to debug? Add yet another tool to detect the error?
In all big companies where I worked, at the end of such an event, it boiled down to answer the 3 questions: - what happened?
- why did it happen?
- what do we do so it does not ever happen again?
You can start a web service business solo (or with a small handful of folks). But the web doesn't shut down overnight, so either have a plan to get 24-hour support onboarded early or accept that you're going to lose a lot of sleep.
(And if you think that's fun, wait until you trip over a regulatory hurdle and you get to come out of that 2AM code-bash to a meeting with some federal or state agent at 9AM...)
Come on, if POST requests work locally and not on PROD, isn't this an obvious place to start?
Related
Firing Myself
Noormar, a developer, accidentally cleared a production database at a Social Gaming startup, causing revenue losses and customer complaints. The incident led to guilt, a tarnished reputation, and eventual resignation.
The Process That Kept Dying: A memory leak murder mystery (node)
An investigation into a recurring 502 Bad Gateway error on a crowdfunding site revealed a memory leak caused by Moment.js. Updating the library resolved the issue, highlighting debugging challenges.
A tale of using chaos engineering at scale to keep our systems resilient
Tines software engineer Shayon Mukherjee discussed a Redis cluster upgrade incident that revealed a bug affecting customer workflows, highlighting the need for better error handling and resilience testing in system architecture.
"We ran out of columns" – The best, worst codebase
The author reflects on a chaotic codebase, highlighting challenges with a legacy database and a mix of programming languages. Despite flaws, it fostered creativity and problem-solving, leading to notable improvements.
A heck of a wild bug chase
George Mauer detailed a debugging challenge with a Next.js application, where a 401 error arose from missing authentication cookies in production, highlighting the complexities of software development and interconnected tech components.