August 20th, 2024

The anatomy of a 2AM mental breakdown

Zarar experienced a critical issue with his website late at night, feeling overwhelmed and isolated. The problem stemmed from a conflict with the PostHog tool, highlighting solo work challenges in tech.

Read original article

In a recent blog post, Zarar recounts a stressful experience during a late-night debugging session when he faced a critical issue with his website, jumpcomedy.com. At around 2 AM, he realized that all HTTP POST calls were failing, despite the code working perfectly in a local environment. With no immediate support from colleagues or resources, he felt overwhelmed by customer complaints and the pressure to resolve the issue. Zarar expressed feelings of shame and incompetence, grappling with thoughts of shutting down his business. As he attempted to troubleshoot, he encountered various errors and frustrations, including a misleading TypeError message. Despite his efforts, including checking for potential causes like recent updates and configurations, he struggled to identify the root of the problem. Eventually, after a series of trials, he discovered that the issue was linked to the PostHog tool he had integrated, which he subsequently removed, restoring functionality. This experience highlighted the challenges of working alone in tech and the emotional toll of production outages.

- Zarar faced a critical production issue with his website late at night, feeling isolated and overwhelmed.

- He experienced significant stress and self-doubt while troubleshooting the problem.

- The root cause of the issue was identified as a conflict with the PostHog tool.

- The incident underscores the challenges of solo work in tech and the emotional impact of production failures.

- Zarar's experience reflects common struggles faced by developers during high-pressure situations.

Firing Myself

Noormar, a developer, accidentally cleared a production database at a Social Gaming startup, causing revenue losses and customer complaints. The incident led to guilt, a tarnished reputation, and eventual resignation.

The Process That Kept Dying: A memory leak murder mystery (node)

An investigation into a recurring 502 Bad Gateway error on a crowdfunding site revealed a memory leak caused by Moment.js. Updating the library resolved the issue, highlighting debugging challenges.

A tale of using chaos engineering at scale to keep our systems resilient

Tines software engineer Shayon Mukherjee discussed a Redis cluster upgrade incident that revealed a bug affecting customer workflows, highlighting the need for better error handling and resilience testing in system architecture.

"We ran out of columns" – The best, worst codebase

The author reflects on a chaotic codebase, highlighting challenges with a legacy database and a mix of programming languages. Despite flaws, it fostered creativity and problem-solving, leading to notable improvements.

A heck of a wild bug chase

George Mauer detailed a debugging challenge with a Next.js application, where a 401 error arose from missing authentication cookies in production, highlighting the complexities of software development and interconnected tech components.

59 comments

By @JohnMakin - 8 months

Working as a SRE for a year in a large global company broke me out of this "panic" mode described in this post. To a business, every problem seems like a world-ending event. It's very easy to give in to panic in those situations. However, in reality, it's rarely that bad, and even if it is, you'll probably survive without harm. The key in these situations, and what I try to do (totally relate to breaking out in a sweat, that still happens to me, just happened yesterday) is to take 5-10 minutes before doing any action to try to fix it and sketch it out, think about it as clearly as you can. Fear interferes with your ability to reason rationally. Mashing buttons in a panic can make your problems spiral even worse (seen that happen). Disrupting that fear circuit any way you can is important. Splashing my face and hands with extremely cold water is my trick.

Then after you go through a few of these, you'll realize it really isn't too bad and you've dealt with bad situations before and you'll gain the confidence to know you can deal with it, even when there's no one you can reach out to for help.

By @sillysaurusx - 8 months

For what it's worth, I'm not sure this is a mental breakdown, and it might give the wrong impression to people who do legitimately suffer breakdowns related to stress about tech.

For me it's only happened once. It was an anxiety attack, and I'm very lucky my wife was there to talk me through it and help me understand what was happening. She's had them many times, but it was my first (and thankfully only).

It turns out that this sort of thing happens to people, and that there's nothing wrong with it. It doesn't mean you're defective or weak. That's a really important point to internalize.

Xanax is worth having on hand, since that was what finally ended it for me and I was able to drift off to sleep.

I guess my point is, there's a difference between having intrusive thoughts vs something that debilitates you and that you legitimately can't control, such as an anxiety attack or a panic attack. You won't be getting any work done if those happen, and that's ok.

By @simpaticoder - 8 months

This person's stress was caused by a single line of code in PostHog. This is the reversion: https://github.com/PostHog/posthog-js/pull/1371/commits/7598...

Highlights two lessons. 1. If you ship it, you own it. Therefore the less you ship, the better. Keep dependencies to a minimum. 2. Keep non-critical things out of the critical path. A failing AC compressor should not prevent your engine from running. Very difficult to achieve in the browser, but worth attempting.

By @bluepnume - 8 months

Looks like the bug was in a monkey-patched `window.fetch`

https://github.com/PostHog/posthog-js/blob/759829c67fcb8720f...

The biggest lesson here is, if you're writing a popular library that monkey-patches global functions, it needs to be really well tested.

There's a difference between "I'll throw posthog calls in a try/catch just in case" and "With posthog I literally can't make fetch() calls with POST"

By @robinhouston - 8 months

As others have mentioned, the bug that led to this late-night stress was a one-line change to the PostHog library[0].

I take this as a reminder of the importance of giving precise names to variables. The code

    res = await originalFetch(url, init)

looks harmless enough. But in fact the `url` parameter is not necessarily a URL, as the TypeScript declaration makes clear:

    url: URL | RequestInfo

The problem arises in the case where it is not a URL, but a RequestInfo object, which has been “used up” by the construction of the Request object earlier in the function implementation and cannot be used again here.

It would have been more difficult to overlook the problem with this change if the parameter were named something more precise such as `urlOrRequestInfo`.

(A much more speculative idea is the thought that it is possible to formalise the idea of a value being “used up” using linear types, derived from linear logic, so conceivably a suitable type system could prevent this class of bug.)

[0] https://github.com/PostHog/posthog-js/pull/1351/commits/2497...

By @ssiddharth - 8 months

Ha, that was a stressful yet funny read. The self flagellation bit hits too close to home though. I run a somewhat successful iOS/MacOS app and pushed a release that completely broke about 350k+ installations. Not entirely my fault but doesn't matter as it's my product.

The cold sweats and shame I felt, man... Plus it's on the App Store so there's the review process to deal with which extends the timeline for a fix. Thankfully, they picked it up for review 30 minutes after submission and approved it in a few minutes.

By @ldayley - 8 months

Author: Thank you for writing this! I love reading about how people overcome challenges like this, especially under pressure (and usually overnight!). I am better for hearing not just the technical post mortem but also the human perspective that is usually sanitized from stories like this. This is the kind of technical narrative only a small/solo dev or entrepreneur can share freely.

By @linuxrebe1 - 8 months

Based on the way you were troubleshooting it. You can tell you're a programmer first. You went to your code, you went to your logs. Both reasonable, both potential causes of the problem. Both ignore the primary clue that you had. It worked on localhost.

As an SRE/devops/platform engineer or whatever the title of the day is people want to give. I would have zeroed in on the difference between the working system. And the non-working system. Either adding and then removing, or removing and then adding back the differences one at a time. Until something worked. What I see is two things. 1) you have an environment where it does work. 2) the failing environment was working, then started failing.

Is my method superior to yours, no. It just is being stated to highlight the difference in the way we look at a problem. Both of a zero in on what we know. I know systems, you know code.

By @foodevl - 8 months

> This is no good. Let me just try reverting to a version from a month ago. Nothing. Three months ago? Nothing. Still failing. A year ago? Zilch.

Reverting your own code, but still using a broken PostHog update from that same day? For me, the lesson is to make sure that I can revert everything, including dependencies.

By @adamc - 8 months

Great reminder of the people behind services, as well as a nice account of the debugging process. The reality is that pressure doesn't make you debug problems any faster... usually, it interferes with your thinking. You have to try to ignore the consequences and stay as calm as possible.

But most of us have been in some situation similar, if not quite as bad. (Running your own company is going to be uniquely stressful.)

By @rockyj - 8 months

All monitoring comes at a cost and adds complexity. I wish people realized that, I struggle with this in my own team, we keep adding layers upon layer of monitoring, metrics etc.

By @lifeisstillgood - 8 months

Weirdly I think this is heavily related to social anxiety / shame - as in “everyone will knowingness me and point”. This is buried so deep in our brains it’s almost certain to do with herd behaviour.

And it’s (IMO) why anonymity online is usually a bad idea - we need to learn, deep in our bones, that what is said online is the same as standing up in front of the church congregation and reading out our tweets - if you would not in front the vicar, don’t in front of the planet.

By @mobeigi - 8 months

These breakdowns happen to everyone and its really bad when its just you against the world. I've been lucky that my last few major outages have all been team efforts with anywhere from 2-10 people working on the issue. Albeit, this is a perk of working in a large enterprise.

With more than one person you can bounce ideas off each other and share the pain so to speak. It's highly desirable.

By @jonnycat - 8 months

Great post, but kind of buries the lede: PostHog is having a CrowdStrike moment.

By @tempfile - 8 months

Every external service you integrate is adding a small, non-zero, compounding probability of finding yourself in exactly this situation.

By @coolhand2120 - 8 months

Side loading 3rd party scripts in a critical path is asking for problems. Try https://partytown.builder.io/ runs 3rd party scripts like this in a web worker. I'm not sure it would help in this case. Maybe? Probably couldn't hurt to try.

By @recursive - 8 months

B&H Photo Video shuts down intentionally for a whole day every single week, and as far as I can tell, they're one of the top retailers of pro/prosumer A/V electronics.

By @tristor - 8 months

During the mid-early part of my Ops/SRE career I had a senior who was a mentor to me. I noticed as we dealt with outage after outage he was always calm, cool, and collected, so one day I asked him "<name redacted>, how do you always stay so calm when everything is down?" That's when I found that before he'd been in tech, he'd be a Force Recon Marine and had been deployed. His answer was "Nothing we do here is life or death, and nobody is shooting at me. It'll be alright, whatever it is."

While I have never experienced anything similar myself, it really helped me to put things in perspective. Since then, I've worked on some critical systems that were actually life or death, but I no longer do. For the /vast/ majority of technology systems, nobody will die if you let the outage last just a few hours longer. The worst case scenario is a financial cost to the company who employs you, which might be your own company. Smart companies de-risk this by getting various forms of business insurance, and that should include you if it's your own company.

So, do everything you can to fix the outage, but approach it with some perspective. It's not life or death, nobody is shooting at you.

By @gcommer - 8 months

I know solo projects always have an infinite list of "nice to haves". But personally I never skimp on vendoring dependencies.

In my experience, not vendoring has _always_ led to breakages that are hard to debug and fix.

Meanwhile, vendoring is quite easy nowadays. Every reasonable package manager, and even npm, can do this near-trivially.

By @darepublic - 8 months

I remember as a junior dev getting a call at around 3am from Indian tech support about a failed deploy I had been heading. I stressed myself about it so much and only later reflecting back do I realize nobody but me cared.

Also funny that the culprit was posthog since I have some past experience with it.

By @riiii - 8 months

I heard a sleep expert say that during the night your logic and reasoning abilities are greatly reduced. I think it was in relation to dreaming, you don't want to apply much logic to that stuff.

That's why trying to solve problems in the middle of the night just ends up in stress.

By @languagehacker - 8 months

I know that it's not a real post-mortem, but this is the opposite of what a good post-mortem looks like.

It includes:

* Blaming the tools (and the author)

* Not focusing on facts in the timeline

* Not considering improvements

But that doesn't make for engaging content, right?

> * At $TIME we observed HTTP POST calls failing

> * At $TIME customers reported inability to make changes to ticket prices and promo codes

> * $PERSON took the following steps to debug...

> * Root cause: an update to a vendor library resulted in cascading failures to the site

> * 5 whys (which might include lack of defensive programming, the use of a CDN without a fixed version, etc. etc.)

> * Next steps: pin the CDN version or pull the dependency into the build, etc.

Actually, that still looks like a pretty good story to me without any of the associated mania.

By @vvoruganti - 8 months

A mantra I heard recently that has been helping me with my own 3AM panics was "None of this matters and we're all gonna die". A bit Nihilist maybe, but has been helpful and just kind of removing the weight of the situation

By @wonderwonder - 8 months

Nothing good comes from working at 2am. Company I worked at pushed a major waterfall upgrade one day. Completely tanked the database while attempting to update the schema.

I spent hours on a call with the clients sr. engineer and we eventually came up with a script to fix it. It was after midnight, my director said, good job, you are tired, I'll run the script, call it a night.

An hour later director ran the wrong script... and then called me.

Clients sr. engineer was legitimately flabbergasted, only time I have ever seen that word apply in real life.

Was a not good, very bad day.

By @__MatrixMan__ - 8 months

From the PR:

> fetch() broken on August 19: TypeError: ...

Not broken at this version, broken on August 19. This is why I'm terrified of putting anything on the web. It is a dark scary place where runtime dependency on servers that you don't control is considered normal.

One day I'll start my own p2p thing with just a bunch of TUI's and I'll only manage to convince six people to use it each for less than a month and then I'll have to go get a real job again but at least I won't have been at the mercy of PostHog.

By @merek - 8 months

During these situations, a prompt message can be very reassuring for users, for example:

"We're investigating an issue affecting $X".

As a user, I can rule out that the issue is at my end. I can focus on other things and I won't add to the stack of emails.

This is one of my biggest frustrations with AWS being slow to update their status page during interruptions. I can spend much of the intervening time frantically debugging my software, only to discover the issue is at their end.

By @curiousllama - 8 months

Love the detailed emotional reaction to scrambling to fix an outage. Nothing quite like attempting calm, dispassionate debugging while actively wrestling your own brain

By @KennyBlanken - 8 months

The thing that struck me the most was how wildly unprofessional Paul D'Ambra's comments are in response to the bug report.

And then he rolled out a fix that was broken, too - showing incompetence in development, understanding the problem, and a total failure to do proper QA on the fix.

Royally fucked the pooch twice and he's all "gee golly whillikers!"

By @charles_f - 8 months

Such an entertaining read, conveyed very well the sense of stress and abandonment felt by the author. It adds to it that this was written fresh and right in the moment, and feels as an expiation.

I'm struggling to find the lesson to take out of that. Limit your dependencies? Have a safe mode that deactivates everything optional?

By @ThinkBeat - 8 months

Whenever I am dealing with a 3rd party services, I like to write a small adapter for it that bridges the connection point and can keep an eye on a few things.

Primary I use a code generator to write most of it.

For huge services it may not be practical, but for most it usually provides a heads up if something stops working. with an integration.

By @gsora - 8 months

This post resembles how I’ve been feeling lately at work, too bad it’s been months like this now!

By @Octabrain - 8 months

I can relate to this. I used to be on call for many years and honestly, it destroyed my mental health. In the last company I did it, it felt like falling in a meat grinder for a week. I remember once spending a whole weekend giving support on an bug that was introduced by a recent release. 72 hours of working non stop. Because of that among others, I got a severe burnt out that took me to the deepest dark place I've ever been.

To this day, I simply refuse to do on call. There's no enough money you can pay me that would make me to suffer that again.

PS: Fuck you, Rackspace.

By @nurettin - 8 months

More like "The anatomy of a calm and collected response while facing a dire situation thanks to years of expertise, eminem and my sweet wife."

By @flumpcakes - 8 months

I've been on my end of plenty of operational outages. I don't want to be harsh but this could have been written by one of my colleagues, the type of colleagues that I really wish I didn't work with. Console logging for hours? Randomly disabling things? Sometimes when you feel "imposter syndrome" you shouldn't ignore it and maybe up your game a bit.

In fact, I have dealt with an extremely similar situation where a bunch of calls for one of our APIs were failing silently but only after they had taken card payment transactions. Dealing with the developers of this system was like pulling teeth, after we got them to stop stammering and stop chipping in with their ideas (after half a day with this issue ongoing) it took 10 minutes to find the culprit by simply going through the system task by task until we got to the failing task (confirmation emails were unable to send so the API server failed for the entire order despite payments being taken etc.).

This only required 2 things: knowledge of the system, and systematic process to fault finding. You would think that developers who have at least the first, being the ones who wrote it, but sometimes even that is a big ask.

Maybe I'm just burnt out from this industry and incompetent people but... come on... no excuses really.

By @refulgentis - 8 months

The author needs to relax and Posthog needs more discipline. (and a rename)

By @wmichelin - 8 months

Pin your dependency versions! The rollback should've fixed things :(

By @adamredwoods - 8 months

So Zarar needs to keep the localdev as close to prod as possible, or have a separate pre-prod environment that can run integration tests to catch vital function disruptions.

By @uaas - 8 months

Having a(n accurate) service graph of all your (internal and external) dependencies is a game changer in troubleshooting issues like this.

By @jppope - 8 months

Love the story. The question for the author of course... what did you learn and how can you keep this from happening again

By @dbacar - 8 months

Since you were able to think and act, I would not call this mental breakdown. That kind of thing is very, very different.

By @hoseja - 8 months

I'm so glad I don't have to work with these byzantine JS monstrosities.

By @kayo_20211030 - 8 months

I feel your pain. Someone else shot me in the foot. No fun whatsoever.

By @begueradj - 8 months

That's a display of endeavour and persistence. Congratulations.

By @f1shy - 8 months

I would love to read from the author what are the lessonS learned.

Use better tools? Know better your tools? Know better how to debug? Add yet another tool to detect the error?

In all big companies where I worked, at the end of such an event, it boiled down to answer the 3 questions: - what happened?

- why did it happen?

- what do we do so it does not ever happen again?

By @shadowgovt - 8 months

This also serves as a cautionary tale to small-business web people.

You can start a web service business solo (or with a small handful of folks). But the web doesn't shut down overnight, so either have a plan to get 24-hour support onboarded early or accept that you're going to lose a lot of sleep.

(And if you think that's fun, wait until you trip over a regulatory hurdle and you get to come out of that 2AM code-bash to a meeting with some federal or state agent at 9AM...)

By @dangoodmanUT - 8 months

write my own everything gang rise up

By @codexb - 8 months

Relevant [XKCD](https://xkcd.com/2347/)

By @tlarkworthy - 8 months

... and thats why you pin dependancies

By @sandspar - 8 months

Nice writing. Art can be useful for helping us cope.

By @misja111 - 8 months

"Maybe PostHog, I have the api_key blanked out locally to reduce costs"

Come on, if POST requests work locally and not on PROD, isn't this an obvious place to start?

Firing Myself

The Process That Kept Dying: A memory leak murder mystery (node)

An investigation into a recurring 502 Bad Gateway error on a crowdfunding site revealed a memory leak caused by Moment.js. Updating the library resolved the issue, highlighting debugging challenges.

The anatomy of a 2AM mental breakdown

Related

Firing Myself

The Process That Kept Dying: A memory leak murder mystery (node)

A tale of using chaos engineering at scale to keep our systems resilient

"We ran out of columns" – The best, worst codebase

A heck of a wild bug chase

Related

Firing Myself

The Process That Kept Dying: A memory leak murder mystery (node)

A tale of using chaos engineering at scale to keep our systems resilient

"We ran out of columns" – The best, worst codebase

A heck of a wild bug chase