Firing Myself
Noormar, a developer, accidentally cleared a production database at a Social Gaming startup, causing revenue losses and customer complaints. The incident led to guilt, a tarnished reputation, and eventual resignation.
Read original articleThe article recounts a personal experience of a developer, Noormar, who made a critical mistake while working at a Social Gaming startup in 2010. Assigned to implement a feature inspired by World of Warcraft, he accidentally cleared the USERS table in the production database, wiping out character stats for thousands of paying customers. The error led to a crisis, with the CEO estimating millions in revenue losses. Despite efforts to salvage the data and manage customer complaints, the blame was initially attributed to a 'junior engineer' before it became known that Noormar was responsible. The incident caused a shift in how colleagues perceived him, leading to feelings of guilt and ultimately prompting his resignation and departure from the company. The narrative serves as a cautionary tale about the consequences of oversight and the impact of mistakes in a high-stakes environment.
Related
An arc welder in the datacenter: what could possibly go wrong?
A former IBM engineer fixed a cracked metal frame on a stock exchange's printer in the 1960s. A later inexperienced repair attempt caused chaos, emphasizing the need for expertise in critical system maintenance.
Whose bug is this anyway?? (2012)
Patrick Wyatt shares bug experiences from game development, including issues in StarCraft and Guild Wars. Compiler problems caused bugs, emphasizing the need for consistent tools and work-life balance in development.
A couple years into my career, I was trying to get my AWS keys configured right locally. I hardcoded them into my .zshrc file. A few days later on a Sunday, forgetting that I'd done that, I committed and pushed that file to my public dotfiles repo, at which point those keys were instantly and automatically compromised.
After the dust settled, the CTO pulled me into the office and said:
1. So that I know you know: explain to me what you did, why it shouldn't have happened, and how you'll avoid it in the future.
2. This is not your fault - it's ours. These keys were way overpermissioned and our safeguards were inadequate - we'll fix that.
3. As long as it doesn't happen again, we're cool.
Looking back, 10 years later, I think that was exactly the right way to handle it. Address what the individual did, but realize that it's a process issue. If your process only works when 100% of people act perfectly 100% of the time, your process does not work and needs fixing.
Hahaha. I still see this being done today every now and then.
> The CEO leaned across the table, got in my face, and said, "this, is a monumental fuck up. You're gonna cost us millions in revenue". His co-founder (remotely present via Skype) chimed in "you're lucky to still be here".
this type of leadership needs to be put on blast. 2010 or 2024, doesn’t matter.
If it’s going to cost “millions in revenue”, then maybe it would have been prudent to invest time in proper data access controls, proper backups, and rollback procedures.
Absolutely incompetent leadership should never be hired ever again. There should be a public blacklist so I don’t make the mistake of ever working with such idiocy.
The only people ever “fired” should be leadership. Unless the intent is on purpose in which you should be subject to jail time
It should never be like this, and especially in this case I blame OP 0%. This is something that could happen to anybody in such circumstances. I have not deleted a full database, but have had to restore stuff a few times, I have made mistakes myself and have rushed to fix problems caused by others' mistakes and each single time the whole point and discussions was about improving our processes so that this does not happen again.
This is the issue not what the author did. It would be a matter of time that the database would have been accidentally deleted somehow.
I'm not sure I even believe that part of the story. This was either a very disfunctional company or a looooong time ago.
The most egregious case involved an incompetent configuration that resulted in hundreds of millions $ in lost data and a 6-month long automated recovery project. Fortunately, there were traces of the data across the entire stack - from page caches in a random employee's browser, to automated reports and OCR dumps. By the end of the project, all data was recovered. No one from outside ever found out or even realized anything had happened - we had redundancy upon redundancy across several parts of the business, and the entire company basically shifted the way we did ops to work around the issue for the time being. Every department had a scorecard tracking how many of their files were recovered, and we had little celebrations when we hit recovery milestones. To this day only a few people know who was responsible (wasn't me! lol)
Blame and derision are always inevitable in situations like this. It's how it's handled afterwards that really marks the competence of the company.
I was hired by a little photo development company, doing both walk in jobs and electronic B2B orders. I was brought in to pick up on the maintenance and development of the B2B order placement web service the previous developer had written.
Sadly, the previous dev designed the DB schema and software under the assumption that there would only ever be one business customer. When that ceased to be the case, he decided to simply create another database and spin up another process.
So here I am on my first day, tasked with creating a new empty database to bring on another customer. I used the Microsoft SQL Server admin GUI to generate the DDL from one of the existing tables, created (and switched the connection to) a pristine, new DB, and ran the script.
Little did I know, in the middle of many thousands of lines of SQL, the script switched the connection back to the DB from which the DDL was generated, and then proceeds to drop every single table.
Oops.
Of course, the last dev disabled back ups a couple months before I joined. My one saving grace was that the dev had some strange fixation on logging every single thing that happened in a bunch of XML log files; I managed to quickly write some code to rebuild the state of the DB from those log files.
I was (and am) grateful to my boss for trusting my ability to resolve the problem I had created, and placing as much value as he did in my ownership of the problem.
That was about 16 years ago. One of the best working experiences in my career, and a time of rapid technical growth for myself. I would have missed out on a lot if that had been handled differently.
I was not the one who set up the backups, nor did I perform the restore. I just told my senior I made a big mistake, and he said thanks for saying so right away and we're going to handle it. Our client company told their people to take the rest of the day off. It must have been costly. I learned a lesson, but I also internalized some guilt about it.
Reflecting on this, I was in an environment where it was the norm to edit production live, imagining we could be careful enough. I'm suggesting that the error the author publicly took the fall for was not all their fault. Everyone up the chain was responsible for creating that risky situation. How could they not have backups?
There is no part of this story that’s the protagonist’s fault. What a mess.
It's on the company for cancelling their backups.
But individuals will always make mistakes, systems and processes prevent individuals mistakes from doing damage. That’s what was lacking here, not your fault at all. I just hope lessons were learned.
If a software company is making money, but developers develop against the production database, and there are no backups, then its the leadership that is at fault. The leadership deserves severe criticism.
As usual, a company with legitimately moronic processes experiences the consequences of those moronic processes when a "junior" person breaks something. Whoever turned off those backups as well as whoever thought devs (especially "junior" devs) should be mutating prod tables by hand are ultimately accountable.
I'm not implying there are few of them, more like they each have a dedicated finger, and I remember them and the cold sweat feel of each specific one.
The first one was decades ago. There was a server room. They had a hardcopy printer in there, and behind it was a big red mushroom button that shut down the whole room. It was about head level if you were looking behind the printer...
Edit: Also, the user name 'drericlevi' seems to be used on pretty much all other social media outlets by a very online ENT doctor from Melbourne. I think somebody got the password to an unused Substack and is just posting blogspam.
Edit again: Also the story in the post is taken from the 2023 book 'Joy of Agility' by Joshua Kerivsky and modified (presumably by AI) to be told from the first person perspective:
https://books.google.com/books?id=0pZuEAAAQBAJ&pg=PA96&lpg=P...
My biggest foible: Our 400 person company had issued blackberries and was pulled into a project where all the work was in India. I covered EMEA, and phone rates were like a quarter a minute, or something like that. There was no concern about using the phone/data.
I ended up spending a significant amount of time over there and one thing I noticed is it looked like I had a different carrier every time I looked at the device. Did not think much of it in the couple months I was there. When I got back - one of our Ops guys called me and we discovered it was closer to $9/min, with absurd data charges. Sat down with the CFO, as the $12k charge was not projected. I had no idea if I was about to be canned or not.
Instead, leadership and sales got brand new blackberries, with an unlocked SIM card. Celebrations by all the brass... Was very happy to have a job.
We immediately got an email from Github because they scan for such things. We rotated our keys within a few minutes, and he was gone the next day.
Obviously we should have been more careful with our keys, so all would have been forgiven if not for the fact that he leaked keys while looking for another job on company time.
Every single usage of secrets of privileged action of consequence must go through code review followed by multiple signatures where an automated system does the dangerous things on your behalf with a paper trail.
Never follow any instructions to place privileged secrets or access in plaintext on an development OS or you might well be the one a CEO chooses as a sacrificial lamb.
A little bit of room for error is essential for learning, but this is insane. I'm so glad the only person who has ever put me in that kind of position is me, haha. This career would have seemed so much scarier if the people I worked with early on were willing to trust me with such terrifying error margins.
Should expose the CEO’s name. Between this and forcing you to work 3 days straight, that was the least professional way to handle this situation.
Uhh, there's the problem, not that someone accidentally deleted something lol
Engineers must never feel guilty if the company was run in such a way as to make that possible.
Related
An arc welder in the datacenter: what could possibly go wrong?
A former IBM engineer fixed a cracked metal frame on a stock exchange's printer in the 1960s. A later inexperienced repair attempt caused chaos, emphasizing the need for expertise in critical system maintenance.
Whose bug is this anyway?? (2012)
Patrick Wyatt shares bug experiences from game development, including issues in StarCraft and Guild Wars. Compiler problems caused bugs, emphasizing the need for consistent tools and work-life balance in development.