September 4th, 2024

"SRE" doesn't seem to mean anything useful any more

The article highlights the changing perception of Site Reliability Engineers, noting a shift towards operational tasks over programming skills, leading to a disconnect in the job market regarding SRE expectations.

Read original article

"SRE" doesn't seem to mean anything useful any more

The article discusses the evolving perception of the role of Site Reliability Engineers (SREs) in the tech industry, suggesting that the term has lost its original meaning. The author reflects on their experiences in hiring and job searching, noting that many applicants for SRE positions seem to be more focused on operational tasks rather than the dual role of sysadmin and programmer that the author believes SREs should embody. The author expresses frustration at being pigeonholed into a "devops" role, which they equate to being an "ops monkey," rather than being recognized for their programming skills and ability to automate processes. They share a personal project involving a C++ build tool that they enhanced to run operations in parallel, demonstrating their technical capabilities. The author concludes that the current job market does not value the full spectrum of skills that SREs should possess, leading to a disconnect between expectations and reality in the field.

- The term "SRE" has become diluted, often equating to basic operational roles.

- Many job applicants for SRE positions lack the programming skills expected in the role.

- The author emphasizes the importance of automation and programming in the SRE position.

- Personal projects can showcase the technical abilities of SREs beyond operational tasks.

- The current job market may not adequately recognize the full skill set of SREs.

DevOps: The Funeral

The article explores Devops' evolution, emphasizing reproducibility in system administration. It critiques mislabeling cloud sysadmins as Devops practitioners and questions the industry's shift towards new approaches like Platform Engineering. It warns against neglecting automation and reproducibility principles.

All metrics are scar tissue (unless they're Business Intelligence)

Managing metrics in Site Reliability Engineering involves emotional ties and past experiences, impacting incident management. Balancing operational goals and emotional attachment is crucial for refining metrics effectively.

A network engineer in search of greener pastures

A laid-off network engineer shares frustrations about the 2024 job search, highlighting challenges with application filtering, misleading job postings, and cumbersome processes, while advocating for improvements in hiring practices.

Why Heroism Is Bad and What We Can Do to Stop It

Heroism in site reliability engineering can obscure systemic issues, create unrealistic workload expectations, and lead to burnout. Encouraging discussions about service level objectives and allowing failures can improve system reliability.

The Red Herring of Red Flags: Why Resumes Are a Relic of the Past

Traditional resumes inadequately assess talent in tech hiring, masking potential. The article advocates for skills assessments and real-world evaluations, emphasizing a skills-centric approach to include non-traditional candidates.

18 comments

By @dijit - 6 months

People need operations staff, people don't like operations staff and keep trying to treat them like developers.

But, operations staff do and have always developed software, just internal software for glue or orchestration, and they work differently to regular software developers in that their customers are usually themselves to meet an internal objective of reliability, stability or ease-of-use for developers.

It's interesting to me, because I'm a bit longer lived it seems and have been tied to industry for a long time, and I see the treadmill grinding along continuously;

Sysadmins (different from system operators) were usually among the most senior developers who ended up knowing how operating systems and compute worked fundamentally. Over time this eroded and eventually you had helpdesk people being labelled as sysadmins.

Then "DevOps" emerged as a job title, which meant a dozen things to a dozen people, and the same issue happened, it was the same operational needs and the same operational solutions, just with better tools as the passage of time allowed better tools to exist.

Then SRE, which ironically was devised before DevOps was, which did exactly the same thing of trying to turn operations problems into the easier to reason about software development space.

But still, it's operations folks, people who are more responsible about an outcome and a continuance than they are about delivering a feature.

But Managers genuinely can't reason about anything other than features, so a cost center it becomes and eradication is desired, and the treadmill begins again.

So, SRE's, who embody essentially the same characteristics as early sysadmins (but with large budgets and better tools) will eventually become system operators, who will eventually become helpdesk and eventually replaced by some new title that insists that "SREs never wrote code, and this next generation will!".

The author is experiencing the exact same thing I did over a decade ago when "sysadmin" became unfashionable and everyone told me that "sysadmins can't code", despite working in teams of sysadmins who wrote precursors to kubernetes/nomad, on bare-metal, in perl on Solaris.

Will be interesting to see what the next iteration will be called, the author will just need to alter her vernacular.

By @dimitar - 6 months

you're going to be the "ops bitch" for the "real" programmers

Rachel is spot on about what is often wrong with IT culture; "typecasting" people for someone's convenience or to get a fancy title leads to learned helplessness and dissmissing other people's expertise and interests. I rather we all try to keep things simple and encourage people to be well-rounded engineers.

By @lr4444lr - 6 months

Ops people are cost centers. They can display their wizardry in blog posts until they are blue in the face, but except for those few companies with an incredibly large moat whose main profit is just raw traffic or high uptime (and how many companies like that can maintain that moat indefinitely anyway), engineers not actually building or directly improving product will always be cost centers.

It brings me no joy to say this, as ops people tend to be very smart and cool under pressure, but I never see myself becoming one. High quality software and airtight system integrity seems decreasingly important to people paying salaries and investing money. And the world moves on. I refuse to be seen as a cost center if I can avoid it, and I don't have much sympathy left for otherwise very smart tech people who haven't figured this out yet. If you love the job, do it. If not, transition and don't complain. The people in charge are not going to be persuaded by blog rhetoric. I said what I said.

By @joshstrange - 6 months

Honest question, what would you call a role that:

* Is on call

* Manages internal software (grafana, Prometheus, salt stack, etc)

* is the first line of defense for issues in the field, works with support and the engineering team to handle problems

* Manages a distributed fleet of servers (uses off the self and/or custom code to do so)

* Builds internal tools/automations to improve the reliability of our platform and/or address problems automatically

* Pushes us to make software/OS updates

I ask because that’s what I’d love to hire. Someone who owns the production environment and has the time/space to focus on improving/maintaining it.

“Ops” or “Production engineer” is what I’ve referred to it as in the past but this seems like a good place to ask what others would call it.

By @znpy - 6 months

I got the SRE book by google, and read the whole thing, cover to cover.

Companies want SRE people but aren't willing to give SRE empowerment and authority.

So companies do what companies to: take a regular team of operations people and slap the SRE term on it, and call it a day.

And it doesn't work, of course.

---

Regarding the empowerment & authority: according to "the book", SREs often play the role of "launch coordination engineers" as in vetting (read: roasting) a service before it goes live and have authority to say "this won't go live, fix this first" and to do so no matter what deadline is going to be missed.

Also SRE team have the extreme prerogative to "give back the pager" as in take a service back to the development team and say: it's not stable enough, YOU will be on-call for it until you fix the shit you wrote.

These are two emblematic examples, but there are many more in the book.

Can you imagine any of the non-google (and non-faang) companies actually doing something like that?

By @austin-cheney - 6 months

And that’s why I don’t mention my military background during job interviews. In software world it would likely be lost on them anyways, but really certain words are triggers for unfounded assumptions in an industry dominated by unfounded assumptions.

By @kkfx - 6 months

Management hate classic operation because it's too powerful, it's the operation part of the nervous system of a company, operation can't be managed nor segmented by the management, as a result many try to do their best to delete sysadmins from the scene.

Results are that most end up on someone else system with some form of operation. Meaning the cloud. Now that costs skyrocket, issues piling up, no one seems to been able to create a damn full infra well, the push deflate.

We have experienced many similar trends:

- full stack virtualization on x86, sold as the future, a super-duper simplification, actually a way to allow third party selling pre-made images to those who have no operation or do not own the bare metal (VPS etc);

- when people realize how big the overhead is paravirtualization became an old-new trend, mostly with k8s, now this model start to creak and the push toward owning back the infra and the iron start to be noisy

I expect in a decade a mainstream NixOS/Guix System move as we had with Ansible/Salt before, the old CFEngine much before and so on, in 20 years probably companies will own back their machine room with a Plan 9 -alike model to just get redundancies and extra temporary resources.

By @minkles - 6 months

It never meant anything outside Google or huge orgs. Everyone else is running cargo cults.

By @znpy - 6 months

    To me, a SRE is *both* a sysadmin AND a programmer, developer, whatever you want to call it. It's a logical-and, not an XOR.
    By sysadmin, I mean "runs a mean Unix box, including fixing things and diving deeply when they break", and by the programmer/whatever part of it, I mean "makes stuff come into existence that wasn't there before".

The main issue i see with that is the companies usually aren't willing to advertise and pay SRE salaries for actual SRE skills.

The skillset described in the above quotes are essentially the skills of an SWE and of a Sysadmin. So essentially you're doing two jobs for one salary.

There are people capable of doing two jobs, but you won't find them until you start advertising and paying actual-SRE salaries.

By @sgarland - 6 months

It doesn’t mean anything for two reasons: companies have treated it as a catch—all, and there is a glut of people calling themselves SREs who have never operated a server that wasn’t in a cloud.

You can learn enough about Linux to be decent at your job on only VMs if you’re dedicated, but I’d argue that until you’ve also dealt with hypervisors, bare metal, and hardware issues, you’re missing some of the picture.

“That’s no longer applicable, so why should I care?” Because it pops up everywhere I’ve been. Random build server that everyone forgot about but is critical suddenly shits itself, it’s running some ancient version of Ubuntu, and it’s all hand-rolled. Someone decided to provision a bunch of EC2s with bash, but they made critical errors like not knowing to make a new initramfs after configuring mdadm, so now the RAID disappears on reboots.

Understanding the fundamentals has always and will always matter. Anyone telling you differently is selling you something.

By @candiddevmike - 6 months

It still means mandatory, potentially grueling, on-call rotations.

By @muppetman - 6 months

Everyone lost respect for SREs when Elon took over Twitter and culled lots of staff. Every SRE in the land lined up to shout from the rooftops "There is no way Twitter can keep working it's going to flame out and fall over and keep crashing" and it didn't, not even once.

By @ChrisMarshallNY - 6 months

Boy, this sounds familiar.

Welcome to "Olds-land," Rachel. Sorry about that. It's almost impossible to be a developer/engineer/opsmonkey, with any varied experience, without running into this. People will always find something in your résumé, that makes them uncomfortable.

> they didn't have the usual lists of godawful clown software that most places rattle off that you'd be expected to work with.

That could be a summary of all that is amiss with the tech industry, these days.

By @bravetraveler - 6 months

I tire of being the internet janitor, good read

By @nunez - 6 months

Well, yeah, in a world where corporates don't want to pay for sysadmins that can actually code or giving their devs a pager, you get what you saw: sysadmins and the people that do Jenkins being renamed as the "devops" team in 2014, then the "SRE" team in 2016 or so, then, "platform teams" after the age of Kubernetes in 2018-ish.

There are so many companies that have SRE teams despite those teams not maintaining a website!

By @PaulHoule - 6 months

It’s a predictable problem. Make up a new and fashionable term and it will bite you in the end. See: euphemism treadmill. It has everything to do with the specifics of ‘site’, ‘reliability’, ‘engineer’ and all the technical and social stuff Rachel talks about but on another level it is about the style of discourse, see ‘non’, ‘fungible’, ‘token’.

By @formerly_proven - 6 months

OP seems to think SRE and DevOps are seen as lowly by management, yet those jobs still exist here, while the "real programmers" were offshored long ago. Good luck getting a job Actually Building Stuff, almost no one does it any more.

"SRE" doesn't seem to mean anything useful any more

Related

DevOps: The Funeral

All metrics are scar tissue (unless they're Business Intelligence)

A network engineer in search of greener pastures

Why Heroism Is Bad and What We Can Do to Stop It

The Red Herring of Red Flags: Why Resumes Are a Relic of the Past

Related

DevOps: The Funeral

All metrics are scar tissue (unless they're Business Intelligence)

A network engineer in search of greener pastures

Why Heroism Is Bad and What We Can Do to Stop It

The Red Herring of Red Flags: Why Resumes Are a Relic of the Past