"SRE" doesn't seem to mean anything useful any more
The article highlights the changing perception of Site Reliability Engineers, noting a shift towards operational tasks over programming skills, leading to a disconnect in the job market regarding SRE expectations.
Read original articleThe article discusses the evolving perception of the role of Site Reliability Engineers (SREs) in the tech industry, suggesting that the term has lost its original meaning. The author reflects on their experiences in hiring and job searching, noting that many applicants for SRE positions seem to be more focused on operational tasks rather than the dual role of sysadmin and programmer that the author believes SREs should embody. The author expresses frustration at being pigeonholed into a "devops" role, which they equate to being an "ops monkey," rather than being recognized for their programming skills and ability to automate processes. They share a personal project involving a C++ build tool that they enhanced to run operations in parallel, demonstrating their technical capabilities. The author concludes that the current job market does not value the full spectrum of skills that SREs should possess, leading to a disconnect between expectations and reality in the field.
- The term "SRE" has become diluted, often equating to basic operational roles.
- Many job applicants for SRE positions lack the programming skills expected in the role.
- The author emphasizes the importance of automation and programming in the SRE position.
- Personal projects can showcase the technical abilities of SREs beyond operational tasks.
- The current job market may not adequately recognize the full skill set of SREs.
Related
DevOps: The Funeral
The article explores Devops' evolution, emphasizing reproducibility in system administration. It critiques mislabeling cloud sysadmins as Devops practitioners and questions the industry's shift towards new approaches like Platform Engineering. It warns against neglecting automation and reproducibility principles.
All metrics are scar tissue (unless they're Business Intelligence)
Managing metrics in Site Reliability Engineering involves emotional ties and past experiences, impacting incident management. Balancing operational goals and emotional attachment is crucial for refining metrics effectively.
A network engineer in search of greener pastures
A laid-off network engineer shares frustrations about the 2024 job search, highlighting challenges with application filtering, misleading job postings, and cumbersome processes, while advocating for improvements in hiring practices.
Why Heroism Is Bad and What We Can Do to Stop It
Heroism in site reliability engineering can obscure systemic issues, create unrealistic workload expectations, and lead to burnout. Encouraging discussions about service level objectives and allowing failures can improve system reliability.
The Red Herring of Red Flags: Why Resumes Are a Relic of the Past
Traditional resumes inadequately assess talent in tech hiring, masking potential. The article advocates for skills assessments and real-world evaluations, emphasizing a skills-centric approach to include non-traditional candidates.
But, operations staff do and have always developed software, just internal software for glue or orchestration, and they work differently to regular software developers in that their customers are usually themselves to meet an internal objective of reliability, stability or ease-of-use for developers.
It's interesting to me, because I'm a bit longer lived it seems and have been tied to industry for a long time, and I see the treadmill grinding along continuously;
Sysadmins (different from system operators) were usually among the most senior developers who ended up knowing how operating systems and compute worked fundamentally. Over time this eroded and eventually you had helpdesk people being labelled as sysadmins.
Then "DevOps" emerged as a job title, which meant a dozen things to a dozen people, and the same issue happened, it was the same operational needs and the same operational solutions, just with better tools as the passage of time allowed better tools to exist.
Then SRE, which ironically was devised before DevOps was, which did exactly the same thing of trying to turn operations problems into the easier to reason about software development space.
But still, it's operations folks, people who are more responsible about an outcome and a continuance than they are about delivering a feature.
But Managers genuinely can't reason about anything other than features, so a cost center it becomes and eradication is desired, and the treadmill begins again.
So, SRE's, who embody essentially the same characteristics as early sysadmins (but with large budgets and better tools) will eventually become system operators, who will eventually become helpdesk and eventually replaced by some new title that insists that "SREs never wrote code, and this next generation will!".
The author is experiencing the exact same thing I did over a decade ago when "sysadmin" became unfashionable and everyone told me that "sysadmins can't code", despite working in teams of sysadmins who wrote precursors to kubernetes/nomad, on bare-metal, in perl on Solaris.
Will be interesting to see what the next iteration will be called, the author will just need to alter her vernacular.
Rachel is spot on about what is often wrong with IT culture; "typecasting" people for someone's convenience or to get a fancy title leads to learned helplessness and dissmissing other people's expertise and interests. I rather we all try to keep things simple and encourage people to be well-rounded engineers.
It brings me no joy to say this, as ops people tend to be very smart and cool under pressure, but I never see myself becoming one. High quality software and airtight system integrity seems decreasingly important to people paying salaries and investing money. And the world moves on. I refuse to be seen as a cost center if I can avoid it, and I don't have much sympathy left for otherwise very smart tech people who haven't figured this out yet. If you love the job, do it. If not, transition and don't complain. The people in charge are not going to be persuaded by blog rhetoric. I said what I said.
* Is on call
* Manages internal software (grafana, Prometheus, salt stack, etc)
* is the first line of defense for issues in the field, works with support and the engineering team to handle problems
* Manages a distributed fleet of servers (uses off the self and/or custom code to do so)
* Builds internal tools/automations to improve the reliability of our platform and/or address problems automatically
* Pushes us to make software/OS updates
I ask because that’s what I’d love to hire. Someone who owns the production environment and has the time/space to focus on improving/maintaining it.
“Ops” or “Production engineer” is what I’ve referred to it as in the past but this seems like a good place to ask what others would call it.
Companies want SRE people but aren't willing to give SRE empowerment and authority.
So companies do what companies to: take a regular team of operations people and slap the SRE term on it, and call it a day.
And it doesn't work, of course.
---
Regarding the empowerment & authority: according to "the book", SREs often play the role of "launch coordination engineers" as in vetting (read: roasting) a service before it goes live and have authority to say "this won't go live, fix this first" and to do so no matter what deadline is going to be missed.
Also SRE team have the extreme prerogative to "give back the pager" as in take a service back to the development team and say: it's not stable enough, YOU will be on-call for it until you fix the shit you wrote.
These are two emblematic examples, but there are many more in the book.
Can you imagine any of the non-google (and non-faang) companies actually doing something like that?
Results are that most end up on someone else system with some form of operation. Meaning the cloud. Now that costs skyrocket, issues piling up, no one seems to been able to create a damn full infra well, the push deflate.
We have experienced many similar trends:
- full stack virtualization on x86, sold as the future, a super-duper simplification, actually a way to allow third party selling pre-made images to those who have no operation or do not own the bare metal (VPS etc);
- when people realize how big the overhead is paravirtualization became an old-new trend, mostly with k8s, now this model start to creak and the push toward owning back the infra and the iron start to be noisy
I expect in a decade a mainstream NixOS/Guix System move as we had with Ansible/Salt before, the old CFEngine much before and so on, in 20 years probably companies will own back their machine room with a Plan 9 -alike model to just get redundancies and extra temporary resources.
To me, a SRE is *both* a sysadmin AND a programmer, developer, whatever you want to call it. It's a logical-and, not an XOR.
By sysadmin, I mean "runs a mean Unix box, including fixing things and diving deeply when they break", and by the programmer/whatever part of it, I mean "makes stuff come into existence that wasn't there before".
The main issue i see with that is the companies usually aren't willing to advertise and pay SRE salaries for actual SRE skills.The skillset described in the above quotes are essentially the skills of an SWE and of a Sysadmin. So essentially you're doing two jobs for one salary.
There are people capable of doing two jobs, but you won't find them until you start advertising and paying actual-SRE salaries.
You can learn enough about Linux to be decent at your job on only VMs if you’re dedicated, but I’d argue that until you’ve also dealt with hypervisors, bare metal, and hardware issues, you’re missing some of the picture.
“That’s no longer applicable, so why should I care?” Because it pops up everywhere I’ve been. Random build server that everyone forgot about but is critical suddenly shits itself, it’s running some ancient version of Ubuntu, and it’s all hand-rolled. Someone decided to provision a bunch of EC2s with bash, but they made critical errors like not knowing to make a new initramfs after configuring mdadm, so now the RAID disappears on reboots.
Understanding the fundamentals has always and will always matter. Anyone telling you differently is selling you something.
Welcome to "Olds-land," Rachel. Sorry about that. It's almost impossible to be a developer/engineer/opsmonkey, with any varied experience, without running into this. People will always find something in your résumé, that makes them uncomfortable.
> they didn't have the usual lists of godawful clown software that most places rattle off that you'd be expected to work with.
That could be a summary of all that is amiss with the tech industry, these days.
There are so many companies that have SRE teams despite those teams not maintaining a website!
Related
DevOps: The Funeral
The article explores Devops' evolution, emphasizing reproducibility in system administration. It critiques mislabeling cloud sysadmins as Devops practitioners and questions the industry's shift towards new approaches like Platform Engineering. It warns against neglecting automation and reproducibility principles.
All metrics are scar tissue (unless they're Business Intelligence)
Managing metrics in Site Reliability Engineering involves emotional ties and past experiences, impacting incident management. Balancing operational goals and emotional attachment is crucial for refining metrics effectively.
A network engineer in search of greener pastures
A laid-off network engineer shares frustrations about the 2024 job search, highlighting challenges with application filtering, misleading job postings, and cumbersome processes, while advocating for improvements in hiring practices.
Why Heroism Is Bad and What We Can Do to Stop It
Heroism in site reliability engineering can obscure systemic issues, create unrealistic workload expectations, and lead to burnout. Encouraging discussions about service level objectives and allowing failures can improve system reliability.
The Red Herring of Red Flags: Why Resumes Are a Relic of the Past
Traditional resumes inadequately assess talent in tech hiring, masking potential. The article advocates for skills assessments and real-world evaluations, emphasizing a skills-centric approach to include non-traditional candidates.