August 26th, 2024

Launch HN: Parity (YC S24) – AI for on-call engineers working with Kubernetes

Parity is an AI-powered tool for on-call engineers managing Kubernetes, automating root cause analysis and runbook execution. It is free during launch, encouraging user feedback for improvements.

CuriositySkepticismFrustration
Launch HN: Parity (YC S24) – AI for on-call engineers working with Kubernetes

Jeffrey, Coleman, and Wilson are developing Parity, an AI-powered site reliability engineering (SRE) copilot designed to assist on-call engineers managing Kubernetes environments. Parity aims to alleviate the burdens of on-call duties by automating the investigation and remediation processes. The tool conducts preliminary investigations to identify root causes of issues before engineers even log in, thereby reducing the time spent on troubleshooting. The founders, who previously worked at Crusoe, experienced the challenges of on-call responsibilities firsthand and recognized a common struggle among their peers. They have integrated large language models (LLMs) with specialized agents to perform complex tasks such as root cause analysis and executing runbooks. The agents simulate human investigative processes by generating hypotheses based on symptoms, validating them with logs and metrics, and presenting findings in a comprehensive report. Additionally, Parity includes an agent that can automatically execute runbooks, allowing for more efficient handling of alerts. The system is designed to execute read-only commands, ensuring that engineers retain control over critical actions. Parity is currently available for free during its launch phase, allowing users to install it easily in their Kubernetes clusters. The developers encourage feedback from users to refine the product further.

- Parity is an AI tool designed to assist on-call engineers with Kubernetes management.

- It automates root cause analysis and runbook execution to streamline troubleshooting.

- The tool uses LLMs and specialized agents to simulate human investigative processes.

- Parity is available for free during its launch, with easy installation via a helm repo.

- User feedback is encouraged to improve the product's functionality.

AI: What people are saying
The comments on the article about Parity, the AI tool for Kubernetes management, reveal a mix of skepticism and interest among users.
  • Concerns about the tool's integration with existing workflows, especially regarding GitOps and infrastructure troubleshooting.
  • Apprehension about security and data privacy when using AI tools in production environments.
  • Some users express enthusiasm for AI's potential to assist in troubleshooting and documentation.
  • Criticism of Kubernetes as a platform and doubts about relying on AI for critical operations.
  • Questions about how Parity differentiates itself from other AI SRE solutions in the market.
Link Icon 15 comments
By @Atotalnoob - 6 months
It would be kind of interesting if, based on an engineer accepting the suggestion, parity generated a new run book.

This would allow repeated issues to be well documented.

On iOS Firefox, when clicking “pricing” on the menu, it scrolls to the proper location, but does not close the menu. Closing the menu causes it to jump to the top of the page. Super annoying.

By @stackskipton - 6 months
Azure Kubernetes Wrangler (SRE) here, before I turn some LLM loose on my cluster, I need to know what it supports, how it supports it and how I can integrate into my workflow.

Videos show CrashLoopBackOff pod and analyzing logs. This works if Pod is writing to stdout but I've got some stuff doing straight to ElasticSearch. Does LLM speak Elastic Search? How about Log Files in the Pod? (Don't get me started on that nightmare)

You also show fixing by editing YAML in place. That's great except my FluxCD is going revert since you violated principle of "All goes through GitOps". So if you are going to change anything, you need to update the proper git repo. Also in said GitOps is Kustomize so hope you understand all interactions there.

Personally, the stuff that takes most troubleshooting time is Kubernetes infrastructure. Network CNI is acting up. Ingress Controller is missing proper path based routing. NetworkPolicy says No to Pod talking to PostGres Server. CertManager is on strike and certificate has expired. If LLM is quick at identifying those, it has some uses but selling me on "Dev made mistake with Pod Config" is likely not to move the needle because I'm really quick at identifying that.

Maybe I'm not the target market and target market is "Small Dev team that bought Kubernetes without realizing what they were signing up for"

By @henning - 6 months
An AI agent to triage the production issues caused by code generated by some other startup's generative AI bot. I fucking love tech in 2024.
By @habosa - 6 months
Just some feedback on the landing page: ditch the Stanford/MIT/Carnegie Mellon logos. I’m not hating on elite universities or anything, but it has no relevance here (this is not a research project) and I think it detracts from the brand. I don’t associate academia with pager-carrying operators of critical services.
By @ronald_petty - 6 months
I think this kind of tooling is one positive aspect of integrating LLM tech in certain workflows/pipelines. Tools like k8sgpt are similar in purpose and show a strong potential to be useful. Look forward to seeing how this progresses.
By @raunakchowdhuri - 6 months
hmmm idk how I would feel about giving an llm cluster access from a security pov
By @RadiozRadioz - 6 months
> using AI agents to execute runbooks

This scares me. If I was confident enough in the runbook steps, they'd be automated already by a program. If it's a runbook and not a program, either it's really new or has some subtle nuance around it. "AI" is cool, and humans aren't perfect, but in this scenario I'd still prefer the judgment of a skilled operator who knows the business.

> our agents exclusively execute read-only commands

How is this enforced?

The RCA is the better feature of this tool, in my opinion.

By @drawnwren - 6 months
This is a great idea. I use claude for my most of my unknown K8s bugs and it's impressive how useful it is (far more than my coding bugs).
By @nerdjon - 6 months
Well the website seems to be down so I can't actually see any information about what LLM you are using, but I seriously hope you are not just sending the data to OpenAI API or something like that and are forcing the use of a private (ideally self hosted) service.

I would not want any data about my infrastructure sent to a public LLM, regardless of how sanitized things are.

Otherwise, on paper it seems cool. But I worry about getting complicit with this tech. It is going to fail, that is just the reality. We know LLM's will hallucinate and there is not much we can do about it, it is the nature of the tech.

SO it might work most of the time, but when it doesn't and you're bashing your head against the wall trying to figure out what is broken. This system is telling you that all of these things are fine, but one of them actually isn't. But it worked enough times that you trust it, so you don't bother double checking.

That is before we even talk about having this thing running code for automatic remediation, which I hope no one seriously considers ever doing that.

By @mdaniel - 6 months
Why would you have your demo video set to "unlisted"? (on what appears to be your official channel) I'd think you'd want to show up in as many places as possible
By @manveerc - 6 months
Congratulations on the launch! I'm curious—how is what you're building different from other AI SRE solutions out there, like Cleric, Onegrep, Resolve, Beeps, and others?
By @klinquist - 6 months
Website won't load - just me?
By @andrewguy9 - 6 months
For god sakes, SREs need to give up on K8. It was a bad idea, just move on.

The answer is not, “let an ai figure it out.”

That is legitimately scary.

By @threeseed - 6 months
> This agent is a combination of separate LLM agents each responsible for a single step of the runbook

Someone needs to explain to me how this is expected to work.

Percentage of Hallucinations/Errors x Steps in Runbook = Total Errors

0.05 x 10 = 0.5 = 50%