Launch HN: Parity (YC S24) – AI for on-call engineers working with Kubernetes
Parity is an AI-powered tool for on-call engineers managing Kubernetes, automating root cause analysis and runbook execution. It is free during launch, encouraging user feedback for improvements.
Jeffrey, Coleman, and Wilson are developing Parity, an AI-powered site reliability engineering (SRE) copilot designed to assist on-call engineers managing Kubernetes environments. Parity aims to alleviate the burdens of on-call duties by automating the investigation and remediation processes. The tool conducts preliminary investigations to identify root causes of issues before engineers even log in, thereby reducing the time spent on troubleshooting. The founders, who previously worked at Crusoe, experienced the challenges of on-call responsibilities firsthand and recognized a common struggle among their peers. They have integrated large language models (LLMs) with specialized agents to perform complex tasks such as root cause analysis and executing runbooks. The agents simulate human investigative processes by generating hypotheses based on symptoms, validating them with logs and metrics, and presenting findings in a comprehensive report. Additionally, Parity includes an agent that can automatically execute runbooks, allowing for more efficient handling of alerts. The system is designed to execute read-only commands, ensuring that engineers retain control over critical actions. Parity is currently available for free during its launch phase, allowing users to install it easily in their Kubernetes clusters. The developers encourage feedback from users to refine the product further.
- Parity is an AI tool designed to assist on-call engineers with Kubernetes management.
- It automates root cause analysis and runbook execution to streamline troubleshooting.
- The tool uses LLMs and specialized agents to simulate human investigative processes.
- Parity is available for free during its launch, with easy installation via a helm repo.
- User feedback is encouraged to improve the product's functionality.
Related
GitHub Copilot – Lessons
Siddharth discusses GitHub Copilot's strengths in pair programming and learning new languages, but notes its limitations with complex tasks, verbosity, and potential impact on problem-solving skills among new programmers.
SREBench Competition
Parity is hosting the SREBench Leaderboard Race, allowing participants to compare incident response times against its AI, which has a 70% success rate. The competition ends on August 23, 2024.
Leveraging AI for efficient incident response
Meta has developed an AI-assisted system for root cause analysis, achieving 42% accuracy by combining heuristic retrieval and LLM ranking, significantly improving investigation efficiency while addressing potential risks through feedback and explainability.
- Concerns about the tool's integration with existing workflows, especially regarding GitOps and infrastructure troubleshooting.
- Apprehension about security and data privacy when using AI tools in production environments.
- Some users express enthusiasm for AI's potential to assist in troubleshooting and documentation.
- Criticism of Kubernetes as a platform and doubts about relying on AI for critical operations.
- Questions about how Parity differentiates itself from other AI SRE solutions in the market.
This would allow repeated issues to be well documented.
On iOS Firefox, when clicking “pricing” on the menu, it scrolls to the proper location, but does not close the menu. Closing the menu causes it to jump to the top of the page. Super annoying.
Videos show CrashLoopBackOff pod and analyzing logs. This works if Pod is writing to stdout but I've got some stuff doing straight to ElasticSearch. Does LLM speak Elastic Search? How about Log Files in the Pod? (Don't get me started on that nightmare)
You also show fixing by editing YAML in place. That's great except my FluxCD is going revert since you violated principle of "All goes through GitOps". So if you are going to change anything, you need to update the proper git repo. Also in said GitOps is Kustomize so hope you understand all interactions there.
Personally, the stuff that takes most troubleshooting time is Kubernetes infrastructure. Network CNI is acting up. Ingress Controller is missing proper path based routing. NetworkPolicy says No to Pod talking to PostGres Server. CertManager is on strike and certificate has expired. If LLM is quick at identifying those, it has some uses but selling me on "Dev made mistake with Pod Config" is likely not to move the needle because I'm really quick at identifying that.
Maybe I'm not the target market and target market is "Small Dev team that bought Kubernetes without realizing what they were signing up for"
This scares me. If I was confident enough in the runbook steps, they'd be automated already by a program. If it's a runbook and not a program, either it's really new or has some subtle nuance around it. "AI" is cool, and humans aren't perfect, but in this scenario I'd still prefer the judgment of a skilled operator who knows the business.
> our agents exclusively execute read-only commands
How is this enforced?
The RCA is the better feature of this tool, in my opinion.
I would not want any data about my infrastructure sent to a public LLM, regardless of how sanitized things are.
Otherwise, on paper it seems cool. But I worry about getting complicit with this tech. It is going to fail, that is just the reality. We know LLM's will hallucinate and there is not much we can do about it, it is the nature of the tech.
SO it might work most of the time, but when it doesn't and you're bashing your head against the wall trying to figure out what is broken. This system is telling you that all of these things are fine, but one of them actually isn't. But it worked enough times that you trust it, so you don't bother double checking.
That is before we even talk about having this thing running code for automatic remediation, which I hope no one seriously considers ever doing that.
The answer is not, “let an ai figure it out.”
That is legitimately scary.
Someone needs to explain to me how this is expected to work.
Percentage of Hallucinations/Errors x Steps in Runbook = Total Errors
0.05 x 10 = 0.5 = 50%
Related
GitHub Copilot – Lessons
Siddharth discusses GitHub Copilot's strengths in pair programming and learning new languages, but notes its limitations with complex tasks, verbosity, and potential impact on problem-solving skills among new programmers.
SREBench Competition
Parity is hosting the SREBench Leaderboard Race, allowing participants to compare incident response times against its AI, which has a 70% success rate. The competition ends on August 23, 2024.
Leveraging AI for efficient incident response
Meta has developed an AI-assisted system for root cause analysis, achieving 42% accuracy by combining heuristic retrieval and LLM ranking, significantly improving investigation efficiency while addressing potential risks through feedback and explainability.