July 27th, 2024

Show HN: I built an open-source tool to make on-call suck less

Opslane is an open-source tool that reduces alert fatigue by classifying alerts, integrating with Slack for insights, and utilizing large language models for analysis. It supports Datadog integration and requires Docker for installation.

Read original article

FrustrationSkepticismAppreciation

Show HN: I built an open-source tool to make on-call suck less

Opslane is a tool designed to improve the on-call experience by minimizing alert fatigue through effective alert classification. It distinguishes between actionable and noisy alerts, providing contextual information to aid in alert management. Opslane integrates with Slack, allowing users to receive insights and debugging resources directly within their Slack channels. The tool utilizes large language models to analyze alert history and Slack conversations for classification purposes. It also generates weekly reports on alert quality and enables users to silence noisy alerts from Slack.

The architecture of Opslane is modular, consisting of an alert ingestion system that receives alerts from Datadog via webhooks, a FastAPI server for processing alerts, a Slack integration for user interaction, and a Postgres database with pgvector for data storage. Currently, Opslane supports integration with Datadog.

To install Opslane, users need Docker, a Slack workspace, and a Datadog account. The setup involves cloning the repository, configuring environment variables with necessary API keys, and running the Docker container. Once installed, users can add the Opslane bot to their Slack channel and configure Datadog to send alerts to Opslane's webhook endpoint, allowing the tool to analyze alerts and provide insights in real-time. Opslane is open source, encouraging community contributions and collaboration. For further information, the Opslane GitHub repository is available for access.

A Eulogy for DevOps

DevOps, introduced in 2007 to improve development and operations collaboration, faced challenges like centralized risks and communication issues. Despite advancements like container adoption, obstacles remain in managing complex infrastructures.

Open-Source Perplexity – Omniplex

The Omniplex open-source project on GitHub focuses on core functionality, Plugins Development, and Multi-LLM Support. It utilizes TypeScript, React, Redux, Next.js, Firebase, and integrates with services like OpenAI and Firebase. Community contributions are welcomed.

Show HN: I am building an open-source incident management platform

The GitHub repository for Incidental, an open-source incident management platform, provides ChatOps for Slack, a Web UI, custom fields, and workflows. Users need specific tools for setup. Detailed instructions are available.

Dynolog: Open-Source System Observability

Dynolog is an open-source observability tool for optimizing AI applications on distributed CPU-GPU systems. It offers continuous monitoring of performance metrics, integrates with PyTorch Profiler and Kineto CUDA profiling library, and supports GPU monitoring for NVIDIA GPUs and CPU events for Intel and AMD CPUs. Developed in Rust, Dynolog focuses on Linux platforms to enhance AI model observability in cloud environments.

Diverse ML Systems at Netflix

Netflix utilizes data science and machine learning through Metaflow, Fast Data, Titus, and Maestro to support ML systems efficiently. The platform enables smooth transitions from prototypes to production, aiding content decision-making globally.

AI: What people are saying

The comments on Opslane highlight various perspectives on the tool's approach to alert management and its implications for on-call engineers.

Concerns about relying on large language models (LLMs) for classifying alerts, with some arguing it may not address the root causes of alert fatigue.
Discussion on the importance of improving observability and alert systems rather than just filtering alerts.
Suggestions for better on-call management practices, including scheduling and training improvements.
Mixed feelings about the integration with Slack, with some advocating for a more platform-agnostic approach.
Recognition of the need for cultural changes within organizations to effectively manage on-call responsibilities.

36 comments

By @dclowd9901 - 6 months

> It reduces alert fatigue by classifying alerts as actionable or noisy and providing contextual information for handling alerts.

grimace face

I might be missing context here, but this kind of problem speaks more to a company’s inability to create useful observability, or worse, their lack of conviction around solving noisy alerts (which upon investigation might not even be “just” noise)! Your product is welcome and we can certainly use more competition in this space, but this aspect of it is basically enabling bad cultural practices and I wouldn’t highlight it as a main selling point.

By @jedberg - 6 months

People do not understand the value of classifying alerts as useful after the fact.

At Netflix we built a feature into our alert systems that added a simple button at the top of every alert that said, "Was this alert useful?". Then we would send the alert owners reports about what percent of people found their alert useful.

It really let us narrow in on which alerts were most useful so that others could subscribe the them, and which were noise, so they could be tuned or shut off.

That one button alone made a huge difference in people's happiness with being on call.

By @aflag - 6 months

It feels to me that using LLM to classify alerts as noisy is just adding risk instead of fixing the root cause of the problem. If an alert is known to be noisy and have appeared on slack before (which is how the LLM would figure out it's a noisy alert), then just remove the alert? Otherwise, how will the LLM know it's noise? Either it will correctly annoy you or hallucinate a reason it figures that alert is just noise.

By @ravedave5 - 6 months

The goal for oncall should be to NEVER get called. If someone gets called when they are oncall their #1 task the next day is to make sure that call never happens again. That means either fixing a false alarm or tracking down the root cause of the call. Eventually you get to a state where being called is by far the exception instead of the norm.

By @Jolter - 6 months

Telecoms solved this problem fifteen years ago when they started automating Fault Management (google it).

Granted, neural networks were not generally applicable to this problem at the time, but this whole idea seems like the same problem being solved again.

Telecoms and IT used to supervise their networks using Alarms, in either a Network Management System (NMS) or something more ad-hoc like Nagios. There, you got structured alarms over a network, like SNMP traps, that got stored as records in a database. It’s fairly easy to program filters using simple counting or more complex heuristics against a database.

Now, for some reason, alerting has shifted to Slack. Naturally since the data is now unstructured text, the solution involves an LLM! You build complexity into the filtering solution because you have an alarm infrastructure that’s too simple.

By @mads_quist - 6 months

Founder of All Quiet here: https://allquiet.app.

We're building a tool in the same space but opted out of using LLMs. We've received a lot of positive feedback from our users who explicitly didn't want critical alerts to be dependent on a possibly opaque LLM. While I understand that some teams might choose to go this route, I agree with some commentators here that AI can help with symptoms but doesn't address the root cause, which is often poor observability and processes.

By @RadiozRadioz - 6 months

> Slack-native since that has become the de-facto tool for on-call engineers.

In your particular organization. Slack is one of many instant messaging platforms. Tightly coupling your tool to Slack instead of making it platform agnostic immediately restricts where it can be used.

Other comment threads are already discussing the broader issues with using IM for this job, so I won't go into it here.

Regardless, well done for making something.

By @throw156754228 - 6 months

I don't want to be relying on another flaky LLM for anything mission critical like this.

Just fix the original problem, don't layer an LLM into it.

By @Terretta - 6 months

Note that according to StackOverflows dev survey, more devs use Teams than Slack, over 50% were in Teams. (The stat was called popularity but really should have been prevalence, since a related stat showed devs hated Teams even more than they hated Slack.) Teams has APIs too, and with Microsoft Graph working you can do a lot more than just Teams for them.

More importantly, and not mentioned by StackOverflow, those devs are among the 85% of businesses using M365, meaning they have "Sign in with Microsoft" and are on teams that will pay. The rest have Google and/or Github.

This means despite being a high value hacking target (accounts and passwords of people who operate infrastructure, like the person owned from Snowflake last quarter) you don't have to store passwords therefore can't end up on Have I Been Pwned.

By @voidUpdate - 6 months

Filtering whether a notification is important or not through an LLM, when getting it wrong could cause big issues, is mildly concerning to me...

By @nprateem - 6 months

Almost all alerting issues can be fixed by putting managers on call too (who then have to attend the fix too).

It suddenly becomes a much higher priority to get alerting in order.

By @asdf6969 - 6 months

I don’t really understand the use case. If there’s a way to programmatically tell that it’s a false alarm then there must also be a way to not create the alert in the first place

I’ve never seen an issue that’s conclusively a false alarm without investigating at all. Just delete the alarm? An LLM will never find something like another team is accidentally stress testing my service but it does happen

Another perfect example is when the queen died and it looked like an outage for UK users. Can your LLM read the news? ChatGPT doesn’t even know if she’s alive

I expect you will need AGI before large companies will trust your product.

By @makmanalp - 6 months

Underrated oncall problem that needs solving is scheduling IMHO:

- We have a weekday (2 shifts) / weekend (1 slightly longer shift including friday morning to allow people to take long weekends) oncall rotation as well as a group-combined oncall schedule which gets finnicky.

- When people join or leave the rotation, making sure nothing shifts before a certain date or swapping one person with another without changing the rest and other things are a massive pain in the butt

- Combine this with a company holiday list - usually there's different policies and expectations during those. - Allow custom shift change times for people in different timezones.

- We have "oncall training" / shadowing for newbies, automate the process of substituting them in gradually, first with a shared daytime rotation and then on their own etc.

- Make oncall trades (if you can't make your shift simpler)

Gripes with PD:

- Pagerduty keeps insisting I'm "always on call" because I'm on level N of a fallback pager chain which makes their "when oncall next" box useless - just let me pick.

- Similarly, pagerduty's google calendar export will just jam in every service you're remotely related to and won't let you pick when exporting, even though it will in their UI. So I can't just have my oncall schedule in google calendar without polluting it to all hell.

By @lmeyerov - 6 months

Big fan of this direction. The architecture resonates! The base lining is interesting, I'm curious how you think about that, esp for bootstrapping initially + ongoing.

We are working on a variant being used more by investigative teams than IT ops - so think IR, fraud, misinfo, etc - which has similarities but also domain differences. If of interest to someone with an operational infosec background (hunt, IR, secops) , and esp US-based, the Louie.AI team is hiring an SE + principal here.

By @CableNinja - 6 months

I get your sentiment, but theres another side of this coin that everyone is forgetting, hilariously.

You can tune your monitoring!

Noisy alert that tends to be a false positive but not always? Tune alert message to only send if the issue continues for more than a minute, or if the check fails 3 times in a row. Theres hundreds of ways to tweak a monitor to match your environment.

Best of all? It takes 30 seconds at most. Find the trigger, adjust slightly, and after maybe 1-2 tries, youll be getting 1 false positive sometimes, and actual alerts when they happen, compared to 99% false alerts, all the time.

Oh and did you know any monitoring solution worth its salt can execute things automatically on alerts, and then can alert you if that thing fails?

Also, Slack is not a defacto anything. Its a chat tool in a world of chat tools

By @jpb0104 - 6 months

I love this space; stability & response! After my last full-time gig, I was also frustrated with the available tooling and ONLY wanted an on-call scheduling tool with simple calendar integration. So I built: https://majorpager.com/ Not OSS, but very simple and hopefully pretty straightforward to use. I'm certainly wide open to feedback.

By @solatic - 6 months

In my current workplace (BigCo), we know exactly what's wrong with our alert system. We get alerts that we can't shut off, because they (legitimately) represent customer downtime, and whose root cause we either can't identify (lack of observability infrastructure) or can't fix (the fix is non-trivial and management won't prioritize).

Running on-call well is a culture problem. You need management to prioritize observability (you can't fix what you can't show as being broken), then you need management to build a no-broken-windows culture (feature development stops if anything is broken).

Technical tools cannot fix culture problems!

edit: management not talking to engineers, or being aware of problems and deciding not to prioritize fixing them, are both culture problems. The way you fix culture problems, as someone who is not in management, is to either turn your brain off and accept that life is imperfect (i.e. fix yourself instead of the root cause), or to find a different job (i.e. if the culture problem is so bad that it's leading to burnout). In any event, cultural problems cannot be solved with technical tools.

By @maximinus_thrax - 6 months

Nice work, I always appreciate the contribution to the OSS ecosystem.

That said, I like that you're 'saying out loud' with this. Slack and other similar comm tooling has always been advertised as a productivity booster due to their 'async' nature. Nobody actually believes this anymore and coupling it with the oncall notifications really closes the lid on that thing.

By @topaztee - 6 months

co-founder of merlinn here: https://merlinn.co | https://github.com/merlinn-co/merlinn We're also building a tool in the same space with the option of choosing your own model (private llms) + we're open source with a multitude of integrations.

good to see more options in this space! especially OS. I think de-noising is a good feature given alert fatigue is one of the repeating complaints of on-callers.

By @deepfriedbits - 6 months

Nice job and congratulations on building this! It looks like your copy is missing a word in the first paragraph:

> Opslane is a tool that helps (make) the on-call experience less stressful.

By @Arch-TK - 6 months

We could stop normalising "on-call" instead.

By @snihalani - 6 months

can you build a cheaper datadog instead?

By @tryauuum - 6 months

every time I see notifications in Slack / Telegram it makes me depressed. Text messengers were not designed for this. If you get the "something is wrong" alert it becomes part of history, it won't re-alert you if it's still present. And if you have more than one type of alert it will be lost in history

I guess alerts to messengers are OK as long it's only a couple manually created ones, and there should be a graphical dashboard to learn the rest of problems

By @T1tt - 6 months

is this only on the frontpage because this is an HN company?

By @c0mbonat0r - 6 months

if this is open-source project how are you planning to make this a sustainable business? also why the choice of apache 2.0

By @T1tt - 6 months

how can you prove it works and doesnt hallucinate? do you have any actual users that have installed it and found it useful?

By @lars_francke - 6 months

Shameless question tangential related to the topic.

We are based in Europe and have the problem that some of us sometimes just forget we're on call or are afraid that we'll miss OpsGenie notifications.

We're desparately looking for a hardware solution. I'd like something similar to the pagers of the past but at least here in Germany they don't really seem to exist anymore. Ideally I'd have a Bluetooth dongle that alerts me on configurable notifications on my phone. Carrying this dongle for the week would be a physical reminder I'm on call.

Does anyone know anything?

By @7bit - 6 months

> * Alert volume: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later.

I don't understand this. Either the issue is important and requires immediate human action -- or the issue can potentially resolve itself and should only ever send an alert if it doesn't after a set grace period.

The way you're trying to resolve this (with increasing alert volumes) is the worst approach to both of the above, and improves nothing.

By @protocolture - 6 months

I feel like this would be a great tool for people who have had a much better experience of On Call than I have had.

I once worked for a string of businesses that would just send everything to on call unless engineers threatened to quit. Promised automated late night customer sign ups? Haven't actually invested in the website so that it can do that? Just make the on call engineer do it. Too lazy to hire off shore L1 technical support? Just send residential internet support calls to the the On Call engineer! Sell a service that doesn't work in the rain? Just send the on call guy to site every time it rains so he can reconfirm yes, the service sucks. Basic usability questions that could have been resolved during business hours? Does your contract say 24/7 support? Damn, guess thats going to On Call.

Shit even in contracting gigs where I have agreed to be "On Call" for severity 1 emergencies, small business owners will send you things like service turn ups or slow speed issues.

By @EGreg - 6 months

One of the “no-bullshit” positions I have arrived at over the years is that “real-time is a gimmick”.

You don’t need that Times Square ad, only 8-10 people will look up. If you just want the footage of your conspicuous consumotion, you can easily photoshop it for decades already.

Similarly, chat causes anxiety and lack of productivity. Threaded forums like HN are better. Having a system to prevent problems and the rare emergency is better than having everyone glued to their phones 24/7. And frankly, threads keep information better localized AND give people a chance to THINK about the response and iterate before posting in a hurry. When producers of content take their time, this creates efficiencies for EVERY INTERACTION WITH that content later, and effects downstream. (eg my caps lock gaffe above, I wont go back and fix it, will jjst keesp typing 111!1!!!)

Anyway people, so now we come to today’s culture. Growing up I had people call and wish happy birthday. Then they posted it on FB. Then FB automated the wishes so you just press a button. Then people automated the thanks by pressing likes. And you can probably make a bot to automate that. What once was a thoughtful gesture has become commoditized with bots talking to bots.

Similar things occurred with resumes and job applications etc.

So I say, you want to know my feedback? Add an AI agent that replies back with basic assurances and questions to whoever “summoned you”, have the AI fill out a form, and send you that. The equivalent of front-line call center workers asking “Have you tried turning it on and off again” and “I understand it doesn’t work, but how can we replicate it.”

That repetitive stuff should he done by AI and build up an FAQ Knowledge Base for bozos and then only bother you if it came across a novel problem it hasn’t solved yet, like an emergency because, say, there’s a windows BSOD spreading and systems don’t boot up. Make the AI do triage and tell the differencd.

By @LunarFrost88 - 6 months

Really cool!

By @racka - 6 months

Really cool!

Anyone know of a similar alert UI for data/business alarms (eg installs dropping WoW, crashes spiking DoD, etc)?

Something that feeds of Snowflake/BigQuery, but with a similar nice UI so that you can quickly see false positives and silence them.

The tools I’ve used so far (mostly in-house built) have all ended in a spammy slack channel that no one ever checks anymore.

By @Flop7331 - 6 months

Is this for missile defense systems or something? What's possibly so important that you need to be woken up for it?

By @theodpHN - 6 months

What you've come up with looks helpful (and may have other applications as someone else noted), but you know what also makes on-call suck less? Getting paid for it, in $ and/or generous comp time. :-)

https://betterstack.com/community/guides/incident-management...

Also helpful is having management that is responsive to bad on-call situations and recognizes when capable, full-time around-the-clock staffing is really needed. It seems too few well-paid tech VPs understand what a 7-Eleven management trainee does, i.e., you shouldn't rely on 1st shift workers to handle all the problems that pop up on 2nd and 3rd shift!

By @throwaway984393 - 6 months

Don't send an alert at all unless it is actionable. Yes, I get it, you want alerts for everything. Do you have a runbook that can explain to a complete novice what is going on and how to fix the problem? No? Then don't alert on it.

The only way to make on-call less stressful is to do the boring work of preparing for incidents, and the boring work of cleaning up after incidents. No magic software will do it for you.

By @sanj001 - 6 months

Using LLMs to classify noisy alerts is a really clever approach to tackling alert fatigue! Are you fine tuning your own model to differentiate between actionable and noisy alerts?

I'm also working on an open source incident management platform called Incidental (https://github.com/incidentalhq/incidental), slightly orthogonal to what you're doing, and it's great to see others addressing these on-call challenges.

Our tech stacks are quite similar too - I'm also using Python 3, FastAPI!

Show HN: I built an open-source tool to make on-call suck less