July 22nd, 2024

No More Blue Fridays

Future computers aim to avoid crashes from bad updates, like a recent global outage caused by a security company's flawed update. eBPF technology offers secure kernel execution to prevent such incidents.

Read original articleLink Icon
SkepticismConcernsAlternatives
No More Blue Fridays

In a recent blog post by Brendan Gregg, it was highlighted that future computers will no longer crash due to bad software updates, particularly those involving kernel code. The post discussed a significant outage on July 19th caused by a security company's update that led to blue screens of death and boot loops on Windows systems worldwide. The outage emphasized the risks of kernel programming but also pointed out the potential of eBPF (extended Berkeley Packet Filter) technology to prevent such crashes. eBPF, a secure kernel execution environment, offers safety checks by a software verifier and runs programs in a sandbox to avoid system crashes. The post also mentioned the adoption of eBPF by various companies for security purposes, with examples like Cisco's acquisition of Isovalent for an eBPF security product. Despite some bugs in eBPF management code, the technology aims to enhance security, reduce resource usage, and prevent system crashes. The post encouraged companies to consider making eBPF a requirement for commercial software to mitigate risks during software deployment.

AI: What people are saying
The article on eBPF technology and its potential to prevent system crashes from bad updates has sparked a diverse discussion. Key points and common themes include:
  • eBPF's Current Limitations: Several commenters highlight that eBPF, especially on Windows, is not yet ready to replace traditional kernel-space anti-malware drivers and has limited hooks available.
  • Verification and Safety Concerns: There is skepticism about the claim that eBPF can completely prevent crashes, with some pointing out that eBPF itself has caused kernel crashes in the past.
  • Alternative Solutions: Some suggest that traditional methods like canary testing and staged rollouts are simpler and effective ways to prevent crashes from bad updates.
  • Complexity and Risk: Concerns are raised about adding more layers of complexity with eBPF, which could introduce new risks and challenges.
  • Broader Issues: A few comments touch on the broader social and corporate pressures that contribute to software failures, suggesting that technology alone cannot solve these problems.
Link Icon 56 comments
By @mrpippy - 3 months
> Once Microsoft's eBPF support for Windows becomes production-ready, Windows security software can be ported to eBPF as well.

This doesn’t seem grounded in reality. If you follow the link to the “hooks” that Windows eBPF makes available [1], it’s just for incoming packets and socket operations. IOW, MS is expecting you to use the Berkeley Packet Filter for packet filtering. Not for filtering I/O, or object creation/use, or any of the other million places a driver like Crowdstrike’s hooks into the NT kernel.

In addition, they need to be in the kernel in order to monitor all the other 3rd party garbage running in kernel-space. ELAM (early-launch anti-malware) loads anti-malware drivers first so they can monitor everything that other drivers do. I highly doubt this is available to eBPF.

If Microsoft intends eBPF to be used to replace kernel-space anti-malware drivers, they have a long, long way to go.

[1]: https://microsoft.github.io/ebpf-for-windows/ebpf__structs_8...

By @kevin_nisbet - 3 months
I hate to dispute with someone like Brendan Gregg, but I'm hoping vendors in this space take a more holistic approach to investigating the complete failure chain. I personally tend to get cautious when there is a proposal that x will solve the problem that occurred on y date, especially 3 days after the failure. It may be true, but if we don't do the analysis we could leave ourselves open to blindspots. There may also be plenty of alternative approaches that should be considered and appropriately discarded.

I think the part I specifically dispute is the only negative outcome is wasted CPU cycles. That's likely the case for the class of bug, but there are plenty of failure modes where a bad ruleset could badly brick a system and make it hard to recover.

That's not to say eBPF based security modules isn't the right choice for many vendors, just that let's understand what risks they do and do not avoid, and what part of the failure chain they particularly address.

By @kayo_20211030 - 3 months
This isn't right. If I need a system to run with a piece of code, then it shouldn't run at all if that piece of code is broken. Ignoring the failure is perverse. Let's say that the driver code ensures that some medical machine has safety locks (safeguards) in place to make sure that piece of equipment won't fry you to a crisp; I'd prefer that the whole thing not run at all rather than blithely operate with the safeguards disabled. It's turtles all the way down.
By @amluto - 3 months
> In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

eBPF is fantastic, and it can be used for many purposes and improve a lot of things, but this is IMO overselling it. Assuming that BPF itself it free of bugs, it’s still a rather large sprawl of kernel hooks, and those hooks invoke eBPF code, which can call right back into the kernel. Here’s a list:

https://www.man7.org/linux/man-pages/man7/bpf-helpers.7.html

bpf_probe_read_kernel() is particularly heavily used, and it is not safe. It tries fairly hard not to OOPS or crash, but it is definitely not perfect.

The rest of that list contains plenty of this that will easily take down a system, even if it doesn’t actually oops or panic in the process.

And, of course, any tool that detects userspace “malicious behavior” and stops it can start calling everything malicious, and the computer becomes unusable.

Meanwhile, eBPF has no real security model on the userspace side. Actual attachment of an eBPF program goes through the bpf() syscall, not through sensibly permissioned operations on the underlying kernel objects being attached to, and there is nothing whatsoever that confines eBPF to, say, a container that uses it. (See bpf_probe_read_kernel() -- it's fundamentally able to read all kernel memory.)

So, IMO, most of the benefit of eBPF over ordinary kernel C code is that eBPF is kind of like writing code in a safe language with a limited unsafe API surface. It's a huge improvement for this sort of work, but it is not perfect by any means.

> The verifier is rigorous -- the Linux implementation has over 20,000 lines of code

The verifier is absurdly complex. I'd rather see something based on formal methods than 20kLOC of hand-written logic.

By @uticus - 3 months
> eBPF programs cannot crash the entire system because they are safety-checked by a software verifier and are effectively run in a sandbox.

Isn’t one of the purposes of an OS to police software? I get that this has to do with the OS itself, but what does watching the watchers accomplish other than adding a layer which must then be watched?

Why not reduce complexity instead of naively trusting that the new complexity will be better long term?

By @brundolf - 3 months
This sounds like a cool technology, but this was the really egregious problem:

> There are other ways to reduce risks during software deployment that can be employed as well: canary testing, staged rollouts, and "resilience engineering" in general

You don't need a new technology to implement basic industry-standard quality control

By @__MatrixMan__ - 3 months
Maybe we should start taking Fridays off to commemorate the event, which probably would have been less bad if more people spent less time with their nose to the grindstone and had more time to stop and think about how it all was shaping up and how they could influence that shape.
By @muth02446 - 3 months
```The verifier is rigorous -- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage. ``` Wow, 20k is not exactly encouraging. Besides the extra attack surface, who can vouch for such a large code base?
By @the8472 - 3 months
If the filters are loaded at boot and hook into everything then a bug can still lock down the system to a point where it can't be operated or patched anymore (e.g. because you loaded an empty whitelist). So it could end up replacing a boot loop with another form of DoS.

If microsoft includes a hardcoded whitelist that covers some essentials needed for recovery that could make a bug in such a tool easier to fix, but could still cause effective downtimes (system running but unusuable) until such a fix is delivered.

By @throwaway2037 - 3 months
The blog post says:

    > eBPF, which is immune to such crashes.
I tried to Google about this, but I cannot find anything definitive. It looks like you can still break things. Can an expert on eBPF please comment on this claim? This is the best that I could find: https://stackoverflow.com/questions/70403212/why-is-ebpf-sai...
By @kaliszad - 3 months
"These security agents will then be safe and unable to cause a Windows kernel crash."

Unless of course there is a bug in eBPF (https://access.redhat.com/solutions/7068083) @brendangregg and the kernel panics/ BSoDs anyway which you mention later in the article of course.

By @xg15 - 3 months
> In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

Assuming every security critical system will be on a recent enough kernel to support this...

By @blinkingled - 3 months
Ok. But the good old push code to staging / canary it before mainstream updates was a simpler way of solving the same problem.

Crowdstrike knows the computers they're running on, it is trivial to implement a system where only few designated computers download and install the update and report metrics before the update controller decides to push it to next set.

By @skywhopper - 3 months
The implicit assumption of the article is that eBPF code can't crash a kernel, but the article itself eventually admits that it can and has done, including last month. eBPF is a safer way of providing kernel-extension functionality, for sure, but presenting it as the perfect solution is just asking to have your argument dismissed. eBPF is not perfect. And there's plenty of things it can't do. The very sandbox rules that limit how long its programs may run and what they can do also make it entirely inappropriate for certain tasks. Let's please stop pretending there's a silver bullet.
By @lazycog512 - 3 months
"The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair."

- Douglas Adams

By @nkozyra - 3 months
I don't do any kernel stuff so I'm out of my element, but doesn't the fact that Crowdstrike & Linux kernel eBPF already caused kernel crashes[1] sort of downplay the rosiness of the state of things?

[1]: https://access.redhat.com/solutions/7068083

By @kjellsbells - 3 months
Lets suppose that eBPF solves this particular problem, eventually, for Windows. Doesn't sidestepping the entire class of Crowdstrike-style fubars require that Microsoft then mandate that no, backward compatibility will not be offered?

Back compat seems to be such a shibboleth in the Windows world, but comes at an incredible price. The reasons cited all seem to boil down to keeping some imagined customers' obscure LOB app running for decades. But that seems like an excuse to me. Surely Microsoft would like to shake out the last diehards running some VB5 app on a patched up PC in a factory. Isn't it more beneficial to everyone to start sunsetting acres of ancient NT code and approaches and streamline the entire attack surface?

By @xyzzy123 - 3 months
So many problems though! including commercial monocultures, lack of update consent, blast radius issues, etc etc. There's a commons in our pockets but that is very difficult to regulate for. The will keep putting the gun to your head until you keep choosing the monoculture.
By @titzer - 3 months
WebAssembly is a better choice for sandboxing kernel code. It has a full formal specification with a mechanized proof of type safety, many high-performance implementations, broad toolchain support, is targetable from many languages, and a capability security model.
By @3np - 3 months
> The worst thing an eBPF program can do is to merely consume more resources than is desirable, such as CPU cycles and memory.

This is obviously not true. It might be the worst it can do, by itself, to the currently running kernel. It's not the worst it can do to the machine or its user(s).

There are infinite harmful things an eBPF program can do. As can programs solely in user-space. There is a specific class of vulnerabilities being mitigated by moving code from kernel to BPF. That does not mean that eBPF programs are in general safe.

By @usrme - 3 months
Does anyone know how far along the eBPF implementation for Windows actually is? In the sense that it could start feasibly replacing existing kernel drivers.
By @tgtweak - 3 months
Even if Microsoft rolls out eBPF and mainstreams it - it will be years before everything is ported over and it still won't address legacy windows versions (which appear to be a good chunk of what was impacted).

It's a move in the right direction but it probably won't fully mitigate issues like this for another 5+ years.

By @CodeWriter23 - 3 months
> an unprecedented example of the inherent dangers of kernel programming

I take issue with that. Kernel programming was not to blame; looking up addresses from a file and accessing those memory locations without any validation is. The same technique would yield the same result at any Ring.

By @twen_ty - 3 months
Can someone tell me what's the advantage of eBPF over a user mode driver? The article makes it look it eBPF is have your cake and eat it too solution which is too good to be true? Can you run graphics drivers in eBPF for example?
By @Yawrehto - 3 months
1. How does eBPF solve this? It makes it more difficult, sure, but it'll almost always be possible to cause a crash, if you try hard enough. 2. More importantly, the problem is rarely fixable by changing technology, because typically, problems are caused by people and their connections: social/corporate pressures, profit-seeking, mental health being treated as unimportant, et cetera. eBPF can't fix those, and as long as corporations have social structures that penalize thoroughness and caution, and incentivize getting 'the most stuff' done, this will persist as a problem.
By @tracker1 - 3 months
I don't buy it... didn't a bug from RedHat + Crowdstrike have a similar panic issue? I understand in that case it was because of RedHat, but still. I don't think this, by itself will change much.
By @WaitWaitWha - 3 months
eBPF == extended Berkeley Packet Filter

https://en.wikipedia.org/wiki/Berkeley_Packet_Filter

By @dveeden2 - 3 months
So eBPF is giving us eBFP (enhanced Blue Friday Protection)?
By @mschuster91 - 3 months
> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement. It's possible for Linux today, and Windows soon. While some vendors have already proactively adopted eBPF (thank you), others might need a little encouragement from their paying customers.

How about Microsoft's large government and commercial customers make it a requirement that MS does not develop a single new feature for the next two fucking years or however long it takes to go through the entirety of the Windows+Office+Exchange code base and to make sure there are no security issues in there?

We don't need ads in the start menu, we don't need telemetry, we don't need desktop Outlook becoming a rotten slow and useless web app, we don't need AI, we certainly don't need Recall. We need an OS environment that doesn't need a Patch Tuesday where we have to check if the update doesn't break half the canary machines.

And while MS is at that they can also take the goddamn time and rework the entire configuration stack. I swear to god, it drives me nuts. There's stuff that's only accessible via the registry (and there is no comprehensive documentation showing exactly what any key in the registry can do - large parts of that are MS-internal!), there's stuff only accessible via GPO, there's stuff hidden in CPLs dating back to Windows 3.11, and there's stuff in Windows' newest UI/settings framework.

By @jeffrallen - 3 months
Here's an idea for an interesting hack: a piece of kernel resident code that feeds fake data into eBPF so that an eBPF-based antimalware will see nothing bad as the malware goes about it's merry way.

Sandboxes are safe, but are ultimately virtual machines, and virtual machines can be made to live in a world that's not real.

By @yubiox - 3 months
Title reminds me of when microsoft promised no more UAEs back in 92. They just renamed them to GPFs in windows 3.1.
By @egorfine - 3 months
One option to prevent this is to not run corporate spyware. But I guess for some industries this isn't an option.
By @datadeft - 3 months
It is great that we need a linux kernel feature to be ported to Windows so we don’t have blue Fridays
By @CoastalCoder - 3 months
> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement.

Are they saying that device drivers should be written in eBPF?

Or maybe their drivers should expose an eBPF API?

I assume some driver code still needs to reside in the actual kernel.

By @wiresurfer - 3 months
Hey Brendan,

> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement.

Windows soon, may still be atleast a year ahead. Would that be a fair statement? atleast being the operating keyword here.

Specifically in the context of network security software, for eBPF programs to be portable across windows/linux, we would need MSFT to add a lot more hooks and expose internal kernel stucts. Hopefully via a common libbpf definition. Otherwise, I fear, having two versions of the same product, across two OSs would mean more secuirty and quality issues.

I guess the point I am trying to make is, we would get there, but we are more than a few years away. I would love to see something like cilium on vanilla windows for a Software defined Company Wide network. We can then start building enterprise network secutiry into it. Baby steps!

---

btw, your talks and blog posts about bpftools is godsent!

By @vfclists - 3 months
Yep, another fix to all our problems, a new bandwagon to be jumped on by wall EDR vendors, until ...

Here I am using the term "EDR". Until this CrowdStrike debacle I'd never heard it.

Only tells how seriously you should take my opinions.

By @throw0101d - 3 months
Meta:

> eBPF (no longer an acronym) […]

Any reason why the official acronym was done away with?

By @ninju - 3 months
So a couple of questions

1) Is CrowdStrike Falcon using eBPF for their Linux offering?

2) Would the faulty patch update get caught by the eBPF verifier?

By @rezonant - 3 months
> the company behind this outage was already in the process of adopting eBPF, which is immune to such crashes

Oh I'm sure they'll find a way.

By @fullspectrumdev - 3 months
This puts an awful lot of stock in the robustness of eBPF.

Which is odd, given there’s been a bunch of kernel privesc bugs using eBPF…

By @0xbadcafebee - 3 months
> In the future, computers will not crash due to bad software updates

I'm still waiting on my flying car...

By @ksec - 3 months
The article mentions Windows and Linux. Does anyone know if there will be eBPF for FreeBSD?
By @Scene_Cast2 - 3 months
How much extra security does this provide on top of HLK?
By @userbinator - 3 months
In the future, computers will not crash due to bad software updates, even those updates that involve kernel code.

100% BS. Even if they don't "crash" they will "stop functioning as intended" which is just the same. It's absolutely disgusting how this industry is now using this one outage as a talking point to further their totalitarian agenda.

It reminds me of how Google went after adblockers with their new extension model that also promised more "security". It's time we realised what they're really trying to do. In fact, I wonder whether this outage was not accidental after all.

By @klooney - 3 months
First io_uring, now eBPF. Kind of wild.
By @asynchronous - 3 months
Is there a reason for the lack of naming+shaming Crowdstrike in this blogpost? Was it to not give them any more publicity, good or bad?
By @7e - 3 months
eBPF will be an improvement, I’m sure, but does not mean the end of bugs/DoS in software.
By @odyssey7 - 3 months
"The verifier is rigorous"

But the appeal-to-authority evidence that the article presents is not.

"-- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage."

By @ReleaseCandidat - 3 months
Sorry, but neither eBPF nor Rust nor formal verification nor ... is going to solve that problem. Repeat after me: there are no technical solutions to social problems. As long as the result of such an outage is basically a "oh, a software problem! shrug", _nothing_ will change.
By @bfrog - 3 months
I wonder if microkernels ever had this kind of bullshit. Had it been a microkernel, would we all be sitting twiddling our thumbs on friday? Hot take: No.
By @shrx - 3 months
From the article:

> If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code [0] -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington).

[0] links to https://github.com/torvalds/linux/blob/master/kernel/bpf/ver... which has this interesting comment at the top:

    /* bpf_check() is a static code analyzer that walks eBPF program
     * instruction by instruction and updates register/stack state.
     * All paths of conditional branches are analyzed until 'bpf_exit' insn.
     *
     * The first pass is depth-first-search to check that the program is a DAG.
     * It rejects the following programs:
     * - larger than BPF_MAXINSNS insns
     * - if loop is present (detected via back-edge)
    ...
I haven't inspected the code, but I thought that checking for infinite loops would imply solving the halting problem. Where's the catch?
By @joker99 - 3 months
I used to work for an EDR vendor and this post glosses over two major and important things. 1. There’s no need for eBPF on windows, it has the ETW framework (event tracing) which is much more powerful and provides applications subscribing to a class of events almost too detailed insights. the issue most AV vendors have with it though is speed. Leading to … 2. eBPF lets you watch. Congrats. It’s something, but it’s not the reason why these tools are deployed. Orgs deploy these tools to prevent or stop potentially bad stuff from executing. The only place this can be done in our operating systems is usually the kernel - for that you need kernel level drivers or various other filter drivers.

Crowdstrike screwed the pooch here, yes. But after a couple of days I feel like I haven’t read enough blog posts and articles that crap on Microsoft. It’s their job to build a secure operating system, instead they deliver Windows and because they themselves cannot secure windows, they ship defender… and we use tools like falcon like a bandaid for Microsofts bad security practices

By @risenshinetech - 3 months
Thank God some superheros have finally come along to make sure code never crashes any computers ever again! /s