No More Blue Fridays
Future computers aim to avoid crashes from bad updates, like a recent global outage caused by a security company's flawed update. eBPF technology offers secure kernel execution to prevent such incidents.
Read original articleIn a recent blog post by Brendan Gregg, it was highlighted that future computers will no longer crash due to bad software updates, particularly those involving kernel code. The post discussed a significant outage on July 19th caused by a security company's update that led to blue screens of death and boot loops on Windows systems worldwide. The outage emphasized the risks of kernel programming but also pointed out the potential of eBPF (extended Berkeley Packet Filter) technology to prevent such crashes. eBPF, a secure kernel execution environment, offers safety checks by a software verifier and runs programs in a sandbox to avoid system crashes. The post also mentioned the adoption of eBPF by various companies for security purposes, with examples like Cisco's acquisition of Isovalent for an eBPF security product. Despite some bugs in eBPF management code, the technology aims to enhance security, reduce resource usage, and prevent system crashes. The post encouraged companies to consider making eBPF a requirement for commercial software to mitigate risks during software deployment.
Related
How eBPF is shaping the future of Linux and platform engineering
eBPF, developed by Daniel Borkmann, revolutionizes Linux by enabling custom programs in the kernel. It enhances networking, security, and observability, bridging monolithic and microkernel architectures for improved performance and flexibility.
The weirdest QNX bug I've ever encountered
The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.
CrowdStrike broke Debian and Rocky Linux months ago
CrowdStrike's faulty update caused a global Blue Screen of Death issue on 8.5 million Windows PCs, impacting sectors like airlines and healthcare. Debian and Rocky Linux users also faced disruptions, highlighting compatibility and testing concerns. Organizations are urged to handle updates carefully.
- eBPF's Current Limitations: Several commenters highlight that eBPF, especially on Windows, is not yet ready to replace traditional kernel-space anti-malware drivers and has limited hooks available.
- Verification and Safety Concerns: There is skepticism about the claim that eBPF can completely prevent crashes, with some pointing out that eBPF itself has caused kernel crashes in the past.
- Alternative Solutions: Some suggest that traditional methods like canary testing and staged rollouts are simpler and effective ways to prevent crashes from bad updates.
- Complexity and Risk: Concerns are raised about adding more layers of complexity with eBPF, which could introduce new risks and challenges.
- Broader Issues: A few comments touch on the broader social and corporate pressures that contribute to software failures, suggesting that technology alone cannot solve these problems.
This doesn’t seem grounded in reality. If you follow the link to the “hooks” that Windows eBPF makes available [1], it’s just for incoming packets and socket operations. IOW, MS is expecting you to use the Berkeley Packet Filter for packet filtering. Not for filtering I/O, or object creation/use, or any of the other million places a driver like Crowdstrike’s hooks into the NT kernel.
In addition, they need to be in the kernel in order to monitor all the other 3rd party garbage running in kernel-space. ELAM (early-launch anti-malware) loads anti-malware drivers first so they can monitor everything that other drivers do. I highly doubt this is available to eBPF.
If Microsoft intends eBPF to be used to replace kernel-space anti-malware drivers, they have a long, long way to go.
[1]: https://microsoft.github.io/ebpf-for-windows/ebpf__structs_8...
I think the part I specifically dispute is the only negative outcome is wasted CPU cycles. That's likely the case for the class of bug, but there are plenty of failure modes where a bad ruleset could badly brick a system and make it hard to recover.
That's not to say eBPF based security modules isn't the right choice for many vendors, just that let's understand what risks they do and do not avoid, and what part of the failure chain they particularly address.
eBPF is fantastic, and it can be used for many purposes and improve a lot of things, but this is IMO overselling it. Assuming that BPF itself it free of bugs, it’s still a rather large sprawl of kernel hooks, and those hooks invoke eBPF code, which can call right back into the kernel. Here’s a list:
https://www.man7.org/linux/man-pages/man7/bpf-helpers.7.html
bpf_probe_read_kernel() is particularly heavily used, and it is not safe. It tries fairly hard not to OOPS or crash, but it is definitely not perfect.
The rest of that list contains plenty of this that will easily take down a system, even if it doesn’t actually oops or panic in the process.
And, of course, any tool that detects userspace “malicious behavior” and stops it can start calling everything malicious, and the computer becomes unusable.
Meanwhile, eBPF has no real security model on the userspace side. Actual attachment of an eBPF program goes through the bpf() syscall, not through sensibly permissioned operations on the underlying kernel objects being attached to, and there is nothing whatsoever that confines eBPF to, say, a container that uses it. (See bpf_probe_read_kernel() -- it's fundamentally able to read all kernel memory.)
So, IMO, most of the benefit of eBPF over ordinary kernel C code is that eBPF is kind of like writing code in a safe language with a limited unsafe API surface. It's a huge improvement for this sort of work, but it is not perfect by any means.
> The verifier is rigorous -- the Linux implementation has over 20,000 lines of code
The verifier is absurdly complex. I'd rather see something based on formal methods than 20kLOC of hand-written logic.
Isn’t one of the purposes of an OS to police software? I get that this has to do with the OS itself, but what does watching the watchers accomplish other than adding a layer which must then be watched?
Why not reduce complexity instead of naively trusting that the new complexity will be better long term?
> There are other ways to reduce risks during software deployment that can be employed as well: canary testing, staged rollouts, and "resilience engineering" in general
You don't need a new technology to implement basic industry-standard quality control
If microsoft includes a hardcoded whitelist that covers some essentials needed for recovery that could make a bug in such a tool easier to fix, but could still cause effective downtimes (system running but unusuable) until such a fix is delivered.
> eBPF, which is immune to such crashes.
I tried to Google about this, but I cannot find anything definitive. It looks like you can still break things. Can an expert on eBPF please comment on this claim? This is the best that I could find: https://stackoverflow.com/questions/70403212/why-is-ebpf-sai...Unless of course there is a bug in eBPF (https://access.redhat.com/solutions/7068083) @brendangregg and the kernel panics/ BSoDs anyway which you mention later in the article of course.
Assuming every security critical system will be on a recent enough kernel to support this...
Crowdstrike knows the computers they're running on, it is trivial to implement a system where only few designated computers download and install the update and report metrics before the update controller decides to push it to next set.
- Douglas Adams
Back compat seems to be such a shibboleth in the Windows world, but comes at an incredible price. The reasons cited all seem to boil down to keeping some imagined customers' obscure LOB app running for decades. But that seems like an excuse to me. Surely Microsoft would like to shake out the last diehards running some VB5 app on a patched up PC in a factory. Isn't it more beneficial to everyone to start sunsetting acres of ancient NT code and approaches and streamline the entire attack surface?
This is obviously not true. It might be the worst it can do, by itself, to the currently running kernel. It's not the worst it can do to the machine or its user(s).
There are infinite harmful things an eBPF program can do. As can programs solely in user-space. There is a specific class of vulnerabilities being mitigated by moving code from kernel to BPF. That does not mean that eBPF programs are in general safe.
It's a move in the right direction but it probably won't fully mitigate issues like this for another 5+ years.
I take issue with that. Kernel programming was not to blame; looking up addresses from a file and accessing those memory locations without any validation is. The same technique would yield the same result at any Ring.
How about Microsoft's large government and commercial customers make it a requirement that MS does not develop a single new feature for the next two fucking years or however long it takes to go through the entirety of the Windows+Office+Exchange code base and to make sure there are no security issues in there?
We don't need ads in the start menu, we don't need telemetry, we don't need desktop Outlook becoming a rotten slow and useless web app, we don't need AI, we certainly don't need Recall. We need an OS environment that doesn't need a Patch Tuesday where we have to check if the update doesn't break half the canary machines.
And while MS is at that they can also take the goddamn time and rework the entire configuration stack. I swear to god, it drives me nuts. There's stuff that's only accessible via the registry (and there is no comprehensive documentation showing exactly what any key in the registry can do - large parts of that are MS-internal!), there's stuff only accessible via GPO, there's stuff hidden in CPLs dating back to Windows 3.11, and there's stuff in Windows' newest UI/settings framework.
Sandboxes are safe, but are ultimately virtual machines, and virtual machines can be made to live in a world that's not real.
Are they saying that device drivers should be written in eBPF?
Or maybe their drivers should expose an eBPF API?
I assume some driver code still needs to reside in the actual kernel.
> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement.
Windows soon, may still be atleast a year ahead. Would that be a fair statement? atleast being the operating keyword here.
Specifically in the context of network security software, for eBPF programs to be portable across windows/linux, we would need MSFT to add a lot more hooks and expose internal kernel stucts. Hopefully via a common libbpf definition. Otherwise, I fear, having two versions of the same product, across two OSs would mean more secuirty and quality issues.
I guess the point I am trying to make is, we would get there, but we are more than a few years away. I would love to see something like cilium on vanilla windows for a Software defined Company Wide network. We can then start building enterprise network secutiry into it. Baby steps!
---
btw, your talks and blog posts about bpftools is godsent!
Here I am using the term "EDR". Until this CrowdStrike debacle I'd never heard it.
Only tells how seriously you should take my opinions.
> eBPF (no longer an acronym) […]
Any reason why the official acronym was done away with?
1) Is CrowdStrike Falcon using eBPF for their Linux offering?
2) Would the faulty patch update get caught by the eBPF verifier?
Oh I'm sure they'll find a way.
Which is odd, given there’s been a bunch of kernel privesc bugs using eBPF…
I'm still waiting on my flying car...
100% BS. Even if they don't "crash" they will "stop functioning as intended" which is just the same. It's absolutely disgusting how this industry is now using this one outage as a talking point to further their totalitarian agenda.
It reminds me of how Google went after adblockers with their new extension model that also promised more "security". It's time we realised what they're really trying to do. In fact, I wonder whether this outage was not accidental after all.
But the appeal-to-authority evidence that the article presents is not.
"-- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage."
> If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code [0] -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington).
[0] links to https://github.com/torvalds/linux/blob/master/kernel/bpf/ver... which has this interesting comment at the top:
/* bpf_check() is a static code analyzer that walks eBPF program
* instruction by instruction and updates register/stack state.
* All paths of conditional branches are analyzed until 'bpf_exit' insn.
*
* The first pass is depth-first-search to check that the program is a DAG.
* It rejects the following programs:
* - larger than BPF_MAXINSNS insns
* - if loop is present (detected via back-edge)
...
I haven't inspected the code, but I thought that checking for infinite loops would imply solving the halting problem. Where's the catch?Crowdstrike screwed the pooch here, yes. But after a couple of days I feel like I haven’t read enough blog posts and articles that crap on Microsoft. It’s their job to build a secure operating system, instead they deliver Windows and because they themselves cannot secure windows, they ship defender… and we use tools like falcon like a bandaid for Microsofts bad security practices
Related
How eBPF is shaping the future of Linux and platform engineering
eBPF, developed by Daniel Borkmann, revolutionizes Linux by enabling custom programs in the kernel. It enhances networking, security, and observability, bridging monolithic and microkernel architectures for improved performance and flexibility.
The weirdest QNX bug I've ever encountered
The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.
CrowdStrike broke Debian and Rocky Linux months ago
CrowdStrike's faulty update caused a global Blue Screen of Death issue on 8.5 million Windows PCs, impacting sectors like airlines and healthcare. Debian and Rocky Linux users also faced disruptions, highlighting compatibility and testing concerns. Organizations are urged to handle updates carefully.