Does Security Observability Really Need eBPF?
Extended Berkeley Packet Filters are awesome tools for collecting massive amounts of raw telemetry data from the Linux kernel. There’s use cases where that is critical, particularly in security use cases: anomaly detection of low level traffic can capture stealthy attacks at several points of the ATT&CK chain, especially penetration, lateral movement, and exfiltration. When you need this kind of massive volume/variability/velocity data, Observe is a brilliant 💎 solution for working with it. Resource awareness and temporality in our powerful Explorers make slicing and dicing of deep telemetry like traces or packet captures much more tractable, correlation by the involved resources is trivial, and our cloud native cost model is attractive for large scale work.
However, the most effective cost-saving action of all is to not do a task in the first place. Sometimes we see calls to use eBPF telemetry in scenarios where it might not be the most cost-effective tool. It can be difficult to discuss the hard pros and cons of large scale data collection in the face of a squishier security hypothetical. Here’s a mental model that might help your team decide what tools to use for security monitoring of which applications: Sample Your Cows, Trace Your Pets.
Sample Your Cows
There’s a wide selection of stuff that pushes management tasks to the left and then cannot be effectively managed in production. Servers as Cattle and Functions As A Service are the clearest examples, but arguably mobile app development or a rigidly controlled Desktop As A Service environment can look like this. In this sort of a system, intentional change is not produced by users or administrators interacting with the deployed production environment. Instead, intentional changes are produced upstream in the build pipeline, then deployed to production. Let’s call these “unmanageable” things, because once they are deployed, the only intended management actions are to elastically scale them or replace them with newer versions.
If your service is built of these unmanageable things that receive their config (servers as cattle, mobile devices, containers), you shouldn’t need a great deal of telemetry from them. The sales pitch for telemetry is one of “well you just don’t know what’s really happening out there” and that’s fine, but you can sample to find out instead of instrumenting everything everywhere all the way. Because change is relatively uncommon and can be clearly detected in deployment logs, any perturbation comes from interactions between components, customer inputs, or attackers. Logs, metrics, and traces should therefore be sufficient to understand these systems built from trusted sources, and sampling that material is potentially a reasonable approach. TL;DR – it’s an Observability problem, because you’re inferring internal states from external outputs.
Trace Your Pets
Unmanageable things are a neat concept, but they’re not the entirety of deployed systems providing services. If you’ve got devices that experience configuration drift once they enter the world, you have to have a way to see what their configuration is and ideally send updates to it. In other words, you have a visibility and control problem: you have manageable things that need managing.
If you’re responsible for manageable things that maintain their own config (servers as pets, laptops, virtual machines) then you really don’t know how each device is behaving and rich telemetry is a real, vital need. These systems will face at least some level of uncontrolled mutability, perhaps a huge amount of it because they could be used by a full local administrator with an unpredictable series of jobs to do. If uncontrolled change is possible in the running production world, then you will absolutely need access to large amounts of raw telemetry of the sort that eBPF or EDR (Endpoint Detect and Respond) agents can be an excellent provider for. These systems present a multi-layered security struggle: what is the current state of the world, what are the possible intrusion paths that have been opened by that state, and is anyone actively exploiting an intrusion path. No amount of data is too much, presuming the analytical resources to work with it.
That’s a big assumption though: lots of organizations don’t have the security analytics bench to hunt for meaningful signals. These organizations might depend on some combination of AI models, vendor content, or contracted services for help. AI can absolutely help with pattern and anomaly detection, which is quite useful for systems in which the past predicts the future and stability is the norm. We might then ask, does that mean AI is better for unmanageable or manageable things?
Unpacking that answer leads to some surprisingly gray zones. The Platonic ideal of manageable or unmanageable systems described above is not entirely realistic in actual production deployments, where a containerized tech stack might interact with dozens of others in a way that reintroduces all the complexity of the manageable world. “Who did what” is very tough to answer on a long-lived multi-core Linux server, even with eBPF; processes and threads cycle fast, and the counters are only so large, so reuse happens. Given drift in team membership, organizational goals, and technology choices, a given system may be entirely a black box for the teams responsible for running it, and even more so for those who must keep it secure. These teams can benefit from an approach called Security Observability: using the external outputs of a system to infer risk, monitor behavior, and alert on failures. Using a containerized tech stack doesn’t absolve organizations from maintenance in itself. If you don’t understand what’s happening in the container, then you might very well need the raw telemetry to gain that understanding. If you want telemetry like opened/wrote/closed file, open/wrote/closed network socket, started/stopped process, eBPF is where you’ll get that data.
Pulling in that much telemetry could quickly result in a massive increase of your compute, network, and storage costs though. Also it’s such low-level material, it will take expert analysis to make sense of it; you can do some basics with cognitive computing techniques like anomaly detection and cluster detection, but determining if the anomaly is good or bad or important or not takes human contextual knowledge. The best teams use AI techniques to support human analysts with deep visibility to the entire services landscape, from developer laptops to deployed containers. These techniques can benefit from sampling, focus on critical systems such as build pipelines, and target risk evaluation: simply put, pay more attention to the systems that need it. Data analysis and rule generation aided by AI is a highly useful refinement provided by the Security Observability approach.
There’s so much available data from the systems that we depend on to provide and use services. Metrics, Logs, Traces are the famous Observability triad, but deep telemetry from kernel level tooling like eBPF or an EDR agent is a potentially rich source as well. Ingestion and query aren’t the main problem though: how quickly can your team, using AI or not, make sense of the data and fix whatever problem it shows? There’s no one-size-fits-all approach, but Observe is able to help with whatever set of approaches works for your organization.