What Is Observability?
Remember when IT environments were neatly organized into monolithic applications running on virtual machines? Monitoring and managing those systems was easy. You had a handful of logs and metrics to look at, and the relationships between different parts of the systems were straightforward
But those environments are gone — or disappearing, at least — as microservices-based, scale-out, software-defined, cloud-native everything becomes the order of the day.
Faced with the complexity that these new environments have introduced, SRE, ITOps and DevOps teams are taking a new approach to monitoring and performance management. It’s called observability, and it offers unprecedented opportunities for optimizing the performance and reliability of IT systems.
To do observability right, however, you must understand the complex concepts at its core. Observability is about more than just collecting data from multiple sources so you can say you’re monitoring (for example) your VMs, containers and microservices at the same time. To deliver real value, observability has to make data from disparate sources relatable. It’s only through correlation and contextualization that engineers can truly understand the complexity of modern systems.
Definition of Observability
Observability refers to the ability to understand what is happening inside a system based on the external data exposed by that system.
Put another way, observability means using a system’s surface-level information to understand the internal state of the system, even if the patterns and trends evident from the surface-level view are different from what is actually present internally.
A system that is observable, then, is one that can be fully understood simply by collecting surface-level data while the system is running, as opposed to having to tear apart the system in order to expose its internal state.
If you like metaphors, you can think of observability as akin to using the data from a car’s exhaust system and dashboard to determine what’s happening inside the car. Is the engine running too hot? Is the car burning oil? If you are able to infer answers to questions like these simply by analyzing the data that you can view from the “surface” of the car, without tearing the vehicle apart, you’re achieving observability.
As another example, consider a house with a light fixture that won’t turn on. One way to solve the problem would be to cut open walls and trace wires until you find the source of the issue. But that would be destructive and time-consuming. A more efficient — and much easier — approach would be to collect data that is visible from the surface of the electrical system, like the state of the breaker box and other fixtures near the problematic one, to determine what’s wrong. Assuming the system is observable enough, this strategy should lead you to the root cause of the problem, which could be something like a faulty breaker or a loose wire connection inside another fixture.
Observability in IT
In IT systems, observability means being able to interpret the internal state of a complex environment based on information you collect from the surface of the environment. It entails, for instance, determining that the root cause of a slow response rate from your application is a failed server or an exhaustion of memory.
The surface-level data that drives observability comes in many forms. They include software and infrastructure logs, traces and metrics from the environment where applications run, as well as data from complementary systems, like CI/CD pipelines or help desks, which provide essential context for other environment data.
History of observability
The concept of observability first appeared in 1960, when the engineer and inventor Rudolf E. Kálmán wrote about it as part of his work on control theory. Thanks to Kálmán, observability has long been an important idea within fields like control theory, systems theory and signal processing.
But it took much longer for observability to become influential in the world of IT. Although the term was in use as early as the 1990s at companies like Sun Microsystems, it wasn’t until the mid-2010s, when engineers at organizations such as Twitter and Stripe began talking about observability within the context of managing applications and application environments, that the term and concept began to go mainstream among IT practitioners.
This means that, from the perspective of IT engineers, observability is both a relatively old and a relatively new concept. It has strong theoretical underpinnings, but how best to implement observability for IT systems remains an open question.
Indeed, even the precise definition of observability within the context of IT can vary a bit, as you’ll notice if you read different takes on the term. At Observe, we believe that observability hinges on end-to-end analysis of all of the data you can possibly collect about a system. We also focus on making actionable use of that data, rather than collecting it just to collect it. These philosophies make our approach a bit different from definitions of observability that emphasize only certain types of data sources, or that don’t tie the theory behind observability to actual practice.
Observability vs. monitoring
Monitoring is one process that drives observability, but observability is about much more than mere monitoring.
Monitoring and observability both rely on surface-level data. But monitoring stops there. Monitoring tells you what’s happening on the surface, without using the data to peer deeper and gain an understanding of the internal state of the system. To do that, you need observability.
In other words, monitoring is like checking for a pulse, while observability is like performing an MRI.
For simple issues where surface-level views correlate directly to internal state, monitoring and observability may deliver the same ultimate result. For instance, if your application has stopped responding because the server that was hosting it has failed completely, you don’t need more than basic, surface-level data to figure out what’s wrong. You can monitor the server, determine that it has gone down and infer pretty easily that you need to bring it back up to get the application working again.
But in more complex situations, surface-level data alone is rarely enough to lead you to the root cause of a problem. Again, imagine you have an application that is not responding. Maybe the server that hosts your application did fail, but your orchestrator automatically moved the application to another server in the cluster, so the failed server is not the root cause of the application issue. Instead, it’s a coding problem with the application itself, which has a memory leak that will eventually cause any server hosting it to fail.
In this case, simply monitoring whether servers are up or down is hardly enough to help you understand what’s happening under the surface of the environment. You would instead need to correlate data from a variety of sources — application logs, operating system logs, cluster health metrics and CI/CD pipeline data — to determine that a memory leak is at fault, then pinpoint which CI/CD deployment introduced the leak so that you can trace it back to the specific code change that caused it.
Observability vs. visibility
Like monitoring, visibility is one step toward observability, but it’s not the full picture.
Visibility means understanding discrete parts of your system in isolation. If a server is visible, it means you know whether the server is up and which resources it is consuming. If an application is visible, you know whether it’s handling requests in the way you need it to.
However, the major limitation with visibility is that it doesn’t give you the contextual information necessary to understand how the individual resources within your environment add up to a whole. With visibility alone, you don’t how multiple instances of the same application are distributed across a cluster of servers, or how interactions between two microservices impact overall system performance.
Observability takes visibility a step further by contextualizing data in order to deliver understanding of the entire system. That isn’t to say that observability doesn’t allow you to drill down into individual components when you need to. Observability enables you to do that, too. But the main focus of observability is on understanding the health of the system as a whole, not merely individual parts of it.
Observing different systems
Your exact approach to implementing observability will vary depending on exactly which type of system you’re observing and how that system operates.
Applications are part of virtually every type of IT system that you might need to observe. Observability therefore almost always means working, in part, with application data.
However, in some cases, the application is the entire system. If you are managing a SaaS app that is hosted on an external provider’s infrastructure, for instance, the environment that is available for you to observe consists of little more than the application itself. You may still be able to use data sources like CI/CD pipelines and ticketing systems to complement the metrics and logs produced by the application, but you won’t be looking at the host environment, because the application delivery model doesn’t allow you to view it.
Distributed systems observability
It’s one thing to observe a centralized software environment where all of your applications run on one server. In that case, most of the data you’ll need to collect is available from that central location.
Today, however, it’s more common to deploy applications in large-scale distributed environments. Here, applications run as containerized microservices, or possibly serverless functions, that are spread across a cluster of servers. Not only does observability in this context require the analysis and correlation of many more types of data, but it also necessitates the ability to interpret complex relationships between the various moving parts of a constantly changing environment.
Observability for cloud-based environments presents special challenges, which vary depending on exactly which type of cloud architecture and services are involved. If you use multiple clouds, you’ll need to collect and analyze data from all of your providers, and reconcile differences in data formatting and completeness, in order to observe the environment. If you use services like serverless functions, which limit your ability to monitor host servers, you’ll have fewer data points to work with than you would when using virtual machines that produce complete operating system logs, for example.
Adding an orchestrator like Kubernetes to your environments increases the complexity of the system. Not only do you have to monitor servers, containers and applications, but you must also track the state of the orchestrator (which is itself a series of discrete components, like an API server, a key-value store and a scheduler) that manages them.
While this makes observability more difficult in one respect, it also means that you have more data sources to work with. You can collect data from the orchestration layer that will help contextualize events or patterns taking place at other layers of the system.
Thus, while observability in Kubernetes presents some special challenges, it can be managed through an approach that correlates data from across the entire stack. Check out our Kubernetes observability eBook for a deeper dive into this topic.
Data sources for observability
As noted above, the data that engineers use to achieve observability comes in many forms.
The conventional approach to collecting data for observability centers on what are often called the three “pillars” of observability: logs, metrics and traces. By analyzing application and infrastructure logs, tracking metrics and tracing application requests as they flow through your system, you can understand much about what is happening inside the system.
Complete observability, however, often requires more than just these three pillars. Data from the CI/CD pipeline that helps you understand how often application releases are deployed and how quickly your team is able to roll back a problematic release provides crucial context for understanding the performance of the application itself. Logs from Git or other source code management repositories help you correlate metrics from your software environment with the code changes that shape application behavior. Even ticketing system data, such as the mean time to close a ticket and the issues most frequently tagged in tickets, provides context for understanding how log, metric and trace data relates to the end-user experience.
In short, while the three pillars of observability are a good foundation for your observability strategy, you should not stop there. You should collect and correlate every data point available to you in order to gain complete observability, especially in complex systems that require rich contextual information to interpret events and patterns effectively.
Correlating data to maximize observability
Data correlation may sound simple, but it can be quite challenging, especially in complex, multi-layered environments. The log formats used by some parts of your systems may be different than others, making it difficult to compare logs directly. The period over which one of your clouds generates metrics may differ from what is standard on another cloud, which also hampers one-to-one comparisons. It may not always be clear how data from one discrete component, like a container, relates to data from something else, like a network switch.
These are real challenges, but they can be overcome by transforming and aggregating complex data sets in order to provide end-to-end observability across all data sources. That’s what Observe does. No matter where your data comes from or how simple or complex your system is, Observe delivers the comprehensive data correlation and analytics features your team needs to understand and fix issues fast.