10 Lessons for Observability: What Every VP Of Engineering Needs to Know

 Getting the most from observability entails building a culture wherein engineers naturally think in terms not just of “What’s wrong?” but “Why is it wrong?”

To date, most conversations surrounding observability have taken place “in the trenches.” They have been largely spearheaded by engineers — like those at Twitter and Netflix — who face the considerable challenge of managing systems that are constantly increasing in complexity.

It’s easy to understand why observability is embraced by engineers working in the throes of modern, cloud-native software environments. That’s because observability offers deeper levels of visibility and actionability than these teams can achieve through conventional monitoring strategies.

But observability is not just something to be mastered by practitioners who find themselves in the thick of cloud-native systems. To reap the full benefits of observability, businesses must embrace it at all organizational levels.

That’s why we’ve put together this guide to observability for VPs of engineering. Unlike most other work on observability and the IT industry, this document discusses the key observability concepts that VPs, directors, and other managers should understand — even if observability practices aren’t a part of their day-to-day workflow.

After all, it’s only when practitioners are lucky enough to get full support for their initiatives that those initiatives tend to prove highly successful. By understanding and embracing the observability concepts described below, engineering VPs and other leaders can position their organizations to get the very most out of observability.

Lesson 1: A brief definition of observability

Observability Definition

The textbook definition of observability (which applies to systems of any type, not just software) is a measure of how well the internal state of a system can be understood based on data that the system makes available externally.

For example, observability could mean collecting performance data (request frequency, error rates, and request duration rates) from individual microservices within a distributed application, then correlating that data to understand the health of the application as a whole.

Going deeper, a team might collect data not only from applications and infrastructure but also from resources like CI/CD pipelines or customer support systems to gain stronger context into the internal health of a system. By correlating all of these data points together, engineers can not just determine what is happening within an application, but can also make inferences such as which external events triggered a change in application behavior.

Lesson 2: Observability is not just a buzzword

A second key point for VPs to understand about observability is that observability is not merely a buzzword or fad. It’s a fundamentally new way for IT organizations to approach software performance monitoring and management.

We’ll elaborate on this point below by diving into the history of observability, as well as detailing how observability relates to but is distinct from practices like monitoring. But for now, suffice it to say at a high level that observability is a well-established discipline that has become essential for understanding what happens within the highly complex, distributed, fast-changing software environments that businesses commonly rely on today.

Admittedly, you can find differing opinions out there about what observability means. There may also be some instances where the hype surrounding observability doesn’t fully live up to what observability can do. Keep in mind, observability alone won’t solve every performance management challenge your organization may face.

But in these respects, observability is no different from terms like “DevOps” and “Cloud Computing” — concepts that are also occasionally over-hyped and even abused but have had a sustained, and transformational impact on the IT industry.

Lesson 3: Observability is a well-established concept

Understanding the history of observability is also helpful for appreciating why the concept is so powerful.

Observability dates to the 1960s, when the engineer and inventor Rudolf E. Kálmán published academic work on observability in the field of control theory. In the decades that followed, observability became an essential concept in engineering domains like control theory, systems theory, and signal processing.

Only in the mid-2010s, however, did practitioners in the IT industry began to incorporate observability into their work in a significant way. Heralded by talks and influential blog posts on observability by engineers at Web-scale companies, observability became a mainstream component of IT operations work.

Lesson 4: Observability vs. monitoring, visibility, and telemetry

It may be tempting to think of observability as simply another word to refer to monitoring, but that would be a mistake. While it’s true that observability and monitoring both entail understanding what is happening with software, the major difference between them is that monitoring merely tells you when something is wrong. Observability helps pinpoint what is wrong, and why it happened.

MonitoringObservability
Is it broken?Why is it broken?
Facilitates quick response, but AFTER incident occurs.Prevents and reduces the duration and impact of incidents.
Is my application (or service) running?How efficiently is application (or service) running?
Siloed data.Highly correlated data.
Passively consume data and metrics about your system.Actively explore and understand your environment.

Observability can achieve this by expanding and extending monitoring processes to gain deeper insights into complex systems. Whereas monitoring tools typically focus just on collecting data and sometimes generating alerts based on anomalies or pre-configured triggers.  Observability correlates data from across disparate systems to provide nuanced context into each issue that is surfaced through monitoring data.

Along similar lines, observability is different from visibility, in that observability provides the context required to understand both the what and the why of software problems. Visibility mostly just alerts you to the what.

As for telemetry the collection of data from remote systems observability is different because it provides the context necessary to interpret telemetry data fully. Telemetry alone just gives you data, not the ability to understand it.

In short, monitoring, visibility, and telemetry are among the processes that enable observability. However, observability goes deeper and provides a much higher level of actionability.

Why did it take so long for the IT industry to embrace observability? The most likely explanation is that around 2015, more developers and IT engineers were tasked with building, deploying, and managing highly dynamic, distributed systems than ever before. These systems were orders of magnitude more complex than their predecessors. As monolithic applications and VMs were supplanted by containerized, Kubernetes-based, multi-cloud microservices apps, organizations needed a better means of understanding what was happening inside their systems than what they could achieve using monitoring alone. Observability provided the solution.

Observability is a decades-old idea that has already proven critical to engineering fields outside of IT. But now in a cloud-native world, it has become essential for IT organizations, too.

Lesson 5: Observability increases ROI

Compared to application performance management techniques that rely on processes like monitoring and telemetry alone, observability yields even stronger financial results for the business.

This is partly because observability maximizes a team’s ability to identify and remediate root-cause performance problems quickly. This translates to less downtime and fewer customer-impacting performance issues, which in turn means higher rates of engagement and revenue.

At the same time, observability helps engineering teams work faster and smarter. By providing the context and correlation that engineers need to resolve complex application issues, observability tools help teams spend less time tracking down root-cause problems and performing unplanned work.

In turn, engineers have more time to focus on activities that add value. Those activities could include the implementation of new features or tasks that improve reliability engineering or finding ways to optimize applications for reliability to reduce the incident rate.

Lesson 6: Observability is system-agnostic

Although conversations about observability often focus on cloud-native, microservices-based applications, the fact is that observability can be applied to any type of IT environment or architecture.

For instance, you could use observability to correlate performance changes in a monolithic application to changes in the CI/CD processes used to build that application. Likewise, observability can help you deliver actionable insights in an on-prem server or private data center just as effectively as it can in public cloud environments.

Legacy applications may not need observability to the same degree as cloud-native applications, but they can still benefit significantly from observability. This means that no matter which type of applications your business manages, or which technological paradigms you embrace, observability can benefit you.

Lesson 7: The more data, the better

One of the common challenges that engineering teams face when managing software is that more data is not always a good thing. If your team has more data than you can effectively interpret, your engineers will struggle to make sense of it. Storage and compute costs for managing and processing the data may also be quite high relative to the degree of insights that the data generates.

However, because observability tools can correlate disparate data sets in an efficient and automated way, having more data is not a risk, but rather a benefit. After all, the core purpose of observability is to help teams quickly identify root-cause issues and to understand how different problems relate to each other. Once this connection is made, it allows them to determine where the performance issues are in their software deployment processes.

Remember, successful observability hinges not just on collecting data from as many sources as possible, but also on analyzing and correlating data from other systems. Data from systems like CI/CD pipelines, customer service platforms, etc can help provide full context into performance issues.

In short, when you embrace observability, voluminous data shifts from being a liability to an asset. The more data you feed into observability solutions, and the more systems that are represented by that data, the greater your ability to identify and understand complex problems.

Lesson 8: Any engineer can embrace observability

Just as observability can be applied to any type of system, it can be implemented by an engineer or engineering team.

Unlike certain other technical skill sets — such as those associated with cybersecurity — observability is not something that requires a special background. It’s something that any developer or engineer can understand, appreciate and perform.

As mentioned earlier, observability is not a concept that is unique to IT, but rather a key concept across a variety of engineering domains. Your engineers need not come from a specific background or have certain credentials to apply observability to the systems they manage.

Lesson 9: Observability is a culture

Part of the reason why observability can be applied anywhere is that it’s more than deploying certain tools or adopting certain workflows. Tooling is part of the equation — you’ll need a platform that can ingest, correlate and analyze data from across all facets of complex systems — but tools alone aren’t the key to observability.

Culture is. 

To gain the greatest value from observability, you must brick observability into the culture of your IT organization. That means that every engineer should understand the differences between observability and monitoring and that they’re prepared to support observability by building the necessary instrumentation.

 Getting the most from observability entails building a culture wherein engineers naturally think in terms not just of “What’s wrong?” but “Why is it wrong?” And it requires demonstrating buy-in and support for observability from all stakeholders, including those in managerial roles.

Lesson 10: Now is the time to embrace observability

The final point to understand about observability is that it’s something that most IT organizations need to implement now if they haven’t already. Observability is the only way to stay ahead of performance issues within the complex, cloud-native software environments that power the typical business today. Even if your teams haven’t yet fully migrated to cloud-native architectures, observability can provide important benefits for legacy environments, too.

Observability is not an “idea” that only makes sense only for Web-scale companies that are the vanguard of technology change. It’s a concept and practice that can benefit companies of all types, across all industries. Just as it would have been a mistake to wait until cloud computing was a decade old before exploring public cloud solutions, it would be an error today to treat observability as something you’ll implement five or ten years from now, start today.