The SRE Manager’s Guide to Selling Observability to Your Org

By Liam Rogers, April 11, 2022

Intuition is good, but intuition plus observability is better.

SRE Life is Not Easy, Let’s Not Make it Harder

You may be asking “why should I “sell” observability, isn’t that your job”? A fair question, but no doubt tooling that makes your life easier is a lifeline to you, and when you can show other people how useful it is they’ll let you buy more of it.

Let’s acknowledge that the SRE job is not an easy one and it’s not always well defined or well valued, but it is important to the business. The SRE title means many different things to many different people and the daily responsibilities will fluctuate from company to company. What we can count on is that if an outage or performance degradation is affecting an external customer, it’s probably going to be up to the SREs to figure it out.

Perception of SREs is often skewed toward the role of responding to incidents and putting out fires. But to take a cue from Smokey the Bear, prevention is the best fire fighting technique. Observability tooling is most needed in troubleshooting situations when time is of the essence, but it can also play a meaningful part in helping users understand where systems are not working optimally and can be improved. Is O11y the solution to all the problems you or your company face? No, and you would be right to be skeptical of any vendor that pitches it as such. That being said, without it parts of the SRE life are going to be a lot harder than they have to be.

Be an O11y Hero

Establishing SRE Value and “Selling” O11y

Articulating the value that observability, and in the bigger picture your SRE team, brings to the business is a necessary but often nebulous task. An understanding of observability and its importance can’t be assumed. We are in a phase of adoption where many less technical personas don’t understand what observability is or are grappling with how it differs from monitoring approaches of the past.

An SRE manager needs to ensure their team has the right tools for the job and can demonstrate the value they (both the team and the tools) bring to the business as a whole. Not all of the ways SREs and observability positively impact a business are easily articulated for upper management, but here are some of the ways SRE teams have a direct impact.

1. Incident Response Is Mission Critical

This is the obvious and most direct reason, but if critical infrastructure is down then the ability of the business to function is going to be severely impacted. History has shown us that downtime equals dollars lost and reputation with customers damaged. And from the SRE’s perspective, nothing else induces anxiety quite like a customer letting you know that prod is down and you don’t know why. 

Intuition and prior knowledge are going to come into play in any troubleshooting scenario. However, the situations that are the most challenging are typically so because something unexpected has happened and your intuition may only get you part of the way to the answer you need. If all potential variables could be known at a given time we wouldn’t be so fixated on being able to account for unknown unknowns when the time comes. 

Observability is well-tuned for those exact situations where you need to model and analyze data as needed to extract useful information from it. That’s where we use the power of OPAL to ensure your data is navigable. Intuition is good, but intuition plus observability is better. 

With observability, your team can quickly home in on the likely cause and with Observe you won’t have to jump between multiple tools to do so. Fewer tools needed to respond to incidents are good for SREs but also for the company. Many organizations are pushing top-down mandates to reduce tool sprawl and consolidation of surplus monitoring tools can make a meaningful difference.

2. Reliability Engineering Is Good For Development

Every organization wants to move quickly enough to have a competitive edge and this means having the ability to release code to production fast. Moving fast means nothing if infrastructure is breaking down. It’s up to the SRE team to optimize systems to make them more reliable to enable developers to be more focused on high-value tasks.

Knowing basic metrics on resource consumption is table stakes and can be found in any monitoring tool (Observe has your metrics and visualization needs covered too). What observability offers is a way to map out the relationships between different parts of the system and how they affect one another. This can highlight likely bottlenecks in performance and can be used proactively to explore opportunities to improve reliability. Doing so imparts cost reductions related to infrastructure as well as benefits dev teams. 

3. Team Culture and Sustainability Are More Important Than Ever

Having a good SRE culture is crucial to attracting new hires, reducing the likelihood of churn, and charting a path toward sustainable reliability in your org. Many organizations still don’t have SREs and many that do would like to have more which makes retention of your team a paramount concern. Institutional knowledge can often be siloed, or even non-existent if there’s been higher than usual levels of turnover. To counter that challenge SREs should choose tooling that will reduce the overall learning curve for new, and existing, team members.

Observe DatasetGraph

Tooling is only one factor in the SRE’s day-to-day but it does play a part in the culture. Observability is a way to gain visibility and context around your environment, this can be a powerful learning tool for those that lack prior knowledge of how systems were designed. This is why Observe has features like our Universe Maps to let you quickly see all the things and the relationships between them. 

In an earlier installment of our Observability Heroes series, Auditboard highlighted how beneficial it is to have visibility across their environment including customer-facing systems. Not only can features like Universe Maps help with knowledge sharing but they also tie back to that critical first point about reducing customer impacts.

Talking Business Context Is a Must

Ultimately, anyone leading an SRE team has to be able to establish the value they bring to the organization and the criticality they have to the line of business. This involves trying to quantify the compound effect that the inner workings of the SRE team have elsewhere in the organization. As we try to package that information in a way that will resonate with upper management the more you can put things into “business terms” the better.

Leadership may rely on business intelligence tools but BI tools are geared toward just that, the business. BI isn’t suited for SRE, however, it does indicate what kind of data does matter to leadership. What SREs need is not an additional tool or product to bridge the gap between IT infrastructure and the business, but an observability platform that can do this by default. 

Turning Machine Data Into Things

We often talk about how Observe can understand context in its correlation of data and can turn your telemetry streams into “things”. Things could be technical like pods or hosts, but they could also be business-y things like customers or shopping carts. The ability to turn data into things makes day-to-day tasks for SREs easier by enabling users to focus on what’s most relevant to the task at hand. Observe is an observability tool, but the context it provides makes it well suited for bridging that gap and providing visibility from a perspective that will resonate outside of the SRE team. The c-suite doesn’t think in terms of logs, metrics, and traces, but customers and related attributes are going to be more familiar. 

If you or your company wants to take your observability strategy to the next level, we encourage you to check out our demo of Observe in action or to join us for one of our weekly live demos.