Rethinking MTTR: Don’t Be Stuck At Good Enough

By Liam Rogers, October 17, 2022

When one of our customers came to us their mean time to resolution (MTTR), like at many organizations, was stuck. Nothing they were doing with their existing tools could get MTTR below a certain threshold. They’d exhausted the low-hanging fruit to move the needle such as continuous monitoring, automation, and configuration as code but they needed something more. However, they weren’t satisfied with having this barrier to faster incident resolution. When they started using Observe they were finally able to break through that MTTR barrier they had been stuck at.

Observe’s architecture is designed to understand data context and expedite your ability to search and navigate your data through the use of features such as Dataset Graph and Graphlink. These features help you connect the dots between your data any way you need and smart dashboards that link to your Datasets let you easily launch into investigations. All these facets of Observe mean you are equipped to get to the root cause and resolve problems faster and get MTTR down – and that’s the exact experience our customers report to us.

MTTR, What It Is And Why It Matters

MTTR is a metric used to assess the average time to detect and resolve an incident. For all the talk about eliminating monitoring tool sprawl to bring costs down and reduce data silos, a key driver for better observability should be reducing your MTTR. Given the proclivity of the “fail fast” mindset in software development, incidents will continue to be commonplace and organizations must be able to recover quickly to avoid negative impacts to their customers and the business as a whole. However, there are some headwinds facing organizations that want to reduce their MTTR.

Challenges with existing monitoring tools

Many tools designed before any mainstream acceptance of observability have evolved by integrating previously disparate products to try and get to observability through the sum of their parts. In our State of Observability 2022 report, 62% of organizations said that tool complexity was their biggest challenge with existing monitoring and troubleshooting tools. That was up from only 38% the year before. Simply put, if a tool adds complexity to your life and time to your investigations then it’s hindering rather than helping.

Don’t Settle For Good Enough

It can be difficult to define what constitutes as “good” MTTR as it can be subjective. The severity of any given incident will be dependent on the specific situation, the nature of your business, and the potential impact to customers or revenue. That being said, MTTR refers to the average time and if the average is trending toward a day or more to resolve any incident then that’s cause for concern.

In our State of Observability report, we saw a few interesting data points that showed a relationship between MTTR and existing tooling. For example, an overabundance of tools can negatively impact MTTR. Organizations with 6 to 10 monitoring tools in place were most likely to have MTTR of a day while orgs with 11 to 20 tools were most likely to have MTTR of a few days. Additionally, 51% of organizations that have MTTR of a few days said their existing monitoring tools were very effective for troubleshooting cloud-native applications. Yet, if it takes a few days on average to resolve an incident it calls into question how effective that tooling can be. As mentioned, there are many factors at play, but the data hints that organizations may be settling when it comes to MTTR.

MTTR, But With Observe

The big question is, can Observe bring down MTTR? In Q3, we conducted a survey of our users using a Net Promoter Score methodology, and in that process, we also asked them a few of the same questions from our 2022 State of Observability report. We wanted to know how our own customers were fairing in comparison to the broader industry. The survey results certainly indicate that Observe is indeed working as intended! MTTR was the area where customers most commonly saw measurable improvements. Of our NPS respondents, 72% indicated that their average time to investigate an incident was a few hours or less. Only 35% of the total respondents in our broader State of Observability study could say the same.

MTTR for Observe Customers

We’ve seen customers such as Jackpocket reduce MTTR from a day to just three hours after using Observe. We’ve also previously heard from customers like AuditBoard that Observe empowers them to identify problems and alert relevant teams much more proactively. Recently we had a customer solve an issue within hours of using Observe that had recently stumped their organization. They were able to easily aggregate and compare data — that they previously weren’t able to do — to find the one-line code change that kicked it all off.

Don’t Choose Between Quality and Cost

When it comes to choosing observability tooling cost is often a major factor. It’s not too surprising given that many legacy tools have pricing models that do not scale favorably with ongoing data growth. However, as the adage goes, time is money. In any impactful incident time is most definitely money. If you’re paying for tooling that isn’t getting you to the root cause of an issue quickly, because it keeps your data in separate silos or because it incentivized you to toss relevant data to bring down costs, then you might be paying to not have observability. The evolution of legacy tooling has shown us that the relationship between cost and observability is all over the place in the current market, resulting in pain for customers.

Observe’s ability to understand context and correlate data while also enabling retention of more data and for longer periods of time gives you an edge when it comes time to investigate an incident. This is alongside Observe’s usage-based pricing which puts much cost control directly in the hands of the users. If you’re looking to gain observability to reduce MTTR and reduce cost, Observe is designed to do just that.

If taking MTTR down a notch or two and going from reactive to proactive sounds like something your company needs then click here to request access and start seeing the difference Observe can make.