
Why We Built an AI SRE
To keep services reliable, engineers often get pulled into troubleshooting and incident response, slowing down feature work and burning out teams. We built the Observe AI SRE to change that. By using AI to analyze observability data and guide troubleshooting, developers can solve more issues on their own and incidents require fewer engineers to swarm. That means faster recovery, smaller on-call burdens, and more time spent building instead of firefighting.
To this end, we built our AI SRE with several principles in mind:
- Shorten MTTR by letting AI handle the heavy lifting.
- Make observability accessible without first needing to master a tool or query language.
- Help every engineer get the right answer quickly, no matter their experience level.
The AI SRE in Practice
1. Ask Questions Right Away with a Chat Interface
Previously, users new to Observe may have told us, “I’m new to Observe. Can I get what I need without learning a new query language?”
The AI SRE makes it possible to just ask, right on day one:
“Several users are reporting that payment is failing. Can you analyze logs, metrics, or traces in Observe in the past 24 hours to identify what’s causing the issue?”
AI SRE translates your question into OPAL (Observe’s query language), runs the query, and shows you the results. You see both the natural language explanation and the underlying OPAL query, so you can learn as you go.
2. Get to the Right Data Easily
Another common question is “Which dataset should I use to answer my question?”
The AI SRE helps by automatically choosing the right dataset to use for your question — whether it’s logs, metrics, or traces. You don’t need to know dataset names or schema up front. In fact, Observe maintains a Knowledge Graph that tracks objects in your environment, and the relationships between them, to provide the necessary context for fast, accurate answers through the AI SRE.
3. Built-In RCA & Postmortems
When something breaks, AI SRE doesn’t just show you charts. It investigates the issue, highlights the suspected root cause, and even drafts a postmortem outline you can refine and share.
How to Use AI SRE
AI SRE is available in two places today:
- Observe UI → If you’re an existing Observe user, this is right where you already investigate incidents.
- Observe MCP Server → For embedding AI SRE into IDEs, chat, or support workflows. Learn more about MCP here.
Getting Started: A Real Example
Payment is failing
To put AI SRE to the test, I ran the OpenTelemetry Astronomy Shop demo app on a GKE cluster with a feature flag that fails 25% of transactions.
I asked:
“Several users are reporting that payment is failing. Can you analyze logs, metrics, or traces in Observe in the past 24 hours to identify what’s causing the issue?”
Root Cause Analysis
Within minutes, AI SRE found the root cause:
- Error:
Payment request failed. Invalid token. app.loyalty.level=go
ld - Location:
Payment pod payment-97d7c78cc-q7rhx
- Timing: Error spike around 6 PM on September 8th
- Impact: Affected only gold-level loyalty users, not the entire payment system
AI SRE suggested likely causes, including a misconfigured loyalty token service, gold-tier configuration issues, or authentication service changes.
Recommended Actions
- Verify loyalty token validation service configuration
- Review recent deployments/config changes
- Add alerts for this error pattern
- Test gold-level payment flows
Understanding Business Impact
Next, I asked:
“Please break down the revenue loss by product, using human-readable product names. Product names are available in the Products dataset (42248473).”
AI SRE calculated the revenue loss:
Creating Monitors Automatically
Finally, I asked:
“We should create a monitor to keep track of these payment failures moving forward. Can you create one?”


AI SRE quickly generated three threshold monitors. While the initial OPAL query wasn’t fine-tuned for “High Value Product Transaction Failures,” it provided a strong starting point.
The Bottom Line
With AI SRE, you don’t need to be a tool expert or spend precious minutes figuring out which datasets you need. You just ask questions in natural language, get answers instantly, and focus on fixing what matters.
What’s Next
We will enable AI SRE (in the Observe UI) first for existing customers already participating in the MCP Server private preview. If you haven’t tried Observe yet, sign up for a free trial. Want early access to AI SRE? Contact your account team to get onboarded.