The Future of Observability in the Age of Data Lakes and AI

By Shruti BhatApril 24, 2025

Observe provides open, scalable AI-powered observability by correlating logs, metrics, and traces directly in your data lake, enabling faster troubleshooting at a much lower cost.

 

Observability at scale is broken. Not because you lack data, but because you have too much, and you’re managing it the wrong way.

If you’ve been using Splunk, ELK, Grafana, Datadog or New Relic for years, you’ve felt the pain:

  • Costs keep rising: your monitoring tools cost you a lot more than you’d expect every year.
  • Complexity keeps growing: you have more services, more logs, more metrics, more noise.
  • Your on-call experience is getting worse, not better: too many alerts and too hard to troubleshoot. “I love being on-call,” said no one ever.

The last generation of monitoring tools did an incredible job helping you get visibility into your systems, but your scale today is pushing your monitoring stack beyond the limits of what it was originally designed for. This reality is echoed in recent Gartner research, which calls out rapidly escalating observability spend, “driven by explosive growth in operational telemetry and increased complexity in digital businesses.”

Gartner research on getting observability spend under control

The Cost Problem: The Hidden Tax of Indexing, Data Silos, and Telemetry Pipelines

Every log, metric, and trace your team collects comes at a direct cost in terms of compute and storage.

Take Splunk, for example:

  • Indexing is expensive. You have to decide upfront what to index, and it is really expensive to build and store those indexes.
  • Cold storage isn’t a real solution. Sure, you can use S3 as your cold tier, but querying it later is painfully slow (or requires rehydration which again needs more compute and storage).
  • Telemetry pipelines let you sample, filter and route your data but leave you with the hidden cost of blind spots. Not to mention the operational complexity of tuning and monitoring the pipeline itself.

Not surprisingly, teams using Splunk, Datadog, New Relic, or Sumo Logic love the ease of onboarding but eventually hit the cost ceiling. The more you scale, the more you rely on it but the harder it gets to justify the price.

Reality check: Most of your observability data is never used. You over-collect and under-query because you don’t know what you will need in the future. But indexing everything “just in case” is just too expensive. And telemetry pipelines only delay the pain. They definitely help reduce ingestion costs in the short term but don’t fundamentally change the economics of observability.

The Complexity Problem: Alert Fatigue and Troubleshooting Bottlenecks

Ask your on-call engineers: How many alerts do you see in a day? And what are the real bottlenecks when troubleshooting the more serious issues?

The problem isn’t that alerts don’t work. It’s that they lack context.

  • Blind spots caused by cost cutting. You asked your team to cut costs so they made some difficult tradeoffs. They set 7 day retention policies, avoided high cardinality, leaned heavily on sampling. The result? More on-call hell because of blind spots and lack of context. A CPU spike in a microservice might be caused by an upstream issue, but your tooling can’t tell you that.
  • Data is fragmented across teams and tools. Engineers jump between multiple teams and tools, manually correlating timestamps across logs, metrics, and traces. Not for the faint of heart.
  • AI codegen is adding a lot more data volume, and leaving engineers with even less context. What happens when AI generated code fails in production in gnarly ways?

Modern incident management platforms are trying to bridge this gap by helping teams make sense of noisy alerts and speed up response times. I believe if the fundamental data quality issue of blind spots is addressed, AI-powered incident response and resolution can become a lot more powerful.

tracing microservices

Reframing the Solution: Data Lakes and AI

Traditional observability tools were built on architectural assumptions that don’t hold up anymore:

  • Indexing everything is the only way to query efficiently. It’s not. Data lakes now let you scan selectively, index on demand, and be far cheaper.
  • Logs, metrics, and traces should be stored in separate systems. Wrong. Modeling everything as a knowledge graph unlocks new correlations.
  • Dashboards and alerts are the primary ways engineers troubleshoot. They won’t be. AI will enable a shift toward new investigative workflows.

1. Data Lakes Solve the Cost and Silo Problem

A data lake approach changes everything.

  • Store the signals you need affordably in object storage. Correlate logs, metrics and traces by storing in a central, affordable place. Filter out the noise but don’t need to be so aggressive that you end up with blind spots.
  • Query on demand. Need structured search? Index just that subset. Need high performance? Burst your on-demand cloud compute to query efficiently.
  • Re-use your telemetry data. Instead of paying for separately for storing your telemetry for observability, security, and compliance use cases, let them all drink from the same underlying dataset in your data lake. Apache Iceberg open table format is a game changer here.

Rather than aggressively discarding data to control costs, store what you need and process it intelligently.

2. AI Solves the Complexity Problem

AI isn’t just about anomaly detection or reducing false alerts or handling known issues. It is about fundamentally changing how we troubleshoot, just like AI codegen is changing how we write code.

  • AI-assisted investigations. Instead of filtering dashboards, ask: *“Why did service X slow down at 2:14 AM?”* and get a meaningful response.
  • Automated correlation across data types. AI can identify patterns across logs, metrics, and traces using a knowledge graph, so you can drill and pivot instantly as you dig into your most complex issues.
  • Better instrumentation and more context. Even as you learn how to prompt your codegen tool to add more context into debugging workflows, expect AI to learn from past mistakes, and start detecting patterns across services before they escalate into full blown outages.

AI-Powered Observability in Your Data Lake

The shift is already happening. Teams are moving away from fragmented, indexing-first observability stacks toward a simpler data lake-first model, with AI-powered troubleshooting workflows.

For years, observability tools treated logs, metrics, and traces as separate entities, each requiring its own storage, indexing, and query engine. The approach made sense when the scale was small but at today’s scale, it’s unsustainable. Embracing a simple, centralized data lake architecture just makes sense.

Historically, debugging meant manually writing regex filters, correlating timestamps, hopping between dashboards. That’s about to change with codegen tools and the ability to rethink your troubleshooting workflows with AI-assisted investigations. Are we finally ready to move on from runbooks (common repetitive tasks) to playbooks (more strategy, goals and context)?

The next era of observability won’t just be cheaper and simpler. It will change how engineers work. The shift has already begun. I believe observability should empower engineers, not drain their energy and budget.

This is why after having worked on indexing for the last 7 years at Rockset (acquired by OpenAI), I’m thrilled to join Observe Inc. Well that, and because the people here are amazing.

Where do you see observability going next?