How Observe Uses Snowflake to Deliver the Observability Cloud [Part 1]
Observe chose to build the Observability Cloud on Snowflake, because it allowed us to store essentially infinite amounts of data, shape, and accelerate it in an efficient, scalable way while offering resiliency and worldwide coverage.
Part 1: Ingesting Data from Customer Applications
The Observability Cloud takes input from various data sources via open endpoints and agents and stores it in the Data Lake to unify telemetry. From there it builds an interactive, extensible, and connected map of your data called the Data Graph that allows you opportunities to explore and monitor the digital connections of your entire organization.
This series will look at how Snowflake is key to The Observability Cloud’s unique architecture and allows it to deliver an order-of-magnitude improvement in both the speed of troubleshooting and economics of deployment.
This 3-part blog series includes:
- In this post, we review how Observe accepts and processes customer data
- In Part 2, we will look at how we shape that data into a useful form and “link” it to other data
- In Part 3, we will review how the Observe manages resources with a focus on resilience, quality, and cost
Strengths of Snowflake Partnership
Observe chose to build the Observability Cloud on Snowflake, because it allowed us to store essentially infinite amounts of data, shape, and accelerate it in an efficient, scalable way while offering resiliency and worldwide coverage. Snowflake’s separation of storage and compute allows Observe to scale our offering for the largest customer workloads, with the flexibility of rapidly getting a lot of compute resources on demand.
Based on the Snowflake investor relations page reviewed in October 2022, Snowflake served an average of 2.4B queries/day worldwide. Observe’s portion of those was ~40M queries/day, which represents an astounding 1.66% of the daily query volume across all Snowflake customers. On the storage side, Snowflake currently allows us to ingest over four Petabytes of data each month! These impressive metrics will only go up in the future as we bring more customers into the Observability Cloud.
We optimized Observe’s micro-batch ingest to smooth out burst-y torrents of unpredictably shaped data into a steady stream capable of being ingested into Snowflake at tremendous transfer rates at a very low cost. Observe transforms are an innovative way of shaping streams of data into the form of usable, accelerated, fast-to-query datasets that do not require dedicated teams of data scientists, complex data processing frameworks, manual clusters, or job management. All of which is made possible by standard Snowflake SQL— but without the need to write it.
By leveraging efficiencies of scale and cloud compute resources from AWS and Snowflake, the larger-sized datasets we have, the more cost-effective they are to observe. We provide those savings directly to our customers. Unlike our competitors, we want and encourage MORE data, not less.
Observe offers a rich UI with complex, interactive dashboards. It can deal with any number of ad hoc queries because of the elastic scaling of Snowflake computational resources on demand.
We have excellent relationships with various engineering, platform, and support teams within Snowflake. We regularly have meetings focusing on new features and discuss issues that are important to Observe. The resulting Snowflake product improvements benefit both Observe and all customers of Snowflake worldwide.
For the software systems, the “External Outputs” are logs, metrics, and traces (commonly referred to as “3 Pillars of Observability”), or, generalized, any piece of semi-structured data with a timestamp. The “Internal States,” are things like “CPU,” “cluster memory usage,” “the number of user sessions in my application,” or “amounts of payments flowing through the e-commerce website,” etc.
Customer applications stream raw data (“External Outputs”) to Observe, and then Observe applies logic to shape that raw data into accelerated, strongly typed, Datasets representing those “Internal States” to be easily discovered, connected, and queried.
Ingestion of Customer Event Data
This high-level diagram shows how Observe provides the Observability Cloud to our customers:
The right-hand path of this diagram describes the components involved in the ingestion of data (Collector, Encoder, and Loader):
Agents and Endpoints for Ingestion
Observe provides endpoints for many open-wire protocols (HTTP, Prometheus, DTrace, Elastic, etc.) and provides dozens of Apps and integrations (like AWS, GCP, Azure, MySQL, GitHub, GitLab, Jenkins, and so on) that can send data to our systems.
Observe ingestion process is (mostly) agent-less in that we try hard not to build any agents ourselves. Except for times when an agent is required, Observe builds on open protocols and frameworks. For example, if you look at Observe our Kubernetes manifest, you will discover that we are using the “otel/opentelemetry-collector-contrib” for traces, “fluent/fluent-bit” for logs, and “grafana/agent” which sends data to endpoints we mentioned already. We do offer an optional agent for Kubernetes’ resource updates, but still, we try to stay out of building agents.
Dealing With Unpredictable Streams of Messages Big and Small
The systems that send data to Observe bombard us with small (but also occasionally large) messages with often unpredictable traffic levels. Those can be log file snippets, tracing spans, Telegraf/Prometheus metrics, random events of arbitrary shape, and so on. The pattern of those messages often varies with the time of business or general internet weather.
Meanwhile, Snowflake’s ingestion process is optimized for large, compressed files of predictable size. We can’t put the incoming messages into Snowflake one by one because Snowflake’s ingestion methods are optimized for batch ingestion. Inserting records a few at a time is possible but doesn’t scale with volume due to table locking and would cost a fortune in Snowflake compute credits.
So how do we go from the torrent of small messages delivered via different protocols to large files in Snowflake’s External Stage in Amazon S3?
Buffering and Cleaning Streams into Manageable Chunks
We optimized Observe’s micro-batch ingest for both cost and time, aiming for data sent to Observe to arrive in the Snowflake in under a minute. Let’s take a look at how Snowflake enabled us to do that.
The Observe Collector puts incoming messages into Kafka into various Topics, with one Topic per customer. Kafka acts as a resilient, scalable buffer to hold the data for further processing before we ingest it into Snowflake. Observe can hold up to two days of data in that buffer to account for unforeseen circumstances and internet weather.
Observe Encoder component listens to those topics and converts all these messages to a generic format called the “observation” format. These observations are matched into so-called “bundles” and these bundles are then further batched together. Once a certain interval elapses or they have a couple of MB’s of data, the Encoder component flushes the data into a compressed AVRO file saved to Amazon S3. Observe Loader component notices these new file(s) and loads the data into Snowflake via
COPY INTO command with the
FILES option pointing to new files, all surrounded by a Snowflake Transaction.
Ingestion Efficiency and Unit Economics in Snowflake
Keen observers will notice that the size of the files generated by the Encoder can still be far below the typical file size recommended by Snowflake. With light ingestion volume, the files generated by Encoder and loaded into Snowflake can be small. Those smaller-sized
COPY INTO load processes are done by “Small” Snowflake Virtual Warehouses, the least expensive of all warehouse types.
With heavy ingest volume, the file sizes produced by Encoder get nicely packed up to the size recommended by Snowflake, and the loader process can potentially choose larger-sized Snowflake Virtual Warehouses accordingly based on the number of files incoming. Still, the majority of the data files are likely to be loaded by Small Virtual Warehouses.
Thanks to the clever use of buffering and optimal encoding, the data load efficiency increases with increased data volume, and the economics of continuous data loading into Snowflake becomes more favorable. In other words, the higher the volume of the data that is ingested, the less it costs to ingest a unit of data. Note that, unlike most vendors in this market, Observe does not charge customers for ingestion.
Check out Part 2, where we will look at how we shape that data into a useful form and “link” it to other data.