Observability at Observe – Infrastructure and Business Processes
Observe provides a highly scalable and available Observability Cloud to customers worldwide, allowing teams to build, run, deploy, and monitor their software at internet scale. Observe charter is to provide our users with useful insights from their seemingly disparate and never-ending mounds of machine and user data.
To run this platform at scale, Observe uses its own offering to observe itself. We call that “Observe on Observe” or O2 for short. Self-hosting our stack and validating everything makes a big difference to the quality of our customer offering.
In the first part of the series, we describe Observe architecture and explained various use cases for self-observation that we use the O2 tenant for.
In the second part, we explain how Observe combines performance and business data into an interactive, connected map that we call the Data Graph, and how our engineers and product managers use it to manage performance, features and costs.
In this part of the series, we explore some of the more technical aspects of keeping the business of Observe running smoothly, including tracking user activity, monitoring our service provider Snowflake, ensuring the health of data transformations, and other aspects of engineering and business observability.
In the fourth part of the series, we will describe the recent optimizations of the Observe on Observe environment that improve the platform performance and decrease operational expenditures.
Tracking User Activity and Queries
One of the most widely used tools is the feature-laden “User Session Drill-down” dashboard that brings together multiple data sources into a single view focusing on user actions, including the waterfall session timeline with sections for each browser tab used by a user, showing successes and failures and linking to other dashboards for more details about every page interaction:
When it is necessary to investigate a single event from a user’s session, engineers can drill-down into all the intricate details of a single customer request via the “Observe User Query” dataset, which includes various implementation details, WebSocket communication events and Snowflake query details:
Even light-on-visuals dashboards are often useful. The “Cost Analysis Dashboard” is a great example of a rapidly built data-focused view that joins multiple datasets to provide information about the usage of dashboards and their costs:
Managing Snowflake Health End to End
O2 is chock-full of useful datasets, dashboards and monitors helping our engineers manage the Snowflake database so that we can provide a fast, quality experience to our customers while keeping a lid on costs.
The high-level “Snowflake Health” dashboard provides an overarching view of various components of Snowflake health that matter to maintain performance and availability.
Observe runs a tremendous number of Snowflake queries using our own fork of the Snowflake Go driver. In mid-2023 we were responsible for ~2% of daily worldwide Snowflake query volume and the number has only grown since then. The query compilation time and run statistics are of paramount importance to us. These and more are tracked in “General Stats” section of “Snowflake Health” dashboard:
A plethora of Snowflake-focused monitors run all the time, watching for query failures and unusual performance, and alerting engineers via PagerDuty, Slack, or email:
Our non-production Snowflake tenants receive the new upcoming releases before they hit any of our production deployments. While Snowflake is very thorough in giving plenty of warning about upcoming changes to all customers, we still find it important to pay attention when this happens, in a spirit of “trust but verify”.
To accomplish this, the “Snowflake Version Update” monitor alerts our backend engineers when Snowflake makes those releases. This alert is routed to the right engineers as well as to ongoing test automation that runs to validate the impact of any new Snowflake release:
Ensuring the Health of Dataset Accelerations
Observe dataset accelerations are an innovative way of shaping streams of data into the form of usable, accelerated, fast-to-query datasets that do not require dedicated teams of data scientists, complex data processing frameworks, manual management of clusters, or complex job scheduling.
This key aspect of Observe functionality has its own share of dedicated dashboards in O2, focusing on both the overall health of the platform and on the deep-dive investigation of a single customer dataset.
For example, engineers wanting to see the health of the dataset acceleration activities across all tenants or in single deployment use the “Transform Health” dashboard (“Transform” is just the internal term for dataset acceleration):
Customers can set goals of how quickly data that they send to the Observe collection endpoint becomes available for queries using the Acceleration Manager configuration page in their Observe tenant. Observe engineers and product managers can troubleshoot the dataset configuration and acceleration using “Platform Table Overview” dashboard, including validating the freshness goals set by customers:
Tracking Source Code Changes and Builds
The Observe O2 tenant tracks code changes from GitHub and Gerrit and software build events from CI/CD infrastructure powered by Gerrit, Jenkins and Bazel.
Our ship velocity is high and so there is constant attention to quality software releases and the developer experience. Developers do not like to sit there wasting time waiting for builds to complete. The data in O2 is constantly used to optimize the build experience for Observe team.
For example, the “Bazel Build Trends” dashboard offers SLO style reporting of performance for builds by all the developers and build automation runners:
All source control activity in various Github and Gerrit repositories is tracked in O2, including commits, pull requests, merges, branches and reviews. A commit in “observe/Git Commit” provides a jumping-off point for exploring changes done by Observe engineers:
Thanks to the links between datasets, the git commit record links via Jenkins and Bazel builds to the Kubernetes image record in “kubernetes/Image” dataset and then to the Kubernetes containers running this image, producing metrics, logs and traces tagged with the git commit and originating user information.
Deep Dives into Logs, Metrics and Traces
We’ve seen that O2 has many carefully crafted dashboards that are pre-configured for exploring customer and platform data for well-understood purposes (like seeing customer data ingestion metrics or monitoring Snowflake query execution times). But not every possible use of the platform can be predicted in advance, and sometimes one has to dive into the raw input data.
For ad-hoc exploration and troubleshooting ongoing issues, users in O2 can use powerful tools such as the Log Explorer to slice and dice logs, the Metric Explorer to visualize time series data and the Trace Explorer to look at distributed traces.
Many on-the-fly explorations into incidents or product tuning start in these components and get turned into dashboards or monitors for future reuse.
Searching Logs with Log Explorer
Observe Log Explorer enables seamless exploration of any dataset containing structured or unstructured logs. This includes point and click filtering, navigation with statistics, easy content extraction, column data type modifications and a plethora of fast visualization options. These ad-hoc explorations can be saved to a worksheet, converted to a more permanent dashboard, or turned into a monitor.
O2 has hundreds of log datasets, covering every aspect of Observe product, including helpful log datasets for builds, nginx logs, poller logs, billing logs and logs for the AI-powered OPAL Copilot. Here we see the Kubernetes container logs being queried for some criteria:
Graphing Metrics with Metric Explorer
Observe Metric Explorer provides options for visualization of any time series metric that is sent to an Observe tenant. These ad-hoc visualizations can be promoted to a more permanent dashboard or turned into a monitor.
Lots of key metrics associated with Observe infrastructure are available in O2. For example, our Kafka clusters emit the “encoder_kafka_bytes_total” metric that can be used to watch the throughput of our Kafka environment for each customer. Multiple metrics can be switched between or shown all at once on a single screen:
Viewing Distributed Traces with Trace Explorer
Observe is a great place to put your distributed traces, whether you carefully craft them yourself (like we do) or use an auto-instrumentation agent. Observe codebase is instrumented with OpenTelemetry distributed tracing and all data from the instrumentation of both server-side and client-side components is sent back to O2.
Observe Trace Explorer offers easy exploration for all the traces sent to the platform, allowing for filtering by type, status code, or duration as well as rapid dashboard and monitor creation:
A single trace view provides a wealth of information showing various activities and multiple code paths taken in satisfying user requests:
Conclusion
Managed solutions live and die by their quality and uptime. Quality and uptime are directly related to what you can measure and how you manage it. To measure something you have to observe it.
Observe provides a highly available, immensely configurable, easy to use observability solution to our customers. The self-usage of the product offers great visibility into technical and product usage details to product engineering and product management. At the same time, it contains a wealth of useful data on customer usage for our field engineering and sales teams.
O2 has a wealth of rich dashboards to cover all common performance and feature investigations, covering everything from building our software to deploying it and running it for our customers. Powerful ad-hoc exploration tools allow for deep visibility into any aspect of Observe platform that isn’t covered by previously built dashboards or monitors.
Up Next
Continue on to part four of the series, where we will describe the recent optimizations of the Observe on Observe environment that improve the platform performance and decrease operational expenditures!
Try Observe out yourself, with our free trial!