Kubernetes Observability For Humans
When it comes to observing distributed and ephemeral systems, it requires a paradigm shift in the way you think about collecting logs, metrics, traces, and even how you alert on important events.
I know what you’re thinking — why does the world need another guide to observability for Kubernetes? In short, there’s a better way now. But first, some history on monitoring Kubernetes and its challenges.
Challenges To Monitoring Kubernetes
The heart of Kubernetes monitoring challenges lies within its architecture, and the very thing it is designed to do — orchestrate containerized workloads. It is designed to free us from the burden of managing “pets” and usher us into a new era of simply managing “cattle.” But any rancher will tell you that managing cattle still takes a considerable effort, and is not without its perils.
Kubernetes is essentially a complex and multi-layered, ever-changing array of services and resources. A simple change to a configuration file can have you and your team drowning in tsunamis of new data. To make matters worse, many — if not most — of the incumbent monitoring tools have a simpler architecture in which they assume your infrastructure is more or less permanent. Trying to monitor Kubernetes with traditional monitoring tools is the definition of cat herding.
Animal metaphors aside, Kubernetes’ unique architecture extends all the way down to how it emits data.
Concerning logging, Kubernetes manages and stores container logs at the Node level, as one would expect. However, they’re quickly rotated as Nodes have finite disk space. A handy feature until you find yourself troubleshooting production issues at 3 AM. In addition, the Kubernetes command-line tool —
kubectl — only queries logs on demand for a small subset of targets. Which makes “grepping” through logs an impossible task. This headache alone leads most to invest in third-party tools or solutions, such as an ELK Stack, to meet their Kubernetes logging needs.
When it comes to metrics, this data is readily available via the Kubernetes API. However, the data model is so rich you’ll spend countless hours devising and implementing a tagging scheme to make sense of the data. On top of this, it’s up to you to design or employ tools like Prometheus to store, correlate, and analyze all of this data!
Logging and metrics aren’t even half of the different types of data your cluster emits. But, let’s say that you slogged the different data sources and tools to collect this data using the more traditional “3 Pillars” approach to observability. You got all your data, but you now work for your data and not the other way around, because it’s still siloed!
Without a bridge — or context — between your data sources, you’ll have to work hard to find connections and correlations. That will cost you significant time and money.
What Should You Observe In Kubernetes?
Now that we understand some of the many challenges of observing a Kubernetes cluster, where do we start on our journey to observing Kubernetes? Let’s begin by identifying the types of data you should consider collecting, and later we’ll cover how to correlate and analyze all of this data.
Arguably, the first and most important place to start is with logs. Logs are the bread and butter of observability for any environment. As you might have guessed, Kubernetes has a wide variety and sources of logs.
Types of Logs in Kubernetes:
- Container Logs
- Node Logs
- Audit Logs
- Event Logs
- Ingress Logs
- API Server Logs
A crucial part of your observability playbook for Kubernetes should begin with capturing as many — if not all — of these types of logs available to you.
Metrics monitoring is a bit more straightforward, as many of the same metrics found in more traditional infrastructure can be found in Kubernetes. However, there are quite a few Kubernetes-specific metrics you should be aware of. Furthermore, collecting metrics in Kubernetes is a bit more difficult as at the moment there is no native solution for metrics.
There are management platforms for Kubernetes such as Amazon Elastic Kubernetes Service that can provide you with rich and integrated metrics, but they will leave you short on all other observability telemetry.
Your other option is to directly query the Kubernetes API — which will require yet another 3rd party tool, or a significant in-house development effort to create tooling.
However you chose to collect metrics, you should be aware of the various types of metrics.
Cluster & Node Metrics
The first metrics you should concern yourself with are metrics that pertain to the overall health of the cluster. This will help ensure you have enough physical resources to handle your containerized workloads, see which applications run on which Nodes, and ensure the Nodes are working properly.
- Memory, CPU, and Network Usage
- Number of Nodes Available
- Number of Running Pods (per Node)
Moving on from the cluster level to the individual Pod level, Kubernetes metrics allow you to gauge Pod health at a glance.
- Current Deployment (and Daemonset)
- Missing and Failed Pods
- Running vs Desired Pods
- Pod Restarts
- Available and Unavailable Pods
- Pods in the CrashLoopBackOff state
These metrics are what users are most familiar with, as they provide detailed information about various “hardware” components and usage statistics. They are similar to what you’d find in a more traditional environment.
- Container CPU
- Container memory utilization
- Network usage
Finally, any fully observable environment must include application metrics. Each organization has to decide which of these help them best gauge performance and reliability relative to the business scope and objectives of each of their products and services.
Here are a few of the most important performance-related metrics that organizations should track.
- Error Rates
- Application Performance Index (Apdex Score)
- Average Response Time
- Request Rate
After you’ve identified which various logs and metrics you’d like to collect, you may also consider collecting traces. Traces allow you to map interactions between different services and resources to identify the root cause of a failure or event.
Traces are relatively easy to collect for Kubernetes system components, though it does require some configuration and a 3rd party tool or two. Collection can be done with the OpenTelemtry Collector to route them to an appropriate tracing backend for storage, processing, and analysis.
In terms of getting traces from applications, that will require some instrumentation on your part. You can achieve application-level tracing by implementing a distributed tracing library — such as OpenTracing — directly into your application’s code, or by using a service mesh. Both require significant development and implementation time, and their value may not prove worth the effort.
As mentioned earlier, Kubernetes provides information about its state — essentially any change to any resource type — via the API logs. Though, without special tooling, reconstructing your clusters state from these logs may prove very difficult. The value of knowing your cluster’s state is greatly magnified when used in conjunction with logs, metrics, and other observability data. Together, they can give you a complete picture of the current, past, and insight into the future state of your cluster.
Kubernetes’ components continually emit Events, which are essentially small messages that let you know what happens inside each moving part. Most people’s experience with Kubernetes events is from issuing the
kubectl describe command, in which events will be displayed in the command’s output at the bottom.
It must be noted that events in Kubernetes are just that, they do not trigger any behavior or condition. They are simply the output from components “talking” and are intended to be used solely for diagnostic purposes.
Events coupled with logs, metrics, traces, and other sources of data can be invaluable. They allow you to link events to resources and establish a detailed timeline of exactly when changes occurred in your cluster, and the effects they may have had. You should strongly consider adding them to your observability solution.
Lastly, teams should think beyond the data sources mentioned earlier. Data that is generated by CI/CD pipelines, customer service software, and other applications outside your environment(s) can greatly enhance your ability to see trends, troubleshoot, and more.
Kubernetes Observability With Observe
Traditionally, observability for Kubernetes has been achieved in one of two ways.
The first solution involves employing and managing a magnitude of 3rd-party tools as outlined above. To illustrate, it would take at least three tools to collect, store, and analyze all the various types of observability data a Kubernetes cluster can generate. Needless to say, this effort is mired with costs, both in engineering time and money. Once completed, your observability data is still siloed and very difficult to correlate.
The other solution requires running Kubernetes on a managed platform, such as Amazon Elastic Kubernetes Service (EKS). While this may seem like a complete observability strategy since you get monitoring out of the box, it still leaves out a ton of functionality — like alerting — and loads of important observability data. Moreover, these solutions won’t give you insight into your applications’ performance, application performance over time, or even the overall environment state. Though convenient, these platforms are hardly a solution for full observability.
Now there’s a third, and better, way.
Observe’s platform was designed to thrive with cloud-native technologies like Kubernetes. Rather than assuming that Kubernetes environments can be observed in the same way as static environments — focusing on only one layer of observability — Observe takes a dynamic, holistic approach to Kubernetes observability.
Deploying Observe in Kubernetes is as simple as running a single
kubectlcommand. From there, Observe does the dirty work. It automatically collects logs, metrics, and state changes from all layers and components of your cluster. It then uses that data to track the overall state of your environment continuously. You don’t need to worry about deploying multiple agents or aggregating different logs manually.
Thanks to Observe’s innate ability to correlate data from any source — no matter the format — you can send data from applications and sources outside your Kubernetes cluster to Observe as well. Events from GitHub, Zendesk and virtually any HTTP endpoint can be sent to Observe to bring even more context to your environment(s).
Because Observe maintains a complete changelog of all events in your cluster, you can easily reconstruct the state of your environment from any point in time. There’s no need to wade through historical log data manually to understand the past. Observe keeps track of the past for you, while at the same time providing real-time visibility into the present.
Lastly, put to rest the question “Is my cluster healthy?” Kubernetes doesn’t make it easy to know when things go wrong. But with Observe, you can create alerts for any Resource or Dataset related to your Kubernetes cluster. Get notifications on failed deployments, Pods stuck in a restart loop, or even alert on the status of the Kubernetes API Server.
In With The New
Distributed systems and software like Kubernetes are here to stay. You can expect them to only grow in complexity and become even more abstracted from traditional on-premise, or cloud-based computing environments.
When it comes to observing distributed and ephemeral systems, it requires a paradigm shift in the way you think about collecting logs, metrics, traces, and even how you alert on important events. Otherwise, you’ll end up chasing your tail and creating more work below the value line.
Rather than spending countless hours trying to automate your current monitoring tools, and creating tags to organize your — siloed — data, you need a tool that can automatically gather the data you have today, in any format, and begin to make sense of it now.