I still know what you did last summer…Ensure Observability is in place for your system in cloud.
I still know what you did last summer…This is what haunts developers, support team members, when they release new feature/added quick fix/lack of sufficiently tested code reached into a live system. In absence of better observability in the system, they are not sure how to react if any issue occurs on the live system. What did it cause? Is it my new feature or some recent fix or some legacy issue or some infra issue or something else.
What is observability in software?
Observability is the ability to infer internal states of a system based on the system’s external outputs. System observability is the method for evaluating outputs to reach meaningful conclusions about internal states of the system.
How it is different from Monitoring?
How Observability helps?
i)Observability helps you understand the internals of your production system by asking questions from the outside -
Historically, we monitor metric values like CPU and memory from known past issues. And every time a new issue arises, we add a new monitor. But the problem is that we usually end up with noisy monitors, which people tend to ignore. Or you end up with monitors whose purpose no one understands.
ii)Observability is superset of monitoring -
Modern distributed application has microservices, containers, cloud, serverless, and a lot of combinations of these technologies. All of these increase the number of failures that systems will have because there are too many parts interacting. And because of the distributed system’s diversity, it’s complex to understand present problems and predict future ones.
As your system grows in usability and complexity, new and different problems will continue emerging. Known problems from the past might not occur again, and you’ll have to deal with the unknown problems regularly.
For instance, what usually happens is that when there’s a problem in a production environment, sysadmins are the ones trying to find out what the problem is. They can make guesses based on the metrics they see. If the CPU usage is high, it might be because the traffic has increased. If the memory is high, it might be because there’s a memory leak in the application. But these are just guesses.
Systems need to emit telemetry that shoots out what the problem is. It’s difficult because it’s impossible to cover all failure scenarios. Even if you instrument your application with logs, at some point you’ll need more context; the system should be observable.
Observability is what will help you to be better when troubleshooting in production. You’ll need to zoom in and zoom out over and over again. Take a different path, deep-dive into logs, read stack trace errors, and do everything you can to answer new questions to find out what’s causing problems.
How you can achieve Observability?
Metrics
Logging
Request Tracing
Monitoring
Alerting
Details about each pillar of Observability
1.Metrics
A metric is a value that expresses some data about a system. These metrics are usually represented as counts or measures, and are often aggregated or calculated over a period of time. A metric can tell you how much memory is being used by a process out of the total, or the number of requests per second being handled by a service.
Time series metrics dashboards can be used to identify a subset of traces that point to underlying issues or bugs — and log messages associated with those traces can identify the root cause of the issue. Then, new metrics can be configured to more proactively identify similar issues before the next incident.
Examples of widely referred metrics — System Throughput, Network Utilization, CPU Utilization, Queue Throughput, Turnaround Time, System Availability, Disk Utilization, Discrepancy Reports
AWS — AWS CloudWatch
Azure — Azure Monitor
GCP — Cloud Monitoring
2.Logging
Structured or unstructured lines of text that are emitted by an application in response to some event in the code. Logs are distinct records of “what happened” to or with a specific system attribute at a specific time. They are typically easy to generate, difficult to extract meaning from, and expensive to store.
Centralized processing of logs and generating insights and alerts from them. Activity logs, diagnostic logs, application logs, event logs, and even custom logs can send information to centralized log systems that can further provide rich reporting, dashboarding, and analytics capabilities to get insights from incoming data and act on them.
Reactively, the logs help in finding areas and locations that are causing issues, help in identifying the issues and enable faster and better fixation.
AWS — AWS CloudWatch
Azure — Azure Monitor, Azure Monitor Logs
GCP — Cloud Monitoring, Cloud Logging
3.Request Tracing
A single trace shows the activity for an individual transaction or request as it flows through an application. Traces are a critical part of observability, as they provide context for other telemetry. For example, traces can help define which metrics would be most valuable in a given situation, or which logs are relevant to a particular issue.
For known problems, you have everything under control — in theory. You have an incredible runbook that you just need to follow, and customers shouldn’t notice that anything happened. But this isn’t how things often work, in reality. Customers still complain about issues in your system even if your monitors look good.
Therefore, monitors for metrics alone are not enough. You need context, and you can get it from your logs.
Correlating these different sources of data is challenging but not impossible. For example, in microservices, there’s heavy leverage on HTTP headers to pass information between calls.
Something as simple as marking a user’s request with a unique ID can make the difference when debugging in production. By using the request ID in a centralized storage location, you can get all the context from a user’s call at a specific point in time…like the time when the user complained but your monitors said things were all good.
Also, when viewed in aggregate, traces can reveal immediate insights into what is having the largest impact on performance or customer experience, and surface only the metrics and logs that are relevant to an issue.
Few recommended practices for tracing patterns — To add Unique Request Id, Customer Id, Tenant Id, User Id, Account Id, Server IP, etc.
AWS — AWS CloudWatch
Azure — Azure Monitor, Azure Monitor Logs
GCP — Cloud Monitoring, Cloud Logging, Cloud Trace, Cloud Debugger, Cloud Profiler
4.Monitoring
Monitoring is an important architectural concern that should be part of any solution whether big or small, mission critical or not, cloud or not. It should not be avoided at any cost.
Monitoring is the practice of collecting measurements of key aspects of infrastructure and applications. Examples include average CPU utilization over the past minute, the number of bytes written to a network interface, and the maximum memory utilization over the past hour. These measurements, which are known as metrics, are made repeatedly over time and constitute a time series of measurements.
Monitoring helps in taking both proactive as well as reactive actions and measures on the solution. It is also the first step towards auditability of the solution. Without availability of monitoring log records, it is difficult to audit the system from various perspectives such as security, performance, availability and more.
Monitoring helps in identifying availability, performance and scalability issues before it happens. Hardware failure, software misconfiguration, patch update challenges can be known much before they impact users, using monitoring. Performance degradation can be fixed before it happens.
AWS — AWS CloudWatch
Azure — Azure Monitor, Azure Monitor Logs, Azure Power BI
GCP — Cloud Monitoring, Cloud Logging
5.Alerts & Alarm
It is possible to generate alerts on the ingested data.Cloud Solution does so by running a pre-defined query composed of conditions on the incoming data. If it finds any or a group of records that falls within the ambit of the said query, it generates an alert. Cloud Solution provides a highly configurable environment for determining the conditions for generating alerts, time windows from which the query should return the records, time windows when the query should be executed, and action to be undertaken when the query returns results as alerts.
AWS — Cloudwatch
Azure — Azure Monitor
GCP — Cloud Monitoring
Better level of Observability helps
i) Developers — Quickly find out root causes of critical issues and provide fixes.
ii) Ops Team — Monitor system health and take proactive steps (if the system is about to breach critical thresholds). In case of any bug/issues impacting the system’s availability and reliability, Ops team can take reactive steps to analyse the issue and provide the fix, in some scenarios quick business justifications.
iii) Business — Sense of confidence to allow quicker functional deployments and achieve business agility. Also, helps to achieve better auditability of the system.
If you like the article, please clap for it. Also, share the article with your friends.
Reference articles for this article -
https://www.scalyr.com/blog/observability-production-systems-why-how/
https://lightstep.com/observability/#telemetry-data-logs-metrics-and-traces