Alerting

Cards (52)

    • Service level indicators, or SLIs, are carefully selected monitoring metrics that measure one aspect of a service's reliability. Ideally, SLIs should have a close linear relationship with your users' experience of that reliability, and we recommend expressing them as the ratio of two numbers: the number of good events divided by the count of all valid events.
    • A Service level objective, or SLO, combines a service level indicator with a target reliability. If you express your SLIs as is commonly recommended, your SLOs will generally be somewhere just short of 100%, for example, 99.9%, or "three nines."
  • You can't measure everything, so when possible, you should choose SLOs that are S.M.A.R.T(Specific, Measurable, Achievable, Relevant, Time-Bound)
    • SLOs should be specific. "Hey everyone, is the site fast enough for you?" is not specific; it's subjective. "The 95th percentile of results are returned in under 100ms." That's specific.
    • They need to be based on indicators that are measurable. A lot of monitoring is numbers, grouped over time, with math applied. An SLI must be a number or a delta, something we can measure and place in a mathematical equation.
    • SLO goals should be achievable. "100% Availability" might sound good, but it's not possible to obtain, let alone maintain, over an extended window of time.
    • SLOs should be relevant. Does it matter to the user? Will it help achieve application-related goals? If not, then it’s a poor metric.
  • And SLOs should be time-bound. You want a service to be 99% available? That’s fine. Is that per year? Per month? Per day? Does the calculation look at specific windows of set time, from Sunday to Sunday for example, or is it a rolling period of the last seven days? If we don't know the answers to those types of questions, it can’t be measured accurately.
    • Service Level Agreements, or SLAs, which are commitments made to your customers that your systems and applications will have only a certain amount of “down time.” An SLA describes the minimum levels of service that you promise to provide to your customers and what happens when you break that promise.
    • If your service has paying customers, an SLA may include some way of compensating them with refunds or credits when that service has an outage that is longer than this agreement allows. To give you the opportunity to detect problems and take remedial action before your reputation is damaged, your alerting thresholds are often substantially higher than the minimum levels of service documented in your SLA.
    • Here is an example of a SLA, which is to maintain an error rate of less than 0.3% for the billing system. Here error rate is a quantifiable measure which is the SLI and 0.3 is the specific target set which is the SLO in this case
    • An alert is an automated notification sent by Google Cloud through some notification channel to an external application, ticketing system, or person.
  • Why is the alert being sent? Perhaps a service is down, or an SLO isn't being met. Regardless, an alert is generated when something needs to change. The events are processed through a time series: a series of event data points broken into successive, equally spaced windows of time. Based on need, the duration of each window and the math applied to the member data points inside each window are both configurable. Because of the time series, events can be summarized, error rates can be calculated, and alerts can be triggered where appropriate.
    • A great time to generate alerts is when a system is heading to spend all of its error budget before the allocated time window. An error budget is perfection minus SLO. SLIs are things that are measured, and SLOs represent achievable targets. If the SLO target is "90% of requests must return in 200 ms," then the error budget is 100% - 90% = 10%.
  • Alert Attributes:
    • Precision:Precision is the proportion of alerts detected that were relevant to the sum of relevant and irrelevant alerts. It's decreased by false alerts. Precision can be seen as a measure of exactness.
    • Recall:Recall is the proportion of alerts detected that were relevant to the sum of relevant alerts and missed alerts. It's decreased by missing alerts. Recall is a measure of completeness
  • Alert Attributes:
    • Detection Time:Detection time can be defined as how long it takes the system to notice an alert condition. Long detection times can negatively affect the error budget, but alerting too fast can generate false positives.
    • Reset Time:Reset time measures how long alerts fire after an issue has been resolved. Continued alerts on repaired systems can lead to confusion.
    • Error budgeting 101 would state that when the error count, or whatever is being measured, is trending to be greater than the allowed error budget, an alert should be generated. Both the SLO itself, and the idea of “trending toward” require windows of time over which they are calculated.
  • Alert window lengths
    • Smaller windows tend to yield faster alert detections and shorter reset times, but they also tend to decrease precision because of their tendency toward false positives.
    • Longer windows tend to yield better precision, because they have longer to confirm that an error is really occurring. But reset and detection times are also longer. That means you spend more error budget before the alert triggers.
    • One trick might be to use short windows, but add a successive failure count. One window failing won’t trigger the alert, but when three fail in a row the error is triggered. This way, the error is spotted quickly but treated as an anomaly until the duration or error count is reached. This is what you do when your car starts making a sound. You don't immediately freak out, but you pay attention and try to determine whether it's a real issue or a fluke.
    • So how do we get good precision and recall? This is achieved with multiple conditions. Many variables affect a good alerting strategy, including the amount of traffic, the error budget, peak and slow periods, to name a few. The fallacy is believing that you have to choose a single option. Define multiple conditions in an alerting policy to get better precision, recall, detection time, and rest time.
  • You can also define multiple alerts through multiple channels. Perhaps a short window condition generates an alert, but it takes the form of a Pub/Sub message to a Cloud Run container, which then uses complex logic to check multiple other conditions before deciding whether a human gets a notification.
  • You can use severity levels are an important concept in alerting to aid you and your team in properly assessing which notifications should be prioritized. You can use these levels to focus on the issues deemed most critical for your operations and triage through the noise. You can create custom severity levels on your alert policies and have this data included in your notifications for more effective alerting and integration with downstream third-party services (for example, Webhook, Cloud Pub/Sub, PagerDuty).
    • The notification channels were enhanced to accept this data—including Email, Webhooks, Cloud Pub/Sub, and PagerDuty. This enables further automation and customization based on importance wherever the notifications are consumed. High-priority alerts might go to Slack, SMS, and/or maybe even a third-party solution like PagerDuty. You can even use multiple channels together for redundancy. Low-priority alerts might be logged, sent through email, or inserted into a support ticket management system.
  • Alert policies can also be created from the gcloud CLI, the API and Terraform. It starts with an alert policy definition in either a JSON or YAML format. One neat trick when learning the correct file format is to create an alert using the Google Cloud console. Then use the gcloud monitoring policies list and the describe commands to see the corresponding definition file. The alerting API and gcloud CLI can create, retrieve, and delete alerting policies.
  • Google Cloud defines alerts by using alerting policies. An alerting policy has:
    • A name
    • One or more alert conditions
    • Notifications
    • Documentation
  • Metric based alerting:
    • Policies used to track metric data collected by Cloud Monitoring are called metric-based alerting policies. You can add a metric-based alerting policy to your Google Cloud project by using the Google Cloud console. A classic example of a metric-based alerting policy to notify when the application running on a VM has high latency for a significant time period.
  • Log based alerting:
    • Notified anytime a specific message occurs in a log. You can add a log-based alerting policy to your Google Cloud project by using the Logs Explorer in Cloud Logging or by using the Cloud Monitoring API. An example of log-based alerting policy is to notify when a human user accesses the security key of a service account.
  • The alert condition is where you spend the most alerting policy time and make the most decisions. The alert condition is where you decide what's being monitored and under what condition an alert should be generated.
  • Alerting Configuration:
    • Notice how the web interface combines the heart of the Metrics Explorer with a configuration condition. You start with a target resource and metric you want the alert to monitor. You can filter, group by, and aggregate to the exact measure you require. Then the yes-no decision logic for triggering the alert notification is configured. It includes the trigger condition, threshold, and duration.
  • There are three types of conditions for metric-based alerts:
    • Metric-threshold conditions trigger when the values of a metric are more than or less than a threshold for a specific duration window.
    • Metric-absence conditions trigger when there is an absence of measurements for a duration window.
    • Forecast conditions predict the future behavior of the measurements by using previous data. These conditions trigger when there is a prediction that a time series will violate the threshold within a forecast window.
    • An alert might have zero to many notification options selected, and they each can be of a different type. There are direct-to-human notification channels (Email, SMS, Slack, Mobile Push), and for third-party integration use Webhook and Pub/Sub.
    • Manage your notifications and incidents by adding user-defined labels to an alerting policy. Because user-defined labels are included in notifications, if you add labels that indicate the severity of an incident, then the notification contains information that can help you prioritize your alerts for investigation.
    • If you send notifications to a third-party service like PagerDuty, Webhooks, or Pub/Sub then you can parse the JSON payload and route the notification according to its severity so that so that your team doesn't miss critical information.
  • Supported notification channels include:
    • Email
    • SMS
    • Slack
    • Google Cloud app
    • PagerDuty
    • Webhooks
    • Pub/Sub
  • A notification channel decides how the alert is sent to the recipient.
    Alerts can be routed to any third-party service.Email alerts are easy and informative, but they can become notification spam if you aren't careful.
    SMS is a great option for fast notifications, but choose the recipient carefully. Slack is very popular in support circles.The Google Cloud app for mobile devices is a valid option.PagerDuty is a third-party on-call management and incident response service.
    Webhooks and Pub/Sub are excellent options when you want to alert users to external systems or code.\
  • When one or more alert policies are created, the alerting web interface provides a summary of incidents and alerting events. An event occurs when the conditions for an alerting policy are met. When an event occurs, Cloud Monitoring opens an incident.
    • In the Alerting window, the Summary pane lists the number of incidents, and the Incidents pane displays the ten most recent incidents.
  • Each incident is in one of three states:
    • Incidents firing: If an incident is open, the alerting policy's set of conditions is being met. Or there’s no data to indicate that the condition is no longer met. Open incidents usually indicate a new or unhandled alert.
    • Acknowledged incidents: A technician spots a new open alert. Before one starts to investigate, they mark it as acknowledged as a signal to others that someone is dealing with the issue.
    • Alert policies displays the number of alerting policies created.
    • Snooze displays the recently configured snoozes. When you want to temporarily prevent alerts from being created and notifications from being sent, or to prevent repeated notifications from being sent for an open incident, you create a snooze. For example, you might create a snooze when you have an escalating outage and you want to reduce the number of new notifications.
    • Groups provide a mechanism for alerting on the behavior of a set of resources instead of individual resources. For example, you can create an alerting policy that is triggered if some resources in the group violate a condition (for example, CPU load), instead of having each resource inform you of violations individually.
  • Groups can contain subgroups and can be up to six levels deep. One application for groups and subgroups is the management of physical or logical topologies. For example, with groups, you can separate your monitoring of production resources from your monitoring of test or development resources. You can also create subgroups to monitor your production resources by zone.
    • Resources can belong to multiple groups.
  • You define the one-to-many membership criteria for your groups. A resource belongs to a group if the resource meets the membership criteria of the group. Membership criteria can be based on resource name or type, Cloud Projects, network tag, resource label, security group, region, or App Engine app or service. Resources can belong to multiple groups.
    • Logs-based metrics are extracted from Cloud Monitoring and are based on the content of log entries. For example, the metrics can record the number of log entries that contain particular messages, or they can extract latency information reported in log entries. You can use logs-based metrics in Cloud Monitoring charts and alerting policies.
  • To create alerting policies by using Terraform, start by creating a description of the conditions under which some aspect of your system is considered to be "unhealthy" and the ways to notify people or services about this state.