Battling Alert Fatigue
Alert fatigue is a silent productivity killer. Eventually, the most relevant alerts are un-checked, killing customer experience. Here are some tips to reduce alert fatigue
Battling Alert Fatigue

'Alert fatigue' occurs when your team gets exhausted by the sheer volume of alerts they’re receiving. This results in teams becoming indifferent to alerts, or folks burn out trying to resolve them all. In all, it's a terrible sign, highlighting the failures of your monitoring and alerting strategy.

Still, your team needs to know when your users are affected by technical issues in order to quickly correct these issues. Problems that aren’t addressed can lead to poor user experience (UX) and permanently damage your company’s reputation.

Proper alert management means finding a balance between offering the best possible service to your users and keeping your developers focused on growing your business with the least amount of interruption.

In this article, you’ll learn some tips to reduce alert fatigue in order to protect your team from the mental strain of frequent context switching and the stress of being in constant firefighting mode.

How to Reduce Alert Fatigue

The following are tips and best practices to keep your team from dealing with alert fatigue.

Use an On-Call Rotation
Instead of bombarding the entire team, balance the load between members of that team. Team members should take turns being on call, usually each week, depending on what schedule works best. When the person on call receives an alert, they’re responsible for all actions needed to resolve it, including opening an incident and involving other team members.

The advantage of this approach is that it takes the pressure off the rest of the team, ensuring that they can focus on deeper tasks.

This change in the team’s schedule often requires some logistical adjustments. For example, it can be disruptive for the whole team to receive notifications via a Slack channel or group email, so using text messages or push notifications might be a better solution.

Create Escalation Policies
What if the person on call isn’t available, or an alert is accidentally missed? It’s inevitable that some alerts won’t get addressed, even if you apply all other best practices. As a result, you need to account for those what-if scenarios and have a backup solution.

Escalation tells the alerting system what to do if the alert isn’t resolved after a set amount of time. You can prevent different levels of alert fatigue by combining an on-call rotation with escalation policies.

You need to be careful when configuring your alert policies. You might end up sending alerts to somebody higher up the escalation chain before they’re supposed to receive it. When escalations are sent too quickly, it can stress out the next person on the chain. Escalation is meant to deal with stalled alerts and should not aggravate the alert fatigue situation.

Also, it could be tempting to add your engineering manager or VPs on the last level in an escalation policy. They most likely don't want to be bothered until an alert is absolutely urgent. This is another reason to plan your escalation policy properly so that they’re not pulled into the incident thread sooner than needed.

Prioritize Alerts
Not all alerts are equal. It’s important to create different levels of alerts so that you can prioritize the most important ones and postpone the others. For instance, a notification that 80 percent of the database disk is being used is less crucial than if 90 percent of the disk were being used. The second alert requires immediate attention, while the first one is a friendly reminder that you should take action later in the day.

Differentiating between emergency and non-emergency alerts is essential. You can set up additional priority levels, but often this creates a classification problem. To keep things clearer, try filtering alerts by those two categories. You could also consider delaying less important alerts if an important one occurs so that the person on call can focus on one alert at a time.

Use a Well-Calculated Threshold
In essence, alert fatigue comes from experiencing a large number of alerts on a frequent basis. If you find yourself getting deluged with alerts, a simple way to reduce this is by better configuring the threshold.

Not every error needs immediate intervention. Short bursts of errors aren’t important since they resolve themselves, and errors that are always there feel like false positives, discouraging the person on call from investigating further. In order to reduce alert fatigue, you need to take these factors into account in your threshold calculation.

For instance, you can use time buckets. Time buckets are an interval of time (say, one second) in which you perform a mathematical operation such as a sum or average. Instead of simply looking for a flat threshold that would detect occasional spikes, build alert thresholds based on time buckets that focus on errors occurring over a larger period of time.

Finally, continuously review errors and exclude known errors from your calculation, whether or not you plan to fix them.

Use Statistical Analysis
You can’t have truly smart alerts with simple thresholds. If you want to make the best use of your data and reduce flaky alerts, you need to analyze historical data. This approach seems more complex, but it’s worth your while. Platforms are increasingly offering smart thresholds based on this kind of analysis.

The first step is to identify the typical behavior of your system and come up with a model that defines how it responds. You need to figure out the frequency of your errors at a given time and use this data in your calculations for thresholds.

For instance, you could consider the occurrence rate of HTTP 500 errors in percentages at a given time of the day and compare it to the mean value in the previous days, plus or minus three times the standard deviation. This approach is part of Statistical Process Monitoring (SPM), or checking the stability of the process, and it’s a powerful and underutilized tool to eliminate alert fatigue.

Center Alerts around User Impact
The whole point of alerts is to support your users. You want to make sure they’re not dealing with bugs and infrastructure issues. Don’t let yourself be bothered by alerts unless a certain proportion of users are affected by some metric degradation.

This is the philosophy introduced by Google in its site reliability engineering (SRE) handbook. To follow its approach to monitoring and alerting, you need to define service level objectives (SLOs). An example SLO might be that 95 percent of user login requests must be processed in under 200 ms.

SLOs are composed of two elements: an indicator (here representing slowness in the system) and an error budget (a percentage). Unless the budget is consumed, the teams responsible for the login shouldn’t get an alert because most users are not impacted. However, monitoring the degradation of an SLO (but not a violation) and alerting product teams separately from the typical on-call rotation can help you proactively avoid a deterioration in the quality of service.

Implement Custom Metrics
The data you can get by default from your application is often limited (CPU, RAM, error logs, etc.) and tells you very little about the inner workings of your application. This leads to two problems: you don’t have much information about the root cause of a problem, and the alert threshold is hard to define because it doesn’t represent tangible quantities.

Implementing custom metrics in your application will give you more granular data, which can help pinpoint the root cause of a problem more quickly than when you’re relying on generic resources such as response time or requests per second. A custom metric could be the latency of steps of the process (such as image processing) or the conversion rate of a given page. If no one is using a feature like the call to action button, that could mean you have a problem.

When an issue arises, you know what subsystem is impacted and can get more detailed context about the urgency of the problem.

Keep Alerts Actionable
There is nothing more stressful than seeing an alert pop in and not knowing what to do. Vague and non-actionable alerts inevitably lead to alert fatigue, because the person on call can’t resolve them, gives up and ignores the alerts.

When adding alerts, create an action plan to address them so that everyone on your team knows what to do when that alert pops up. The easiest plan of action is to add links in the alert pointing to all the relevant resources (dashboard, GitHub repository, etc.) for quick action. You can also make it easier to resolve alerts if you create runbooks, which are step-by-step instructions including scripts to troubleshoot various issues.

Reduce Duplicate Alerts
Eliminating alert fatigue means reducing the frequency of alerts, especially identical alerts. Multiple alerts raised by the same rule (for example, the same metrics) should be combined into one alert. If your team receives an identical alert and the first alert is resolved, you might want to set a delay before it can retrigger, perhaps putting the alert in sleep mode for ten to fifteen minutes. That way you reduce your team’s frustration over getting an alert too frequently; it shouldn’t require your team’s immediate attention if they just dealt with the issue.

When a service is having issues, many dependent systems can be affected and raise their own alerts, quickly, leading to a chaotic situation where too many people are paged simultaneously. Avoiding duplicate alerts involves understanding the dependencies and grouping them on a service level, not at every component involved.

Let your team know about duplicate alert settings (number of alerts per day or week, for instance) so they can review the alerts and adjust the threshold if necessary. Reducing alert duplication is a preventative measure, but this can usually be solved with one of the tips listed above, such as using SLOs, SPM, or custom metrics.

Create Alert Lifecycles
Alerts need to have a story and a purpose—they shouldn’t just be annoying noise that pops up throughout the day. Alerts also shouldn’t accumulate indefinitely in your system.

You can define a workflow and link between your bug tickets and alerts. When an alert is linked to a bug, mute that alert for a certain number of days until the bug is supposed to be resolved.

In other words, you should define how long an alert lives and when you can mute it, and you should create cleanup tasks to delete older alerts that are no longer relevant. Plan monthly reviews with your team or between each on-call rotation to go over the errors that occurred the most and gave your team fatigue.

Create Runbooks and Postmortems
If your team doesn’t know what to do when a problem occurs, this will increase their stress level and lead to inefficient responses. To reduce the mental strain on your team, create a runbook procedure that tells them what they should do in the event of an incident.

Whenever creating a new alert, you should also document what is expected from the person on-call. Provide all relevant information, such as system diagrams (what components are involved), links to dashboards and logs, steps to resolve the problem, and who to call if the resolution procedure doesn’t work.

Ultimately, you can’t manage alerts effectively unless you draw on knowledge gained from previous incidents. This means you should also document all incidents in a postmortem, so your team knows what work has been done in the past and you can identify what needs to be updated in your runbooks.

Postmortems and runbooks work together, helping you and your team feel confident that you are constantly improving your system and its reliability.

Conclusion

DevOps engineers have options for reducing alert fatigue and helping their teams feel less burdened throughout the day. Basically, battling alert fatigue is based on two concepts: reducing the quantity (how many) and reducing the frequency (how often).

To reduce the quantity, you should apply tips that improve the quality of your alerts, such as reducing duplicate alerts, using a well-calculated threshold, applying SPM, creating user-centered alerts with SLOs, and implementing custom metrics.

To reduce the frequency, focus on prioritizing alerts, creating an on-call rotation and escalation policy, setting up an alert lifecycle, and using frequent reviews to improve the process.

Combining these different practices will make your alert systems more efficient and give you a happier, more productive team.

Last9 built Compass to help DevOps and SRE teams build better alerting systems that enable system reliability and a better oncall experience. Contact us for a demo to learn more.


Want to know more about Last9 and our products? Check out last9.io; we're building reliability tools to make running systems at scale, fun, and embarrassingly easy. 🟢

Share to:
Twitter
Reddit
Linkedin
#Last9 Engineering #Last9 #on-call practice #reliability #software engineering #SRE Tooling #tools

You might also like...

India vs Pakistan, Site Reliability Engineering, and Shannon Limit
India vs Pakistan, Site Reliability Engineering, and Shannon Limit

How does one ‘detect change’ in a complex infrastructure, so you don’t lose out on critical revenues — A short SRE story

Read ->
Guide to Service Level Indicators and Setting Service Level Objectives
Guide to Service Level Indicators and Setting Service Level Objectives

A guide to set practical Service Level Objectives (SLOs) & Service Level Indicators (SLIs) for your Site Reliability Engineering practices.

Read ->
Kubernetes Monitoring with Prometheus and Grafana
Kubernetes Monitoring with Prometheus and Grafana

A guide to help you implement Prometheus and Grafana in your Kubernetes cluster

Read ->

SRE with Last9 is incredibly easy. But don’t just take our word for it.