How to Improve On-Call Experience!
Better practices and tools for management of on-call practices
How to Improve On-Call Experience!

Software engineers often go on-call to handle business-critical issues that arise in their software products. Teams work in shifts to maintain the high availability of an application, tackling issues as they arise and fixing them as quickly as possible. Due to the global availability of most software products, on-call teams need to be on the watch around the clock, which requires a system of time-based shifts among two or three teams.

A strong, reliable on-call system is one of the most essential prerequisites for maintaining a happy global customer base. This article will help you understand how you can do an on-call system right and reap the most benefits out of it.

Why Is Being On-Call Important?

Being on-call is important to ensure that any issues with your product are appropriately handled in real time. It also helps to maintain the quality of customer support and overall user experience for your product. But it offers other advantages as well.

Here are other reasons why you should bolster your on-call strategy.

Speeds Resolution Process

Since on-call strategies are focused on quickly solving issues, they help to reduce the mean time to resolution (MTTR) for your product. This produces happier customers since they can get back to using your software more quickly. Without timely resolution of critical issues, your organization can lose revenue and customers.

Enhances Employee Productivity

A well-formulated on-call strategy isolates issue resolution from the standard development workflow. This helps your team members better organize their work. When there are specific team members designated to handle customer issues, other employees can focus on what they need to get done.

Improves Teamwork

While it is a good practice to have the code owners available while resolving business-critical issues, this is not always possible. Team members are often encouraged to work with code that they haven’t written themselves. This pushes people outside of their comfort zones to learn and work with each other.

How to Improve Your On-Call Practices

There are a few principles and best practices you should keep in mind in order to get the most out of your on-call strategy.

Understand Your Purpose

The first step to formulating an effective on-call strategy is understanding its purpose. Focus on the specific goals of on-call support for your product or organization. You can start by looking to answers for questions like:

  • What’s the estimated frequency of issues that you will be tackling on-call?
  • Do you need to have the service owners (developers who know the product) available all the time to handle incoming issues?
  • What time zones are your team members based in?
  • Can the development teams take turns in handling on-call issues, or do you need dedicated support teams to keep your development process intact?

Understanding why you need an on-call system is as crucial as performing it. Once you understand your situation and requirements, you can formulate an outline for your on-call strategy.

Service-level agreements (SLAs) are a great benchmark to base your on-call strategies around. An SLA describes the quality of service that your product is expected to provide to end users. You need to factor it into your on-call strategy to ensure that you keep the promises that you make to your customers.

Communicate with Your Teams

Communication is critical while tackling live issues. You may have an entire team available to resolve an issue, but they won’t be able to do so if they don’t know where to begin or how to collaborate. Clarify any concerns they might have before their shifts start. If possible, maintain a handbook with answers to common questions that the on-call team might have.

Also, define a communication strategy with clear directions on the mediums and language to be used during shifts. This can include instant messaging, pagers, war rooms, and anything else that fits your purpose as long as it enhances your team’s productivity.

Even after all of this preparation, your team might have unique needs with respect to tools, resources, or trainings based on the issues that they are handling. You should be open to hearing their concerns so that they can more easily find a solution.

Keep On-Call Work Separate

While on-call support and normal development can be provided by the same people, these are different types of work that require different mindsets. On-call work is more irregular and intense compared to regular development, which is steady and consistent. Define clear boundaries between the two when assigning responsibilities to your team, and ensure that team members are working on one role at a time. Otherwise, they might suffer high stress levels, reduced productivity, and fatigue.

Respect Working Hours

You should maintain strict boundaries between shifts. Setting specific times for shifts helps you better organize your schedule and reduces stress on your team members. People working more than the recommended number of hours in their shift might lose productivity, which could affect their job performance.

In cases where a twenty-four-hour rotation schedule is not possible, consider an out-of-office on-call setup. However, you still need to define boundaries for out-of-office employees. Alert fatigue is real, and it has cost lives in the past. Maintaining a balance between workload and performance is important.

Beware of the Robot Pharmacist
In tech-driven medicine, alerts are so common that doctors and pharmacists learn to ignore them — at the patient’s risk.

Maintain Flexibility

Being on call isn’t a predefined routine job, and your requirements will change based on the state of your product. You’ll need to be ready to move pieces around to better support your teams. This might mean swapping team members’ schedules, delegating tasks, or bringing in more people as needed.

Be Prepared

Your team needs to be well equipped to handle all aspects of software issues, meaning they need the right tools for raising tickets and resolving them. Some commonly used tools include PagerDuty, Splunk On-Call, and LinkedIn Oncall. These tools help you organize your on-call strategy by managing rotation schedules and alerting the right people at the right time.

You can also automate a significant part of your on-call response and incident resolution. Tools such as PagerDuty and Splunk can help you direct or escalate issues, generate alerts, or schedule overrides. Research the available tools in the market to enhance your on-call strategy.

Maintain a history of incident reports and postmortems for on-call team members so they can see what problems were previously reported and how they were resolved.

Implement an Escalation Policy


An escalation policy helps to fast-track the process of escalating issues to the right people. It spells out the fundamental questions of your incident management process, such as who to notify first in case of an incident, who to check with next if the first responder isn’t available, or how to hand over an issue to another team member if the responder can’t resolve it.

You can integrate this policy into your on-call management tool and save time on your standard issue-resolution process. For an effective escalation strategy, you should create a hierarchy of support engineers stacked according to their expertise.

Case Studies

Following are a few real-life examples of how streamlined incident management and on-call processes have impacted the business of software products.

Pantheon

Pantheon is a software-as-a-service (SaaS) platform for website creation and management. Over 300,000 websites rely on its services. At first it relied on standard support tickets to raise emergency issues, but team members couldn’t prioritize urgent issues or provide round-the-clock support.

To solve this problem, Pantheon began to use PagerDuty. By automating its issue-escalation process, Pantheon was able to offer high reliability and meet its preset SLAs. You can read more about the change here.

Claranet

Claranet is an IT services management company that has been active in the industry since 1996. It grew rapidly via business acquisitions and almost tripled its employee count. This growth also led to multiple challenges, such as worker burnout or delays in responding to support calls.

To solve these problems, Claranet revamped its incident-management strategies. It automated most of its incident-response grunt work like routing requests or generating support tickets through tools and gained increased availability, improvements in MTTR, and better on-call performance overall.

Conclusion

As industries rely more heavily on software products and services, a bug or downtime can cost a business thousands of customers and millions of dollars. This makes maintaining an effective on-call strategy more crucial than ever.

The key is to be proactive about creating and implementing your strategy. You can do this by communicating honestly about your organization’s needs and abilities and by following the above best practices. If your team can quickly and efficiently deal with problems as soon as they’re reported, that will help you improve your products and keep your customers satisfied.

(A big thank you to Kumar Harsh for his contribution to this article.)


Thank you for reading. If you liked the post, please check out our blog and subscribe to it as well.


Share to:
Twitter
Reddit
Linkedin
#software engineering #devops #on-call practice #SRE Tooling

You might also like...

How we won Dukaan over
How we won Dukaan over

5 meetings. 1 month. From introductions, to a demo, and ultimately winning Dukaan over. Subhash and his team’s velocity on decision-making, moving fast, and radical candor, is a breath of fresh air in the Indian startup ecosystem.

Read ->
Sample vs Metrics vs Cardinality
Sample vs Metrics vs Cardinality

When dealing with Time Series databases, I always got confused with Sample vs Metrics vs Cardinality. Here’s an explanation as I have understood it.

Read ->
How to calculate HTTP content-length metrics on cli
How to calculate HTTP content-length metrics on cli

A simple guide to crunch numbers for understanding overall HTTP content length metrics.

Read ->

SRE with Last9 is incredibly easy. But don’t just take our word for it.