Latency SLO
How do you set Latency based alerts? The most common measurement is a percentile-based expression like: 95% of the requests must complete within 350ms. But is it as simple?
Latency SLO

Latency is an indicator of how long it takes for a request to reach and return to the customer. Its measures in a unit of time and is a critical pillar in defining the Quality of a Service. As long as it is served within a specific time range, it's great. However, it's almost impossible to operate a latency within limits to all requests that come in, and NOT all requests are equal.

The most preferred unit of measurement is a percentile, and the health Indicator is a bi-variate expression of the form

95% of the requests must complete within 350ms.


SLO Breach

How do you report something is broken when 95% of the requests are not completing within 350ms, simple. But what do you say?

  1. Only 90% of the requests are serving within 350ms and not 95%
  2. 95% of the requests are performed within 400ms now and not 350ms.

While both of the reporting is valuable, they both convey a different meaning.

Option 1: 90% of the requests are serving within 350ms and not 95%

It tells me that I was OK with 5% of the traffic/requests being degraded. But now that number has climbed up to 10%.

Option 2: 95% of the requests are served within 400ms now and not 350ms.

This number tells that 95% of requests have degraded further to 400ms instead of 350. But this does not tell me how much of the user base is served well. 94.99% of the users may still be OK. While the difference is only 0.01%, it may paint a picture of panic.

If you disagree with Option1 being a better fitment or have an Option3 to suggest, please leave a comment.


SLO Threat

While this solves the argument of SLO, the question now remains of alerting. When should we be warned of a potential breach?

Requests usually look like this.

Even though a small fraction of requests experiences extreme latency drops, they should be alerted because it tends to affect your most profitable users. These users tend to be the ones that make the highest number of requests and thus have a higher chance of experiencing tail latency. In addition, several studies have proven that high latency affects revenues: A mere 100-millisecond delay in load time can hurt conversion rates by 7 percent.

Then there is also a cost value tradeoff.

As the SLOs get stricter, the cost of enforcement surpasses the value that they bring.

A delicate balance is, setting an SLO to track 99% of the requests. And early warning on the performance of the far end 99.5% and 99.9% of the requests. But what should be a tolerant value for those far-off requests? It should be computed based on a baseline based on earlier data. The easy way to catch degradation is to keep an eye on the shape of the tail. If,

  • The tail gets thinner and stretches towards the right, OR
  • The tail gets thicker, where the worst 100% is improving, but the ones between 99% and 99.9% concentrate, suggesting movement towards 99.9%. Simply put, the gap between mode and max reduces.
Share to:
Twitter
Reddit
Linkedin
#SRE Tooling #sre #devops #Observability #SLO #Deep Dives #Last9 Engineering #Last9 #Failures #hans #tools #Systems Engineering #Latency

You might also like...

A Primer on Saturation SLO: What Is It and Do You Need to Consider It?
A Primer on Saturation SLO: What Is It and Do You Need to Consider It?

What is Saturation and why should you think about it as an SLO? Saturation can be understood as the load on your network and server resources.

Read ->
Sleep Friendly Alerting
Sleep Friendly Alerting

We've all been woken up with that dreaded Slack notification at ungodly hours only to realise that the alert was all smoke and no fire. The perfect recipe for dread and alert fatigue.

Read ->
Services; not Server
Services; not Server

Gone are the days of yore when we named are our servers Etsy, Betsy, and Momo, fed them fish, and cleaned their poop.

Read ->

SRE with Last9 is incredibly easy. But don’t just take our word for it.