Battling Alert Fatigue
Battling Alert Fatigue

Alert fatigue is a silent productivity killer. Eventually, the most relevant alerts are un-checked, killing customer experience. Here are some tips to reduce alert fatigue

Read ->
Guide to Service Level Indicators and Setting Service Level Objectives
Guide to Service Level Indicators and Setting Service Level Objectives

A guide to set practical Service Level Objectives (SLOs) & Service Level Indicators (SLIs) for your Site Reliability Engineering practices.

Read ->
Kubernetes Monitoring with Prometheus and Grafana
Kubernetes Monitoring with Prometheus and Grafana

A guide to help you implement Prometheus and Grafana in your Kubernetes cluster

Read ->
Challenges of Distributed Tracing
Challenges of Distributed Tracing

What are the challenges, benefits and use cases of distributed tracing?

Read ->
Static Threshold vs. Dynamic Threshold Alerting
Static Threshold vs. Dynamic Threshold Alerting

What's the difference between Static Threshold vs Dynamic Threshold Alerting? Do you really know when and how to use each threshold type?

Read ->
Observability - That Last 9
Observability - That Last 9

TL;DR: A stitch in time, saves 9. A discussion on the key blocks of observability.

Read ->
How we won Dukaan over
How we won Dukaan over

5 meetings. 1 month. From introductions, to a demo, and ultimately winning Dukaan over. Subhash and his team’s velocity on decision-making, moving fast, and radical candor, is a breath of fresh air in the Indian startup ecosystem.

Read ->
How to calculate HTTP content-length metrics on cli
How to calculate HTTP content-length metrics on cli

A simple guide to crunch numbers for understanding overall HTTP content length metrics.

Read ->
Last9 completes SOC II Type 2 Certification
Last9 completes SOC II Type 2 Certification

The comprehensive audit validates Last9 as a trusted SRE partner; a crucial process to work with highly regulated industries.

Read ->
Reliability Tools
Reliability Tools

A guide through the most popular DevOps and SRE tools for building your reliability stack.

Read ->
Latency is the new downtime
Latency is the new downtime

In the early days of Google, a lot of users were asking for 30 results on the first page of search results. So after long deliberation, Marissa Mayer, then the Product Manager for google.com, decided to run the A/B test for ten vs 30 results. When the results came in, they were in for a surprise.

Read ->
We’ve raised a $11M Series A led by Sequoia Capital India!
We’ve raised a $11M Series A led by Sequoia Capital India!

Change is the only constant in a cloud environment. The number of microservices is constantly growing and each of these is being deployed several times a day or week, all hosted on ephemeral servers. A typical customer request depends on at least 3 internal and 1 external service. It’s

Read ->
Why Service Level Objectives?
Why Service Level Objectives?

Understanding how to measure the health of your servcie, benefits of using SLOs, how to set compliances and much more...

Read ->
How to Improve On-Call Experience!
How to Improve On-Call Experience!

Better practices and tools for management of on-call practices

Read ->
Best Practices for Postmortems: A guide
Best Practices for Postmortems: A guide

The ins and outs of conducting an effective postmortem. Ready templates and examples from leading organizations around the world!

Read ->
Choosing Effective SLIs
Choosing Effective SLIs

Practical advice to choose an effective SLI.

Read ->
Running a Database on EC2 is Slowing It Down
Running a Database on EC2 is Slowing It Down

Learn everything about the advantages of EC2, it's use cases and how to optimize EC2 further.

Read ->
Deployment Readiness Checklists
Deployment Readiness Checklists

A ready checklist of a comprehensive list of steps and activities involved in the deployment of your application.

Read ->
The most interesting talks from SREcon21!
The most interesting talks from SREcon21!

Learn about some of the most interesting talks from SREcon21.

Read ->
Doing SRE the Right Way!
Doing SRE the Right Way!

A well-thought-out approach to SRE, which will help site reliability engineers and software engineers develop and maintain a useful, consistent, and effective SRE strategy for their products!

Read ->
Getting the big picture with Log Analysis
Getting the big picture with Log Analysis

How to get the most out of your logs!

Read ->
Microservices - Tracking Dependencies
Microservices - Tracking Dependencies

Quick primer into microservices architecture and the importance of tracking dependencies

Read ->
Latency SLO
Latency SLO

How do you set Latency based alerts? The most common measurement is a percentile-based expression like: 95% of the requests must complete within 350ms. But is it as simple?

Read ->
Deeper Dive into SLO: Effects on Development, Culture and Performance
Deeper Dive into SLO: Effects on Development, Culture and Performance

Thanks to Service Level Objectives (SLOs), your teams have a numerical threshold for system availability, so everyone has a clear vision of what keeps the users and the business happy.

Read ->
Monorepos - The Good, Bad, and Ugly
Monorepos - The Good, Bad, and Ugly

A monorepo is a single version control repository that holds all the code, configuration files, and components required for your project (including services like search) and it’s how most projects start. However, as a project grows, there is debate as to whether the project's code should be split into

Read ->
Components in Designing Effective SLOs
Components in Designing Effective SLOs

Service Level Objectives or SLOs serve as an objective measure of your system's performance. And when designed well, SLOs can help you direct engineering efforts effectively. It does not matter whether you're working in a startup or a tech giant; there is always a natural tension between the speed of

Read ->
Strace – A Hidden Superpower
Strace – A Hidden Superpower

As with any operating system, it’s not uncommon to encounter issues while running Linux and associated applications. This is especially true while using closed-source programs since granular code inspection isn’t possible.

Read ->
A Primer on Saturation SLO: What Is It and Do You Need to Consider It?
A Primer on Saturation SLO: What Is It and Do You Need to Consider It?

What is Saturation and why should you think about it as an SLO? Saturation can be understood as the load on your network and server resources.

Read ->
Sleep Friendly Alerting
Sleep Friendly Alerting

We've all been woken up with that dreaded Slack notification at ungodly hours only to realise that the alert was all smoke and no fire. The perfect recipe for dread and alert fatigue.

Read ->
Systems Observability
Systems Observability

Observability is not just about being able to ask questions to your systems. It's also about getting those answers in minutes and not hours.

Read ->
AWS security groups: canned answers and exploratory questions
AWS security groups: canned answers and exploratory questions

While using a Terraform lifecycle rule, what do you do when you get a canned response from a security group?

Read ->
If it ain't broke...
If it ain't broke...

A Terraform lifecycle rule in the right place can help prevent a deadlock. But the same lifecycle rule in the wrong place?

Read ->
mv aws-security-group shoot-foot
mv aws-security-group shoot-foot

How you can run into an unplanned downtime while making a seemingly harmless change of renaming an AWS security group through Terraform?

Read ->
Rescuing a SPAghetti React project
Rescuing a SPAghetti React project

I gave a talk at react.geekle.us today about improving reliability of our React app. Here are slides of that talk. Here is transcript of the talk. Hello all, my name is Prathamesh Sonpatki. I work at Last9 building a world class operational intelligence platform for SREs. The Last9

Read ->
One year at Last9
One year at Last9

I completed one year at Last9 today. When I joined Last9 on April 20th 2020 last year, I was unsure how it will pan out. I only knew Nishant and Piyush - founders of Last9 from Pune tech community. But I had never worked with them before. Some of the

Read ->
Much That We Have Gotten Wrong About SRE
Much That We Have Gotten Wrong About SRE

An illustrated summary of Developers ➡ DevOps ➡ SRE

Read ->
Infrastructure-As-Code-As-Software
Infrastructure-As-Code-As-Software

We ran a poll on Twitter“Do you care about the quality of your infrastructure code?”And on RedditThat’s an approximate and staggering 60–30–10 split. What do you think will the response be if the poll was — “Do you care about the quality of your product code?

Read ->
SLOs That Lie
SLOs That Lie

What are SLOs and how do you define them. We usually set SLOs that might not accurately define what the requirements are. Here's a look at SLOs That Lie! SLO is an acronym for Service Level Objective. But before I explain SLO, you need one more acronym SLI (Service Level

Read ->
Latency Percentiles are Incorrect P99 of the Times
Latency Percentiles are Incorrect P99 of the Times

What are P90, P95, and P99 latency? Why are they incorrect P99 of the times? Latency is for a unit of time and the preferred aggregate is percentile.

Read ->
SRE Tooling – the Clever Hans fallacy
SRE Tooling – the Clever Hans fallacy

Chef or Ansible? Terraform or Pulumi? Python or Ruby? Last9 or Last9? What if we told you that the mindset of building new tools has an age old link to the story of a horse who could do arithmetic?

Read ->
Root Cause Analysis For Reliability: A Case Study
Root Cause Analysis For Reliability: A Case Study

Let's explore the importance of RCAs in Site Reliability Engineering, why use RCAs, and our take on what constitutes a “good” RCA.

Read ->

SRE with Last9 is incredibly easy. But don’t just take our word for it.

Last9 is an operational intelligence platform for SRE.