Wikipedia defines Root Cause Analysis (RCA) as “a method of problem-solving used for identifying the root causes of faults or problems.” Essentially, root cause analysis means to dive deeper into an issue to find what caused a nonconformance. What’s important to understand here is that Root Cause Analysis does not mean just looking at superficial causes of a problem. Rather, it means finding the highest-level cause- the thing that started a chain of cause-effect reactions and ultimately led to the issue at hand.
Root cause analysis methodology is widely used in IT operations, telecommunications, healthcare industry, etc. In this post, I’ll take you through how to use RCA for making your system more reliable using an experience I had.
However, before discussing the case study, a note on the importance of RCA in Site Reliability Engineering.
Why should you use Root Cause Analysis for Reliability?
A good Root Cause Analysis looks beyond the immediate technical cause of the problem and helps you in finding the systemic root cause of the issue. And then you work on eliminating it. RCAs along with documenting the core issue and steps to resolve it – also lends itself to the very next question an SRE will ask – is this happening somewhere else? Automating the solution and deploying it to affected areas ensures that not only is the problem’s root cause taken care of but also problems of similar nature are permanently fixed.
That being said, a not-so-thorough RCA makes the entire process futile. Here’s an incident I encountered a few years back and how my team went about doing a Root Cause Analysis for it.
How Things Went Down
Elasticsearch wakes up Pagerduty wakes up Oncall
It was one of those times for me where I thought everything was in order, working fine. And yet, a small slip led to an unexpected breakdown. It was around 7:30 AM, almost 25 hours before a country launch and PagerDuty went off. Buzzers kept going on that something’s breaking. So, we went to Elasticsearch and found five or six 5XX requests.
Pagerduty auto resolves – good news or bad news?
There were a lot of logs coming in around 1 Mbps. There wasn’t a correlation ID so we couldn’t actually isolate from the tons of logs where the issue was exactly happening. Precisely 5 minutes later, HTTP 500 error stopped and Pager duty auto resolved. We thought it’s probably not a major issue and de-prioritised it.
Pagerduty wrath unleashes in rinse repeat mode
Five minutes later, exactly sharp by the clock, Pingdom starts alerting PagerDuty again that the public API is unreachable. We had set up all the tools we could possibly think of. There was Grafana, which was sending us some 5xx requests. The alerts were going to Sentry, through the standard route Elastisearch> Elastalert> Sentry. The issue keeps going on until exactly 7:45 AM, five minutes later. 500 stops and the Pagerduty is auto-resolved. This keeps repeating.
The first thought we had was whether the issue was due to a new deployment. It’s often the easiest answer to these problems. The release manager as well the on-call SRE deny any new deployment. Not just that, all other tools were doing okay. Grafana looked okay, Sentry was doing its job, Prometheus and APM were doing okay. We even checked the firewall to figure out whether there was a drop in traffic, but there wasn’t.
But what really happened?
Nearly 20 hours of struggling to find the problem, we realized that there was a simple mount command which hadn’t run on one of the database shards. And because of that, data was being written in-memory. When the system rebooted, data was wiped. Now this machine wasn’t supposed to be commissioned, but we had fixed the machine. So we put it into circulation.
However, Ansible did not run one of its commands as it should have.
So one machine was left out, one command did not run and the machine rebooted.
What we noticed here is that only a certain section of the data was gone. So only a certain type of request started failing, which disconnected the faulty node from the load balancer as an unhealthy upstream.
Now that we had figured it out, the question was how to fix this. But before going there, two more important questions needed to be answered:
- If there’s a mistake in one place, it’s quite likely the mistake was repeated in other places. Where else was this failing?
- How am I going to avoid the situation ever from happening again?
Systems are designed by culture.
Digging Deeper: The Root Cause Analysis
We found out that this was a gap in our first bootstrap run by Ansible. We used Ansible to run commands. But when the (faulty) machine was brought into circulation, one of the Ansible runs hadn’t happened on it. It was a newly provisioned machine and service discovery did not pick it up.
We were using Nomad to schedule dynamic workloads. This meant that request could go to any of the deployed machined, faulty one’s included.
We could have gone old school, used none of these, and avoided the situation. But that wasn’t really an option. We had dynamic workloads and needed a clustered scheduler like Nomad.
But our RCA didn’t end there. A good root cause analysis, identifies the business loss as well. Was it data loss? Did it cause a significant amount of embarrassment, if not financial loss? Was there financial loss that couldn’t be measured because the system was down?
Probable Root Cause 1: Raghu forgot to execute the mount command
A common outcome of RCA is that we end up blaming individuals. We did mention this: Raghu forgot to execute the mount command. What we must realize is that individuals cannot be reasons for failures. So the conclusion of our RCA cannot be that an individual failed to do something. It must be more actionable and concrete.
Probable Root Cause 2: Infrastructure team forgot to execute the mount command
We took another stab at it. We said that the infrastructure team forgot to execute the mount command. But this RCA, too, is not actionable.
Is improving the Infrastructure Team’s memory a probable resolution?
Another thing to consider with the above two causes is the bad apple theory. Change the existing team or replace the team members? The Bad Apple theory states that if you take out a bad apple from a basket, what remains is a basket of really good apples. Most of the time changing people doesn’t really solve anything. We have to be pragmatic and cognizant of the fact that skill and ownership always go hand in hand. You have to have the right skills in people and you have to give them the right ownership. Along with the right tools. This solution would, therefore, have not reaped the right results. Human failures are inevitable.
Probable Root Cause 3: Remote Login via SSH Allowed
We then went a step deeper and asked the question: Why was it possible to SSH into the system? If SSH was the issue, a probable resolution could be to not allow SSH. But that would actually hinder work.
There’s a reliability team, there’s a DevOps team and there’s a development team working on the system. They are invariably going to SSH into the system.
Preventing somebody from doing their job is not going to make the system reliable.
Probable Root Cause 4: Right Tools Were Not Present
We then asked another question: why don’t we have a tool which actually matches alerts on a configuration mismatch? What if all my systems in production, had a way to match their configurations and raise an error every time there was a mismatch? That would be the best thing, right?
This way people are free to automate. Are not burdened. And can actually make decisions freely because the mundane work of configuration validation has been offloaded to a computer. As long as there are humans in charge of the system, there will be mistakes.
So the real issue here was that we didn’t have the right tools to catch the smaller details a human could easily overlook.
Anything that you’re not paying attention to is going to fail at some point or the other.
The final root cause analysis
System configuration validator was absent
There was no system to actually validate the state of the machines every now and then.
Solution: We built it. It became one of our bread and butter tools. Everytime we had a launch, we actually started running those configuration validators to see if everything looked okay. We could check if all ports could actually scan all machines and if the connectivity was fine. And the scope kept increasing. One of the greatest tools that fit here is osquery.
It is a great tool by Facebook where you can actually make SQL queries to systems. Then, you can match them across systems and see which one’s are working fine, if there’s any difference and so on.
FMEA did not exist
Failure mode effective analysis is a slightly out-dated term borrowed from the industrial sector. The cool kids these days call it chaos engineering. FMEA means we run an application in Failure Mode and see its impact. If we could actually run these systems against failure scenarios in our test environments by using fault injection techniques, we would have probably caught the issue.
Non-Latent Configuration Validator was not available
Non-latent configuration validation is an approach which advocates early detection of invalid data to ensure fast failure and prevent the situation where an invalid configuration causes bigger damage. For example, if a bad URL is detected during system initialization, there’ll be far less damage than if a customer workflow invokes it.
We realized that a non-latent configuration validator was missing.
Research shows that 72% of the failures could have been reduced if configuration validation was not latent!
These three we identified as the root cause of this situation. And when we rectified these, that is, when we added the necessary tools we did not encounter similar issues again.
That’s the power of a strong RCA. And to do it, you need to look beyond individuals, teams, and superficial causes and find actionable solutions.