Chef or Ansible? Terraform or Pulumi? Python or Ruby? Last9 or Last9? The debate is endless. While exploring the landscape of these tools is impossible in a single blog post – it is worthwhile thinking about why there are so many options in the SRE toolchain. At times the tools are inadequate and the other times our usage of old tools is inadequate with the modern times.
Why does that happen ever so often? Well, because you cannot teach an old horse new tricks. Speaking of horses…
What if we told you that the mindset of building new tools has an age old link to the story of a horse who could do arithmetic?
Say Hi To Clever Hans
People usually brush it off as a joke. Just like all the people in Germany did, back in the 1900s. So, his trainer, the great mathematician, Wilhelm von Osten, decided to take him on a tour.
Try asking Hans a question yourself. Go ahead, he won’t bite.
So How Would Clever Hans Do It?
How could a horse solve arithmetic questions?
Would you believe if we tell you that you actually gave the answer away?
Clever Hans could read the emotions of the person asking him the questions. Based on their feelings, Hans could judge how close he was to the answer.
People have a hard time believing this!
So there were numerous experiments conducted. And the results were mind-boggling.
Hans got 95% answers correctly when the evaluator was in front of him and delivered 5% accuracy when the evaluator was behind him or somewhere Hans could not see.
How cleverly Hans performed depended on the person asking.
How Does This Relate To SRE Tools
We are conditioned to scope problem domains by the metrics we know. We know the knowns and our existing SRE toolchain does a very good job of validating them. In that sense, existing metrics are our tells and our tools are like Clever Hans, which happily tap away to verify what we know. What existing tools or limited metrics don’t tell us are hidden anomalies that manifest as faint glimmers in incidents when we are in fire fighting mode.
A single node failure of a standalone 3 node load balancer app has vastly different implications than a single node failure in a system where the faulty node has both upstream and downstream dependencies. Similarly, if you did not have distributed tracing, how would you map a request journey in a microservices based system? Systems evolved, problems evolved and hence tooling evolved. In the absence of tooling that evolves with problems, we are fooling ourselves to believe that our alarms cover the whole outage domain, akin to saying that Clever Hans knows math.
Making Clever Hans do math is impossible, but choosing tools that solve today’s problems and anticipate near future needs is tractable. Staying in lockstep with the problem domain of system as it evolves is key. Detecting where your existing tooling falls short and filling those gaps is a learnable skill. During that process, you will realize that marrying your existing toolchain or shoehorning problems to fit a familiar tool is counter productive.
So do you adopt every shiny new tool that shows up on hackernews? No. Exploring a tool to do a test run is one thing, ensuring enterprise adoption across teams is another. Weigh the benefits against the switching cost. Is it worth switching if the system you are solving problems for is sunsetting in the next 3 months? Conversely, isn’t it worth evaluating a CI/CD tool if your team is spending most of their time doing manual deployments? These are real questions that go beyond the shimmer of superficial concerns like – Is it written in my favourite programming language? Is it the next best thing I could put on my resume?
- Know your Clever Hans toolchain. Put on your poker face and ask hard subjective questions.
- Bend the solutions to your problem, not new problems to old solutions.
- Own your toolchain and not the other way around.
- If the only tool you know is a hammer, everything will look like a nail. Identify those hammers. Be on the lookout for power drills.
We had some really interesting discussions with the SRE community on reddit around this post. We got some really good feedback that helped us enrich the post content. Turns out that you can teach an old horse new tricks! Check them out – here
You might also like...
A simple guide to crunch numbers for understanding overall HTTP content length metrics.Read ->
Stories from the world of SRE. Delivered.
SRE with Last9 is incredibly easy. But don’t just take our word for it.