🏏 450 million fans watched the last IPL. What is 'Cricket Scale' for SREs? Know More

Oct 6th, ‘22/4 min read

Observability - That Last 9

TL;DR: A stitch in time, saves 9. A discussion on the key blocks of observability.

Observability - That Last 9

Mindful Observability

Pick your metaphor. Not knowing how fast or where you’re mainly heading ends in the wreckage, at least in the context of a technical setting. A skilled operator must know her machine so that she can be mindful of the journey and extract the most out of the experience.

The problem at hand is about visualizing how our distributed system and it’s constituents are behaving. We want to know this, because we are customer obsessed and want to ensure that our product always works for our customers, while being frugal for our business.

Uptime & bottomline, both matter.

Building Blocks

Behold, here lie, in full view, services. How does it all tie in together though?

Self Maintaining Topology

Point in time topologies, architecture blocks go only so far. Maintaining organisational knowledge bases is such a complex problem that it merits its own blog. A knowledge base (KB) article is like a car, it loses half its value the moment you drive it off the lot!

What we have are traffic logs and traffic in all directions. There is gold here. Leveraging that input and constructing a topology that is self-maintaining. This is the starting point. Being able to simply visualise how everything connects together, is the foundational rock.

Flow Analytics

With the view in place, let's turn our attention on what’s riding on them pipes. Silo-ed observability will tell you when your particular block is running into trouble (or not). You’re left to figure out the rest through what is often tribal knowledge.

Flow logs will ultimately help you build your “service compass,” N/S, E-W connectivity, which is super helpful for causality, and dependency graphs for changes in a block. This can be gold when your team is growing / new or moving too fast to know all the moving parts.

Flow log modeling solves for discoverability and smells, basis deviations observed from the norm. Bonus, if we factor in seasonality and “curve-fit” appropriately, it can help immensely discover problems before they snowball!

Resiliency

Resiliency, can the system survive fatalities in individual blocks of execution? This is akin to isolating fire in a self-contained block/concern to prevent the spread and survive degradation. If we view each block as expendable, what does it take to “mock” that block?

Photo by Bradyn Trollip on Unsplash

Inevitably, post “k” failures, it’s no longer tenable to degrade, but still, that improves your overall resiliency.

We can deem this as a “panic response” so that the block continues to act as if it’s there, whereas, in reality, it’s non-functional. Positive affirmation that the leaky chamber is shut is very valuable in an incident. The ability to spot smoke, replace behavior (panic), and know that things are OK, is a factor in reducing your Mean Time To Recovery (MTTR). An observability platform lets you do that solidly.

Seasonality & Operational Intelligence

It’s also valuable to have an observability platform that recognizes this and adapts and learns how your system behaves over time. Truly then, can we begin to detect “smoke”?

Customer traffic is driven by intent/time of day, mood, and several other human factors, that are sometimes impossible to predict. While we’ve seen the adverse effects of these traffic “tsunamis”, the flip is also true, where traffic is so low, relative to BAU levels, that often errors go unnoticed.

An acknowledgment of variability is also very valuable when we’re making projections about traffic and want to see how customers use the system. Then, use this as a basis for what-if analysis to prepare for the onslaught of real production-grade traffic.

Second Order Benefits

Once the baseline system for observability is in place, you can also layer in spend tracking and optimization, whether cloud or hybrid, given that you now have a system of record.

If the pipes and flows are instrumented, how far behind can log analysis-based threat modeling be? The same flow patterns can be analyzed for optimal flow control through the services for hot spot identification and performance tuning.

While perhaps not strictly under the definition of what we think of as observability, these are critical measures for any team to look at. Possibly, higher in the observability Maslow hierarchy, though!

The Last 9

Observability is a foundational building block and can unlock much goodness — however, it’s deviously complex to get right. The founders at Last9.io, aptly named, have been excellent co-build partners in trying to make inroads on what a solid observability platform should be and hit most, if not all, of the building blocks.

When dealing with any system sans observability, it’s just not the mindful operation of the value that the system can unleash. Knowing how your machine works and being cognizant of when it’s roaring or purring is so crucial!

“The test of the machine is the satisfaction it gives you. There isn’t any other test. If the machine produces tranquility, it’s right. If it disturbs you, it’s wrong until either the machine or your mind is changed.”
Robert M. Pirsig, Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values


Akash Saxena is the CTO at Hotstar. You can find him on twitter.

Want to know more about Last9 and our products? Check out last9.io; we're building reliability tools to make running systems at scale fun and embarrassingly easy. 🟢

Contents


Newsletter

Stay updated on the latest from Last9.

Authors

Akash Saxena

Ex-CTO at Hotstar

Handcrafted Related Posts