Observability - That Last 9
TL;DR: A stitch in time, saves 9. A discussion on the key blocks of observability.
Observability - That Last 9

Mindful Observability

Pick your metaphor. Not knowing how fast, or where you’re heading, mostly ends in wreckage, atleast, in the context of a technical setting. A skilled operator must know her machine, so that she can be mindful about the journey and extract the most out of the experience.

The problem at hand is about visualising how our distributed system and it’s constituents are behaving. We want to know this, because we are customer obsessed and want to ensure that our product always works for our customers, while being frugal for our business.

Uptime & bottomline, both matter.

Building Blocks

Behold, here lie, in full view, services. How does it all tie in together though?

Self Maintaining Topology

Point in time topologies, architecture blocks go only so far. Maintaining organisational knowledge bases is such a complex problem that it merits its own blog. A knowledge base (KB) article is like a car, it loses half its value the moment you drive it off the lot!

What we have are traffic logs and traffic in all directions. There is gold here. Leveraging that input and constructing a topology that is self-maintaining. This is the starting point. Being able to simply visualise how everything connects together, is the foundational rock.

Flow Analytics

With the view in place, let's turn our attention on what’s riding on them pipes. Silo-ed observability will tell you when your particular block is running into trouble (or not). You’re left to figure out the rest through what is often tribal knowledge.

Flow logs will ultimately help you build your “service compass” , N/S, E-W connectivity, super helpful for causality and also dependency graphs for when changes are made in a block. This can be gold when your team is growing / new or just plain moving too fast to know all the moving parts.

Flow log modeling solves for discoverability and smells, basis deviations that are observed from the norm. Bonus, if we factor in seasonality and “curve-fit” appropriately, can help immensely in quickly discovering problems before they snowball!

Resiliency

Resiliency, can the system survive fatalities in individual blocks of execution? This is akin to having the ability to isolate fire in a a self contained block / concern, so as to prevent spread and survive with degradation. If we view each block as expendable, then what does it take to “mock” that block?

Photo by Bradyn Trollip on Unsplash

Inevitably, post “k” failures, it’s no longer tenable to degrade, but still, that improves your overall resiliency.

We can deem this as a “panic-response”, so that the block continues to act as if it’s there, whereas, in reality it’s non-functional. Having a positive affirmation that the leaky chamber is shut, is very valuable in an incident. The ability to spot smoke, replace behavior (panic), and know that things are OK, is a factor to reduce your Mean Time To Recovery (MTTR), and an observability platform lets you do that solidly.

Seasonality & Operational Intelligence

It’s also valuable to have an observability platform that recognises this and adapts and learns how your system behaves over time. Truly then, can we begin to detect “smoke”.

Customer traffic is driven by intent / time of day , mood and several other human factors, that are sometimes impossible to predict. While we’ve seen the adverse effects of these, traffic “tsunami’s” , the flip is also true, where traffic is so low, relative to BAU levels, that often errors go un-noticed.

An acknowledgement of variability is also very valuable when we’re making projections about traffic and want to see how customers are using the system. Then, using this as a basis for what-if analysis to ultimately prepare for the onslaught of real production grade traffic.

Second Order Benefits

Once the baseline system for observability is in place, you can also layer in spend tracking and optimisation, whether cloud or hybrid, given that you now have a system of record.

If the pipes and flows are instrumented, then how far behind can log analysis based threat modeling be? The same flow patterns can be analysed for optimal flow control through the services for hot spot identification and performance tuning.

While perhaps not strictly under the definition of what we think of as observability, these are key measures for any team to look at as well. Perhaps, higher in the observability Maslow hierarchy though!

The Last 9

Observability is a foundational building block and can unlock much goodness — however, it’s deviously complex to get right. The founders at Last9.io, aptly named, have been amazing co-build partners on trying to make in-roads on what a solid observability platform should be and hit most, if not all of the building blocks.

When dealing with any system, sans observability, it’s just not mindful operation of the value that the system can unleash. Knowing how your machine works and being mindful of when it’s roaring or purring is so key!

“The test of the machine is the satisfaction it gives you. There isn’t any other test. If the machine produces tranquility it’s right. If it disturbs you it’s wrong until either the machine or your mind is changed.”
Robert M. Pirsig, Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values


Akash Saxena is the CTO at Hotstar. You can find him on twitter

Want to know more about Last9 and our products? Check out last9.io; we're building reliability tools to make running systems at scale, fun, and embarrassingly easy. 🟢

Share to:
Twitter
Reddit
Linkedin
#Deep Dives #Last9

You might also like...

India vs Pakistan, Site Reliability Engineering, and Shannon Limit
India vs Pakistan, Site Reliability Engineering, and Shannon Limit

How does one ‘detect change’ in a complex infrastructure, so you don’t lose out on critical revenues — A short SRE story

Read ->
Battling Alert Fatigue
Battling Alert Fatigue

Alert fatigue is a silent productivity killer. Eventually, the most relevant alerts are un-checked, killing customer experience. Here are some tips to reduce alert fatigue

Read ->
Guide to Service Level Indicators and Setting Service Level Objectives
Guide to Service Level Indicators and Setting Service Level Objectives

A guide to set practical Service Level Objectives (SLOs) & Service Level Indicators (SLIs) for your Site Reliability Engineering practices.

Read ->

SRE with Last9 is incredibly easy. But don’t just take our word for it.