The most interesting talks from SRECon 2021!

SREcon is a two-day conference organized by the USENIX Association, a nonprofit organization that supports advanced computing system communities and furthers the reach of innovative research. It’s one of the most popular conferences hosted by USENIX and is focused on site reliability, distributed systems, and systems engineering at scale.

Each SRECon is jam-packed with informative, insightful, and technical sessions. In 2021, SREcon was hosted virtually due to the ongoing pandemic.

In this roundup, you’ll learn about a few exciting talks that took place at SREcon21. The talks reviewed here to highlight some of the best sessions, including What’s the Cost of a Millisecond? and Scaling for a Pandemic: How We Keep Ahead of Demand for Google Meet during COVID-19.

1. Ten Lessons Learned in 10 Years of SRE

🏃🏻‍♂️ Speaker: Andrea Spadaccini
(Principal Software Engineer - SRE @ Microsoft, ex SRE @ Google)

In the session, Andrea Spadaccini shares lessons learned during his ten-year journey, from his start as an intern at Google in 2011 to his current role as a principal software engineer in SRE (Site Reliability Engineering) at Microsoft. This talk mainly focuses on organizations that are considering implementing SRE into their business.

This thirty-minute talk takes you through the process of implementing SRE in your organization and aligning it with business goals and customer needs by evaluating, starting, and scaling SRE appropriately. Most importantly, it focuses on a culture of shared ownership, trust, and efficient communication between teams and executives.

If you’re at a stage in your company where you’re evaluating or scaling your SRE efforts, or if you’re interested in understanding the fundamental principles of adopting SRE, this talk is for you.

2. What’s the Cost of a Millisecond?

🏃🏻‍♂️ Speaker: Avishai Ish-Shalom
(Ex Developer Advocate @ ScyllaDB)

Avishai, a former developer advocate at ScyllaDB, discusses the concept of latency amplification, where latency amplifies as your requests are routed between microservices providing users a latent system. Although this talk is very technical in nature, but Avishai helps make the content easier to understand.

It focuses on how microservice queues amplify latency rather than managing retries. Latency amplification should be prevented, as it’s one of the most common reasons organizations suffer from low utilization.

In this talk, you learn about the process of latency management through abstaining from using unlimited queues and capping queue length to tackle the amplified latency. Concepts like circuit breakers, back pressure, and more are introduced as ways to help achieve the desired latency.

The real-life experience of the speaker is a valuable takeaway of the talk. Avishai walks you through how you can actively combat your amplification to reduce user-facing latency.

This presentation is recommended for any business having difficulties managing their user-facing latency. And, don’t forget that user-facing latency amplifies if internal latency is not actively monitored.

3. Cache Strategies with Best Practices

🏃🏻‍♂️ Speaker: Tao Cai
(Staff Software Engineer @ LinkedIn)

In this talk, Tao Cai, a staff software engineer at LinkedIn, discusses cache strategies, multiple cache TTL strategies, and cache warm-up strategies with examples that make it efficient to retrieve remote data for data-intensive and latency-sensitive services.

The twenty-five minute talk focuses on several best practices, including the preferred method of utilizing soft- and hard-TTL strategies with a notification pipeline for the real-time cache. Tao focuses on one crucial point regarding cache fallback: not issuing multiple requests via dedup and using async fallback to update the cache if serving time is high. He also recommends using cache to store partial/empty/error results in order to not overload the database when it’s in a problematic state.

The second part of the talk focuses on cache warm-up using local disk persistence, peer cache rsync, schema revolution, and shared remote cache, and concludes with a discussion regarding cache efficiency and sharding. Sharding is where you distribute a cluster into multiple shards, and each shard is responsible for part of the total cache.

While this talk is considered highly technical, it provides a complete overview of how caching should be implemented at scale. It’s great for anyone curious about operating cache servers at scale or on latency-sensitive services.

4. Scaling for a Pandemic: How We Keep Ahead of Demand for Google Meet during COVID-19

🏃🏻‍♀️ Speaker: Samantha Schaevitz
(Software Engineer @ Google)

Samantha Schaevitz, a senior staff software engineer in SRE at Google, shares a case study on how teams inside Google managed scaling when demand exploded during the COVID-19 pandemic. This talk covers the incident response structure that helped Google Meet scale without any user-facing outages.

The brief talk discusses how teams inside Google ramped up their incident response structure before the massive wave of traffic and the enormous team effort it took to automate resource allocation.

One major takeaway is that every individual had a shadow partner who could jump in if one person was not available. The case study is fun, and is a unique look at Google’s pandemic response to a specific platform.

5. How We Built Out Our SRE Department to Support over 100 Million Users for the World’s 3rd Biggest Mobile Marketplace

🏃🏻‍♀️ Speaker: Sinéad O’Reilly
(Senior Manager SRE @ Aspiegel, ex-Salesforce)

Sinéad O’Reilly, from Aspiegel, presents a fascinating talk on how they scaled their SRE headcount while working from home.

In this thirty-minute talk, you get an inside look at how Aspiegel SE made scaling decisions with efficient procedures and documentation. One critical point discusses how new growth required Aspiegel SE to scale their internal applications and support chats.

This talk is great for anyone interested in knowing how to scale teams fast and efficiently. The detailed talk focuses on processes and efficient on-boarding that can benefit your organization.

6. Lessons Learned Using the Operator Pattern to Build a Kubernetes Platform

🏃🏻‍♂️ Speaker: Pavlos Ratis
(Senior SRE @ Red Hat)

Pavlos Ratis, a senior site reliability engineer at Red Hat and the owner of the awesome SRE repository, shares his experience with the Operator pattern while building OpenShift, a container orchestration platform from Red Hat.

The twenty-minute talk is very technical and is filled with lessons that the team at Red Hat learned while building OpenShift, including:

Sane practices
Pitfalls of Operator patterns
Standardization
Testing
When to use Helm
Microservices vs Operators
Automation

This is a fascinating talk for anyone trying to dive deep into Kubernetes, Operators, and exploring the Operator pattern. By the end of the session, you’ll understand the common pitfalls of running Operators and when you should consider writing a custom one.

7. Food for Thought: What Restaurants Can Teach Us about Reliability

🏃🏻‍♂️ Speaker: Alex Hidalgo
(Principal Reliability Advocate @Nobl9)

Alex Hidalgo, principal reliability advocate at Nobl9, presents a very interesting talk covering what restaurants can teach us about reliability from his former experiences as a server, bartender, line cook, busser, and runner. The talk is a subtle introduction to reliability using analogies.

Alex makes the point that there are many complex external systems in the world, and restaurants are a great example because they are complex systems made of component complex systems.

For example, there can be several issues on a restaurant floor, one of which is when food is served late to the customer. Alex compares the time it takes to serve customers to latency. The solution to this problem could be hiring more staff and runners, or looking at the bigger picture and arranging your customers per section to manage the demand.

Hiring more staff can be compared to upgrading your resources by either vertical or horizontal scaling. Hiring more runners is equivalent to routing services, and arranging customers per section equals load balancing, where you transfer the traffic equally between servers or per restaurant table.

In short, the talk teaches you to look everywhere for help. To understand the importance of reliability, there are a lot of lessons you can learn from those around you. This talk is suitable for anyone who wants to learn about reliability more intuitively.

Conclusion

There were so many interesting talks at SRECon 2021, and this roundup was just a brief overview of highlights from the event.

If you want to check out more presentations from the conference, you can visit the USENIX website for a full list of presentations.

(A big thank you to Hrittik Roy for his contribution to this article)

The most interesting talks from SRECon 2021!

1. Ten Lessons Learned in 10 Years of SRE

2. What’s the Cost of a Millisecond?

3. Cache Strategies with Best Practices

4. Scaling for a Pandemic: How We Keep Ahead of Demand for Google Meet during COVID-19

5. How We Built Out Our SRE Department to Support over 100 Million Users for the World’s 3rd Biggest Mobile Marketplace

6. Lessons Learned Using the Operator Pattern to Build a Kubernetes Platform

7. Food for Thought: What Restaurants Can Teach Us about Reliability

Conclusion

Contents

Newsletter

Handcrafted Related Posts

Latency Percentiles are Incorrect P99 of the Times

What are P90, P95, and P99 latency? Why are they incorrect P99 of the times? Latency is for a unit of time and the preferred aggregate is percentile.

Systems Observability

Observability is not just about being able to ask questions to your systems. It's also about getting those answers in minutes and not hours.

Doing SRE the Right Way!

A well-thought-out approach to SRE, which will help site reliability engineers and software engineers develop and maintain a useful, consistent, and effective SRE strategy for their products!