Loading…
May 10-12, 2023
Vancouver, British Columbia, Canada + Virtual
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit North America 2023 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Pacific Daylight Time (UTC/GMT -8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Thursday, May 11 • 2:55pm - 3:35pm
Cluster Golden Signals to Avoid Alert Fatigue at Scale - Anusha Ragunathan & Sahil Badla, Intuit Inc

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
As platform engineers & SREs, we rely on metrics from Kubernetes clusters to understand platform health. For a Kubernetes platform running hundreds of clusters, there is often a sea of alerts arising from these clusters and on-call engineers need to tend to all of them, which can lead to alert fatigue. The alerts cannot be ignored due to the potential of an outage or incident resulting from them. How do we devise an observability system for Kubernetes Clusters that filters the signal from noise? Fortunately, we can use the industry standard “Golden Signals” (error rate, latency, traffic and resource saturation) defined for applications and services, towards Kubernetes Clusters as well. In this talk, we will take a deep dive into how we have defined “Cluster Golden Signals”, how they work, and go over the architecture and components of a successful metrics pipeline that derives baseline behaviors and detects anomalies. With a demo of a simulated incident, Anusha and Sahil will explain how cluster golden signals are invaluable in distinguishing a service issue from a platform issue and how to isolate and remediate a platform incident efficiently and quickly. You will learn the best practices from us, having built and operated this system in production at a large scale.

Speakers
SB

Sahil Badla

Staff Software Engineer, Intuit Inc
Sahil Badla is a technologist with decade of experience as a backend engineer. He started his career as a Software Engineer and has spent most part of his experience specializing in services and Infrastructure. He has lead many teams to adopt and migrate to microservices. He is currently... Read More →
avatar for Anusha Ragunathan

Anusha Ragunathan

Principal Software Engineer, Intuit Inc
Anusha Ragunathan is a software engineer at Intuit, where she works on building and maintaining the company’s Kubernetes based Compute Infrastructure. Anusha is passionate about solving complex problems in systems and infrastructure engineering. Prior to Intuit, she worked on building... Read More →



Thursday May 11, 2023 2:55pm - 3:35pm PDT
118 (Level 1)
  ContainerCon, Observability