Observability
Are We All on the Same Page? Let’s Fix That - With AI Assistance
Tuesday Mar 17 / 05:05PM GMT
In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure.
Luis Mineiro
Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando
Wrangling Telemetry at Scale: A Guide to Self-Hosted Observability
Tuesday Mar 17 / 03:55PM GMT
Observability is supposed to help you tame complexity, but your Observability stack can quickly become just as complex as the systems it's meant to watch. For most teams, the answer is to pay someone else to deal with it.
Colin Douch
Site Reliability Engineer @DuckDuckGo
Orienting, Understanding, Playing, Thriving: Debugging your Organisation
Tuesday Mar 17 / 10:35AM GMT
Debugging is both an art and a science. But more than that, it's an activity undertaken with deep intention: to understand and improve your systems. In the purely technical realm, we have an extraordinary range of tooling and techniques that can help us tackle this problem.
Hazel Weakly
Fellow @Nivenly Foundation; Director, Haskell Foundation; Experienced Leader Focusing on Organizational Change, Developer Experience, and Resilience Engineering
Uncorking Queueing Bottlenecks with OpenTelemetry
Monday Mar 16 / 11:45AM GMT
Queues are the backbone of scalable, asynchronous systems, but they can easily create a tangled web of complexity. When things slow down, the bottleneck could be anywhere, from producer lag to consumer exhaustion, and standard metrics often fail to show the full picture.
Julian Wreford
Team Lead of Operability Team @Gearset, Software Engineer Turned Accidental SRE
Oli Lane
Engineering Team Lead @Gearset, Focusing on Engineering Culture, Observability, and Platform Reliability
Understanding and Tuning System Performance with CPU Hardware Counters
Monday Mar 16 / 05:05PM GMT
Counters are fundamental to monitoring: how many requests were processed, how many CPU-seconds consumed, how many bytes sent over a network. Very likely you are already monitoring your applications and operating systems via the hundreds or thousands of counters they expose.
Bryan Boreham
Distinguished Engineer @Grafana Labs, Member of the Prometheus Team, Expert in Distributed Systems and Computer Performance
Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale
Tuesday Mar 17 / 10:35AM GMT
As Netflix scales hundreds of client platforms, microservices, and infrastructure components, correlating user experience with system performance has become a hard data problem, not just an observability one.
Prasanna Vijayanathan
Engineer @Netflix
Renzo Sanchez-Silva
Engineer @Netflix