Observability

Session Observability

Are We All on the Same Page? Let’s Fix That - With AI Assistance

Tuesday Mar 17 / 05:05PM GMT

In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure.

Speaker image - Luis Mineiro

Luis Mineiro

Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando

Session Observability

Wrangling Telemetry at Scale: A Guide to Self-Hosted Observability

Tuesday Mar 17 / 03:55PM GMT

Observability is supposed to help you tame complexity, but your Observability stack can quickly become just as complex as the systems it's meant to watch. For most teams, the answer is to pay someone else to deal with it.

Speaker image - Colin Douch

Colin Douch

Site Reliability Engineer @DuckDuckGo

Session Sociotechnical Leadership

Orienting, Understanding, Playing, Thriving: Debugging your Organisation

Tuesday Mar 17 / 10:35AM GMT

Debugging is both an art and a science. But more than that, it's an activity undertaken with deep intention: to understand and improve your systems. In the purely technical realm, we have an extraordinary range of tooling and techniques that can help us tackle this problem.

Speaker image - Hazel Weakly

Hazel Weakly

Fellow @Nivenly Foundation; Director, Haskell Foundation; Experienced Leader Focusing on Organizational Change, Developer Experience, and Resilience Engineering

Session Observability

Uncorking Queueing Bottlenecks with OpenTelemetry

Monday Mar 16 / 11:45AM GMT

Queues are the backbone of scalable, asynchronous systems, but they can easily create a tangled web of complexity. When things slow down, the bottleneck could be anywhere, from producer lag to consumer exhaustion, and standard metrics often fail to show the full picture.

Speaker image - Julian Wreford

Julian Wreford

Team Lead of Operability Team @Gearset, Software Engineer Turned Accidental SRE

Speaker image - Oli Lane

Oli Lane

Engineering Team Lead @Gearset, Focusing on Engineering Culture, Observability, and Platform Reliability

Session performance

Understanding and Tuning System Performance with CPU Hardware Counters

Monday Mar 16 / 05:05PM GMT

Counters are fundamental to monitoring: how many requests were processed, how many CPU-seconds consumed, how many bytes sent over a network. Very likely you are already monitoring your applications and operating systems via the hundreds or thousands of counters they expose.

Speaker image - Bryan Boreham

Bryan Boreham

Distinguished Engineer @Grafana Labs, Member of the Prometheus Team, Expert in Distributed Systems and Computer Performance

Session Generative AI

Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale

Tuesday Mar 17 / 10:35AM GMT

As Netflix scales hundreds of client platforms, microservices, and infrastructure components, correlating user experience with system performance has become a hard data problem, not just an observability one.

Speaker image - Prasanna Vijayanathan

Prasanna Vijayanathan

Engineer @Netflix

Speaker image - Renzo  Sanchez-Silva

Renzo Sanchez-Silva

Engineer @Netflix