Wrangling Telemetry at Scale: A Guide to Self-Hosted Observability

Abstract

Observability is supposed to help you tame complexity, but your Observability stack can quickly become just as complex as the systems it's meant to watch. For most teams, the answer is to pay someone else to deal with it. But bills grow, auditors ask awkward questions, and sometimes you just run out of road with your SaaS provider. In those instances, you have to turn to running it yourself.

Drawing on a decade of experience in building, maintaining, and operating self hosted monitoring and Observability stacks, in this talk, I will explain what it actually means to run your own stack, what the tooling landscape is, where it shines, and where the open source world struggles behind the SaaS experience.

Along the way, I'll cover options for all your telemetry types, with concrete recommendations on what to use and what to avoid, and insights on how to tie them together into one coherent debugging canvas, with a look at where the Observability world is going next.

Interview:

What is your session about, and why is it important for senior software developers?

My session is about self hosted Observability, and the options and unique challenges it presents. More than that, it's an insight into what's going under the hood in an observability system, and aims to contextualise that to better enable engineers to understand and work with their telemetry going forward

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

Especially in the world of AI, our systems are becoming more complex by the day. Observability is the answer to tackling that complexity, but you have to do it right, which means knowing how to get the best out of your telemetry systems

What are the common challenges developers and architects face in this area?

Lots of developers struggle with tying telemetry together into a cohesive debugging strategy. Try as we might, the "three pillar" idea still exists, and is sub optimal for the modern distributed system.

What's one thing you hope attendees will implement immediately after your talk?

Ways to tie telemetry of different types. In particular "exemplars" are an underused aspect of the modern metrics system.


Speaker

Colin Douch

Site Reliability Engineer @DuckDuckGo

Colin currently works as an SRE at DuckDuckGo, orchestrating and inventing solutions to better serve DuckDuckGo's increasingly large portfolio of services, serving search queries and AI chats from around the world. Formerly heading up the Observability Team at Cloudflare, he has been working, advising, and researching in the Monitoring and Observability space for close to 10 years and has gained a wide perspective into the difficulties that modern companies, big and small, deal with in properly introspecting their systems. Originally from New Zealand, he now lives in the UK and regularly speaks at conferences to share insights from the practical side of Observability engineering.

Read more

Date

Tuesday Mar 17 / 03:55PM GMT ( 50 minutes )

Location

Windsor (5th Fl.)

Share

From the same track

Session Sociotechnical Leadership

Orienting, Understanding, Playing, Thriving: Debugging your Organisation

Tuesday Mar 17 / 10:35AM GMT

Debugging is both an art and a science. But more than that, it's an activity undertaken with deep intention: to understand and improve your systems. In the purely technical realm, we have an extraordinary range of tooling and techniques that can help us tackle this problem.

Speaker image - Hazel Weakly

Hazel Weakly

Fellow @Nivenly Foundation; Director, Haskell Foundation; Experienced Leader Focusing on Organizational Change, Developer Experience, and Resilience Engineering

Session Distributed Tracing

How Eve Online Leverages Head Based Sampling to Observe "Fun"

Tuesday Mar 17 / 11:45AM GMT

A unique pattern in video game software is real-time interactions to express the personality of users.Here we will talk about how we instrument the universe of New Eden to identify the traffic that matters, even the "fun" parts!

Speaker image - Nicholas Herring

Nicholas Herring

Technical Director, Eve Online @CCP Games, Refiner of Internet Spaceships and Explorer of Feral Gordian Knots of Python

Session

Can Claude Fix Itself? Using LLMs for Incident Response

Tuesday Mar 17 / 02:45PM GMT

Can you throw an LLM at a production incident and expect useful results? A candid look from someone who runs a distributed AI system and reaches for Claude before reaching for a dashboard. Surprises, failures, and why the answer matters for every engineer carrying a pager.

Speaker image - Alex Palcuie

Alex Palcuie

Member of Technical Staff in AI Reliability Engineering @Anthropic, Previously Staff Site Reliability Engineer on Google Cloud Platform

Session Observability

Are We All on the Same Page? Let’s Fix That - With AI Assistance

Tuesday Mar 17 / 05:05PM GMT

In distributed systems, incidents rarely fail because of missing signals - they fail because the right people aren’t mobilised quickly enough, and teams struggle to build a shared understanding under pressure.

Speaker image - Luis Mineiro

Luis Mineiro

Director of Digital Foundation @ASOS.com, SRE Charmer, Previously @Delivery Hero and @Zalando

Session

Unconference: Debugging Distributed Systems

Tuesday Mar 17 / 01:35PM GMT