As Netflix scales hundreds of client platforms, microservices, and infrastructure components, correlating user experience with system performance has become a hard data problem, not just an observability one. Existing metrics, traces, and logs are siloed by system boundaries, making it slow and brittle to answer even basic questions like: “Is this user-visible regression caused by the client, the network, or a backend dependency?”

In this talk, we’ll walk through the design and implementation of E2EGraph: an end-to-end knowledge graph that models every Netflix user experience as a connected graph of users, clients, services, infrastructure, and their interactions. Each user session, client app, microservice, and network component is a node; requests, user interactions, and dependencies are edges enriched with attributes such as latency, error rates, QoE impact, versions, geo, and more.

We will focus on the data engineering challenges behind this system:

How we ingest heterogeneous data sources (client telemetry, server logs, traces, infra metrics, experiments, deployments) and normalize them into a unified ontology for observability.
How we design a domain ontology that encodes concepts like “user session,” “API call,” “deployment event,” “experimentation,” and “QoE regression,” and how that ontology enables consistent reasoning across the stack.
How we construct and maintain the knowledge graph at scale, including snapshotting the graph at regression time to support temporal comparison between “healthy” and “degraded” states.

On top of this graph, we are building automatic Root Cause Analysis for SRE operations (AutoSRE) using a mixture‐of‐experts architecture:

A coordinator agent decomposes a question like “Why is TV UI lolomo TTR regressing in the latest version?” into tasks.
Specialized “expert” agents (metrics/Atlas, alerts/Radar, experiments/ABlaze, client platforms, events/deploys) query the knowledge graph via the shared ontology.
The coordinator then synthesizes these graph‐backed insights to propose the most likely root causes. E.g., a specific client rollout, a misconfigured experiment, or a backend dependency regression.

We’ll close with our roadmap for predictive and self-healing capabilities:

Using graph‐based models to predict issues before they materially impact QoE, by learning patterns of failing subgraphs, propagation paths, and risky combinations of versions and experiments.
Driving self-healing behaviors where detected or predicted problems can trigger automated mitigations, like targeted rollbacks, traffic shifting, feature flag changes, or capacity adjustments, guided by the knowledge encoded in the E2EGraph ontology.

Attendees will come away with a concrete blueprint for using knowledge graphs as a unifying data layer for observability, how an ontology unlocks cross-domain reasoning and Auto RCA, and how such a foundation can evolve toward predictive, self‐healing infrastructure in large-scale distributed systems.

From the same track

Session Kafka

Introducing Tansu.io -- Rethinking Kafka for Lean Operations

Tuesday Mar 17 / 01:35PM GMT

What if Kafka brokers were ephemeral, stateless and leaderless with durability delegated to a pluggable storage layer?

Peter Morgan

Founder @tansu.io

Session Machine Learning Infrastructure

From S3 to GPU in One Copy: Rethinking Data Loading for ML Training

Tuesday Mar 17 / 11:45AM GMT

ML training pipelines treat data as static. Teams spend weeks preprocessing datasets into WebDataset or TFRecords, and when they want to experiment with curriculum learning or data mixing, they reprocess everything from scratch.

Onur Satici

Staff Engineer @SpiralDB & a Core Maintainer of Vortex (LF AI & Data), Previously Building Distributed Systems @Palantir

Session streaming

The Rise of the Streamhouse: Idea, Trade-Offs, and Evolution

Tuesday Mar 17 / 03:55PM GMT

Over the last decade, streaming architectures have largely been built around topic-centric primitives—logs, streams, and event pipelines—then stitched together with databases, caches, OLAP engines, and (increasingly) new serving systems.

Giannis Polyzos

Principal Streaming Architect @Ververica

Anton Borisov

Principal Data Architect @Fresha

Session AI/ML

Chronon - Mixed-Workload Data Processing Framework

Tuesday Mar 17 / 02:45PM GMT

Chronon is a data processing framework open-sourced by Airbnb. It is adopted across organizations like Stripe, Netflix, OpenAI, and Uber. Chronon was originally built for ML applications.

Nikhil Simha

Co-Founder & CTO @zipline.ai, Author of "Chronon Feature Platform", Previously @Airbnb, @Meta, and @Walmartlabs

Session

Connecting the Dots: Modern Data Engineering & Architectures (Limited Space - Registration Required)

Tuesday Mar 17 / 05:05PM GMT

Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale

Abstract

Speaker

Prasanna Vijayanathan

Speaker

Renzo Sanchez-Silva

Speaker

Prasanna Vijayanathan

Speaker

Renzo Sanchez-Silva

Date

Location

Track

Topics

Video

Slides

Share

From the same track

Introducing Tansu.io -- Rethinking Kafka for Lean Operations

From S3 to GPU in One Copy: Rethinking Data Loading for ML Training

The Rise of the Streamhouse: Idea, Trade-Offs, and Evolution

Chronon - Mixed-Workload Data Processing Framework

Connecting the Dots: Modern Data Engineering & Architectures (Limited Space - Registration Required)

Follow QCon

Contact

Menu

Conferences around the World