Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale

Abstract

As Netflix scales hundreds of client platforms, microservices, and infrastructure components, correlating user experience with system performance has become a hard data problem, not just an observability one. Existing metrics, traces, and logs are siloed by system boundaries, making it slow and brittle to answer even basic questions like: “Is this user-visible regression caused by the client, the network, or a backend dependency?” 

In this talk, we’ll walk through the design and implementation of E2EGraph: an end-to-end knowledge graph that models every Netflix user experience as a connected graph of users, clients, services, infrastructure, and their interactions. Each user session, client app, microservice, and network component is a node; requests, user interactions, and dependencies are edges enriched with attributes such as latency, error rates, QoE impact, versions, geo, and more. 

We will focus on the data engineering challenges behind this system:

  • How we ingest heterogeneous data sources (client telemetry, server logs, traces, infra metrics, experiments, deployments) and normalize them into a unified ontology for observability.
  • How we design a domain ontology that encodes concepts like “user session,” “API call,” “deployment event,” “experimentation,” and “QoE regression,” and how that ontology enables consistent reasoning across the stack.
  • How we construct and maintain the knowledge graph at scale, including snapshotting the graph at regression time to support temporal comparison between “healthy” and “degraded” states.

On top of this graph, we are building automatic Root Cause Analysis for SRE operations (AutoSRE) using a mixture‐of‐experts architecture:

  • A coordinator agent decomposes a question like “Why is TV UI lolomo TTR regressing in the latest version?” into tasks.
  • Specialized “expert” agents (metrics/Atlas, alerts/Radar, experiments/ABlaze, client platforms, events/deploys) query the knowledge graph via the shared ontology.
  • The coordinator then synthesizes these graph‐backed insights to propose the most likely root causes. E.g., a specific client rollout, a misconfigured experiment, or a backend dependency regression.

We’ll close with our roadmap for predictive and self-healing capabilities:

  • Using graph‐based models to predict issues before they materially impact QoE, by learning patterns of failing subgraphs, propagation paths, and risky combinations of versions and experiments.
  • Driving self-healing behaviors where detected or predicted problems can trigger automated mitigations, like targeted rollbacks, traffic shifting, feature flag changes, or capacity adjustments, guided by the knowledge encoded in the E2EGraph ontology.

Attendees will come away with a concrete blueprint for using knowledge graphs as a unifying data layer for observability, how an ontology unlocks cross-domain reasoning and Auto RCA, and how such a foundation can evolve toward predictive, self‐healing infrastructure in large-scale distributed systems.


Speaker

Prasanna Vijayanathan

Engineer @Netflix

Prasanna Vijayanathan is an engineer at Netflix’s Consumer Engineering org, where he leads performance, observability and QoE initiatives using AI/ML innovations across client platforms that improve reliability and user experience at global scale. Previously, he worked at LinkedIn and Qualcomm, across networking, platform performance and large scale telemetry. He contributes to industry standards in responsible AI, including IEEE initiatives on safety by design, and serves in non-profit leadership focused on technology and AI for social good.

Read more

Speaker

Renzo Sanchez-Silva

Engineer @Netflix

Renzo Sanchez-Silva is an engineer on the Observability Team at Netflix, where he has spent the past decade developing alerting and monitoring systems. Currently, he is spearheading AIOps initiatives aimed at introducing cutting-edge observability and remediation tools. Before joining Netflix, Renzo worked in data processing and web mapping at Apple. He holds a background in Pure Mathematics and conducted research at the University of New Mexico on applying ontologies to data repositories.

Read more

Date

Tuesday Mar 17 / 02:45PM GMT ( 50 minutes )

Location

Fleming (3rd Fl.)

Share

From the same track

Session

Introducing Tansu.io -- Rethinking Kafka for Lean Operations

Tuesday Mar 17 / 10:35AM GMT

What if Kafka brokers were ephemeral, stateless and leaderless with durability delegated to a pluggable storage layer?

Speaker image - Peter Morgan

Peter Morgan

Founder @tansu.io

Session

From S3 to GPU in One Copy: Rethinking Data Loading for ML Training

Tuesday Mar 17 / 01:35PM GMT

ML training pipelines treat data as static. Teams spend weeks preprocessing datasets into WebDataset or TFRecords, and when they want to experiment with curriculum learning or data mixing, they reprocess everything from scratch.

Speaker image - Onur Satici

Onur Satici

Staff Engineer @SpiralDB

Session

The Rise of the Streamhouse: Idea, Trade-Offs, and Evolution

Tuesday Mar 17 / 11:45AM GMT

Over the last decade, streaming architectures have largely been built around topic-centric primitives—logs, streams, and event pipelines—then stitched together with databases, caches, OLAP engines, and (increasingly) new serving systems.

Speaker image - Giannis Polyzos

Giannis Polyzos

Principal Streaming Architect @Ververica

Speaker image - Anton Borisov

Anton Borisov

Principal Data Architect @Fresha

Session

Building a Control Plane for Production AI

Tuesday Mar 17 / 03:55PM GMT

Details coming soon.

Session

Unconference: Modern Data Engineering

Tuesday Mar 17 / 05:05PM GMT