Abstract
As Netflix scales hundreds of client platforms, microservices, and infrastructure components, correlating user experience with system performance has become a hard data problem, not just an observability one. Existing metrics, traces, and logs are siloed by system boundaries, making it slow and brittle to answer even basic questions like: “Is this user-visible regression caused by the client, the network, or a backend dependency?”
In this talk, we’ll walk through the design and implementation of E2EGraph: an end-to-end knowledge graph that models every Netflix user experience as a connected graph of users, clients, services, infrastructure, and their interactions. Each user session, client app, microservice, and network component is a node; requests, user interactions, and dependencies are edges enriched with attributes such as latency, error rates, QoE impact, versions, geo, and more.
We will focus on the data engineering challenges behind this system:
- How we ingest heterogeneous data sources (client telemetry, server logs, traces, infra metrics, experiments, deployments) and normalize them into a unified ontology for observability.
- How we design a domain ontology that encodes concepts like “user session,” “API call,” “deployment event,” “experimentation,” and “QoE regression,” and how that ontology enables consistent reasoning across the stack.
- How we construct and maintain the knowledge graph at scale, including snapshotting the graph at regression time to support temporal comparison between “healthy” and “degraded” states.
On top of this graph, we are building automatic Root Cause Analysis for SRE operations (AutoSRE) using a mixture‐of‐experts architecture:
- A coordinator agent decomposes a question like “Why is TV UI lolomo TTR regressing in the latest version?” into tasks.
- Specialized “expert” agents (metrics/Atlas, alerts/Radar, experiments/ABlaze, client platforms, events/deploys) query the knowledge graph via the shared ontology.
- The coordinator then synthesizes these graph‐backed insights to propose the most likely root causes. E.g., a specific client rollout, a misconfigured experiment, or a backend dependency regression.
We’ll close with our roadmap for predictive and self-healing capabilities:
- Using graph‐based models to predict issues before they materially impact QoE, by learning patterns of failing subgraphs, propagation paths, and risky combinations of versions and experiments.
- Driving self-healing behaviors where detected or predicted problems can trigger automated mitigations, like targeted rollbacks, traffic shifting, feature flag changes, or capacity adjustments, guided by the knowledge encoded in the E2EGraph ontology.
Attendees will come away with a concrete blueprint for using knowledge graphs as a unifying data layer for observability, how an ontology unlocks cross-domain reasoning and Auto RCA, and how such a foundation can evolve toward predictive, self‐healing infrastructure in large-scale distributed systems.
Speaker
Prasanna Vijayanathan
Engineer @Netflix
Prasanna Vijayanathan is an engineer at Netflix’s Consumer Engineering org, where he leads performance, observability and QoE initiatives using AI/ML innovations across client platforms that improve reliability and user experience at global scale. Previously, he worked at LinkedIn and Qualcomm, across networking, platform performance and large scale telemetry. He contributes to industry standards in responsible AI, including IEEE initiatives on safety by design, and serves in non-profit leadership focused on technology and AI for social good.
Speaker
Renzo Sanchez-Silva
Engineer @Netflix
Renzo Sanchez-Silva is an engineer on the Observability Team at Netflix, where he has spent the past decade developing alerting and monitoring systems. Currently, he is spearheading AIOps initiatives aimed at introducing cutting-edge observability and remediation tools. Before joining Netflix, Renzo worked in data processing and web mapping at Apple. He holds a background in Pure Mathematics and conducted research at the University of New Mexico on applying ontologies to data repositories.