Summary

Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconlondon.com with any comments or concerns.

Title: From Confusion to Clarity: Advanced Observability Strategies for Media Workflows at Netflix

Speakers: Sujana Sooreddy and Naveen Mareddy

Overview:

Managing media workflows at Netflix involves significant complexity due to millions of workflow executions and substantial computing costs. Traditional observability tools are insufficient for Netflix's media workflows which require a scalable approach that supports asynchronous tasks. To address these challenges, Netflix developed a stream-processing pipeline for enhanced observability, transforming operations from reactive troubleshooting to proactive decision-making.

Key Strategies:

Near Real-time Insights: Implementing quick event processing to identify errors promptly and reduce wasted compute resources.
Optimal Rollup Strategies: Consolidating millions of low-level events into actionable business insights through pre-aggregation and event collapsing.
Opinionated Tagging Taxonomy: Establishing a consistent tagging system to ensure consistent expression of business metrics.
Enabling ROI Analysis: Facilitating deeper analysis of compute usage and latency impacts to optimize feature development strategies.

Implementation Details:

The observability platform utilizes domain-specific events, distributed tracing, and consistent tagging for a comprehensive view of workloads.
A stream-processing pipeline processes billions of events in real-time for rapid insights and on-the-fly aggregations.

Challenges and Solutions:

Handling the complexity and volume of trace spans and microservice calls requires improved trace processing and visualization strategies.
Efforts include introducing a request ID for better tracking and optimizing data storage and retrieval for faster execution graph loading.
Custom visualization and stream processing techniques are employed to manage scalability and enhance real-time analytics capabilities.

Future Outlook:

The team aims to integrate connected data insights across the media production lifecycle, from initial pitch to streaming, and continue enhancing observability capabilities to support business decision-making and infrastructure scalability.

Conclusion: By adopting sophisticated observability strategies, Netflix enhances its ability to manage large-scale media workflows more effectively and efficiently, driving business value through improved operational insights.

This is the end of the AI-generated content.

Abstract

Managing media workflows at the Netflix scale is both thrilling and daunting. With millions of workflow executions across hundreds of types and over 500 million CPU hours consumed quarterly, costs can skyrocket, and encoding issues can disrupt the streaming experience. The challenge is immense: ensuring the timely delivery of high-resolution encodes, avoiding costly codec bugs, supporting last-moment redeliveries, and identifying bottlenecks before they drain compute resources. How do we navigate this complex system without spiraling into budget and delay disasters? This isn't just about fixing bugs faster anymore. This is connected to observability driving real business value. Imagine instantly knowing the true cost of encoding each movie, or precisely tracking redelivery metrics that directly impact revenue.

We confronted these challenges directly and discovered that traditional observability tools, designed primarily for RPC-style services, were inadequate for media workflows. We required observability at scale to support asynchronous media workflows with long-running tasks. By embracing domain-specific events, distributed tracing, and consistent tagging, we achieved a comprehensive view of our users' workloads. We developed a stream-processing pipeline that processes events from various parts of media workflows and collates them into actionable insights. This powers our observability platform, capable of handling billions of events in real-time, enabling rapid insights and on-the-fly aggregations.

In this talk, we’ll cover the following aspects of how we built observability for long-running, distributed, and high-throughput systems, and how you can apply these learnings:

Near real-time insights: Learn how to process events promptly to meet the monitoring needs of low-latency encoding. Discover techniques to enable users to catch bugs sooner, limiting wasted compute on encodes known to fail.
Optimal rollup strategies: Explore how to consolidate millions of low-level events into hundreds of business insight events. We'll share techniques like pre-aggregation and event collapsing to minimize storage and efficiently support top queries.
Opinionated tagging taxonomy: Understand the importance of a defined tagging taxonomy and how it ensures all business metrics are expressed consistently within your observability platform.
Enabling ROI analysis for feature development: See how to facilitate long-pole analysis, gain insights into compute usage, and understand latency implications for better ROI analysis of your feature development.

By the end of this session, attendees will have concrete strategies to implement effective observability, transforming operations from reactive firefighting to proactive decision-making. Get ready to move from panic to clear, actionable insights, bringing clarity and control to your own large-scale systems!

Speaker

Sujana Sooreddy

Software Engineer @Netflix - Building High Scale Observability Solutions

Sujana Sooreddy is a software engineer specializing in distributed and asynchronous processing systems at scale. She is currently a key member of Netflix's Content Infrastructure and Platform team, focusing on insights and experiences. Her expertise includes building rule engines, observability tooling, and SLA monitors. Prior to joining Netflix, she gained valuable experience at startups, allowing her to excel in all aspects of engineering.

Speaker

Naveen Mareddy

Staff Engineer @Netflix, 20+ years in Software Engineering, Creator of MediaInfra Meetup, Speaker, Mentor

Naveen Mareddy is a Senior Staff Engineer in Netflix's Content Infrastructure Solutions (CIS) group, where he works at the intersection of media processing platforms and large-scale distributed cloud computing systems. His team is responsible for building and managing the infrastructure that powers the encoding of various media assets, including movies, TV shows, trailers, ads, and image artwork, to create seamless viewing experiences for over 260 million Netflix users worldwide.

With a broad background in large-scale computing platforms, Naveen is passionate about simplifying complex workflows to offer a straightforward user experience through intelligent abstractions. His innovative approach and commitment to improving media processing infrastructure make him a valuable contributor to Netflix’s mission of entertaining the world. Naveen’s work not only enhances the technical capabilities of the CIS group but also significantly contributes to Netflix's vision of delivering high-quality content efficiently and effectively.

Speaker

Sujana Sooreddy

Software Engineer @Netflix - Building High Scale Observability Solutions

Speaker

Naveen Mareddy

Staff Engineer @Netflix, 20+ years in Software Engineering, Creator of MediaInfra Meetup, Speaker, Mentor

From Confusion to Clarity: Advanced Observability Strategies for Media Workflows at Netflix

Summary

Abstract

Speaker

Sujana Sooreddy

Find Sujana Sooreddy at:

Speaker

Naveen Mareddy

Find Naveen Mareddy at:

Speaker

Sujana Sooreddy

Speaker

Naveen Mareddy

Date

Location

Track

Topics

Slides

Share

From the same track

Timeouts, Retries and Idempotency In Distributed Systems

Platforms for Secure API Connectivity With Architecture as Code

From Dashboard Soup to Observability Lasagna: Building Better Layers

Scaling API Independence: Mocking, Contract Testing & Observability in Large Microservices Environments

Follow QCon

Contact

Menu

Conferences around the World