Low latency data streaming technology and practices remain a hot and trending topic among data engineers today. At its core, it promises to deliver data in near real time in order to provide snappy data-driven user experiences. This experience comes in many forms including low latency updates to social news feeds, near-real time payment fraud prevention, time-relevant recommender systems used in flash sales, self-driving car route planning, and more. Our need to stay engaged has made low latency data streams a critical part of modern data architectures.
While it may seem trivial to get a data streaming POC up and running, productionalizing such a system under strict SLAs with the aid of a lean engineering team requires making the right choices but also learning from mistakes along the way. At Datazoom, we built a lossless streaming data system that guarantees sub-second (p95) event delivery at scale with better than three nines availability – we measure availability in terms of the on-time delivery of events. Come to this talk to learn how you can build such a system soup-to-nuts.
Interview:
What is the focus of your work these days?
I currently serve as the Chief Architect and Head of Engineering at Datazoom, a company that offers a video data platform that captures video playback telemetry data. This data can be used to understand how customers experience and interact with video. At Datazoom, we build both client SDKs and a cloud-based analytics platform.
What’s the motivation for your talk?
In this talk, I explain how engineers can build a low-latency, high-fidelity data streaming system using open source software and public cloud technologies combined with recommended best practices. My talk focuses on the non-functional requirements (e.g. the -ilities) of such a system including but not limited to scalability, performance, reliability, observability, availability, etc…
How would you describe the persona and level of the target audience?
This talk will take a ground-up approach to building such a system. My talk requires little background knowledge beyond basic familiarity with various AWS technologies & Apache Kafka. The ideal target audience would be composed of engineers, ranging from beginner to intermediate, interested in building a high-fidelity streaming system.
What do you want this persona to walk away with from your presentation?
This talk will serve as an architect’s guide to building a high-fidelity streaming system. While it may leave out specific details for lack of time, it will provide enough information to get an architect 80% of the way to building a similar system.
What do you think is the next big disruption in software?
AI-managed data infra – it is sorely needed in order to reduce the onerous task of operating data infrastructure at scale.
Speaker
Sid Anand
Fellow, Cloud & Data Platform @Walmart, Apache Airflow Committer/PMC, Ex-Netflix, LinkedIn, eBay, Etsy, & PayPal
Sid recently joined Walmart (i.e. Walmart Global Tech) as a fellow to work on all things data. Prior to joining Walmart Global Tech, Sid served as the Chief Architect and Head of Engineering for Datazoom, where he and his team built high-fidelity, low-latency data streaming systems. Prior to joining Datazoom, Sid served as PayPal's Chief Data Engineer, where he helped build systems, platforms, teams, and processes, all with the aim of building access to the hundreds of petabytes of data under PayPal's management. Prior to joining PayPal, Sid held senior technical positions at Netflix, LinkedIn, eBay, & Etsy to name a few. He earned my BS and MS degrees in CS from Cornell University, focusing on Distributed Systems.
Outside of work, Sid advises early-stage companies and several conferences. Once an active committer on Apache Airflow, he is now mostly a fan.
Sid's body of work includes but is not limited to :
- The world's first cloud-based streaming video service -- I was the first engineer to work on the cloud at Netflix
- LinkedIn's Federated Search Typeahead (a.k.a. auto-complete)
- LinkedIn's (Big Data) Self-service Marketing Analytics tool
- PayPal's DBaaS - an internal self-service system to provision & manage heterogenous databases
- PayPal's CDC - an internal self-service CDC system to stream DB updates to nearline applications
- eBay-over-Skype : Following the Skype-acquisition, I built a P2P version of eBay offers
- eBay's Best Match Search Ranking Engine powered by an In-Memory Database
- eBay's Fuzzy-match name/email Search
- Agari's Data Platform : Batch & Streaming Predictive Data Platform as a Service
- Datazoom's Platform : High-fidelity, Low-latency Streaming Data Platform as a Service