Modern Data Engineering & Architectures

The online world we interact with today is increasingly powered by data and by insights extracted from that data. Our ever-growing thirst for data insights and data-driven behavior (e.g. ML-based systems) is driving our industry to collect data more often from an increasingly varied set of sources. With increased amounts of data, scale becomes a challenge. To complicate matters further, customers want reliable access to high-quality data and insights. This adds availability and data quality to our list of requirements. More often than not, customers require low-latency as well, often referring to the time it takes raw data to be converted into usable insights or production-grade models. Last but not least, access patterns and use-cases dictate the form data will take when being served!

Depending on how the data will be used, the medium used to store and serve it will vary widely. OLTP/OLAP DBs, caches, object stores, search engines, graph DBs, data streams, vector DBs, and the like represent the many forms data takes to be suitable to its many uses. Come to this track to learn about new technologies, practices, and trends shaping the way you will work with data.


From this track

Session

Introducing Tansu.io -- Rethinking Kafka for Lean Operations

Tuesday Mar 17 / 10:35AM GMT

What if Kafka brokers were ephemeral, stateless and leaderless with durability delegated to a pluggable storage layer?

Speaker image - Peter Morgan

Peter Morgan

Founder @tansu.io

Session

The Rise of the Streamhouse: Idea, Trade-Offs, and Evolution

Tuesday Mar 17 / 11:45AM GMT

Over the last decade, streaming architectures have largely been built around topic-centric primitives—logs, streams, and event pipelines—then stitched together with databases, caches, OLAP engines, and (increasingly) new serving systems.

Speaker image - Giannis Polyzos

Giannis Polyzos

Principal Streaming Architect @Ververica

Speaker image - Anton Borisov

Anton Borisov

Principal Data Architect @Fresha

Session

From S3 to GPU in One Copy: Rethinking Data Loading for ML Training

Tuesday Mar 17 / 01:35PM GMT

ML training pipelines treat data as static. Teams spend weeks preprocessing datasets into WebDataset or TFRecords, and when they want to experiment with curriculum learning or data mixing, they reprocess everything from scratch.

Speaker image - Onur Satici

Onur Satici

Staff Engineer @SpiralDB

Session

Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale

Tuesday Mar 17 / 02:45PM GMT

As Netflix scales hundreds of client platforms, microservices, and infrastructure components, correlating user experience with system performance has become a hard data problem, not just an observability one.

Speaker image - Prasanna Vijayanathan

Prasanna Vijayanathan

Engineer @Netflix

Speaker image - Renzo  Sanchez-Silva

Renzo Sanchez-Silva

Engineer @Netflix

Session

Building a Control Plane for Production AI

Tuesday Mar 17 / 03:55PM GMT

Details coming soon.

Session

Unconference: Modern Data Engineering

Tuesday Mar 17 / 05:05PM GMT

Track Host

Sid Anand

Fellow, Cloud & Data Platform @Walmart, Apache Airflow Committer/PMC, Ex-Netflix, LinkedIn, eBay, Etsy, & PayPal

Sid recently joined Walmart (i.e. Walmart Global Tech) as a fellow to work on all things data. Prior to joining Walmart Global Tech, Sid served as the Chief Architect and Head of Engineering for Datazoom, where he and his team built high-fidelity, low-latency data streaming systems. Prior to joining Datazoom, Sid served as PayPal's Chief Data Engineer, where he helped build systems, platforms, teams, and processes, all with the aim of building access to the hundreds of petabytes of data under PayPal's management. Prior to joining PayPal, Sid held senior technical positions at Netflix, LinkedIn, eBay, & Etsy to name a few. He earned my BS and MS degrees in CS from Cornell University, focusing on Distributed Systems.

Outside of work, Sid advises early-stage companies and several conferences. Once an active committer on Apache Airflow, he is now mostly a fan.

Sid's body of work includes but is not limited to :

  • The world's first cloud-based streaming video service -- I was the first engineer to work on the cloud at Netflix
  • LinkedIn's Federated Search Typeahead (a.k.a. auto-complete)
  • LinkedIn's (Big Data) Self-service Marketing Analytics tool
  • PayPal's DBaaS - an internal self-service system to provision & manage heterogenous databases
  • PayPal's CDC - an internal self-service CDC system to stream DB updates to nearline applications
  • eBay-over-Skype : Following the Skype-acquisition, I built a P2P version of eBay offers
  • eBay's Best Match Search Ranking Engine powered by an In-Memory Database
  • eBay's Fuzzy-match name/email Search
  • Agari's Data Platform : Batch & Streaming Predictive Data Platform as a Service
  • Datazoom's Platform : High-fidelity, Low-latency Streaming Data Platform as a Service
Read more
Find Sid Anand at: