ML training pipelines treat data as static. Teams spend weeks preprocessing datasets into WebDataset or TFRecords, and when they want to experiment with curriculum learning or data mixing, they reprocess everything from scratch. Meanwhile, GPUs sit idle waiting for data that takes the scenic route: S3 to disk, disk to CPU, CPU decompression, and finally a copy to the GPU.

What if you could scan and filter your training data on the fly, streaming it from object storage to GPU memory in a single copy, saturating the host-to-device bandwidth?

This talk introduces Vortex, an open-source columnar file format designed for this world. I'll show how the format's design enables a data path that conventional formats can't support: composable encodings that compress better and decompress on GPU, independent column chunking that minimizes bytes on the wire, and a layout tree that turns a query into precise byte-range reads from S3. No CPU in the data path.

You'll learn how Vortex differs from Parquet, how we built a single-copy S3 to GPU pipeline on top of it, and where this is heading next.

For ML infrastructure engineers tired of the preprocessing treadmill.

From the same track

Session Kafka

Introducing Tansu.io -- Rethinking Kafka for Lean Operations

Tuesday Mar 17 / 01:35PM GMT

What if Kafka brokers were ephemeral, stateless and leaderless with durability delegated to a pluggable storage layer?

Peter Morgan

Founder @tansu.io

Session streaming

The Rise of the Streamhouse: Idea, Trade-Offs, and Evolution

Tuesday Mar 17 / 03:55PM GMT

Over the last decade, streaming architectures have largely been built around topic-centric primitives—logs, streams, and event pipelines—then stitched together with databases, caches, OLAP engines, and (increasingly) new serving systems.

Giannis Polyzos

Principal Streaming Architect @Ververica

Anton Borisov

Principal Data Architect @Fresha

Session Generative AI

Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale

Tuesday Mar 17 / 10:35AM GMT

As Netflix scales hundreds of client platforms, microservices, and infrastructure components, correlating user experience with system performance has become a hard data problem, not just an observability one.

Prasanna Vijayanathan

Engineer @Netflix

Renzo Sanchez-Silva

Engineer @Netflix

Session AI/ML

Chronon - Mixed-Workload Data Processing Framework

Tuesday Mar 17 / 02:45PM GMT

Chronon is a data processing framework open-sourced by Airbnb. It is adopted across organizations like Stripe, Netflix, OpenAI, and Uber. Chronon was originally built for ML applications.

Nikhil Simha

Co-Founder & CTO @zipline.ai, Author of "Chronon Feature Platform", Previously @Airbnb, @Meta, and @Walmartlabs

Session

Connecting the Dots: Modern Data Engineering & Architectures (Limited Space - Registration Required)

Tuesday Mar 17 / 05:05PM GMT

From S3 to GPU in One Copy: Rethinking Data Loading for ML Training

Abstract

Speaker

Onur Satici

Speaker

Onur Satici

Date

Location

Track

Topics

Share

From the same track

Introducing Tansu.io -- Rethinking Kafka for Lean Operations

The Rise of the Streamhouse: Idea, Trade-Offs, and Evolution

Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale

Chronon - Mixed-Workload Data Processing Framework

Connecting the Dots: Modern Data Engineering & Architectures (Limited Space - Registration Required)

Follow QCon

Contact

Menu

Conferences around the World