From S3 to GPU in One Copy: Rethinking Data Loading for ML Training

Abstract

ML training pipelines treat data as static. Teams spend weeks preprocessing datasets into WebDataset or TFRecords, and when they want to experiment with curriculum learning or data mixing, they reprocess everything from scratch. Meanwhile, GPUs sit idle waiting for data that takes the scenic route: S3 to disk, disk to CPU, CPU decompression, and finally a copy to the GPU.
What if you could query your training data on the fly and stream it directly from object storage to GPU memory, saturating the host-to-device bandwidth?

This talk introduces Vortex, an open source columnar format designed for this world. I’ll cover how Vortex differs from Parquet - extensible encodings, GPU-native decompression, and a layout optimized for selective reads - then dive deep into a new data path: coalesced byte-range requests from S3 into pinned buffers, with H2D transfers that saturate GPU bandwidth, all in a single copy.

You’ll learn why the CPU is the bottleneck in ML data pipelines, how to remove it, and where this is heading: GPU-direct reads over RDMA that eliminate the last copy entirely.

For ML infrastructure engineers tired of the preprocessing treadmill.


Speaker

Onur Satici

Staff Engineer @SpiralDB

Onur is a Staff Engineer at SpiralDB and a core maintainer of Vortex, an open source columnar file format now part of the Linux Foundation (LF AI & Data). He focuses on high-performance data systems, GPU acceleration, and making analytical workloads faster at every layer of the stack.

Read more

Date

Tuesday Mar 17 / 01:35PM GMT ( 50 minutes )

Location

Fleming (3rd Fl.)

Share

From the same track

Session

Introducing Tansu.io -- Rethinking Kafka for Lean Operations

Tuesday Mar 17 / 10:35AM GMT

What if Kafka brokers were ephemeral, stateless and leaderless with durability delegated to a pluggable storage layer?

Speaker image - Peter Morgan

Peter Morgan

Founder @tansu.io

Session

The Rise of the Streamhouse: Idea, Trade-Offs, and Evolution

Tuesday Mar 17 / 11:45AM GMT

Over the last decade, streaming architectures have largely been built around topic-centric primitives—logs, streams, and event pipelines—then stitched together with databases, caches, OLAP engines, and (increasingly) new serving systems.

Speaker image - Giannis Polyzos

Giannis Polyzos

Principal Streaming Architect @Ververica

Speaker image - Anton Borisov

Anton Borisov

Principal Data Architect @Fresha

Session

Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale

Tuesday Mar 17 / 02:45PM GMT

As Netflix scales hundreds of client platforms, microservices, and infrastructure components, correlating user experience with system performance has become a hard data problem, not just an observability one.

Speaker image - Prasanna Vijayanathan

Prasanna Vijayanathan

Engineer @Netflix

Speaker image - Renzo  Sanchez-Silva

Renzo Sanchez-Silva

Engineer @Netflix

Session

Building a Control Plane for Production AI

Tuesday Mar 17 / 03:55PM GMT

Details coming soon.

Session

Unconference: Modern Data Engineering

Tuesday Mar 17 / 05:05PM GMT