Abstract
ML training pipelines treat data as static. Teams spend weeks preprocessing datasets into WebDataset or TFRecords, and when they want to experiment with curriculum learning or data mixing, they reprocess everything from scratch. Meanwhile, GPUs sit idle waiting for data that takes the scenic route: S3 to disk, disk to CPU, CPU decompression, and finally a copy to the GPU.
What if you could scan and filter your training data on the fly, streaming it from object storage to GPU memory in a single copy, saturating the host-to-device bandwidth?
This talk introduces Vortex, an open-source columnar file format designed for this world. I'll show how the format's design enables a data path that conventional formats can't support: composable encodings that compress better and decompress on GPU, independent column chunking that minimizes bytes on the wire, and a layout tree that turns a query into precise byte-range reads from S3. No CPU in the data path.
You'll learn how Vortex differs from Parquet, how we built a single-copy S3 to GPU pipeline on top of it, and where this is heading next.
For ML infrastructure engineers tired of the preprocessing treadmill.
Speaker
Onur Satici
Staff Engineer @SpiralDB & a Core Maintainer of Vortex (LF AI & Data), Previously Building Distributed Systems @Palantir
Onur is a Staff Engineer at SpiralDB and a core maintainer of Vortex, an open source columnar file format now part of the Linux Foundation (LF AI & Data). He focuses on high-performance data systems, GPU acceleration, and making analytical workloads faster at every layer of the stack.