Vector Search on Columnar Storage

Abstract

Managing vector data entails storing, updating, and searching collections of large and multi-dimensional pieces of data. Some believe that this justifies the creation of a new class of data systems specialized for this. Others would contend that such systems would eventually need to provide services provided by database system, including e.g., transaction management, role-based access control, and integration of vector search predicates in complex queries.

Recent research (PDX - Partition Dimension Across) has shown that already highly optimized vector search kernels can profit from columnar storage. This talk gives a sneak preview of our ongoing work in this area, including optimized vector ingest, tailored vector indexing, and integrated evaluation of queries and vector predicates in the DuckDB system.


Speaker

Peter Boncz

Professor @CWI, Co-Creator of MonetDB, VectorWise and MotherDuck, Database Systems Researcher, and Entrepreneur

Peter Boncz holds appointments as tenured researcher at CWI and professor at VU University Amsterdam. His academic background is in database systems, with the open-source column-store MonetDB the outcome of his PhD. He has a track record in bridging the gap between academia and commercial application, founding multiple startups. In 2008 he co-founded Vectorwise around the analytical database system by the same name, which pioneered vectorized query execution, and lightweight data compression; which have been adopted broadly in analytical database systems.

Recent work to make data (de)compression data-parallel and AI/GPU-friendly led to the FastLanes data format. In recent years he has collaborated closely with both Databricks and with MotherDuck — a startup that is connecting DuckDB to the cloud. DuckDB originates from the Database Architectures research group, which he leads at CWI (the Amsterdam research institute where also python was created).

Read more

From the same track

Session AI/ML

Navigating the Edge of Scale and Speed for Physics Discovery

Wednesday Mar 18 / 10:35AM GMT

Details coming soon.

Speaker image - Thea  Klaeboe Aarrestad

Thea Klaeboe Aarrestad

Particle Physics and Real-Time ML @CERN @ETH Zürich

Session compilers

Automatically Retrofitting JIT Compilers

Wednesday Mar 18 / 03:55PM GMT

We as a community have attempted, multiple times, to speed up languages such as Lua, Python, and Ruby by hand-writing JIT compilers. Sometimes we've had short-term success, but the size, and pace of change, of their standard implementations has proven difficult to keep up with over time.

Speaker image - Laurence Tratt

Laurence Tratt

Shopify / Royal Academy of Engineering Research Chair in Language Engineering @King's College London

Session architecture

Not Just I/O: Using Async/Await for Computational Scheduling

Wednesday Mar 18 / 01:35PM GMT

In the past two years I have developed a new query execution engine for Polars, which not only tries to execute as much of your query in parallel as possible, but in a streaming fashion as well, such that you can process data sets which do not fit in memory.

Speaker image - Orson Peters

Orson Peters

Senior Engineer of Query Execution @Polars, (Co-)Author of Stdlib Sort in Rust & Go

Session

Looking Under the Hood: Data Processing Systems Performance Tricks (and How to Apply Them to Your Code)

Wednesday Mar 18 / 02:45PM GMT

Modern data processing systems—databases, analytics engines, vector stores, and stream processors—hide an extraordinary amount of performance engineering beneath their abstractions.

Speaker image - Holger Pirk

Holger Pirk

Associate Professor for Data Management Systems at Imperial College London and Avid Runner — Minimizing Cache Misses, Thread Divergence and Aerobic Decoupling