In-Process Analytical Data Management with DuckDB

Analytical data management systems have long been monolithic monsters far removed from the action by ancient protocols. Redesigning them to move into the application process greatly streamlines data transfer, deployment, and management. This new class of systems a whole new class of use cases, for example in-browser or edge OLAP, running SQL queries in lambdas, and Big Data on laptops.

DuckDB is a new analytical data management system that is built for an in-process use case. DuckDB speaks SQL, is trivially integrated as a library, and uses state-of-the art query processing techniques with vectorized execution and lightweight compression. DuckDB is Free and Open Source software that is distributed under the permissive MIT license. In my talk, I will explain the rationale and design decisions behind DuckDB and give a tour of the internals.

Interview:

What's the focus of your work these days?

I spend most of my time working on DuckDB. It's the open source project that I co-founded, and we have spun out from the research institute that I worked on into a separate company, which is called DuckDB Labs, which I'm also leading.

What's the motivation for your talk at QCon London 2023?

The motivation for the talk came from interacting with data practitioners some years ago; we found out that they really hated using data systems. As somebody who builds data systems, I was a bit concerned that the world hated us, so we were starting to rethink how data systems could work. We came up with this idea that they should be running in process, and that's what I want to talk about. I want to show people how powerful this new way of thinking about data systems is.

How would you describe your main persona and target audience for this session?

I think there are two groups. The first group consists of data analysts and data scientists that are interested in processing large data sets with SQL. The other group is data engineers that are trying to build data pipelines, as DuckDB does it. I'm going to talk about this and it's going to be very useful for those with a more embedded role. These are the two groups that I think would most interested.

Is there anything specific that you'd like people to walk away with after watching your session?

My first motivation is for them to have heard of us. We are a fast-growing project, but I'm told there are still some people out there that haven't heard of us. I think the way DuckDB works can really open up new possibilities and dimensions for people to think about how to build data pipelines and how to analyze data. So I think for them to walk away with that insight would be great. 


Speaker

Hannes Mühleisen

Co-founder and CEO @duckdblabs

Prof. Dr. Hannes Mühleisen is a creator of the DuckDB database management system and Co-founder and CEO of DuckDB Labs, a consulting company providing services around DuckDB. He is also a senior researcher of the Database Architectures group at the Centrum Wiskunde & Informatica (CWI), the Dutch national research lab for Mathematics and Computer Science in Amsterdam. Hannes is also Professor of Data Engineering at Radboud Universiteit Nijmegen. His' main interest is analytical data management systems.

Read more

Date

Monday Mar 27 / 05:25PM BST ( 50 minutes )

Location

Mountbatten (6th Fl.)

Topics

processing techniques open source Big Data Data Systems

Share

From the same track

Session Microservices

Change Data Capture for Microservices

Monday Mar 27 / 01:40PM BST

Microservices represent complex business domains in the form of loosely coupled systems, but these don't exist in isolation: services need to propagate data changes amongst each other, in a reliable and scalable way.

Speaker image - Gunnar Morling

Gunnar Morling

Senior Staff Software Engineer @Decodableco

Session transactions

Amazon DynamoDB Distributed Transactions at Scale

Monday Mar 27 / 02:55PM BST

NoSQL databases are popular for their high availability, high scalability, and predictable performance.

Speaker image - Akshat Vig

Akshat Vig

Senior Principal Engineer NoSQL databases @awscloud

Session Apache Pinot

Speed of Apache Pinot at the Cost of Cloud Object Storage with Tiered Storage

Monday Mar 27 / 11:50AM BST

For real-time analytics, you need systems that can provide ultra low latency (milliseconds) and extremely high throughput (hundreds of thousands of queries per second).

Speaker image - Neha Pawar

Neha Pawar

Founding Engineer @StarTree

Session raft

Multi-Region Data Streaming with Redpanda

Monday Mar 27 / 04:10PM BST

Real time data streaming platforms such as Redpanda have become a mission critical component in enterprise infrastructure. Multi-region deployments of streaming applications can provide important benefits, such as improved resiliency, better performance and cost reduction.

Speaker image - Michał Maślanka

Michał Maślanka

Software Engineer @Redpanda

Session

A New Era for Database Design with TigerBeetle

Monday Mar 27 / 10:35AM BST

The pre-recorded video of this presentation will become available within the next few hours.  

Speaker image - Joran Greef

Joran Greef

Founder and CEO @TigerBeetle