DuckDB as a unified data interface
Introduction
It’s no surprise that in data science, you have to deal with a lot of data: it’s in the name! The science part is equally important and far from a marketing ploy. Beyond the sensational Gen AI projects, there is also a lot of rigorous work for data scientists to do in modeling, forecasting, and more generally in finding business value in heaps of data no one knows what to do with.
As data scientists, it’s our job to come up with a why, a what and a how for the problems encountered by business stakeholders. And this brings us back to the data part. Building a solution (or data product), whether it be a visualization, a tool, a model, or a pipeline, requires a tremendous amount of exploratory work and data manipulation. There are many tools to do that: from Excel to Jupyter notebooks (and now the even better Marimo notebooks), from PowerBI to Tableau or Metabase, the choices on how to access and visualize your data can be overwhelming. And I haven’t even mentioned the diversity of data storage solutions to begin with. Is your data stored as simple Parquet files in object storage or inside a transactional database (like Postgres) or analytical database (like ClickHouse)? If your team has a data engineer, you’re in luck as they will be the expert on these questions and will most likely set this up for you. Otherwise, it might be up to you and your team to design the data infrastructure and the processes to interact with it.
In the past year, I’ve been working on a cool project involving an order allocation simulation and items sampling based on forecasted sales. From a data perspective, this means ingesting, processing, and analyzing a lot of different data sources. I could write a whole post with all the difficulties we encountered in finding and ingesting all the required data (which is pretty typical with a company as big as IKEA, especially when you start looking at sales), but that wouldn’t be much fun to read. Instead, let’s focus on the processing and analysis part. While our data product may not be handling terabytes or petabytes at once, when scaled to all the countries IKEA operates in and over all the simulation configurations we explore, well we still need to take a careful approach to it.
Our initial approach was based on a simple Postgres database for data storage and Python with Pandas for processing. Unsurprisingly, this turned out to be quite inefficient and did not meet our performance criteria. Our first attempt at improving the situation was simply to move from Pandas to Polars, and it worked like a charm! With that done, the processing itself was not the bottleneck anymore, but data loading was. Of course this wasn’t unexpected. Our pipeline is more analytical than transactional, which makes Postgres rather ill-suited for this job. The obvious choice would be to just use BigQuery which is the data warehouse solution by default at IKEA. However, the cost structure didn’t seem worth it for our purposes, and so we mostly use BigQuery to expose our data within the company instead. We considered moving to ClickHouse for a bit but ultimately decided against it because hosting ClickHouse internally for only one project is a large commitment and, additionally, we encountered some important bugs in our processes. But don’t let that deter you: we considered it seriously because it’s a good product and it might just suit your needs. We finally settled on DuckDB which has provided us with a simple, scalable solution for a fraction of the complexity.
Let me start with a brief introduction to DuckDB before diving deeper into how we use it. DuckDB is a simple, lightweight analytical database that is both fast and open-source. One of its main perks (which took me quite a bit of time to get used to) is that it’s serverless: you don’t need to run a separate process or container, you can just start it from your application and run your queries in-process. If you’re like me, this definition might make you think DuckDB is mostly good as a replacement for analytics engines (like Polars) inside your code, but I’ve since come to realize you can use it for more than that, and at this point I’d even say I use it for everything.
DuckDB as a data lake
I mentioned that in our project, we ended up settling on DuckDB for our data storage. To do this, we chose to use the ducklake extension for DuckDB, which brings all the benefits of data lakes without the usual complexity. What it does is separate the notion of catalog and data storage:
- your data is simply stored as raw Parquet files (which in our case can just be dumped in object storage, e.g., in Google Cloud Storage),
- the data lake keeps track of which file corresponds to what table using a catalog which must be stored in some database (Postgres, MySQL, SQLite, or even a simple DuckDB file). In our case, we just reused the Postgres instance we had lying around anyway, but any database is fine (I would advise against a DuckDB file if you need multi-user access!)
The setup for this is really simple and you can find all the information on how to do it in the documentation. The greatest part of it is that it’s cheap, easy to setup, and very reliable. And since DuckDB is very portable (there are even WASM client implementations if you need them), you can easily connect to your data lake from anywhere. In my workflow, I quite enjoy being able to run my pipeline and analyse the results in my Marimo notebook all at once just using DuckDB everywhere.
DuckDB as a data interface
The title of this post is quite ambitious. I called DuckDB a unified data interface, yet so far I’ve only cheered for DuckDB as a drop-in replacement to complex data lake and data warehousing solutions. Taking a step back, the original strength of DuckDB is that it embeds a powerful analytical engine which you can apply to any database, CSV or Parquet file, data warehouse (BigQuery, etc.), and do that from any client. You can do some of that with libraries like Polars, but you are limited to a Python (or Rust in this example) environment, and to Polars’s API (which is honestly really good). On the other hand, DuckDB allows you to do all this using (enriched) SQL and from any client (your shell, Python, Rust, etc.)
To be honest, SQL wasn’t always a selling point to me. I first learnt SQL through web development, where a lot of it are just mindless and repetitive CRUD operations. But in analytics, I find SQL more interesting. It provides a much more unified API for data transformation (don’t mind all the SQL dialects too much though) and while I still think it can be a bit too verbose on some aspects, it is generally quite clear and powerful, which is why it has become a de facto standard. And when you combine it with a fast, portable analytical processing engine, then you effectively have a unified interface to query, transform and analyze your data.
But what does this mean in practice?
In my workflows, I’ve been using DuckDB as a way to quickly connect very different data sources:
- I have a persistent data lake configuration so I can access my application data from any notebook or directory,
install ducklake; load ducklake;
ATTACH 'ducklake:my_ducklake_config' AS lake;
select * from lake.main.table
- some data from other projects is stored inside a Postgres database which I can just access with the Postgres extension,
install postgres; load postgres;
attach 'dbname=... user=... password=... host=... port=...' as db (type postgres, read_only);
select * from db.table
- other data I can just query from BigQuery directly,
install bigquery from community; load bigquery;
attach 'project=my_gcp_project' AS bq (type bigquery, read_only);
select * from bq.some_dataset.some_table
- finally some of the code I have to run (like our simulation engine) just produces CSV files in development.
create view some_file as (
select * from 'some_file.csv'
);
Most of my notebooks just start with a setup of the appropriate connections, create a few views, and I can then join all this different data in a jiffy! This kind of workflow has really been a game changer. It’s made analyzing data a lot easier and a lot faster. For business critical projects, coming up with a full analysis in a day rather than a week makes a big difference.
Conclusion
When I first discovered DuckDB, I only saw it as a possible replacement for the analytics part of my workflows (so essentially using SQL instead of Polars’s API). I didn’t know it could provide such a simple, scalable data warehousing solution. And it took me a while to realize that I could just use the same engine, the same SQL, and the same process to talk to all the data I care about. I don’t need to write endless conversion or download scripts, I don’t need to do crazy gymnastics to convert the output to some program so it is compatible with some other dataset. I can just write some views as translation layers and reuse that everywhere! So if you find yourself jumping from notebook to dashboards to shell scripts, maybe give DuckDB a try and see how it can help you.