Which embedded OLAP database to pick? | Roman Zaynetdinov (zaynetro)

Intro

I am creating a project that handles visitor events from browsing a website. I need to build an analytical dashboard for reporting purposes. There has been a lot of talk lately that you might not need any cloud solutions and you could survive with an embedded database on a single instance. Is that true?

In this post I will explore how SQLite compares to DuckDB, Polars and Apache DataFusion.

Note: This is not a benchmark! This is me exploring different solutions using their default interfaces.

If you wish to see the results you can jump directly to them.

For my test I am going to generate random events that happen during a visitor session. Let's say a visitor goes to a website and interacts with it. Then they might open a few more pages and interact with them or not. For the purposes of this experiment there are only three distinct events that might happen:

Page load: { "path": "/", "user_agent": "firefox" }
Chat message: { "text": "Hello!" }
Form submission:
- { "form_type": "feedback", "fields": [{ "name": "score", "value": "80" }] }
- OR { "form_type": "contact-us", "fields": [{ "name": "email", "value": "a@b" }] }

Every event has a unique id, a session id and a page id. Page id and session id are used to group events from the same page and from the same session accordingly.

Let's define event schema for SQLite:

CREATE TABLE events (
  id TEXT,
  session_id TEXT,
  page_id TEXT,
  timestamp TEXT,
  event_type TEXT,
  payload TEXT
);

Payload is a JSON string.

SQLite, MySQL and co will store events in a row-oriented storage:

id	session_id	page_id	timestamp	event_type	payload
id-1	s-1	p-1	2023-04-18 12:00:00	page_load	`{ "path": "/", ...`
id-2	s-1	p-1	2023-04-18 12:01:00	chat_message	`{ "text": "Hello!" }`
...	...	...	...	...	...

This format works great for inserts and fetching individual rows.

Online analytical processing (OLAP) needs are different though. For OLAP you need to query large amounts of data and return grouped results of individual fields. Column-oriented storage works much in favour.

What is column-oriented storage? Our example above will be roughly stored as series of values per field:


id	id-1	id-2	...
session_id	s-1	s-1	...
page_id	p-1	p-1	...
timestamp	2023-04-18 12:00:00	2023-04-18 12:01:00	...
event_type	page_load	chat_message	...
payload	`{ "path": "/", ...`	`{ "text": "Hello!" }`	...

This format allows for storage saving and query execution optimizations like:

Timestamp values could record a difference from the previous value, a delta instead of storing a complete date.
Queries that select a single field need to read only a single row.

Column-oriented storage references:

So far we have been treating payload as a JSON string. In fact OLAP databases support structs and lists so we can define a typed schema for our payload. Let's use DuckDB for schema definition in SQL:

CREATE TABLE events (
  id VARCHAR,
  session_id VARCHAR,
  page_id VARCHAR,
  timestamp TIMESTAMP,
  event_type VARCHAR,
  payload STRUCT(
    path VARCHAR,
    user_agent VARCHAR,
    text VARCHAR,
    form_type VARCHAR,
    fields STRUCT(name VARCHAR, value VARCHAR)[]
  )
);

Queries comparison

Querying DuckDB is exactly the same as querying SQLite. Let's try to find the most visited pages:

SELECT payload->>'$.path' AS path, COUNT(*) AS count
  FROM events
 WHERE
     event_type = 'page_load'
 GROUP BY path
 ORDER BY count DESC
 LIMIT 5

This query is the same for SQLite and DuckDB where payload field as JSON. For typed payload field we only need to modify field extraction in the SELECT-case:

-SELECT payload->>'$.path' AS path, COUNT(*) AS count
+SELECT payload.path AS path, COUNT(*) AS count

Now let's take a look at "Average feedback score" query. This query requires extracting a struct field, a first list element and also casting a string to a number.

SQLite is by far the simplest. It also does the casting for us:

SELECT AVG(payload->>'$.fields[0].value') AS average
  FROM events
 WHERE
     event_type = 'form_submit'
     AND payload->>'$.form_type' = 'feedback';

Typed DuckDB:

SELECT AVG(TRY_CAST(payload.fields[1].value AS INTEGER)) AS average
  FROM events
 WHERE
     event_type = 'form_submit'
     AND payload.form_type = 'feedback';

Polars:

let pres = df
    .filter(
        col("event_type").eq(lit("form_submit"))
        .and(col("payload")
                .struct_()
                .field_by_name("form_type")
                .eq(lit("feedback")),
        ),
    )
    .select([
        // '$.fields[0].value
        col("payload")
            .struct_().field_by_name("fields")
            .arr().first()
            .struct_().field_by_name("value")
            .cast(DataType::Int32)
            .alias("score"),
    ])
    .select([avg("score")])
    .collect()?;

Results

Queries were run on Macbook Air M2 from an application written in Rust. Source code for the tests.

2M events

My data generator inserted 2'274'231 event rows. Let's see file sizes:

File	Size
SQLite	593M
DuckDB (JSON payload)	190M
DuckDB (Typed payload)	177M
Parquet (Typed payload)	171M
CSV	555M
JSON	701M

You can see how column-oriented storage solutions decrease the file size. Parquet is ~70% smaller than SQLite!

Now to the queries:

Query	SQLite	DuckDB	Polars	DataFusion
Count by event_type	700ms	10ms	25ms	80ms
Average page loads per session	255ms	10ms	45ms	115ms
Average feedback score	255ms	35ms	160ms	—
Top pages	370ms	30ms	190ms	460ms
Page loads per day	235ms	5ms	20ms	55ms
Form submissions	485ms	40ms	225ms	550ms
Form submissions by page	535ms	60ms	380ms	535ms

DuckDB results with payload as JSON string.

DataFusion doesn't fully support nested structs: https://github.com/apache/arrow-datafusion/issues/2179

22M events

My data generator inserted 22'754'423 event rows. Let's see file sizes:

File	Size
SQLite	5.8G
DuckDB (JSON payload)	1.7G
DuckDB (Typed payload)	1.7G
Parquet (Typed payload)	1.7G

Now to the queries:

Query	SQLite	DuckDB (Typed)	Polars	DataFusion
Count by event_type	10350ms	70ms	200ms	775ms
Average page loads per session	5180ms	95ms	470ms	1170ms
Average feedback score	4845ms	500ms	2185ms	—
Top pages	7820ms	455ms (350ms)	2070ms	4655ms
Page loads per day	4890ms	70ms	165ms	520ms
Form submissions	8020ms	345ms (355ms)	2972ms	5830ms
Form submissions by page	12340ms	620ms (650ms)	6980ms	9380ms

DuckDB results with payload as JSON string.

In parenthesis you can see DuckDB results with typed payload.

DataFusion doesn't fully support nested structs: https://github.com/apache/arrow-datafusion/issues/2179

Notes about results:

I used Polars and DataFusion as crates. At least for Polars there are additional features that could be enabled to increase the performance which I haven't tried.
I have included DuckDB results with payload as JSON string. Typed DuckDB is consistently 1-2ms faster on a smaller dataset. On a larger dataset typed DuckDB was slower on some queries.
Polars supports SQL queries as well. I just wanted to play around with DataFrame API in Rust.
Looking at H2O.ai benchmark Polars should be as fast as DuckDB. Maybe I am doing something wrong.
SQL is a very portable interface. SQLite, DuckDB and DataFusion queries are almost the same.

Feelings

Overall, it was a nice experiment. I really liked using DuckDB and Polars. Polars feels to be targeted more towards Python users as Rust documentation is a bit lacking. DuckDB uses 1-based indexing on arrays. It took me some time to realize it though :/

Honestly, I am impressed with how well DuckDB performed especially with payload as JSON string. I really like the flexibility of just using JSON and not worrying about the schema.

All projects that I played with are built around Apache Parquet and Apache Arrow. Andy Pavlo described the future of OLAP databases really well:

The long-term trend to watch is the proliferation of frameworks like Velox, DataFusion, and Polars. Along with projects like Substrait, the commoditization of these query execution components means that all OLAP DBMSs will be roughly equivalent in the next five years. Instead of building a new DBMS entirely from scratch or hard forking an existing system (e.g., how Firebolt forked Clickhouse), people are better off using an extensible framework like Velox. This means that every DBMS will have the same vectorized execution capabilities that were unique to Snowflake ten years ago. And since in the cloud, the storage layer is the same for everyone (e.g., Amazon controls EBS/S3), the critical differentiator between DBMS offerings will be things that are difficult to quantify, like UI/UX stuff and query optimization.
Ref: https://ottertune.com/blog/2022-databases-retrospective/

If you are not sold to the idea of storing Apache Parquet files on S3 and using an embedded DB engine to query them then you might try ClickHouse.

Open questions

I am yet to find a good way to write Parquet files. Using parquet crate directly feels a bit complicated (I just can't comprehend the docs). Both Polars and DuckDB could write Parquet files. Shall I stick to the same approach I used for this experiment: write to DuckDB and export from it or is there a better solution?
DuckDB and Polars could read Apache Parquet files directly from S3. DuckDB also claims that they will download only file parts that are necessary to facilitate the query. I haven't tried an S3 integration but that is definitely something I would like to test.
Can Polars run faster?

If you have any tips or answers, please, do let me know!