+

chDB v4: The Fastest SQL Engine
on Pandas DataFrame

Zero-Copy Architecture · DataStore API · Benchmarks

Auxten Wang · Technical Director @ ClickHouse

ClickHouse Release 26.3 Webinar · 15 min

What is chDB?

In-Process
ClickHouse engine
No server needed
1 Line
pip install chdb
Zero config
300+
SQL functions
JOINs, CTEs, Window
80+
Data formats
CSV, Parquet, S3...

import chdb
import pandas as pd

df = pd.DataFrame({'user_id': range(1_000_000),
                    'category': ['A', 'B', 'C'] * 333334})

result = chdb.query("""
    SELECT category, COUNT(*) as cnt
    FROM Python(df)
    GROUP BY category
""", "DataFrame")
  

Same ClickHouse engine you already know — compiled as a shared library, embedded in your Python process. No infrastructure.

The Problem: Pandas at Scale

"I have a DataFrame. I want to run SQL on it. Give me back a DataFrame. Fast."
  • Single-threaded. One core out of 16 doing the work
  • Eager execution. Every step materializes the full result in memory
  • Memory hungry. A 50M-row JOIN peaks at 10.5 GB in Pandas

On an 8 GB laptop, that JOIN will OOM and kill your process.

What if you could keep writing Pandas code, but run it up to 247x faster with 23x less memory?

The Journey to Zero-Copy

v1
v1.0 (2023)
4 serialization steps
DataFrame → Parquet → CH → Parquet → DataFrame
~100ms+ overhead per query
v2
v2.0 (2024)
Zero-copy input
Direct memory read from DataFrame
Output still via Parquet
87x faster than v1.0
v4
v4.0 (2026)
Complete zero-copy loop
DataFrame ↔ ClickHouse ↔ DataFrame
No serialization anywhere
The ultimate integration

Three generations of engineering to eliminate every serialization bottleneck

Is chDB Fast?

chDB is ClickHouse — same codebase, same vectorized engine, same SIMD pipelines.

So how do we make it the fastest SQL engine on Pandas DataFrame?
Pandas DataFrame
NumPy arrays in memory
The Bottleneck
serialization, GIL, encoding
ClickHouse Engine
already blazing fast

The secret: Don't let Python slow down ClickHouse.

  • Zero serialization — read/write DataFrame memory directly
  • Minimize GIL — batch all Python calls, then release
  • Bypass CPython — rewrite hot paths (string encoding) in C++

The next 3 slides show exactly how we did each of these.

How Zero-Copy Works

Input: DataFrame → ClickHouse

  1. Extract raw pointers from NumPy arrays
  2. Map NumPy dtypes to ClickHouse types
  3. ClickHouse reads memory directly — no copy

Output: ClickHouse → DataFrame

  1. Map ClickHouse column types to NumPy dtypes
  2. Share underlying memory buffers directly
  3. SIMD-optimized batch conversion for results
# What chDB does internally:
 
# 1. Get NumPy array from column
arr = df['col'].values
 
# 2. Extract memory address
ptr = arr.__array_interface__['data'][0]
 
# 3. Get dtype info
dtype = arr.dtype
 
# 4. Pass to ClickHouse C++ engine
# (no copy, just pointer!)

For numeric-heavy workloads, the boundary crossing cost is effectively constant, regardless of data size

Breakthrough 1: GIL Two-Phase Design

The problem: ClickHouse wants all 16 cores. Python's GIL says "one thread at a time."

If we call CPython API per data access → multi-threaded engine degrades to single-threaded.

Phase 1: prepareColumnCache()
Hold GIL → extract buf, stride, dtype for every column
One batch of Python calls, then done
Phase 2: N parallel PythonSource streams
Each stream reads its row slice via col.buf + offset
No GIL, no Python API — pure pointer arithmetic

// Phase 1: prepareColumnCache() — GIL held
{
    py::gil_scoped_acquire acquire;
    for (auto & col : column_cache) {
        // Pandas: df[col].to_numpy() → array
        py::array arr = column.attr("to_numpy")();
        col.buf    = arr.data();   // raw pointer
        col.stride = arr.strides(0);
        col.row_count = arr.size();
    }
}  // GIL released — shared across all streams

// Phase 2: N parallel PythonSource streams
// each reads a row slice via raw pointer math
void insert_from_ptr(const void* ptr,
    MutableColumnPtr& col, size_t offset,
    size_t count, size_t stride) {
    // Direct memcpy — no GIL, no Python API
    const char* start = (char*)ptr
                        + offset * sizeof(T);
    col->appendRawData(start, count);
}
      

One GIL acquisition to cache all pointers → N parallel streams read raw memory

Breakthrough 2: C++ String Encoding

The problem: Python str stores as Latin-1, UCS-2, or UCS-4 internally (PEP 393). ClickHouse expects UTF-8.

Calling PyUnicode_AsUTF8 requires holding the GIL. Millions of strings → entire conversion becomes serial.

Before: Python API
GIL per string → serial
8.6s
After: C++ native
Direct memory, parallel
0.56s

15x faster on Q23 (SELECT * WHERE URL LIKE '%google%' ORDER BY EventTime)


// Step 1: Get raw pointer + kind without
// Python C API (via pybind11 non_limited_api)
getPyUnicodeUtf8(obj, data, length,
                 kind, codepoint_cnt,
                 direct_insert);

// Step 2: Convert in C++ — no GIL needed
static size_t ConvertPyUnicodeToUtf8(
    const void* input, int kind,
    size_t codepoint_cnt,
    Offsets& offsets, Chars& chars) {
  switch (kind) {
  case 1: // Latin-1 (1 byte/char)
    for (size_t i = 0; i < codepoint_cnt; ++i)
      offset += utf8proc_encode_char(
        ((uint8_t*)input)[i], &chars[offset]);
    break;
  case 2: // UCS-2 (2 bytes/char)
    for (size_t i = 0; i < codepoint_cnt; ++i)
      offset += utf8proc_encode_char(
        ((uint16_t*)input)[i], &chars[offset]);
    break;
  case 4: // UCS-4 (4 bytes/char)
    /* same pattern with uint32_t */
  }
}
      

Breakthrough 3: Native JSON on DataFrame

The problem: Real DataFrames often have object columns filled with nested dicts. Traditional approach: flatten to columns first, lose structure.

chDB's approach:

  1. Sample object column to detect dict structure
  2. Auto-map to ClickHouse's native JSON type
  3. Query nested fields with dot notation — no flattening
  4. Full access to ClickHouse JSON functions

import pandas as pd, chdb

data = pd.DataFrame({
    'event': [
        {'type': 'click',
         'meta': {'x': 100, 'y': 200}},
        {'type': 'scroll',
         'meta': {'x': 150, 'y': 300}},
        {'type': 'click',
         'meta': {'x': 200, 'y': 400}}
    ]
})

# Dot notation on nested JSON!
result = chdb.query("""
    SELECT
        event.type,
        event.meta.x AS x_coord,
        event.meta.y AS y_coord
    FROM Python(data)
    WHERE event.type = 'click'
""", "DataFrame")

#   event.type  x_coord  y_coord
# 0      click      100      200
# 1      click      200      400
      

Perfect for log analytics, event data, API responses — no pre-processing, no schema definition

Your Pandas Code

One Line Change

Under the Hood

Inside the Engine

import pandas as pd
df = pd.read_csv("employee_data.csv")
filtered = df[
(df['age'] > 25) &
(df['salary'] > 50000)
]
grouped = filtered.groupby('city')['salary'] \
.agg(['mean', 'sum', 'count'])
result = grouped.sort_values(
'mean', ascending=False
).head(10)
import chdb.datastore as pd # <- one line!
df = pd.read_csv("employee_data.csv")
filtered = df[
(df['age'] > 25) &
(df['salary'] > 50000)
]
grouped = filtered.groupby('city')['salary'] \
.agg(['mean', 'sum', 'count'])
result = grouped.sort_values(
'mean', ascending=False
).head(10)
auto-generated ClickHouse SQL ↓

SELECT city,
       AVG(salary) AS mean,
       SUM(salary) AS sum,
       COUNT(salary) AS count
FROM file('employee_data.csv',
          'CSVWithNames')
WHERE age > 25
  AND salary > 50000
GROUP BY city
ORDER BY mean DESC
LIMIT 10
        
Up to 4.2x faster · 23x less memory
DataStore Architecture
Same API. Same code. ClickHouse speed underneath. Predicate pushdown · Column pruning · Limit propagation — automatic SQL feeds into chDB's embedded ClickHouse engine — in-process, zero network

DataStore: Four-Layer Architecture

Layer 1 — Pandas API
ds['col'] > 25, ds.groupby(), ds.merge()
Every call returns a lazy object

Layer 2 — LazyOp Chain
Records operations without touching data
filter, select, assign, apply, groupby...

Layer 3 — Query Planner
Triggered by print(), len(), .columns
Splits pipeline into segments, routes to engine

Layer 4 — Execution
chDB segments ↔ Pandas segments
DataFrames flow via memoryview

Lazy Execution & explain()


from chdb import DataStore

ds = DataStore.from_file("users.parquet")
ds = ds[ds["age"] > 30]
ds = ds.sort_values("salary",
                     ascending=False)
ds = ds.head(100)

# Nothing has executed yet!
print(ds.explain())
      

Execution Plan
════════════════════════════════
 [1] File: users.parquet

Operations:
────────────────────────────────
 [2] [chDB] WHERE: age > 30
 [3] [chDB] ORDER BY: salary DESC
 [4] [chDB] LIMIT: 100
      

What happens at execution

  • Filter pushdown — WHERE before reading all data
  • Column pruning — only read columns you use
  • Limit propagation — stop after 100 rows
  • Single pass over the file, vectorized SIMD

Smart Caching

  • Each print() triggers execution & caches results
  • Next operation reuses the cache
  • Auto-invalidated when pipeline changes
  • No manual .cache() needed

Segment Execution: Best of Both Worlds


ds = DataStore.from_file("data.parquet")
ds = ds[ds["age"] > 25]                    # → chDB segment
ds["name_upper"] = ds["name"].str.title()   # → Pandas segment (str.title)
ds = ds.sort_values("age")                  # → chDB segment
  
Segment 1 · chDB
FROM file() WHERE age > 25
Segment 2 · Pandas
name.str.title()
Segment 3 · chDB
via Python() ORDER BY age
  • Auto-splits the pipeline at SQL/Pandas boundaries
  • ClickHouse handles what it can; Pandas handles the rest (300+ CH functions + full Pandas ecosystem)
  • Seamless handoff via Python memoryview — for numeric data, effectively zero-copy
  • Tip: keep Pandas-only ops toward the end for best optimization

The query planner sees through the entire pipeline. You just write Pandas.

Unified Data Sources

DataStore inherits all of ClickHouse's data source support:


# Local files — Parquet, CSV, JSON, ORC...
ds = DataStore.from_file("events.parquet")

# Remote S3
ds = DataStore.uri("s3://bucket/path/to/data.parquet")

# Query a PostgreSQL table
ds = DataStore.from_sql("SELECT * FROM users", engine="postgresql://...")

# Mix sources in a single query — ClickHouse optimizes the execution
ds = DataStore.from_sql("""
    SELECT u.name, e.event_type, e.timestamp
    FROM file('users.parquet') u
    JOIN postgresql('host:5432', 'db', 'events', 'user', 'pass') e
    ON u.id = e.user_id
""")

ds = ds[ds["event_type"] == "purchase"].sort_values("timestamp")
print(ds)  # Lazy — executes here
  

80+ formats. Local, remote, in-memory, cross-database JOINs. Same Pandas-style pipeline. No ETL.

Benchmark: chDB vs Pandas — Speed & Memory (50M Rows)

Scenario Pandas chDB Speedup Pandas Mem chDB Mem Mem Saved
T01 Read Parquet + COUNT 2.07s 0.28s 7.4x 5,310 MB 234 MB 22.7x
T02 Filter + GroupBy + SUM 2.88s 0.69s 4.1x 7,161 MB 1,216 MB 5.9x
T03 Multi-key GroupBy + 4 aggs 5.31s 1.88s 2.8x 7,309 MB 4,222 MB 1.7x
T04 JOIN (50M × 10K) + agg 6.76s 0.87s 7.7x 10,473 MB 1,824 MB 5.7x
T05 Top-10 per group (window) 35.05s 9.29s 3.8x 5,797 MB 2,715 MB 2.1x
T06 P95 quantile by region 4.14s 0.68s 6.1x 3,699 MB 1,472 MB 2.5x
T07 Derived cols + filter + Top-1000 2.93s 1.14s 2.6x 7,374 MB 2,961 MB 2.5x
T08 Count Distinct (exact) 7.57s 1.38s 5.5x 3,650 MB 3,212 MB 1.1x
T09 In-memory DataFrame query 2.89s 2.17s 1.3x 6,841 MB 5,550 MB 1.2x
T10 Time-series monthly agg 4.07s 0.68s 6.0x 3,216 MB 1,679 MB 1.9x
Hardware: AWS EC2 c6a.4xlarge (16 vCPU AMD EPYC, 32 GB) Method: best of 3 runs, cold start Python: 3.14 Memory: Peak RSS via /proc VmHWM

chDB wins 10/10 on speed (geomean 4.2x) and 10/10 on memory (geomean 2.9x). T04 JOIN: Pandas peaks at 10.5 GB → OOM on 8 GB laptop; chDB uses 1.8 GB → OK.

Output Zero-Copy: chDB vs DuckDB

Query ClickBench hits (1M rows) → export to Pandas DataFrame

DuckDB
3.47s
chDB v4
2.64s
 24% faster

Why chDB wins on output

  • Direct type mapping — ClickHouse columns → NumPy dtypes, no intermediate format
  • Memory sharing — numeric buffers shared via memoryview, not copied
  • SIMD batch conversion — result chunks converted with vectorized routines
  • DuckDB still serializes through its own Arrow/Pandas bridge
Test: SELECT * FROM file('hits_0.parquet') → DataFrame Hardware: AWS EC2 c6a.4xlarge Method: best of 3 runs

AI Writes Pandas. chDB Runs It Fast.

  • Pandas is the #1 data API in LLM training data — models generate excellent Pandas code
  • DataStore executes it on ClickHouse automatically
  • No need to teach your AI a new API — works with Hex AI agents, Claude, ChatGPT

The future workflow:

AI writes
Pandas code
DataStore
compiles to SQL
ClickHouse
executes fast
"Hey Claude, group sales by region and show top 10."

↓ generates standard Pandas code ↓
↓ runs on ClickHouse via DataStore ↓
up to 247x faster

Summary: What chDB v4 Delivers

4.2x

geometric mean speedup over Pandas
across 10 real-world scenarios (50M rows)

24%

faster than DuckDB
on DataFrame export (same dataset, same hardware)

2.9 – 23x

less peak memory vs Pandas
Pandas OOM on 8 GB → chDB runs fine

0 ser/deser

complete zero-copy loop
DataFrame ↔ ClickHouse ↔ DataFrame

For ClickHouse users: chDB brings the engine you already trust into Python notebooks, Hex, Colab — same SQL dialect, same functions, same performance. No server to deploy.

Not Just Python

chDB runs as a native library in 9 languages:

Python
pip install chdb
Go
github.com/chdb-io/chdb-go
Rust
cargo add chdb-rust
Ruby
gem install chdb-ruby
Node.js
npm i chdb
Bun
chdb-io/chdb-bun
.NET
dotnet add package chdb
Java
chdb-io/chdb-java
Zig
chdb-io/chdb-zig

Runs everywhere — no server, no infra:

Jupyter Notebook Google Colab Hex marimo AWS Lambda Cloudflare Workers Docker CLI scripts

Getting Started — 30 Seconds


pip install "chdb>=4.0.0"
  

import chdb.datastore as pd  # swap one import, done

df = pd.read_csv("your_data.csv")

result = (
    df[df['amount'] > 100]
    .groupby('category')
    .agg({'amount': 'sum', 'order_id': 'count'})
    .sort_values('amount', ascending=False)
    .head(10)
)

result.plot(kind='bar')  # Still a real DataFrame
  
Jupyter Notebook Google Colab Hex marimo AWS Lambda Docker

No server. No config. No new API to learn.

The Road Ahead

  • Native Python UDF — in-process, no subprocess
    Register Python functions as ClickHouse UDFs via @chdb.func decorator, executed inside the vectorized pipeline with full type mapping (PR #9)
  • Free-threading build (no-GIL) — Python 3.14+
    chDB compiled with free-threading support, enabling true multi-threaded Python UDF execution within ClickHouse's parallel pipeline (PR #12)
  • PyTorch DataLoader integration
    Efficient batching, shuffling, and streaming for ML training
  • Hybrid execution
    Seamlessly scale from laptop to ClickHouse Cloud

Thank You!

GitHub: github.com/chdb-io/chdb
Blog: Journey to Zero-Copy
Docs: clickhouse.com/docs/chdb
Blog: chDB 4.0 + Hex
PyPI: pypi.org/project/chdb
Discord: discord.gg/D2Daa2fM5K

Q&A

@auxten · auxten.com