Porting ClickHouse for High-Performance Apps
Auxten Wang · Technical Director @ ClickHouse
PythonAsia 2026 · 30 min
Who has used Pandas?
Who has waited more than 10 seconds
for a Pandas operation?
What if I told you it could be
247x faster?
@auxten · auxten.com
On an 8GB laptop, that JOIN will OOM and kill your process.
What if you could run the same Pandas code, up to 247x faster, using 3-23x less memory?
pip install chdbThink of it as: SQLite for analytics — but 100x faster.
ClickBench · 100M rows of web analytics · AWS c6a.4xlarge (16 vCPU, 32 GB) · Combined score across 43 queries
import chdb.datastore as pd # one line change!
df = pd.DataFrame({
'user_id': range(1_000_000),
'category': ['A', 'B', 'C'] * 333334,
'value': [i * 0.1 for i in range(1_000_000)]
})
# Same Pandas API — ClickHouse engine underneath
result = (
df.groupby('category')
.agg({'value': ['count', 'mean']})
.sort_values(('value', 'count'), ascending=False)
)
print(result) # Triggers lazy execution
Pandas syntax. ClickHouse speed. Zero setup.
# Local files
chdb.query("SELECT * FROM file('data.parquet', Parquet)")
chdb.query("SELECT * FROM file('data.csv', CSVWithNames)")
# Remote data
chdb.query("SELECT * FROM url('https://example.com/data.json')")
chdb.query("SELECT * FROM s3('s3://bucket/data.parquet')")
# Python objects — dict, DataFrame, PyArrow
data = {'a': [1, 2, 3], 'b': ['x', 'y', 'z']}
chdb.query("SELECT * FROM Python(data)")
# Databases
chdb.query("SELECT * FROM postgresql('host', 'db', 'table', 'user', 'pw')")
80+ formats. Local, remote, in-memory. Same SQL.
Three data sources, one query. No ETL.
Lazy evaluation · Smart engine selection · Intermediate caching
SELECT city,
AVG(salary) AS mean,
SUM(salary) AS sum,
COUNT(salary) AS count
FROM file('employee_data.csv',
'CSVWithNames')
WHERE age > 25
AND salary > 50000
GROUP BY city
ORDER BY mean DESC
LIMIT 10
What if your pipeline mixes SQL-optimizable ops and Python-only ops?
You write Pandas. The engine decides what runs where.
import chdb.datastore as ds
# Connect multiple sources
sales = ds.read_csv("local_sales.csv")
customers = ds.DataStore("clickhouse",
host="localhost", table="customer_info")
# Join and analyze — lazy, cross-source
result = (
sales
.merge(customers, on="customer_id")
.query("purchase_date >= '2025-01-01'")
.groupby("city")
.agg({"revenue": "sum", "order_id": "count"})
.sort_values("revenue", ascending=False)
.head(20)
)
print(result) # Execution triggers here!
Local CSV + remote ClickHouse. One Pandas-style pipeline.
chDB runs as a native library in 9 languages:
Runs everywhere — no server, no infra:
| Scenario | Pandas | chDB | Speedup |
|---|---|---|---|
| T01 Read Parquet + COUNT | 2.07s | 0.28s | 7.4x |
| T02 Filter + GroupBy + SUM | 2.88s | 0.69s | 4.1x |
| T03 Multi-key GroupBy + 4 aggs | 5.31s | 1.88s | 2.8x |
| T04 JOIN (50M × 10K) + agg | 6.76s | 0.87s | 7.7x |
| T05 Top-10 per group (window) | 35.05s | 9.29s | 3.8x |
| T06 P95 quantile by region | 4.14s | 0.68s | 6.1x |
| T07 Derived cols + filter + Top-1000 | 2.93s | 1.14s | 2.6x |
| T08 Count Distinct (exact) | 7.57s | 1.38s | 5.5x |
| T09 In-memory DataFrame query | 2.89s | 2.17s | 1.3x |
| T10 Time-series monthly agg | 4.07s | 0.68s | 6.0x |
chdb v4.1.6 vs pandas 3.0.1 · Ubuntu, 16 CPUs, 33GB RAM, Python 3.14 · chDB wins 10/10, mean 4.2x
Peak RSS via Linux VmHWM (kernel-level, unfakeable) · Geomean memory saving: 2.9x
| 50M-row Operation | Pandas Peak | chDB Peak | 8GB Laptop |
|---|---|---|---|
| JOIN + aggregation | 10.5 GB | 1.8 GB | Pandas OOM · chDB OK |
| Derived columns + filter | 7.4 GB | 3.0 GB | Pandas OOM · chDB OK |
| Multi-key GroupBy | 7.3 GB | 4.2 GB | Pandas OOM · chDB OK |
When Pandas OOMs, the OS kills your process — no recovery.
When chDB hits a limit, it throws an exception you can catch.
The future workflow:
AI writes Pandas → chDB makes it fast
"Hey Claude, group sales by region and show me the top 10."
↓ generates standard Pandas code ↓
↓ runs on ClickHouse via DataStore ↓
↓ up to 247x faster, 3-23x less memory ↓
pip install "chdb>=4.0.0"
import chdb.datastore as pd # swap one import, done
# Your existing workflow — unchanged
df = pd.read_csv("your_data.csv")
result = (
df[df['amount'] > 100]
.groupby('category')
.agg({'amount': 'sum', 'order_id': 'count'})
.sort_values('amount', ascending=False)
.head(10)
)
result.plot(kind='bar') # Still a real DataFrame
No server. No config. No new API to learn.