PyCon Singapore 202601 / 18

Giving Your AI Agent a Database Brain

A local SQL database your agent can actually query — with chDB.

Auxten Wang · Technical Director @ ClickHouse · Creator of chDB

SQL
inside Python
About me02 / 18
Auxten Wang

Auxten Wang

@auxten · auxten.com

Technical Director @ ClickHouse
Creator of chDB — acquired by ClickHouse in 2024
Ex-Principal Engineer @ Shopee
Cofounder & CTO, CovenantSQL
Baidu · Qihoo 360 · 4Paradigm
The pain03 / 18

A familiar agent failure

Yesterday

It hit a constraint and found the fix that worked.

Today

Same task. It reads the prompt — not yesterday's fix.

Result

Repeats the failed plan. Burns tool calls on retries.

The model isn't dumb. The right context just wasn't queryable in time.
The idea04 / 18

Your agent's context is really a pile of data

Context engineering just means choosing what the agent sees. Four kinds of data:

Instructions

Rules, constraints, preferences.

Working data

Files, tables, API results, logs.

Memory

Decisions, facts, prior solutions.

Eval history

What worked, what failed, what changed.

Recall = find the few facts that matter now. A query, not a bigger prompt.
Why a database05 / 18

"Can't I just use what I already have?"

SQLite

Great for tiny state. No columnar scan, no vectors, no remote files.

Pandas

Great in memory. Dies past RAM. Can't query files you never loaded.

DuckDB

Closest match — but fewer formats and no mature cloud.

Vector DB

Similarity only. Can't filter + join + aggregate + audit in one query.

Vector search is just an index. Agent memory is the whole database around it.
ClickHouse06 / 18

ClickHouse in 60 seconds

Analytical database

Open-source. Fast SQL over huge tables: logs, events, metrics.

Columnar engine

Stores by column, so scans over millions of rows stay fast.

More than storage

A real engine: SQL, JSON, arrays, text search, vector distance.

OLAP, not OLTP: built to scan millions of rows. Runs Anthropic, Cursor, Cloudflare, Shopee.
chDB07 / 18

chDB: a rocket engine on a bicycle

The ClickHouse engine, embedded in your Python process — SQLite-simple, 100× faster for analytics.

# pip install chdb
from chdb import session as chs
sess = chs.Session("./agent")
sess.query("SELECT count() FROM file('log.pq')")
Same engine: ClickHouse SQL, JOINs, vectors.
Persistent: local tables on disk for memory.

your agent code

chDB · ClickHouse engine

runs in the same process

files · DataFrames · local tables

Local like SQLite. Analytical like ClickHouse.
chDB08 / 18
A rocket engine on a bicycle
chDB straps ClickHouse's engine onto Python — like this.
chDB API09 / 18

chDB drops into your Python data stack

70+ data formats

Parquet, CSV, JSON, Arrow, ORC — read and write, no converters.

Zero-copy Pandas & Arrow

Query a DataFrame or Arrow table in place — no copy, no load.

Python

Pandas-style API

Prefer DataFrames to raw SQL? chDB speaks that too.

It fits your existing Pandas / Arrow code — you don't rebuild around it.
chDB v410 / 18

Your Pandas Code

One Line Change

Under the Hood

Inside the Engine

import pandas as pd
df = pd.read_csv("employee_data.csv")
filtered = df[
(df['age'] > 25) &
(df['salary'] > 50000)
]
grouped = filtered.groupby('city')['salary'] \
.agg(['mean', 'sum', 'count'])
result = grouped.sort_values(
'mean', ascending=False
).head(10)
import chdb.datastore as pd # <- rocket!
df = pd.read_csv("employee_data.csv")
filtered = df[
(df['age'] > 25) &
(df['salary'] > 50000)
]
grouped = filtered.groupby('city')['salary'] \
.agg(['mean', 'sum', 'count'])
result = grouped.sort_values(
'mean', ascending=False
).head(10)
auto-generated sql ↓

SELECT city,
       AVG(salary) AS mean,
       SUM(salary) AS sum,
       COUNT(salary) AS count
FROM file('employee_data.csv',
          'CSVWithNames')
WHERE age > 25
  AND salary > 50000
GROUP BY city
ORDER BY mean DESC
LIMIT 10
          
Up to 247x faster · 23x less memory
chDB Architecture
Same API. Same code. ClickHouse speed underneath. Predicate pushdown · Column pruning · Limit propagation — automatic SQL feeds directly into chDB's embedded ClickHouse engine — in-process, zero network
Multi-source SQL11 / 18

Join anything with SQL

chdb.query("""
SELECT name, age, purchase_total
FROM url('https://api.co/users.parquet') users HTTP
LEFT JOIN file('orders.csv') orders CSV
ON users.uid = orders.uid
LEFT JOIN Python(processed_df) enriched DataFrame
ON enriched.uid = users.uid
""")

Three sources, one query — no load step.

same join, Pandas-style (runs on chDB):
import chdb.datastore as pd
(pd.read_parquet('https://api.co/users.parquet') HTTP
.merge(pd.read_csv('orders.csv'), on='uid') CSV
.merge(pd.DataFrame(enriched_df), on='uid')) DataFrame
Agent memory12 / 18

What good agent memory needs

Memory the agent can filter, rank, and grow — not a prompt blob.

Memories are rows

Filter, rank, and audit them with plain SQL.

Recall = Similarity + Filter

Meaning and rules in one query.

Local now, cloud later

Same SQL from laptop to ClickHouse.

Memory lifecycle13 / 18

Agent memory has a lifecycle

remember

explicit decision

recall

similarity + filter

conflict?

similar but different

revise

better belief wins

Old memory

"Use Poetry for this repo."

Revised memory

"This repo now standardizes on uv." (old row → conflicted)

Recall SQL14 / 18

Recall is one SQL query

Rank memories by meaning, keep the active ones, pinned first.

cosineDistance

how close two meanings are, as a number (0 = identical)

Need project, tags, or task-history? Add WHERE / JOIN lines — same query.

SELECT
  content,
  1 - cosineDistance(embedding, query_vec) AS meaning
FROM memories
WHERE status = 'active'
ORDER BY pinned DESC, meaning DESC
LIMIT 8
same recall, Pandas-style:
import chdb.datastore as pd
m = pd.DataStore(
    table="memories", database="./agent_context")
m = m.assign(
    meaning=1 - pd.F.cosineDistance(
        m.embedding, pd.F.array(*query_vec)))
m[m.status == "active"].sort_values(
    ["pinned", "meaning"], ascending=False).head(8)
Local hot path15 / 18

Recall sits inside the agent loop

in-process recall

instant

remote recall in the loop

network tax

In the loop

Recall runs many times per turn — jitter becomes think-time jitter.

Retries hurt twice

A timed-out call is retried — each retry re-sends the whole prompt.

Local wins

In-process recall is orders of magnitude faster. No tail, no retries.

Scale path16 / 18

Start local, scale to ClickHouse

Day 1

chDB session on a laptop or agent runtime

Day 100

ClickHouse Server or Cloud when memory is shared or large

# Day 1 — query local memory
sess.query("SELECT content FROM memories WHERE project = 'checkout'")

# Day 100 — repoint the same name at the cluster (SELECT unchanged)
sess.query("""CREATE OR REPLACE VIEW memories AS
              SELECT * FROM remote('cloud:9440', db, memories)""")

Honest note: remote reads pay the network cost again — keep the hot path local.

ClickMem17 / 18

Meet ClickMem

That whole pattern — lifecycle, recall, local-first — packaged as an open-source project on chDB. Build the core yourself:

create

memories table

insert

explicit decisions

recall

similarity + filter

evaluate

which memory helped?

sess = chs.Session("./agent_context")
sess.query("CREATE TABLE memories (...) ENGINE = MergeTree ORDER BY id")
sess.query("INSERT INTO memories VALUES (...)")
sess.query("""
  SELECT content FROM memories
  WHERE created_at > now() - INTERVAL 30 DAY  -- filter
  ORDER BY cosineDistance(embedding, ?) LIMIT 8  -- similarity
""")

Embeddings: Qwen/Qwen3-Embedding-0.6B via sentence-transformers, runs locally.

Close18 / 18

Three takeaways

01

Context is data.

Filter it, rank it, join it, evaluate it.

02

Recall is a query.

Semantic meaning + simple rules, in one SQL.

03

Local first.

chDB now; ClickHouse when shared or large.

pip install chdb
github.com/chdb-io/chdb  ·  github.com/auxten/clickmem
Q&A