DuckDB: The Surprising Swiss Army Knife of Data Processing
DuckDB: Making the Impossible Simple
Last week, someone asked to analyze 50GB of CSV files scattered across S3. Five years ago, this would’ve meant spinning up a Spark cluster. Today? One line of DuckDB SQL.
SELECT customer_id, SUM(amount) as total_spent
FROM read_csv_auto('s3://mybucket/transactions/*.csv')
WHERE transaction_date > '2025-01-01'
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 100;
No cluster. No downloads. No loading. Just query the files where they sit.
This is DuckDB’s superpower: it makes difficult data tasks surprisingly simple.
What Is DuckDB?
Think of DuckDB like an analytically-gifted SQLite. It’s an in-process SQL database designed for analytics, but that description massively undersells it. DuckDB is more like a Swiss Army knife for data – a single tool that replaces dozens of specialised ones.
Key characteristics:
- Embedded: No server, just import and use
- Columnar storage: Optimized for analytical queries
- Zero dependencies: Single binary, works everywhere
- Vectorized execution: Processes data in chunks for speed
- SQL-first: If you know SQL, you know DuckDB
But here’s what the documentation doesn’t explain: DuckDB is great at jobs you wouldn’t expect a database to handle at all.
Surprising Use Case #1: The Universal File Reader
Forget writing parsers. DuckDB reads everything:
-- Query CSV files
SELECT * FROM 'data.csv';
-- Query JSON files
SELECT * FROM 'api_response.json';
-- Query Parquet files
SELECT * FROM 'analytics/*.parquet';
-- Query Excel files (with spatial extension)
SELECT * FROM 'report.xlsx';
-- Mix and match!
SELECT
csv.user_id,
json.preferences,
parquet.transaction_history
FROM 'users.csv' csv
JOIN 'preferences.json' json ON csv.user_id = json.user_id
JOIN 'transactions/*.parquet' parquet ON csv.user_id = parquet.user_id;
No schema definition. No CREATE TABLE. DuckDB infers everything and just works.
Surprising Use Case #2: Git Repository Analyst
Want to analyze your Git history? DuckDB can query it directly:
-- Install git extension
INSTALL git;
LOAD git;
-- Find most modified files
SELECT
file_path,
COUNT(*) as modifications,
COUNT(DISTINCT author_name) as unique_authors
FROM git_log('.')
WHERE file_path LIKE '%.py'
GROUP BY file_path
ORDER BY modifications DESC
LIMIT 20;
-- Analyze commit patterns
SELECT
DATE_TRUNC('hour', commit_date) as hour,
COUNT(*) as commits
FROM git_log('.')
WHERE commit_date > NOW() - INTERVAL '30 days'
GROUP BY hour
ORDER BY hour;
This replaced a 200-line Python script with 10 lines of SQL.
Surprising Use Case #3: The Pandas Replacement
Data scientists, take note. DuckDB often outperforms pandas while using standard SQL:
# Old way with pandas
import pandas as pd
df1 = pd.read_csv('sales_2023.csv')
df2 = pd.read_csv('sales_2024.csv')
combined = pd.concat([df1, df2])
result = combined.groupby('product_id').agg({
'quantity': 'sum',
'revenue': 'sum'
}).reset_index()
result.to_parquet('summary.parquet')
# New way with DuckDB
import duckdb
duckdb.sql("""
COPY (
SELECT
product_id,
SUM(quantity) as total_quantity,
SUM(revenue) as total_revenue
FROM read_csv_auto(['sales_2023.csv', 'sales_2024.csv'])
GROUP BY product_id
) TO 'summary.parquet'
""")
Same result. 10x faster. Uses 5x less memory.
Surprising Use Case #4: Log File Detective
System logs giving you trouble? DuckDB handles unstructured data brilliantly:
-- Parse nginx logs
WITH parsed_logs AS (
SELECT
regexp_extract(line, '(\d+\.\d+\.\d+\.\d+)', 1) as ip,
regexp_extract(line, '\[([^\]]+)\]', 1) as timestamp,
regexp_extract(line, '"(\w+) ([^"]+)"', 1) as method,
regexp_extract(line, '"(\w+) ([^"]+)"', 2) as path,
regexp_extract(line, '" (\d+) ', 1)::INT as status_code,
regexp_extract(line, '" \d+ (\d+)', 1)::INT as response_size
FROM read_csv_auto('access.log', sep='\n', columns={'line': 'VARCHAR'})
)
SELECT
status_code,
COUNT(*) as requests,
AVG(response_size) as avg_size,
COUNT(DISTINCT ip) as unique_ips
FROM parsed_logs
WHERE method = 'GET'
GROUP BY status_code
ORDER BY requests DESC;
Surprising Use Case #5: API Response Wrangler
Got nested JSON from an API? DuckDB flattens it effortlessly:
-- Analyzing GitHub API responses
WITH api_data AS (
SELECT * FROM read_json_auto('github_repos.json')
)
SELECT
repo->>'name' as repo_name,
repo->>'stargazers_count' as stars,
owner->>'login' as owner,
(repo->>'created_at')::TIMESTAMP as created_date,
unnest(topics) as topic
FROM api_data
WHERE (repo->>'stargazers_count')::INT > 1000
ORDER BY stars DESC;
Surprising Use Case #6: The Local Data Lake
Transform your laptop into a mini data lake:
-- Create views over various data sources
CREATE VIEW sales AS
SELECT * FROM read_parquet('s3://bucket/sales/*.parquet');
CREATE VIEW customers AS
SELECT * FROM postgres_scan('postgresql://prod/customers');
CREATE VIEW products AS
SELECT * FROM 'local_products.csv';
-- Query across all sources
SELECT
c.customer_name,
p.product_name,
SUM(s.amount) as total_spent
FROM sales s
JOIN customers c ON s.customer_id = c.id
JOIN products p ON s.product_id = p.id
WHERE s.sale_date > '2024-01-01'
GROUP BY c.customer_name, p.product_name
ORDER BY total_spent DESC;
No ETL. No data movement. Just federated queries across formats and locations.
Performance That Surprises
Let’s talk numbers. Processing a 10GB CSV file:
# PostgreSQL with COPY (after creating table)
Time: 4 minutes 32 seconds
# Pandas read_csv
Time: 2 minutes 18 seconds
Memory: 15GB
# DuckDB
Time: 31 seconds
Memory: 2.1GB
But here’s the kicker – with DuckDB, you can query the file without loading it:
-- This runs in 1.2 seconds and uses 200MB RAM
SELECT COUNT(*), AVG(amount)
FROM read_csv_auto('huge_file.csv')
WHERE category = 'Electronics';
When DuckDB Isn’t the Answer
DuckDB excels at analytical workloads, but it’s not for everything:
Don’t use DuckDB for:
- High-concurrency transactional systems (use PostgreSQL)
- Real-time streaming with millisecond latency (use Kafka + Flink)
- Persistent production databases (use PostgreSQL)
- Distributed processing of petabytes (use Spark/Presto)
Definitely try DuckDB for:
- Data exploration and analysis
- ETL/ELT pipelines
- Processing files without loading
- Local analytical applications
- Prototyping data pipelines
- One-off data transformations
- Embedded analytics in applications
Real-World Pipeline Transformation
Here’s a before/after from a real project:
Before (Apache Airflow + Spark):
- Download files from S3 (7 minutes)
- Load into Spark DataFrame (3 minutes)
- Transform and aggregate (2 minutes)
- Write to PostgreSQL (4 minutes)
- Total: 16 minutes, $15 in compute costs
After (DuckDB):
-- Entire pipeline in one query
COPY (
WITH cleaned_data AS (
SELECT
customer_id,
DATE_TRUNC('day', timestamp) as date,
SUM(amount) as daily_total,
COUNT(*) as transaction_count
FROM read_parquet('s3://bucket/transactions/*.parquet')
WHERE amount > 0
AND timestamp > CURRENT_DATE - INTERVAL '90 days'
GROUP BY customer_id, date
)
SELECT * FROM cleaned_data
WHERE daily_total > 100
) TO postgres_scan('postgresql://analytics/customer_metrics');
- Total: 1 minute, $0(ish) in compute costs
Getting Started
Installation is refreshingly simple:
# Python
pip install duckdb
# CLI
brew install duckdb # macOS
Your first query:
import duckdb
# Query any file
result = duckdb.sql("SELECT * FROM 'data.csv' LIMIT 5").fetchall()
# Or use it like a real database
con = duckdb.connect('my_analysis.db')
con.execute("CREATE TABLE results AS SELECT * FROM 'huge_dataset.parquet'")
The Hidden Magic
What makes DuckDB special isn’t just speed – it’s removing friction:
- No more writing custom parsers for each file format
- No more moving data between systems
- No more managing clusters for simple aggregations
- No more choosing between SQL and DataFrame APIs
- No more explaining why you need a Spark cluster for a 5GB file
Instead, you just write SQL and it works.
Conclusion
DuckDB represents a shift in how we think about data processing. It’s not trying to replace your production database or your streaming platform. It’s filling a gap we didn’t know existed – the space between “too big for Excel” and “too small for Spark.”
The next time you’re faced with messy CSVs, nested JSON, or files scattered across S3, don’t spin up a cluster. Don’t write a complex Python script. Just point DuckDB at your data and write some SQL.
Because sometimes the best solution isn’t the most sophisticated one – it’s the one that just works.
Have you tried DuckDB for unconventional use cases? What surprised you?
Originally shared on LinkedIn