DuckDB: Making the Impossible Simple

Last week, someone asked to analyze 50GB of CSV files scattered across S3. Five years ago, this would’ve meant spinning up a Spark cluster. Today? One line of DuckDB SQL.

SELECT customer_id, SUM(amount) as total_spent
FROM read_csv_auto('s3://mybucket/transactions/*.csv')
WHERE transaction_date > '2025-01-01'
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 100;

No cluster. No downloads. No loading. Just query the files where they sit.

This is DuckDB’s superpower: it makes difficult data tasks surprisingly simple.

What Is DuckDB?

Think of DuckDB like an analytically-gifted SQLite. It’s an in-process SQL database designed for analytics, but that description massively undersells it. DuckDB is more like a Swiss Army knife for data – a single tool that replaces dozens of specialised ones.

Key characteristics:

Embedded: No server, just import and use
Columnar storage: Optimized for analytical queries
Zero dependencies: Single binary, works everywhere
Vectorized execution: Processes data in chunks for speed
SQL-first: If you know SQL, you know DuckDB

But here’s what the documentation doesn’t explain: DuckDB is great at jobs you wouldn’t expect a database to handle at all.

Surprising Use Case #1: The Universal File Reader

Forget writing parsers. DuckDB reads everything:

-- Query CSV files
SELECT * FROM 'data.csv';

-- Query JSON files
SELECT * FROM 'api_response.json';

-- Query Parquet files
SELECT * FROM 'analytics/*.parquet';

-- Query Excel files (with spatial extension)
SELECT * FROM 'report.xlsx';

-- Mix and match!
SELECT 
    csv.user_id,
    json.preferences,
    parquet.transaction_history
FROM 'users.csv' csv
JOIN 'preferences.json' json ON csv.user_id = json.user_id
JOIN 'transactions/*.parquet' parquet ON csv.user_id = parquet.user_id;

No schema definition. No CREATE TABLE. DuckDB infers everything and just works.

Surprising Use Case #2: Git Repository Analyst

Want to analyze your Git history? DuckDB can query it directly:

-- Install git extension
INSTALL git;
LOAD git;

-- Find most modified files
SELECT 
    file_path,
    COUNT(*) as modifications,
    COUNT(DISTINCT author_name) as unique_authors
FROM git_log('.')
WHERE file_path LIKE '%.py'
GROUP BY file_path
ORDER BY modifications DESC
LIMIT 20;

-- Analyze commit patterns
SELECT 
    DATE_TRUNC('hour', commit_date) as hour,
    COUNT(*) as commits
FROM git_log('.')
WHERE commit_date > NOW() - INTERVAL '30 days'
GROUP BY hour
ORDER BY hour;

This replaced a 200-line Python script with 10 lines of SQL.

Surprising Use Case #3: The Pandas Replacement

Data scientists, take note. DuckDB often outperforms pandas while using standard SQL:

# Old way with pandas
import pandas as pd

df1 = pd.read_csv('sales_2023.csv')
df2 = pd.read_csv('sales_2024.csv')
combined = pd.concat([df1, df2])
result = combined.groupby('product_id').agg({
    'quantity': 'sum',
    'revenue': 'sum'
}).reset_index()
result.to_parquet('summary.parquet')

# New way with DuckDB
import duckdb

duckdb.sql("""
    COPY (
        SELECT 
            product_id,
            SUM(quantity) as total_quantity,
            SUM(revenue) as total_revenue
        FROM read_csv_auto(['sales_2023.csv', 'sales_2024.csv'])
        GROUP BY product_id
    ) TO 'summary.parquet'
""")

Same result. 10x faster. Uses 5x less memory.

Surprising Use Case #4: Log File Detective

System logs giving you trouble? DuckDB handles unstructured data brilliantly:

-- Parse nginx logs
WITH parsed_logs AS (
    SELECT 
        regexp_extract(line, '(\d+\.\d+\.\d+\.\d+)', 1) as ip,
        regexp_extract(line, '\[([^\]]+)\]', 1) as timestamp,
        regexp_extract(line, '"(\w+) ([^"]+)"', 1) as method,
        regexp_extract(line, '"(\w+) ([^"]+)"', 2) as path,
        regexp_extract(line, '" (\d+) ', 1)::INT as status_code,
        regexp_extract(line, '" \d+ (\d+)', 1)::INT as response_size
    FROM read_csv_auto('access.log', sep='\n', columns={'line': 'VARCHAR'})
)
SELECT 
    status_code,
    COUNT(*) as requests,
    AVG(response_size) as avg_size,
    COUNT(DISTINCT ip) as unique_ips
FROM parsed_logs
WHERE method = 'GET'
GROUP BY status_code
ORDER BY requests DESC;

Surprising Use Case #5: API Response Wrangler

Got nested JSON from an API? DuckDB flattens it effortlessly:

-- Analyzing GitHub API responses
WITH api_data AS (
    SELECT * FROM read_json_auto('github_repos.json')
)
SELECT 
    repo->>'name' as repo_name,
    repo->>'stargazers_count' as stars,
    owner->>'login' as owner,
    (repo->>'created_at')::TIMESTAMP as created_date,
    unnest(topics) as topic
FROM api_data
WHERE (repo->>'stargazers_count')::INT > 1000
ORDER BY stars DESC;

Surprising Use Case #6: The Local Data Lake

Transform your laptop into a mini data lake:

-- Create views over various data sources
CREATE VIEW sales AS 
SELECT * FROM read_parquet('s3://bucket/sales/*.parquet');

CREATE VIEW customers AS 
SELECT * FROM postgres_scan('postgresql://prod/customers');

CREATE VIEW products AS 
SELECT * FROM 'local_products.csv';

-- Query across all sources
SELECT 
    c.customer_name,
    p.product_name,
    SUM(s.amount) as total_spent
FROM sales s
JOIN customers c ON s.customer_id = c.id
JOIN products p ON s.product_id = p.id
WHERE s.sale_date > '2024-01-01'
GROUP BY c.customer_name, p.product_name
ORDER BY total_spent DESC;

No ETL. No data movement. Just federated queries across formats and locations.

Performance That Surprises

Let’s talk numbers. Processing a 10GB CSV file:

# PostgreSQL with COPY (after creating table)
Time: 4 minutes 32 seconds

# Pandas read_csv
Time: 2 minutes 18 seconds
Memory: 15GB

# DuckDB
Time: 31 seconds
Memory: 2.1GB

But here’s the kicker – with DuckDB, you can query the file without loading it:

-- This runs in 1.2 seconds and uses 200MB RAM
SELECT COUNT(*), AVG(amount) 
FROM read_csv_auto('huge_file.csv')
WHERE category = 'Electronics';

When DuckDB Isn’t the Answer

DuckDB excels at analytical workloads, but it’s not for everything:

Don’t use DuckDB for:

High-concurrency transactional systems (use PostgreSQL)
Real-time streaming with millisecond latency (use Kafka + Flink)
Persistent production databases (use PostgreSQL)
Distributed processing of petabytes (use Spark/Presto)

Definitely try DuckDB for:

Data exploration and analysis
ETL/ELT pipelines
Processing files without loading
Local analytical applications
Prototyping data pipelines
One-off data transformations
Embedded analytics in applications

Real-World Pipeline Transformation

Here’s a before/after from a real project:

Before (Apache Airflow + Spark):

Download files from S3 (7 minutes)
Load into Spark DataFrame (3 minutes)
Transform and aggregate (2 minutes)
Write to PostgreSQL (4 minutes)
Total: 16 minutes, $15 in compute costs

After (DuckDB):

-- Entire pipeline in one query
COPY (
    WITH cleaned_data AS (
        SELECT 
            customer_id,
            DATE_TRUNC('day', timestamp) as date,
            SUM(amount) as daily_total,
            COUNT(*) as transaction_count
        FROM read_parquet('s3://bucket/transactions/*.parquet')
        WHERE amount > 0 
        AND timestamp > CURRENT_DATE - INTERVAL '90 days'
        GROUP BY customer_id, date
    )
    SELECT * FROM cleaned_data
    WHERE daily_total > 100
) TO postgres_scan('postgresql://analytics/customer_metrics');

Total: 1 minute, $0(ish) in compute costs

Getting Started

Installation is refreshingly simple:

# Python
pip install duckdb

# CLI
brew install duckdb  # macOS

Your first query:

import duckdb

# Query any file
result = duckdb.sql("SELECT * FROM 'data.csv' LIMIT 5").fetchall()

# Or use it like a real database
con = duckdb.connect('my_analysis.db')
con.execute("CREATE TABLE results AS SELECT * FROM 'huge_dataset.parquet'")

The Hidden Magic

What makes DuckDB special isn’t just speed – it’s removing friction:

No more writing custom parsers for each file format
No more moving data between systems
No more managing clusters for simple aggregations
No more choosing between SQL and DataFrame APIs
No more explaining why you need a Spark cluster for a 5GB file

Instead, you just write SQL and it works.

Conclusion

DuckDB represents a shift in how we think about data processing. It’s not trying to replace your production database or your streaming platform. It’s filling a gap we didn’t know existed – the space between “too big for Excel” and “too small for Spark.”

The next time you’re faced with messy CSVs, nested JSON, or files scattered across S3, don’t spin up a cluster. Don’t write a complex Python script. Just point DuckDB at your data and write some SQL.

Because sometimes the best solution isn’t the most sophisticated one – it’s the one that just works.

Have you tried DuckDB for unconventional use cases? What surprised you?

Originally shared on LinkedIn