DuckDB: The Surprising Swiss Army Knife of Data Processing

DuckDB: Making the Impossible Simple

Last week, someone asked to analyze 50GB of CSV files scattered across S3. Five years ago, this would’ve meant spinning up a Spark cluster. Today? One line of DuckDB SQL.

SELECT customer_id, SUM(amount) as total_spent
FROM read_csv_auto('s3://mybucket/transactions/*.csv')
WHERE transaction_date > '2025-01-01'
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 100;

No cluster. No downloads. No loading. Just query the files where they sit.

This is DuckDB’s superpower: it makes difficult data tasks surprisingly simple.

What Is DuckDB?

Think of DuckDB like an analytically-gifted SQLite. It’s an in-process SQL database designed for analytics, but that description massively undersells it. DuckDB is more like a Swiss Army knife for data – a single tool that replaces dozens of specialised ones.

Key characteristics:

  • Embedded: No server, just import and use
  • Columnar storage: Optimized for analytical queries
  • Zero dependencies: Single binary, works everywhere
  • Vectorized execution: Processes data in chunks for speed
  • SQL-first: If you know SQL, you know DuckDB

But here’s what the documentation doesn’t explain: DuckDB is great at jobs you wouldn’t expect a database to handle at all.

Surprising Use Case #1: The Universal File Reader

Forget writing parsers. DuckDB reads everything:

-- Query CSV files
SELECT * FROM 'data.csv';

-- Query JSON files
SELECT * FROM 'api_response.json';

-- Query Parquet files
SELECT * FROM 'analytics/*.parquet';

-- Query Excel files (with spatial extension)
SELECT * FROM 'report.xlsx';

-- Mix and match!
SELECT 
    csv.user_id,
    json.preferences,
    parquet.transaction_history
FROM 'users.csv' csv
JOIN 'preferences.json' json ON csv.user_id = json.user_id
JOIN 'transactions/*.parquet' parquet ON csv.user_id = parquet.user_id;

No schema definition. No CREATE TABLE. DuckDB infers everything and just works.

Surprising Use Case #2: Git Repository Analyst

Want to analyze your Git history? DuckDB can query it directly:

-- Install git extension
INSTALL git;
LOAD git;

-- Find most modified files
SELECT 
    file_path,
    COUNT(*) as modifications,
    COUNT(DISTINCT author_name) as unique_authors
FROM git_log('.')
WHERE file_path LIKE '%.py'
GROUP BY file_path
ORDER BY modifications DESC
LIMIT 20;

-- Analyze commit patterns
SELECT 
    DATE_TRUNC('hour', commit_date) as hour,
    COUNT(*) as commits
FROM git_log('.')
WHERE commit_date > NOW() - INTERVAL '30 days'
GROUP BY hour
ORDER BY hour;

This replaced a 200-line Python script with 10 lines of SQL.

Surprising Use Case #3: The Pandas Replacement

Data scientists, take note. DuckDB often outperforms pandas while using standard SQL:

# Old way with pandas
import pandas as pd

df1 = pd.read_csv('sales_2023.csv')
df2 = pd.read_csv('sales_2024.csv')
combined = pd.concat([df1, df2])
result = combined.groupby('product_id').agg({
    'quantity': 'sum',
    'revenue': 'sum'
}).reset_index()
result.to_parquet('summary.parquet')
# New way with DuckDB
import duckdb

duckdb.sql("""
    COPY (
        SELECT 
            product_id,
            SUM(quantity) as total_quantity,
            SUM(revenue) as total_revenue
        FROM read_csv_auto(['sales_2023.csv', 'sales_2024.csv'])
        GROUP BY product_id
    ) TO 'summary.parquet'
""")

Same result. 10x faster. Uses 5x less memory.

Surprising Use Case #4: Log File Detective

System logs giving you trouble? DuckDB handles unstructured data brilliantly:

-- Parse nginx logs
WITH parsed_logs AS (
    SELECT 
        regexp_extract(line, '(\d+\.\d+\.\d+\.\d+)', 1) as ip,
        regexp_extract(line, '\[([^\]]+)\]', 1) as timestamp,
        regexp_extract(line, '"(\w+) ([^"]+)"', 1) as method,
        regexp_extract(line, '"(\w+) ([^"]+)"', 2) as path,
        regexp_extract(line, '" (\d+) ', 1)::INT as status_code,
        regexp_extract(line, '" \d+ (\d+)', 1)::INT as response_size
    FROM read_csv_auto('access.log', sep='\n', columns={'line': 'VARCHAR'})
)
SELECT 
    status_code,
    COUNT(*) as requests,
    AVG(response_size) as avg_size,
    COUNT(DISTINCT ip) as unique_ips
FROM parsed_logs
WHERE method = 'GET'
GROUP BY status_code
ORDER BY requests DESC;

Surprising Use Case #5: API Response Wrangler

Got nested JSON from an API? DuckDB flattens it effortlessly:

-- Analyzing GitHub API responses
WITH api_data AS (
    SELECT * FROM read_json_auto('github_repos.json')
)
SELECT 
    repo->>'name' as repo_name,
    repo->>'stargazers_count' as stars,
    owner->>'login' as owner,
    (repo->>'created_at')::TIMESTAMP as created_date,
    unnest(topics) as topic
FROM api_data
WHERE (repo->>'stargazers_count')::INT > 1000
ORDER BY stars DESC;

Surprising Use Case #6: The Local Data Lake

Transform your laptop into a mini data lake:

-- Create views over various data sources
CREATE VIEW sales AS 
SELECT * FROM read_parquet('s3://bucket/sales/*.parquet');

CREATE VIEW customers AS 
SELECT * FROM postgres_scan('postgresql://prod/customers');

CREATE VIEW products AS 
SELECT * FROM 'local_products.csv';

-- Query across all sources
SELECT 
    c.customer_name,
    p.product_name,
    SUM(s.amount) as total_spent
FROM sales s
JOIN customers c ON s.customer_id = c.id
JOIN products p ON s.product_id = p.id
WHERE s.sale_date > '2024-01-01'
GROUP BY c.customer_name, p.product_name
ORDER BY total_spent DESC;

No ETL. No data movement. Just federated queries across formats and locations.

Performance That Surprises

Let’s talk numbers. Processing a 10GB CSV file:

# PostgreSQL with COPY (after creating table)
Time: 4 minutes 32 seconds

# Pandas read_csv
Time: 2 minutes 18 seconds
Memory: 15GB

# DuckDB
Time: 31 seconds
Memory: 2.1GB

But here’s the kicker – with DuckDB, you can query the file without loading it:

-- This runs in 1.2 seconds and uses 200MB RAM
SELECT COUNT(*), AVG(amount) 
FROM read_csv_auto('huge_file.csv')
WHERE category = 'Electronics';

When DuckDB Isn’t the Answer

DuckDB excels at analytical workloads, but it’s not for everything:

Don’t use DuckDB for:

  • High-concurrency transactional systems (use PostgreSQL)
  • Real-time streaming with millisecond latency (use Kafka + Flink)
  • Persistent production databases (use PostgreSQL)
  • Distributed processing of petabytes (use Spark/Presto)

Definitely try DuckDB for:

  • Data exploration and analysis
  • ETL/ELT pipelines
  • Processing files without loading
  • Local analytical applications
  • Prototyping data pipelines
  • One-off data transformations
  • Embedded analytics in applications

Real-World Pipeline Transformation

Here’s a before/after from a real project:

Before (Apache Airflow + Spark):

  • Download files from S3 (7 minutes)
  • Load into Spark DataFrame (3 minutes)
  • Transform and aggregate (2 minutes)
  • Write to PostgreSQL (4 minutes)
  • Total: 16 minutes, $15 in compute costs

After (DuckDB):

-- Entire pipeline in one query
COPY (
    WITH cleaned_data AS (
        SELECT 
            customer_id,
            DATE_TRUNC('day', timestamp) as date,
            SUM(amount) as daily_total,
            COUNT(*) as transaction_count
        FROM read_parquet('s3://bucket/transactions/*.parquet')
        WHERE amount > 0 
        AND timestamp > CURRENT_DATE - INTERVAL '90 days'
        GROUP BY customer_id, date
    )
    SELECT * FROM cleaned_data
    WHERE daily_total > 100
) TO postgres_scan('postgresql://analytics/customer_metrics');
  • Total: 1 minute, $0(ish) in compute costs

Getting Started

Installation is refreshingly simple:

# Python
pip install duckdb

# CLI
brew install duckdb  # macOS

Your first query:

import duckdb

# Query any file
result = duckdb.sql("SELECT * FROM 'data.csv' LIMIT 5").fetchall()

# Or use it like a real database
con = duckdb.connect('my_analysis.db')
con.execute("CREATE TABLE results AS SELECT * FROM 'huge_dataset.parquet'")

The Hidden Magic

What makes DuckDB special isn’t just speed – it’s removing friction:

  • No more writing custom parsers for each file format
  • No more moving data between systems
  • No more managing clusters for simple aggregations
  • No more choosing between SQL and DataFrame APIs
  • No more explaining why you need a Spark cluster for a 5GB file

Instead, you just write SQL and it works.

Conclusion

DuckDB represents a shift in how we think about data processing. It’s not trying to replace your production database or your streaming platform. It’s filling a gap we didn’t know existed – the space between “too big for Excel” and “too small for Spark.”

The next time you’re faced with messy CSVs, nested JSON, or files scattered across S3, don’t spin up a cluster. Don’t write a complex Python script. Just point DuckDB at your data and write some SQL.

Because sometimes the best solution isn’t the most sophisticated one – it’s the one that just works.


Have you tried DuckDB for unconventional use cases? What surprised you?

Originally shared on LinkedIn