
DuckDB Best Practices
Analytical SQL DBMS in Python
Tech Stack
Python 3,
DuckDB,
VsCode, Jupyter, GitHub
Project
Practical use of SQL DuckDB following best practices.
Benefits
DuckDB provides time-saving tools compared to standard SQL solutions, but users must understand how to use them effectively to benefit. This is my focus here.
Goals
Enhancing my SQL and DuckDB skills.
Why DuckDB Outshines Standard SQL: A Dive into Best Practices
DuckDB has quietly emerged as a game changer for data specialists. In this article, we’ll explore some of the niceties that make DuckDB a refined, enjoyable SQL experience. Based on insights from my recent notebook on DuckDB best practices, I’ll cover features that save time, boost readability, and streamline common data operations.
This is an article about the work done in a notebook. To find this notebook and much more code and explanations of it go to the Github repo link here.
Find the first and the second parts of this series on DuckDB.
A Gentle Introduction: DuckDB vs. Standard SQL
Traditional SQL can sometimes feel rigid and verbose, especially when it comes to data manipulation and advanced query patterns. DuckDB introduces several quality‐of‐life improvements that not only simplify your queries but also enhance performance. Whether you’re dealing with large datasets or complex hierarchical data, DuckDB offers features that let you write cleaner, more efficient code.
Exclude: Save Time by Excluding Columns
One small but powerful feature in DuckDB is the exclude clause. Rather than listing out every column you want to include in your query, you can specify the columns to leave out. This saves time and reduces the chance of error when working with wide tables.
For example, rather than writing:
You can simply do:
This concise syntax streamlines your query and lets you focus on the data that matters.
Function Chaining: Readable and Efficient Scalar Operations
Another appealing feature is function chaining. DuckDB allows you to chain scalar functions using the dot operator, enhancing both readability and execution speed.
Note: This chaining is limited to scalar functions—functions that operate on a single input value (from a row) and return a single output value. They contrast with table functions, which process entire rows or sets of rows.
For example, imagine you have a table of users and you want to generate a username by concatenating the first and last names, trimming extra spaces, replacing any remaining internal spaces with underscores, and converting the result to lowercase. Instead of nesting multiple function calls, you can chain them in a clear, left-to-right manner:
In this snippet:
The concatenation of first_name and last_name is performed first.
The .trim() function removes any extra whitespace.
The .replace(' ', '_') function replaces spaces with underscores.
Finally, .lower() converts the string to lowercase.
This approach makes the transformation process easy to follow, reducing the mental overhead of nested function calls.
Improved Insert Techniques: Insert by Name, Insert or Replace, and Positional Joins
DuckDB enhances data ingestion through several convenient methods:
Insert by Name:
You can specify columns explicitly by name, which reduces dependency on the column order. This is especially useful when your table schema evolves over time:
Insert or Replace:
This feature enables upsert-like functionality, allowing you to insert new records or replace existing ones seamlessly:
Positional Joins:
A positional join is a feature in DuckDB that looks at the ordering of 2 or more tables, no matter how different these ordering are, and join them.So without DuckDB, the 2 tables that I select from have a different ordering, and as such they are seen as unordered. Which means that another element is needed to do the orders matching.
DuckDB can see them as two dataframes and connects them with the `positional join` clause:
These enhanced methods reduce boilerplate and error-proneness compared to standard SQL, making your data pipelines more robust and easier to maintain.
Temporal Joins (ASOF): Synchronizing Time-Series Data
Temporal joins—often referred to as ASOF joins—are particularly useful when working with time-series data. For instance, suppose you have one table with scores (each with a score_time timestamp) and another table with weather conditions (each with its own timestamp), both having different timestamp values. Using a where clause is not adequate anymore.
Standard SQL Approach
In standard SQL, you might achieve this with a rather complex CTE:
While this query works, it can be less efficient and harder to read, especially on large datasets.
DuckDB’s ASOF Join
DuckDB offers a more intuitive syntax with ASOF joins:
Recursive Queries: Navigating Hierarchical Data
Recursive queries are indispensable when working with hierarchical or self-referential data. Consider a wine hierarchy where each wine category can have subcategories. A recursive query lets you start at any level and iteratively join on child/parent categories.
In DuckDB, a recursive query might look like this:
This approach is both clear and scalable. Even though recursive queries may initially appear more verbose than multiple self-joins, their structure remains relatively constant even as hierarchy complexity increases.
Conclusion
DuckDB was designed with SQL users in mind. Its array of quality-of-life improvements—from the simplicity of the exclude clause and the elegance of function chaining to advanced insert methods, ASOF joins, and recursive queries—collectively provide a superior SQL experience compared to standard SQL. These features not only improve code readability and maintainability but also boost performance in data-intensive operations.
I invite you to read the first two notebooks in this series where DuckDB is introduced and demonstrated at a basic level. Stay tuned for our next notebook, which will dive into data exploration and additional best practices.
Happy querying!
💡 If you want to explore more SQL work, check my full SQLAlchemy journey in my GitHub repository.