
Exploratory Data Analysis with DuckDB, JupySQL, and Plotly Express
Analytical SQL DBMS in Python
Tech Stack
Python 3, Pandas, Plotly Express
DuckDB,
VsCode, Jupyter, GitHub, Git LFS
Project
Exploratory Data Analysis on a 2,1 Million entries dataset, performed locally, in-memory.
Benefits
EDA is fundamental to any data project. Leveraging DuckDB and the tools it offers in an efficient Python setup is a must to any data specialist working on datasets of about 1TB.
Goals
Enhancing my Data Analysis, SQL and DuckDB skills.
Exploratory Data Analysis (EDA) is a fundamental step in data analysis and machine learning workflows. It helps analysts uncover patterns, spot anomalies, and test hypotheses before committing to complex modeling or decision-making processes. Without a structured EDA, insights may remain hidden, and models risk being built on incomplete or misleading assumptions.
One of the key challenges in performing EDA efficiently is managing data storage and retrieval. This is where DuckDB proves to be a powerful tool. Designed for analytical workloads, DuckDB operates efficiently in-memory and integrates seamlessly with Python, making it an excellent choice for data analysis. Combining it with JupySQL, Pandas, and Plotly Express results in a highly interactive and efficient EDA workflow. This article discusses how these tools worked together in an analysis I performed using a real-world dataset.
This is an article about the work done in a notebook. To find this notebook and much more code and explanations of it go to the Github repo link here.
Find:
and 3rd part
of this series on DuckDB.
JupySQL and DuckDB: Enabling Efficient SQL Analysis in Jupyter Notebooks
JupySQL is an extension that allows users to execute SQL queries directly inside Jupyter notebooks. When combined with DuckDB, it enables SQL-based data manipulation without requiring a dedicated database server. This is particularly useful for handling medium to large datasets on a single machine without complex setup.
In this EDA, I leveraged JupySQL to interact with DuckDB efficiently. Below is an example of how I initialized DuckDB and queried data directly within a Jupyter notebook:
This simple SQL command provides an immediate count of the dataset entries, allowing for a quick assessment of data volume before proceeding with further analysis. Additionally, I used Pandas to seamlessly fetch and manipulate query results.
Here I first assign an SQL query directly into a variable `sensors_2022_df`
And I then use a custom function to convert that variable to a pandas dataframe:
By integrating SQL and Python-based data manipulation, I was able to efficiently clean and analyze my dataset, leveraging the best of both worlds.
From Relational Object to Disk-Based Persistent Database
This EDA was performed in-memory, on a single machine, with no concurrent users. This setup aligns well with DuckDB's design philosophy—optimized for local analytical workloads.
Initially, I instantiated the dataset as a relational object within memory. However, as the analysis grew more complex and I required persistent storage, I transitioned to a disk-based persistent database. This was achieved by writing the DuckDB table to disk:
This approach ensured that the dataset was not lost when the session ended and allowed me to reload data without having to reprocess it.
Time Series Analysis: A Deeper Look into Pedestrian Traffic Patterns
As I progressed through the analysis, the focus naturally shifted toward time-based trends in pedestrian traffic. I started with a broad temporal scope, examining data across multiple years. Gradually, I narrowed it down to specific periods—2019, 2020, and 2021—to explore the impact of COVID-19 on pedestrian movement.
Further refinement led me to examine trends by month and hour, which uncovered interesting patterns in foot traffic fluctuations. The following visualization (created with Plotly Express) compares pedestrian activity from three different sensors:


Zooming into these graphs helped reveal that traffic peaks varied by location, suggesting differences in commuting behaviors across different city zones.
Plotly and Plotly Express: Enhancing Data Exploration Through Interactivity
Plotly and its high-level wrapper Plotly Express provide a powerful visualization framework for interactive data exploration. While static charts serve their purpose, interactivity allows for deeper engagement with the data.
For example, I used Plotly Express to create bar charts representing pedestrian counts. The ability to hover over data points, zoom into specific time periods, and filter elements dynamically proved invaluable in uncovering fine details, such as how foot traffic peaks at different times of the day.
In the code below, the first block of code is actually using Plotly Express, while the rest is using Plotly (components):
And the result is this graph:


Conclusion
The combination of SQL (via JupySQL & DuckDB) and Python-based analysis (via Pandas & Plotly Express) provided an efficient workflow for the first-mile of exploratory data analysis. SQL enabled structured querying, while Pandas offered flexible data manipulation, and Plotly provided interactive visualization tools.
For any data analyst or data scientist, leveraging these tools effectively is critical to uncovering meaningful insights. As datasets grow in complexity and volume, having a streamlined and efficient approach to EDA becomes even more essential. My work on this project reinforced the importance of using the right tools for the job and the need for continuous learning to refine my analytical skills.
💡 If you want to explore more SQL work, check my full SQLAlchemy journey in my GitHub repository.