Photo by the author
Duckdb This is a quick, in processing analytical database designed for state-of-the-art data analysis. It works directly from your Python script, which means that a separate server is not needed and complicated queries are distinguished by storing column and vectorized performance.
Because understanding how to deal with data is becoming more and more vital, today I want to show how to build a flow of work in Python from Duckdb and examine its key functions.
Let’s immerse ourselves!
What is DuckDB?
DuckDB is a free OLAP database in open source, built for swift, local analysis. Unlike classic databases that work as external services, DuckDB works inside the application, without the server requirement. As an OLAP system, DuckDB stores data in columns (not rows such as OLTP systems), making it highly effective in the case of analytical queries such as connections, aggregations and groups.
Think about DuckDB as a featherlight, optimized analytics of the SQLITE version, combining the simplicity of local databases along with the power of state-of-the-art data storage. This leads us to the following natural question …
What are the main functions of DuckDB?
Burning analytical queries
DuckDB provides impressive efficiency for OLAP loads, often surprises users who know classic databases, such as postgresql. Unlike conventional OLAP systems, which can be snail-paced due to the processing of gigantic amounts of data, DuckDB uses a column, vector engine engine. This project optimizes the exploit of processor’s cache and significantly accelerates the efficiency of the analytical query.
Native SQL Support + Liquid language integration
DuckDB offers full support for complicated SQL queries and reveals API interfaces in many languages, including Java, C and C ++. His strict integration with Python and R makes it ideal for interactive data analysis. You can write queries directly in the preferred environment, with additional improvements of SQL syntax (e.g. exclude, replacement and everything) to simplify writing.
The best part is that DuckDB is completely independent, without external dependencies or configuration of problems.
Free and Open Source
Duckdb is fully open source and actively maintained by the growing community of colleagues. This ensures rapid development of functions and error corrections. And yes, it can be used freely. While future licensing changes are always possible, for now you get a powerful analytical engine for zero cost.
Now that we know its main functions, let’s start with it!
First steps from DuckDB
The DuckDB installation process depends a bit on your environment, but in general it is swift and uncomplicated. Because DuckDB is a database -embedded engine without server requirements or external dependencies, the configuration usually takes only a few code rows. A full installation guide can be found in the official DuckDB documentation.
Preliminary requirements
Before diving, make sure you have the following:
- Python 3.13 or newer installed
- Basic understanding of SQL and data analysis in Python
You can easily install DuckDB in your environment by following the following command:
Working with DuckDB in Python
After installing DuckDB is quite uncomplicated. You simply import DuckDB to your environment, and then connect with the existing database or create a recent one if necessary.
For example:
import duckdb
connection = duckdb.connect()
If it is not delivered to the database file connect() The DuckDB method will create a recent database in memory by default. To say, the easiest way to start launching SQL queries is to exploit sql() Method directly.
# Source: Basic API usage - https://duckdb.org/docs/api/python/overview.html
import duckdb
duckdb.sql('SELECT 42').show()
Starting this command initiates the global DuckDB instance in the Python module and returns the relationship, symbolic representation of the query.
Importantly, the query itself is not made until you ask for the result, as shown below:
# Source: Execute SQL - https://duckdb.org/docs/guides/python/execute_sql.html
results = duckdb.sql('SELECT 42').fetchall()
print(results)
"""
[(42,)]
"""
Let’s work with real data now. DuckDB supports a wide range of file formats, including CSV, JSON and Parquet, and their charging is uncomplicated.
You can see how straightforward it is in the example below:
# Source: Python API - https://duckdb.org/docs/api/python/overview.html
import duckdb
duckdb.read_csv('example.csv') # read a CSV file into a Relation
duckdb.read_parquet('example.parquet')# read a Parquet file into a Relation
duckdb.read_json('example.json') # read a JSON file into a Relation
duckdb.sql('SELECT * FROM "example.csv"') # directly query a CSV file
Working with external data sources in DuckDB
One of the outstanding DuckDB functions is the possibility of direct queries of external data files, without having to import them to the database or charging the entire set of data for memory. Unlike classic databases, which first require data consumption, DuckDB supports the “zero copy” model, enabling only the data required for a given query.
This approach brings some key advantages:
- Minimal exploit of the memory: Only the appropriate parts of the file are read to memory.
- No import/export load: ask your data – you don’t have to move or duplicate.
- Improved work flow: straightforward ask about many files and formats using a single SQL manual.
To illustrate the exploit of DuckDB, we will exploit a uncomplicated CSV file that can be obtained from the following Kaggle link .
To ask about data, we can easily define a uncomplicated query that indicates our file path.
# Query data directly from a CSV file
result = duckdb.query(f"SELECT * FROM '{source}'").fetchall()
print(result)
Now we can easily operate data using SQL -similar logic directly from DuckDB.
Filtering poems
To focus on specific data subsets, exploit the WHERE clause in DuckDB. Filters rows based on conditions using comparative operators (>, <, =, <> etc.) and logical operators (and, or, not) for more complicated expressions.
# Select only students with a score above 80
result = duckdb.query(f"SELECT * FROM '{source}' WHERE total_passengers > 500").fetchall()
result
Sorting results
Employ the order clause to sort the results by one or more columns. By default it is rising (ASC), but you can determine the descent (Desc). To sort according to many columns, separate them with commas.
#Sort months by number of passengers
sorted_result = duckdb.query(f"SELECT * FROM '{source}' ORDER BY total_passengers DESC ").fetchall()
print("nMonths sorted by total traffic:")
print(sorted_result)
Adding calculated columns
Create recent loudspeakers in a question using expressions and keywords. Employ arithmetic operators or built-in functions to transform data-the columns appear in the results, but do not affect the original file.
# Add 10 bonus points to each score
bonus_result = duckdb.query(f"""
SELECT
month,
total_passengers,
total_passengers/1000 AS traffic_in_thousands
FROM '{source}'
""").fetchall()
print("nScores with 10 bonus points:")
print(bonus_result)
Using cases
In the case of more complicated transformations, SQL ensures case expression. It works similarly to the IF-Else instructions in programming languages, enabling the exploit of conditional logic in queries.
segmented_result = duckdb.query(f"""
SELECT
month,
total_passengers,
CASE
WHEN total_passengers >= 100 THEN 'HIGH'
WHEN total_passengers >= 50 THEN 'MEDIUM'
ELSE 'LOW'
END AS affluency
FROM '{source}'
""").fetchall()
print("nMonth by affluency of passangers")
print(segmented_result)
Application
DuckDB is a high -performance OLAP database built for data specialists who must effectively explore and analyze gigantic data sets. Its built-in SQL engine works complicated analytical queries directly in your environment-no separate server is required. With trouble -free support for Python, R, Java, C ++ and others, DuckDB naturally fits your existing work flow, regardless of the preferred language.
You can check the full code in After the GitHub repository.
Josep Ferrer He is an analytical engineer from Barcelona. He graduated from physics engineering and is currently working in the field of data learning used for human mobility. He is the creator of part -time content focused on learning data and technology. Josep writes about all AI things, including the exploit of a ongoing explosion in the field.
