5 lightweight alternatives to pandas you should try

Photo by the author

# Entry

Developers operate pandas to manipulate data, but this can be leisurely, especially for vast data sets. For this reason, many are looking for faster and lighter alternatives. These options retain the core features needed for analysis while focusing on speed, lower memory footprint, and simplicity. In this article, we’ll look at five lightweight panda alternatives you can try.

# 1.DuckDB

DuckDB is like SQLite for analytics. SQL queries can be run directly on CSV files. This is useful if you know SQL or work with machine learning pipelines. Install it with:

We will operate the Titanic dataset and run a plain SQL query on it as follows:

import duckdb

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"

# Run SQL query on the CSV
result = duckdb.query(f"""
    SELECT sex, age, survived
    FROM read_csv_auto('{url}')
    WHERE age > 18
""").to_df()

print(result.head())

Exit:


      sex     age   survived
0     male    22.0          0
1   female    38.0          1
2   female    26.0          1
3   female    35.0          1
4     male    35.0          0

DuckDB runs the SQL query directly on the CSV file and then converts the output to a DataFrame. You gain the speed of SQL with the flexibility of Python.

# 2. Poles

polar is one of the most popular data libraries available today. It is implemented in Rust language and is exceptionally brisk with minimal memory requirements. The syntax is also very immaculate. Let’s install it with pip:

Now let’s operate the Titanic dataset to discuss a plain example:

import polars as pl

# Load dataset 
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pl.read_csv(url)

result = df.filter(pl.col("age") > 40).select(["sex", "age", "survived"])
print(result)

Exit:


shape: (150, 3)
┌────────┬──────┬──────────┐
│ sex    ┆ age  ┆ survived │
│ ---    ┆ ---  ┆ ---      │
│ str    ┆ f64  ┆ i64      │
╞════════╪══════╪══════════╡
│ male   ┆ 54.0 ┆ 0        │
│ female ┆ 58.0 ┆ 1        │
│ female ┆ 55.0 ┆ 1        │
│ male   ┆ 66.0 ┆ 0        │
│ male   ┆ 42.0 ┆ 0        │
│ …      ┆ …    ┆ …        │
│ female ┆ 48.0 ┆ 1        │
│ female ┆ 42.0 ┆ 1        │
│ female ┆ 47.0 ┆ 1        │
│ male   ┆ 47.0 ┆ 0        │
│ female ┆ 56.0 ┆ 1        │
└────────┴──────┴──────────┘

Polars reads the CSV file, filters the rows based on the age condition, and selects a subset of the columns.

# 3. PyArrow

PyArrow is a lightweight library for columnar data. Tools Polars operate Apache arrow in terms of speed and memory efficiency. It’s not a full substitute for pandas, but it’s great for file reading and preprocessing. Install it with:

For our example, let’s operate the Iris dataset in CSV format as follows:

import pyarrow.csv as csv
import pyarrow.compute as pc
import urllib.request

# Download the Iris CSV 
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
local_file = "iris.csv"
urllib.request.urlretrieve(url, local_file)

# Read with PyArrow
table = csv.read_csv(local_file)

# Filter rows
filtered = table.filter(pc.greater(table['sepal_length'], 5.0))

print(filtered.slice(0, 5))

Exit:


pyarrow.Table
sepal_length: double
sepal_width: double
petal_length: double
petal_width: double
species: string
----
sepal_length: [[5.1,5.4,5.4,5.8,5.7]]
sepal_width: [[3.5,3.9,3.7,4,4.4]]
petal_length: [[1.4,1.7,1.5,1.2,1.5]]
petal_width: [[0.2,0.4,0.2,0.2,0.4]]
species: [["setosa","setosa","setosa","setosa","setosa"]]

PyArrow reads a CSV file and converts it to a columnar format. The name and type of each column are listed in a clear diagram. This configuration allows you to quickly check and filter vast data sets.

# 4. Modina

Modine is for anyone who wants more performance without having to learn a modern library. It uses the same pandas API but performs operations in parallel. You don’t need to change existing code; just update the import. Everything else works like normal pandas. Install it with pip:

To understand better, let’s try a compact example using the same Titanic dataset as follows:

import modin.pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"

# Load the dataset
df = pd.read_csv(url)

# Filter the dataset 
adults = df[df["age"] > 18]

# Select only a few columns to display
adults_small = adults[["survived", "sex", "age", "class"]]

# Display result
adults_small.head()

Exit:


   survived     sex   age   class
0         0    male  22.0   Third
1         1  female  38.0   First
2         1  female  26.0   Third
3         1  female  35.0   First
4         0    male  35.0   Third

Modin spreads the work across the processor cores, which means you’ll get better performance without having to do any extra work.

# 5. Dusk

How to deal with vast data without increasing RAM? dask is a great choice if you have files larger than your computer’s RAM. It uses inactive evaluation, so it doesn’t load the entire data set into memory. This helps in processing millions of rows smoothly. Install it with:

pip install dask[complete]

To try this out, we can operate the Chicago Crime dataset as follows:

import dask.dataframe as dd
import urllib.request

url = "https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD"
local_file = "chicago_crime.csv"
urllib.request.urlretrieve(url, local_file)

# Read CSV with Dask (inactive evaluation)
df = dd.read_csv(local_file, dtype=str)  # all columns as string

# Filter crimes classified as 'THEFT'
thefts = df[df['Primary Type'] == 'THEFT']

# Select a few relevant columns
thefts_small = thefts[["ID", "Date", "Primary Type", "Description", "District"]]

print(thefts_small.head())

Exit:


          ID                   Date Primary Type       Description District            
5   13204489 09/06/2023 11:00:00 AM        THEFT         OVER $500      001
50  13179181 08/17/2023 03:15:00 PM        THEFT      RETAIL THEFT      014
51  13179344 08/17/2023 07:25:00 PM        THEFT      RETAIL THEFT      014
53  13181885 08/20/2023 06:00:00 AM        THEFT    $500 AND UNDER      025
56  13184491 08/22/2023 11:44:00 AM        THEFT      RETAIL THEFT      014

Filtering (Primary Type == 'THEFT') and selecting columns are inactive operations. Filtering happens instantly because Dask processes data in chunks rather than loading everything at once.

# Application

We discussed five alternatives to pandas and how to operate them. Everything in the article is plain and focused. For details, check the official documentation for each library:

If you encounter any problems, leave a comment and I will aid.

Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of artificial intelligence and medicine. She is co-author of the e-book “Maximizing Productivity with ChatGPT”. As a 2022 Google Generation Scholar for APAC, she promotes diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, a Mitacs Globalink Research Scholar, and a Harvard WeCode Scholar. Kanwal is a staunch advocate for change and founded FEMCodes to empower women in STEM fields.

Categories

5 lightweight alternatives to pandas you should try

# Entry

# 1.DuckDB

# 2. Poles

# 3. PyArrow

# 4. Modina

# 5. Dusk

# Application

Don’t expect any massive surprises in government foreign files

Anthropic sues Department of Defense over supply chain risk labeling

Improving the ability of AI models to explain their predictions

Can Artificial Intelligence Kill the Venture Capitalist?

Science says left-handed people are more competitive

More News

Don’t expect any massive surprises in government foreign files

Science says left-handed people are more competitive

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change

Don’t expect any massive surprises in government foreign files

Anthropic sues Department of Defense over supply chain risk labeling

Improving the ability of AI models to explain their predictions