How to handle gigantic data in Python, even if you're a beginner

Photo by the author

# Entry

Working with vast datasets in Python often leads to a common problem: you load data with Pandasand the program slows down or crashes completely. This usually happens when you try to load everything into memory at once.

Most memory problems are caused by How you load and process data. With a few practical techniques, you can handle data sets much larger than your available memory.

In this article, you’ll learn seven techniques for working efficiently with vast data sets in Python. We’ll just start and expand, so by the end you’ll know exactly which approach fits your operate case.

🔗 You can find it code on GitHub. You can run it if you want sample data generator Python script to get sample CSV files and operate code snippets to process them.

# 1. Read the data in pieces

The most beginner-friendly approach is to process data in smaller chunks rather than loading it all at once.

Let’s consider a scenario where you have a vast sales data set and you want to find the total revenue. The following code demonstrates this approach:

import pandas as pd

# Define chunk size (number of rows per chunk)
chunk_size = 100000
total_revenue = 0

# Read and process the file in chunks
for chunk in pd.read_csv('large_sales_data.csv', chunksize=chunk_size):
    # Process each chunk
    total_revenue += chunk['revenue'].sum()

print(f"Total Revenue: ${total_revenue:,.2f}")

Instead of loading all 10 million rows at once, we load 100,000 rows at once. We calculate the total for each chunk and add it to our running total. Your RAM only stores 100,000 lines, regardless of file size.

When to operate it: When you need to perform aggregation (sum, count, average) or filtering operations on vast files.

# 2. Utilize only specific columns

Often you don’t need every column in your dataset. Loading only what you need can significantly reduce memory usage.

Let’s say you’re analyzing customer data, but you only need age and purchase amount and not many other columns:

import pandas as pd

# Only load the columns you actually need
columns_to_use = ['customer_id', 'age', 'purchase_amount']

df = pd.read_csv('customers.csv', usecols=columns_to_use)

# Now work with a much lighter dataframe
average_purchase = df.groupby('age')['purchase_amount'].mean()
print(average_purchase)

Specifying usecolsPandas only loads these three columns into memory. If the original file had 50 columns, you have just reduced your memory usage by about 94%.

When to operate it: when you know exactly what columns you need before loading the data.

# 3. Optimize data types

By default, Pandas may operate more memory than necessary. An integer column can be stored as 64 bit while 8 bit will work fine.

For example, if you load a dataset with product ratings (1-5 stars) and user IDs:

import pandas as pd

# First, let's see the default memory usage
df = pd.read_csv('ratings.csv')
print("Default memory usage:")
print(df.memory_usage(deep=True))

# Now optimize the data types
df['rating'] = df['rating'].astype('int8')  # Ratings are 1-5, so int8 is enough
df['user_id'] = df['user_id'].astype('int32')  # Assuming user IDs fit in int32

print("nOptimized memory usage:")
print(df.memory_usage(deep=True))

By converting the rating column from probable int64 (8 bytes per number) to int8 (1 byte per number), we achieve an 8x memory reduction for this column.

Common conversions include:

int64 → int8, int16Or int32 (depending on the range of numbers).
float64 → float32 (if you don’t need extreme precision).
object → category (for columns with repeating values).

# 4. Utilize categorical data types

When a column contains repeated text values (such as country names or product categories), Pandas stores each value separately. The category dtype stores unique values once and uses competent codes to refer to them.

Let’s assume you’re working with a product inventory file in which the category column only has 20 unique values, but they are repeated across all rows in the dataset:

import pandas as pd

df = pd.read_csv('products.csv')

# Check memory before conversion
print(f"Before: {df['category'].memory_usage(deep=True) / 1024**2:.2f} MB")

# Convert to category
df['category'] = df['category'].astype('category')

# Check memory after conversion
print(f"After: {df['category'].memory_usage(deep=True) / 1024**2:.2f} MB")

# It still works like normal text
print(df['category'].value_counts())

This conversion can significantly reduce memory usage for columns with low cardinality (few unique values). The column still works similarly to standard text data: you can filter, group, and sort as you normally would.

When to operate it: for any text column where values are frequently repeated (categories, states, countries, departments, and the like).

# 5. Filter as you read

Sometimes you know you only need a subset of rows. Instead of loading everything and then filtering, you can filter during the loading process.

For example, if you are only interested in transactions from 2024:

import pandas as pd

# Read in chunks and filter
chunk_size = 100000
filtered_chunks = []

for chunk in pd.read_csv('transactions.csv', chunksize=chunk_size):
    # Filter each chunk before storing it
    filtered = chunk[chunk['year'] == 2024]
    filtered_chunks.append(filtered)

# Combine the filtered chunks
df_2024 = pd.concat(filtered_chunks, ignore_index=True)

print(f"Loaded {len(df_2024)} rows from 2024")

We combine fragmentation with filtering. Each snippet is filtered before being added to our list, so we never keep the full set of data in memory, only the rows we actually need.

When to operate it: When you only need a subset of rows based on some condition.

# 6. Utilize Dask for parallel processing

For really huge data sets dask provides an API similar to Panda, but automatically handles all fragmentation and parallel processing.

Here’s how to calculate the average column in a huge dataset:

import dask.dataframe as dd

# Read with Dask (it handles chunking automatically)
df = dd.read_csv('huge_dataset.csv')

# Operations look just like pandas
result = df['sales'].mean()

# Dask is indolent - compute() actually executes the calculation
average_sales = result.compute()

print(f"Average Sales: ${average_sales:,.2f}")

Dask does not load the entire file into memory. Instead, it creates a data processing plan in chunks and executes that plan when called .compute(). It can even operate multiple CPU cores to speed up calculations.

When to operate it: When your dataset is too vast for Pandas, even with chunking, or when you want to parallelize without writing sophisticated code.

# 7. Sample your data for exploration

When you’re just exploring or testing code, you don’t need the full data set. Load the sample first.

Let’s say you’re building a machine learning model and want to test your preprocessing pipeline. You can sample from your dataset as shown:

import pandas as pd

# Read just the first 50,000 rows
df_sample = pd.read_csv('huge_dataset.csv', nrows=50000)

# Or read a random sample using skiprows
import random
skip_rows = lambda x: x > 0 and random.random() > 0.01  # Keep ~1% of rows

df_random_sample = pd.read_csv('huge_dataset.csv', skiprows=skip_rows)

print(f"Sample size: {len(df_random_sample)} rows")

The first approach loads the first N rows, which is suitable for speedy exploration. The second approach randomly samples rows from the entire file, which is better for statistical analysis or when the file is sorted in a way that makes the top rows unrepresentative.

When to operate it: When developing, testing, or exploratory analysis before running code on the full data set.

# Application

Handling vast data sets does not require expert level skills. Here’s a quick summary of the techniques we covered:

Technique	When to operate it
Pieces	There is not enough RAM for data aggregation, filtering and processing.
Column selection	When you only need a few columns from a wide dataset.
Data type optimization	Always; do this after loading to save memory.
Categorical types	For text columns with repeating values (categories, states, etc.).
Filter as you read	When you only need a subset of rows.
dask	For very vast data sets or when parallel processing is required.
Trying	During development and exploration.

The first step is to know both your data and your task. In most cases, a combination of fragmentation and bright column selection will get you 90% of the way.

As your needs grow, upgrade to more advanced tools like Dask, or consider converting your data to more competent file formats like Parquet Or HDF5.

Now go ahead and start working with these huge datasets. Have fun analyzing!

Priya C’s girlfriend is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She likes reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates captivating resource overviews and coding tutorials.

Categories

How to handle gigantic data in Python, even if you’re a beginner

# Entry

# 1. Read the data in pieces

# 2. Utilize only specific columns

# 3. Optimize data types

# 4. Utilize categorical data types

# 5. Filter as you read

# 6. Utilize Dask for parallel processing

# 7. Sample your data for exploration

# Application

OpenAI delays ChatGPT ‘adult mode’ again

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change

Anthropic’s contract with the Pentagon is a warning to startups chasing federal contracts

When AI companies go to war, security gets left behind

More News

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change

5 Powerful Python Decorators for Optimizing LLM Applications

Trump’s war with Iran could upend American farmers

OpenAI delays ChatGPT ‘adult mode’ again

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change