
Photo by the author
# Entry
Hugging Face datasets is one of the simplest methods to load datasets with a single line of code. These datasets are often available in formats such as CSV, Parquet, and Arrow. While all three are designed to store tabular data, they work differently on the backend. The choice of each format determines how data is stored, how swift it is loaded, how much storage space is required, and how well it retains data types. These differences become increasingly significant as datasets grow larger and models become more intricate. In this article, we’ll look at how Hugging Face datasets work with CSV, Parquet, and Arrow files, what really sets them apart on disk and in memory, and when you should operate each. So let’s get started.
# 1.CSV
CSV stands for comma separated values. It’s just text, one line per line, columns separated by commas (or tabs). It can be opened with almost any tool i.e. Excel, Google Sheets, pandas, databases etc. It is very plain and interoperable.
Example:
name,age,city
Kanwal,30,Novel York
Qasim,25,Edmonton
Hugging Face treats it as a line-based format, which means it reads the data line by line. While this is acceptable for miniature datasets, performance deteriorates with scaling. Additionally, there are other restrictions such as:
- No clear pattern: Since all data is stored in text format, types must be inferred every time the file is loaded. This can cause errors if the data is not consistent.
- Huge size and leisurely I/O: Storing text increases file size, and parsing numbers from text is CPU-intensive.
# 2. Parquet floor
Parquet is a binary column format. Instead of saving rows one by one as in the CSV format, Parquet groups values by columns. This makes reads and queries much faster when you only need a few columns, and compression keeps file sizes and I/O low. Parquet also stores the schema so that the types are preserved. Works best for batch processing and large-scale analysis, rather than many miniature, constant updates to the same file (better for batch writes than continuous edits). If we take the CSV file example above, it will store all names together, all ages together, and all cities together. This is a columnar layout, and an example would look like this:
Names: Kanwal, Qasim
Ages: 30, 25
Cities: Novel York, Edmonton
It also adds metadata for each column: type, min/max values, null, and compression information. This enables faster readings, proficient storage, and true font handling. Compression algorithms such as Snappy or Gzip further reduce disk space. It has the following strengths:
- Compression: Similar column values compress well. Files are smaller and cheaper to store.
- Column reading: Load only the columns you need, speeding up queries.
- Opulent writing: The schema is stored, so there’s no type guessing every time you load it.
- Scale: Works well for millions or billions of rows.
# 3. Arrow
Arrow is not the same as CSV or Parkiet. It is a columnar format that is stored in memory to perform operations quickly. In Hugging Face, every data set is based on an Arrow table, whether you started with a CSV, Parquet, or Arrow file. Continuing with the same example table, Arrow also stores the data column by column, but in memory:
Names: contiguous memory block storing Kanwal, Qasim
Ages: contiguous memory block storing 30, 25
Cities: contiguous memory block storing Novel York, Edmonton
Because the data is in adjacent blocks, column operations (such as filtering, mapping, or summing) are extremely swift. Arrow also supports memory mapping, which allows you to access data sets from disk without having to fully load them into RAM. Here are some of the key benefits of this format:
- Zero copy reads: Memory map files without loading everything into RAM.
- Quick access to columns: The columnar layout enables vectorized operations.
- Opulent Types: Supports nested data, lists, tensors.
- Interoperable: Works with pandas, PyArrow, Spark, Polars and more.
# Summary
Hugging facial datasets makes switching formats a routine. Operate CSV for quick experiments, Parquet for storing huge tables, and Arrow for swift rote learning. Knowing when to operate each of them keeps your pipeline quick and plain, so you can spend more time on your model.
Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of artificial intelligence and medicine. She is co-author of the e-book “Maximizing Productivity with ChatGPT”. As a 2022 Google Generation Scholar for APAC, she promotes diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, a Mitacs Globalink Research Scholar, and a Harvard WeCode Scholar. Kanwal is a staunch advocate for change and founded FEMCodes to empower women in STEM fields.
