Command-line data cleaning for beginner data scientists

Photo by the author

# Entry

When you think about wiping data, you probably think about running a file Jupyter Notebook. But here’s something that might surprise you: some of the most effective data cleansing methods can be done right on your terminal, using tools that are already installed on your system.

In this article, you’ll learn how to apply basic command-line tools to immaculate, transform, and explore data files. No installations required; just terminal and some CSV files.

Before we start clearing data from the command line, let’s talk about why this is essential:

Command line tools are basic to apply, quick and capable, especially for huge files.
These tools are built into Linux/macOS and are available on Windows.
They are great for a first look at your data before loading and analyzing it with Python.
It’s basic to combine commands into scripts and apply those scripts for automation.

Now let’s start coding!

Note: All commands can be found in this Bash script on GitHub. To get the most out of this article, I encourage you to open a terminal and code.

# Configuring sample data

Let’s create a messy CSV file to work with. This simulates real-world data problems you are likely to encounter.

cat > messy_data.csv << 'EOF'
name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
   Bob Davis    ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Charlie Brown,45,70000,Sales,charlie@example.com
Dave Wilson,31,52000,Engineering,
Emma Davis,,58000,Marketing,emma@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
EOF

There are several common issues with this dataset: leading and trailing spaces, missing values, and duplicate rows. It's great for learning!

# 1. Explore data with `head`, `tail`AND `wc`

Before you clear your data, you need to understand what you're working with. Let's start with the basics.

# See the first 5 rows (including header)
head -n 5 messy_data.csv

# See the last 3 rows
tail -n 3 messy_data.csv

# Count total rows (including header)
wc -l messy_data.csv

Here's what's happening:

head -n 5 shows the first 5 lines, giving a quick preview.
tail -n 3 shows the last 3 lines (useful for checking if the data is complete).
wc -l counts lines - subtract 1 from the header to get the number of records.

Exit:

name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
   Bob Davis    ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
11 messy_data.csv

# 2. Display specific columns with `cut`

You don't always have to see all the columns. Let's just isolate the names and departments.

cut -d',' -f1,4 messy_data.csv

Breaking it down:

cut is a tool to extract sections from each line.
-d',' sets the delimiter to a comma (for CSV files).
-f1,4 selects fields (columns) 1 and 4.
You can also apply ranges: -f1-3 for columns 1 to 3.

Here is the output:

name,department
John Lee,Engineering
Jane Smith,Marketing
   Bob Davis    ,Engineering
Alice Williams,Marketing
Charlie Brown,Sales
Dave Wilson,Engineering
Emma Davis,Marketing
Frank Miller,Sales
John Lee,Engineering
Grace Lee,Engineering

# 3. Remove duplicate rows with `sort` AND `uniq`

Note that "John Lee" appears twice in our dataset. Let's fix it.

# Save the header first
head -n 1 messy_data.csv > cleaned_data.csv

# Remove duplicates from the data (excluding header)
tail -n +2 messy_data.csv | sort | uniq >> cleaned_data.csv

Here's what each command does: head -n 1 only grabs the header row. tail -n +2 retrieves everything starting from line 2 (ignoring the header). Then, sort sorts lines. Please remember this uniq only works on sorted data and uniq removes adjacent duplicate lines. At last, >> appends to file (vs > which overwrites).

# 4. Search and filter with `grep`

Now let's perform some search and filter operations. Want to find all engineers or filter out rows with missing data? grep is useful for all such tasks.

# Find all engineers
grep "Engineering" messy_data.csv

# Find rows with empty fields (two consecutive commas)
grep ",," messy_data.csv

# Exclude rows with missing data
grep -v ",," messy_data.csv > no_missing.csv

Here, grep "pattern" searches for lines containing this pattern. grep -v reverses the match (shows lines that do NOT match). This is a quick way to filter out incomplete records, provided the missing value results in a double comma (,,).

# 5. Trimming whitespace with `sed`

See how the "Bob Davis" CD has extra spaces? Let's immaculate it up.

sed 's/^[ t]*//; s/[ t]*$//' messy_data.csv > trimmed_data.csv

Now we will understand the command: sed is a stream editor for text transformation. s/pattern/replacement/ is a substitution syntax. ^[ t]* matches spaces/tabs at the beginning of lines. [ t]*$ matches spaces/tabs at the end of lines. A semicolon separates two operations (trim the beginning of the line, then trim the end of the line).

# 6. Replacing values with `sed`

Sometimes you need to standardize values or correct typos. Let's try replacing all occurrences of the word "Engineering" with "Tech".

# Replace all "Engineering" with "Tech"
sed 's/Engineering/Tech/g' messy_data.csv

Then let's fill in the blank email fields (indicated by a comma at the end of the line) with the default email address value.

# Replace empty email fields with "no-email@example.com"
sed 's/,$/,no-email@example.com/' messy_data.csv

Run the above commands and observe the result. I have excluded the output here to avoid repetition.

Breaking it down:

The g flag means "global" - replace all occurrences on each line.
,$ matches a comma at the end of a line (indicating an empty last field).
You can combine multiple replacements using ; between them.

# 7. Counting and summarizing using `awk`

ok it is very useful for field operations. Let's do some basic analysis.

# Count records by department
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c

# Calculate average age (excluding header and empty values)
tail -n +2 messy_data.csv | awk -F',' '{if($2) sum+=$2; if($2) count++} END {print "Average age:", sum/count}'

Including awk order, -F',' sets the field separator to a comma and $2 refers to the second field (age). Condition if($2) ensures that only non-empty values are processed while sum += $2 gathers the whole. At last, END the block is executed after reading all the lines to calculate and print the average age.

Exit:

      5 Engineering
      3 Marketing
      2 Sales
Average age: 33

# 8. Combining commands with pipelines

You'll get more useful processing if you combine these command-line tools together.

# Get unique departments, sorted alphabetically
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq

# Find engineers with salary > 55000
tail -n +2 messy_data.csv | grep "Engineering" | awk -F',' '$3 > 55000' | cut -d',' -f1,3

# Count employees per department with counts
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c | sort -rn

Everyone here | passes the result of one command as input to the next. This allows you to create sophisticated data transformations step by step. The last step sorts by number in reverse numerical order (-rn).

This outputs:

Engineering
Marketing
Sales
   Bob Davis    ,60000
      5 Engineering
      3 Marketing
      2 Sales

# 9. Conversion of data formats

Sometimes you need to work with different stops. Here we try to apply tab as separator instead of comma.

# Convert CSV to TSV (tab-separated)
sed 's/,/t/g' messy_data.csv > data.tsv

# Add a modern column with a fixed value
awk -F',' 'BEGIN{OFS=","} {print $0, "2024"}' messy_data.csv > data_with_year.csv

Including awk order, BEGIN{OFS=","} sets the output field separator to a comma. $0 represents the entire input line, and print $0, "2024" appends "2024" as a modern column to each result row.

# 10. Complete cleaning pipeline

Let's put it all together into one handy command that will immaculate up our messy data:

# Save header
head -n 1 messy_data.csv > final_clean.csv

# Neat the data: remove duplicates, trim whitespace, exclude missing values
tail -n +2 messy_data.csv | 
  sed 's/^[ t]*//; s/[ t]*$//' | 
  grep -v ",," | 
  sort | 
  uniq >> final_clean.csv

echo "Cleaning complete! Check final_clean.csv"

This pipeline first writes the header to preserve the column names, and then skips it when processing the rows of data. Trims leading and trailing whitespace from each line, removes all lines containing blank fields (especially double commas), sorts data, and eliminates duplicate entries. Finally, it appends the cleaned data to the output file.

# Application

Command-line data cleaning is a powerful yet underrated skill for data scientists. These tools are quick and reliable. While you'll still apply Python for sophisticated analysis, mastering these basics will augment your productivity and give you options when Python isn't ideal.

The best part is that these skills translate to data engineering, DevOps, and administrative roles. Learning how to manipulate data on the command line makes you a more versatile programmer.

Start practicing with your own datasets and you'll be surprised how often you reach for these tools instead of cracking open Python notebooks. Enjoy clearing your data!

Priya C's girlfriend is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She enjoys reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates engaging resource overviews and coding tutorials.

Categories

Command-line data cleaning for beginner data scientists

# Entry

# Configuring sample data

# 1. Explore data with `head`, `tail`AND `wc`

# 2. Display specific columns with `cut`

# 3. Remove duplicate rows with `sort` AND `uniq`

# 4. Search and filter with `grep`

# 5. Trimming whitespace with `sed`

# 6. Replacing values with `sed`

# 7. Counting and summarizing using `awk`

# 8. Combining commands with pipelines

# 9. Conversion of data formats

# 10. Complete cleaning pipeline

# Application

OpenAI and Google Employees File Amicus Brief in Support of Anthropic Against US Government

Are language models a commodity?

Nvidia plans to launch an open-source AI agent platform

Anthropic says Pentagon dispute could cost billions

Bluesky CEO Jay Graber is stepping down

More News

Are language models a commodity?

Don’t expect any massive surprises in government foreign files

Science says left-handed people are more competitive

5 useful Python scripts to automate exploratory data analysis

OpenAI and Google Employees File Amicus Brief in Support of Anthropic Against US Government

Are language models a commodity?

Nvidia plans to launch an open-source AI agent platform

Categories

Command-line data cleaning for beginner data scientists

# Entry

# Configuring sample data

# 1. Explore data with head, tailAND wc

# 2. Display specific columns with cut

# 3. Remove duplicate rows with sort AND uniq

# 4. Search and filter with grep

# 5. Trimming whitespace with sed

# 6. Replacing values ​​with sed

# 7. Counting and summarizing using awk

# 8. Combining commands with pipelines

# 9. Conversion of data formats

# 10. Complete cleaning pipeline

# Application

More News

# 1. Explore data with `head`, `tail`AND `wc`

# 2. Display specific columns with `cut`

# 3. Remove duplicate rows with `sort` AND `uniq`

# 4. Search and filter with `grep`

# 5. Trimming whitespace with `sed`

# 6. Replacing values with `sed`

# 7. Counting and summarizing using `awk`