
Photo by the author
# Entry
When you think about wiping data, you probably think about running a file Jupyter Notebook. But here’s something that might surprise you: some of the most effective data cleansing methods can be done right on your terminal, using tools that are already installed on your system.
In this article, you’ll learn how to apply basic command-line tools to immaculate, transform, and explore data files. No installations required; just terminal and some CSV files.
Before we start clearing data from the command line, let’s talk about why this is essential:
- Command line tools are basic to apply, quick and capable, especially for huge files.
- These tools are built into Linux/macOS and are available on Windows.
- They are great for a first look at your data before loading and analyzing it with Python.
- It’s basic to combine commands into scripts and apply those scripts for automation.
Now let’s start coding!
Note: All commands can be found in this Bash script on GitHub. To get the most out of this article, I encourage you to open a terminal and code.
# Configuring sample data
Let’s create a messy CSV file to work with. This simulates real-world data problems you are likely to encounter.
cat > messy_data.csv << 'EOF'
name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
Bob Davis ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Charlie Brown,45,70000,Sales,charlie@example.com
Dave Wilson,31,52000,Engineering,
Emma Davis,,58000,Marketing,emma@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
EOF
There are several common issues with this dataset: leading and trailing spaces, missing values, and duplicate rows. It's great for learning!
# 1. Explore data with head, tailAND wc
Before you clear your data, you need to understand what you're working with. Let's start with the basics.
# See the first 5 rows (including header)
head -n 5 messy_data.csv
# See the last 3 rows
tail -n 3 messy_data.csv
# Count total rows (including header)
wc -l messy_data.csv
Here's what's happening:
head -n 5shows the first 5 lines, giving a quick preview.tail -n 3shows the last 3 lines (useful for checking if the data is complete).wc -lcounts lines - subtract 1 from the header to get the number of records.
Exit:
name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
Bob Davis ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
11 messy_data.csv
# 2. Display specific columns with cut
You don't always have to see all the columns. Let's just isolate the names and departments.
cut -d',' -f1,4 messy_data.csv
Breaking it down:
cutis a tool to extract sections from each line.-d','sets the delimiter to a comma (for CSV files).-f1,4selects fields (columns) 1 and 4.- You can also apply ranges:
-f1-3for columns 1 to 3.
Here is the output:
name,department
John Lee,Engineering
Jane Smith,Marketing
Bob Davis ,Engineering
Alice Williams,Marketing
Charlie Brown,Sales
Dave Wilson,Engineering
Emma Davis,Marketing
Frank Miller,Sales
John Lee,Engineering
Grace Lee,Engineering
# 3. Remove duplicate rows with sort AND uniq
Note that "John Lee" appears twice in our dataset. Let's fix it.
# Save the header first
head -n 1 messy_data.csv > cleaned_data.csv
# Remove duplicates from the data (excluding header)
tail -n +2 messy_data.csv | sort | uniq >> cleaned_data.csv
Here's what each command does: head -n 1 only grabs the header row. tail -n +2 retrieves everything starting from line 2 (ignoring the header). Then, sort sorts lines. Please remember this uniq only works on sorted data and uniq removes adjacent duplicate lines. At last, >> appends to file (vs > which overwrites).
# 4. Search and filter with grep
Now let's perform some search and filter operations. Want to find all engineers or filter out rows with missing data? grep is useful for all such tasks.
# Find all engineers
grep "Engineering" messy_data.csv
# Find rows with empty fields (two consecutive commas)
grep ",," messy_data.csv
# Exclude rows with missing data
grep -v ",," messy_data.csv > no_missing.csv
Here, grep "pattern" searches for lines containing this pattern. grep -v reverses the match (shows lines that do NOT match). This is a quick way to filter out incomplete records, provided the missing value results in a double comma (,,).
# 5. Trimming whitespace with sed
See how the "Bob Davis" CD has extra spaces? Let's immaculate it up.
sed 's/^[ t]*//; s/[ t]*$//' messy_data.csv > trimmed_data.csv
Now we will understand the command: sed is a stream editor for text transformation. s/pattern/replacement/ is a substitution syntax. ^[ t]* matches spaces/tabs at the beginning of lines. [ t]*$ matches spaces/tabs at the end of lines. A semicolon separates two operations (trim the beginning of the line, then trim the end of the line).
# 6. Replacing values with sed
Sometimes you need to standardize values or correct typos. Let's try replacing all occurrences of the word "Engineering" with "Tech".
# Replace all "Engineering" with "Tech"
sed 's/Engineering/Tech/g' messy_data.csv
Then let's fill in the blank email fields (indicated by a comma at the end of the line) with the default email address value.
# Replace empty email fields with "no-email@example.com"
sed 's/,$/,no-email@example.com/' messy_data.csv
Run the above commands and observe the result. I have excluded the output here to avoid repetition.
Breaking it down:
- The
gflag means "global" - replace all occurrences on each line. ,$matches a comma at the end of a line (indicating an empty last field).- You can combine multiple replacements using
;between them.
# 7. Counting and summarizing using awk
ok it is very useful for field operations. Let's do some basic analysis.
# Count records by department
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c
# Calculate average age (excluding header and empty values)
tail -n +2 messy_data.csv | awk -F',' '{if($2) sum+=$2; if($2) count++} END {print "Average age:", sum/count}'
Including awk order, -F',' sets the field separator to a comma and $2 refers to the second field (age). Condition if($2) ensures that only non-empty values are processed while sum += $2 gathers the whole. At last, END the block is executed after reading all the lines to calculate and print the average age.
Exit:
5 Engineering
3 Marketing
2 Sales
Average age: 33
# 8. Combining commands with pipelines
You'll get more useful processing if you combine these command-line tools together.
# Get unique departments, sorted alphabetically
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq
# Find engineers with salary > 55000
tail -n +2 messy_data.csv | grep "Engineering" | awk -F',' '$3 > 55000' | cut -d',' -f1,3
# Count employees per department with counts
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c | sort -rn
Everyone here | passes the result of one command as input to the next. This allows you to create sophisticated data transformations step by step. The last step sorts by number in reverse numerical order (-rn).
This outputs:
Engineering
Marketing
Sales
Bob Davis ,60000
5 Engineering
3 Marketing
2 Sales
# 9. Conversion of data formats
Sometimes you need to work with different stops. Here we try to apply tab as separator instead of comma.
# Convert CSV to TSV (tab-separated)
sed 's/,/t/g' messy_data.csv > data.tsv
# Add a modern column with a fixed value
awk -F',' 'BEGIN{OFS=","} {print $0, "2024"}' messy_data.csv > data_with_year.csv
Including awk order, BEGIN{OFS=","} sets the output field separator to a comma. $0 represents the entire input line, and print $0, "2024" appends "2024" as a modern column to each result row.
# 10. Complete cleaning pipeline
Let's put it all together into one handy command that will immaculate up our messy data:
# Save header
head -n 1 messy_data.csv > final_clean.csv
# Neat the data: remove duplicates, trim whitespace, exclude missing values
tail -n +2 messy_data.csv |
sed 's/^[ t]*//; s/[ t]*$//' |
grep -v ",," |
sort |
uniq >> final_clean.csv
echo "Cleaning complete! Check final_clean.csv"
This pipeline first writes the header to preserve the column names, and then skips it when processing the rows of data. Trims leading and trailing whitespace from each line, removes all lines containing blank fields (especially double commas), sorts data, and eliminates duplicate entries. Finally, it appends the cleaned data to the output file.
# Application
Command-line data cleaning is a powerful yet underrated skill for data scientists. These tools are quick and reliable. While you'll still apply Python for sophisticated analysis, mastering these basics will augment your productivity and give you options when Python isn't ideal.
The best part is that these skills translate to data engineering, DevOps, and administrative roles. Learning how to manipulate data on the command line makes you a more versatile programmer.
Start practicing with your own datasets and you'll be surprised how often you reach for these tools instead of cracking open Python notebooks. Enjoy clearing your data!
Priya C's girlfriend is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She enjoys reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates engaging resource overviews and coding tutorials.
